Skip to content

Commit

Permalink
Enable customization of pypgx-bundle directory location (#100):
Browse files Browse the repository at this point in the history
* :issue:`100`: Add new method :meth:`sdk.utils.get_bundle_path` to enable customization of the ``pypgx-bundle`` directory's location instead of the user's home directory.
  • Loading branch information
sbslee committed Aug 26, 2023
1 parent 27bb79e commit 06ebd3d
Show file tree
Hide file tree
Showing 12 changed files with 67 additions and 27 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ Changelog
0.22.0 (in development)
-----------------------

* :issue:`100`: Add new method :meth:`sdk.utils.get_bundle_path` to enable customization of the ``pypgx-bundle`` directory's location instead of the user's home directory.

0.21.0 (2023-08-25)
-------------------

Expand Down
16 changes: 13 additions & 3 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -229,19 +229,29 @@ structural variant classifier files in PyPGx are moved to the
(only those files are moved; other files such as ``allele-table.csv`` and
``variant-table.csv`` are intact). Therefore, the user must clone the
``pypgx-bundle`` repository with matching PyPGx version to their home
directory in order for PyPGx to correctly access the moved files:
directory in order for PyPGx to correctly access the moved files (i.e. replace
``x.x.x`` with the version number of PyPGx you're using, such as ``0.18.0``):

.. code-block:: text
$ cd ~
$ git clone --branch 0.12.0 --depth 1 https://github.com/sbslee/pypgx-bundle
$ git clone --branch x.x.x --depth 1 https://github.com/sbslee/pypgx-bundle
This is undoubtedly annoying, but absolutely necessary for portability
reasons because PyPGx has been growing exponentially in file size due to the
increasing number of genes supported and their variation complexity, to the
point where it now exceeds upload size limit for PyPI (100 Mb). After removal
of those files, the size of PyPGx has reduced from >100 Mb to <1 Mb.

Starting with version 0.22.0, you can now specify a custom location for the
``pypgx-bundle`` directory instead of using the home directory. This can be
achieved by setting the bundle location using the ``PYPGX_BUNDLE`` environment
variable:

.. code-block:: text
$ export PYPGX_BUNDLE=/path/to/pypgx-bundle
Structural variation detection
==============================

Expand Down Expand Up @@ -756,7 +766,7 @@ For getting help on the CLI:
test-cnv-caller Test CNV caller for target gene.
train-cnv-caller Train CNV caller for target gene.
optional arguments:
options:
-h, --help Show this help message and exit.
-v, --version Show the version number and exit.
Expand Down
12 changes: 6 additions & 6 deletions docs/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ For getting help on the CLI:
test-cnv-caller Test CNV caller for target gene.
train-cnv-caller Train CNV caller for target gene.
optional arguments:
options:
-h, --help Show this help message and exit.
-v, --version Show the version number and exit.
Expand Down Expand Up @@ -409,7 +409,7 @@ estimate-phase-beagle
-h, --help Show this help message and exit.
--panel PATH VCF file (compressed or uncompressed) corresponding to a
reference haplotype panel. By default, the 1KGP panel in
the ~/pypgx-bundle directory will be used.
the pypgx-bundle directory will be used.
--impute Perform imputation of missing genotypes.
filter-samples
Expand Down Expand Up @@ -700,7 +700,7 @@ predict-cnv
Optional arguments:
-h, --help Show this help message and exit.
--cnv-caller PATH Archive file with the semantic type Model[CNV]. By
default, a pre-trained CNV caller in the ~/pypgx-bundle
default, a pre-trained CNV caller in the pypgx-bundle
directory will be used.
prepare-depth-of-coverage
Expand Down Expand Up @@ -813,7 +813,7 @@ run-chip-pipeline
(choices: 'GRCh37', 'GRCh38').
--panel PATH VCF file corresponding to a reference haplotype panel
(compressed or uncompressed). By default, the 1KGP
panel in the ~/pypgx-bundle directory will be used.
panel in the pypgx-bundle directory will be used.
--impute Perform imputation of missing genotypes.
--force Overwrite output directory if it already exists.
--samples TEXT [TEXT ...]
Expand Down Expand Up @@ -911,7 +911,7 @@ run-ngs-pipeline
(choices: 'GRCh37', 'GRCh38').
--panel PATH VCF file corresponding to a reference haplotype panel
(compressed or uncompressed). By default, the 1KGP panel
in the ~/pypgx-bundle directory will be used.
in the pypgx-bundle directory will be used.
--force Overwrite output directory if it already exists.
--samples TEXT [TEXT ...]
Specify which samples should be included for analysis
Expand All @@ -926,7 +926,7 @@ run-ngs-pipeline
--do-not-plot-allele-fraction
Do not plot allele fraction profile.
--cnv-caller PATH Archive file with the semantic type Model[CNV]. By
default, a pre-trained CNV caller in the ~/pypgx-bundle
default, a pre-trained CNV caller in the pypgx-bundle
directory will be used.
[Example] To genotype the CYP3A5 gene, which does not have SV, from WGS data:
Expand Down
14 changes: 12 additions & 2 deletions docs/create.py
Original file line number Diff line number Diff line change
Expand Up @@ -256,19 +256,29 @@
(only those files are moved; other files such as ``allele-table.csv`` and
``variant-table.csv`` are intact). Therefore, the user must clone the
``pypgx-bundle`` repository with matching PyPGx version to their home
directory in order for PyPGx to correctly access the moved files:
directory in order for PyPGx to correctly access the moved files (i.e. replace
``x.x.x`` with the version number of PyPGx you're using, such as ``0.18.0``):
.. code-block:: text
$ cd ~
$ git clone --branch 0.12.0 --depth 1 https://github.com/sbslee/pypgx-bundle
$ git clone --branch x.x.x --depth 1 https://github.com/sbslee/pypgx-bundle
This is undoubtedly annoying, but absolutely necessary for portability
reasons because PyPGx has been growing exponentially in file size due to the
increasing number of genes supported and their variation complexity, to the
point where it now exceeds upload size limit for PyPI (100 Mb). After removal
of those files, the size of PyPGx has reduced from >100 Mb to <1 Mb.
Starting with version 0.22.0, you can now specify a custom location for the
``pypgx-bundle`` directory instead of using the home directory. This can be
achieved by setting the bundle location using the ``PYPGX_BUNDLE`` environment
variable:
.. code-block:: text
$ export PYPGX_BUNDLE=/path/to/pypgx-bundle
Structural variation detection
==============================
Expand Down
6 changes: 3 additions & 3 deletions pypgx/api/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ def run_chip_pipeline(
Reference genome assembly.
panel : str, optional
VCF file corresponding to a reference haplotype panel (compressed or
uncompressed). By default, the 1KGP panel in the ``~/pypgx-bundle``
uncompressed). By default, the 1KGP panel in the ``pypgx-bundle``
directory will be used.
impute : bool, default: False
If True, perform imputation of missing genotypes.
Expand Down Expand Up @@ -166,7 +166,7 @@ def run_ngs_pipeline(
Reference genome assembly.
panel : str, optional
VCF file corresponding to a reference haplotype panel (compressed or
uncompressed). By default, the 1KGP panel in the ``~/pypgx-bundle``
uncompressed). By default, the 1KGP panel in the ``pypgx-bundle``
directory will be used.
force : bool, default : False
Overwrite output directory if it already exists.
Expand All @@ -184,7 +184,7 @@ def run_ngs_pipeline(
Do not plot allele fraction profile.
cnv_caller : str or pypgx.Archive, optional
Archive file or object with the semantic type Model[CNV]. By default,
a pre-trained CNV caller in the ``~/pypgx-bundle`` directory will be
a pre-trained CNV caller in the ``pypgx-bundle`` directory will be
used.
"""
if not core.is_target_gene(gene):
Expand Down
10 changes: 4 additions & 6 deletions pypgx/api/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -794,7 +794,7 @@ def estimate_phase_beagle(
VCF's contig names.
panel : str, optional
VCF file corresponding to a reference haplotype panel (compressed or
uncompressed). By default, the 1KGP panel in the ``~/pypgx-bundle``
uncompressed). By default, the 1KGP panel in the ``pypgx-bundle``
directory will be used.
impute : bool, default: False
If True, perform imputation of missing genotypes.
Expand All @@ -819,8 +819,7 @@ def estimate_phase_beagle(
metadata['Program'] = 'Beagle'

if panel is None:
home = os.path.expanduser('~')
panel = f'{home}/pypgx-bundle/1kgp/{assembly}/{gene}.vcf.gz'
panel = f'{sdk.get_bundle_path()}/1kgp/{assembly}/{gene}.vcf.gz'

has_chr_prefix = pyvcf.has_chr_prefix(panel)

Expand Down Expand Up @@ -1203,7 +1202,7 @@ def predict_cnv(copy_number, cnv_caller=None):
Archive file or object with the semantic type CovFrame[CopyNumber].
cnv_caller : str or pypgx.Archive, optional
Archive file or object with the semantic type Model[CNV]. By default,
a pre-trained CNV caller in the ``~/pypgx-bundle`` directory will be
a pre-trained CNV caller in the ``pypgx-bundle`` directory will be
used.
Returns
Expand All @@ -1218,8 +1217,7 @@ def predict_cnv(copy_number, cnv_caller=None):

gene = copy_number.metadata['Gene']
assembly = copy_number.metadata['Assembly']
home = os.path.expanduser('~')
model_file = f'{home}/pypgx-bundle/cnv/{assembly}/{gene}.zip'
model_file = f'{sdk.get_bundle_path()}/cnv/{assembly}/{gene}.zip'

if cnv_caller is None:
cnv_caller = sdk.Archive.from_file(model_file)
Expand Down
2 changes: 1 addition & 1 deletion pypgx/cli/estimate_phase_beagle.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def create_parser(subparsers):
help=
"""VCF file (compressed or uncompressed) corresponding to a
reference haplotype panel. By default, the 1KGP panel in
the ~/pypgx-bundle directory will be used."""
the pypgx-bundle directory will be used."""
)
parser.add_argument(
'--impute',
Expand Down
2 changes: 1 addition & 1 deletion pypgx/cli/predict_cnv.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ def create_parser(subparsers):
metavar='PATH',
help=
"""Archive file with the semantic type Model[CNV]. By
default, a pre-trained CNV caller in the ~/pypgx-bundle
default, a pre-trained CNV caller in the pypgx-bundle
directory will be used."""
)

Expand Down
2 changes: 1 addition & 1 deletion pypgx/cli/run_chip_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ def create_parser(subparsers):
help=
"""VCF file corresponding to a reference haplotype panel
(compressed or uncompressed). By default, the 1KGP
panel in the ~/pypgx-bundle directory will be used."""
panel in the pypgx-bundle directory will be used."""
)
parser.add_argument(
'--impute',
Expand Down
4 changes: 2 additions & 2 deletions pypgx/cli/run_ngs_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ def create_parser(subparsers):
help=
"""VCF file corresponding to a reference haplotype panel
(compressed or uncompressed). By default, the 1KGP panel
in the ~/pypgx-bundle directory will be used."""
in the pypgx-bundle directory will be used."""
)
parser.add_argument(
'--force',
Expand Down Expand Up @@ -151,7 +151,7 @@ def create_parser(subparsers):
metavar='PATH',
help=
"""Archive file with the semantic type Model[CNV]. By
default, a pre-trained CNV caller in the ~/pypgx-bundle
default, a pre-trained CNV caller in the pypgx-bundle
directory will be used."""
)

Expand Down
4 changes: 2 additions & 2 deletions pypgx/sdk/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from .utils import (Archive, add_cn_samples, compare_metadata, simulate_copy_number)
from .utils import (Archive, add_cn_samples, compare_metadata, get_bundle_path, simulate_copy_number)

__all__ = ['Archive', 'add_cn_samples', 'compare_metadata', 'simulate_copy_number']
__all__ = ['Archive', 'add_cn_samples', 'compare_metadata', 'get_bundle_path', 'simulate_copy_number']
20 changes: 20 additions & 0 deletions pypgx/sdk/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,9 @@ class SemanticTypeNotFoundError(Exception):
class VariantNotFoundError(Exception):
"""Raise if specified variant is not present in the variant table."""

class BundleNotFoundError(Exception):
"""Raise if the given path to the pypgx-bundle directory does not exist."""

class Archive:
"""
Class for storing various data.
Expand Down Expand Up @@ -364,3 +367,20 @@ def add_cn_samples(target, source, samples):
df = source.data.df[samples]
target.data.df = pd.concat([target.data.df, df], axis=1)
return target

def get_bundle_path():
"""
Return the path to the pypgx-bundle directory.
This function utilizes the ``PYPGX_BUNDLE`` environment variable to
ascertain the filesystem location of the ``pypgx-bundle`` directory. In
cases where the environment variable is not defined, the function defaults
to the user's home directory.
"""
try:
bundle_path = os.environ['PYPGX_BUNDLE']
except KeyError:
bundle_path = os.path.expanduser('~') + '/pypgx-bundle'
if not os.path.exists(bundle_path):
raise BundleNotFoundError(f'{bundle_path}')
return bundle_path

0 comments on commit 06ebd3d

Please sign in to comment.