Running Pypgx with Phased variants and Predicted CNV #32

NTNguyen13 · 2021-12-02T17:43:13Z

Hi, I have read the manual and the code, I found that Pypgx always needs to run Beagle to phase the variants before genotyping. and the CNV is detected by read depth. However, in some cases, our variants will be phased by another workflow, so does CNV! How can I input phased VCF and predicted CNV (say, in some common format) into predict_allele and call_genotypes ?

In the past, when I used Stargazer, I saw in the code that it will detect the GT separator '|' or '/' before doing phasing, but Pypgx will always unphase all variants first, then phase, and finally consolidate both files. I tried to tinker with the new Pypgx code, but it turned out to be harder than I thought, and I'm not sure that I understand everything.

Thank you very much.

The text was updated successfully, but these errors were encountered:

sbslee · 2021-12-03T00:17:29Z

@NTNguyen13, thank you for asking these important questions! What you are asking is precisely why I started PyPGx! I wanted to give people a unified platform for PGx research where no matter what type of input data a user might have (e.g. phased vs. unphased VCF), he or she can use PyPGx to process their data and get PGx results. That being said, let's delve into the specifics.

It's true that with the current version of PyPGx, it's not easy to work with phased VCF directly. Updating PyPGx to accept phased VCF is definitely on top of my to-do list. I will be releasing the official version of 0.9.0 this weekend, so I will prioritize this change in the next version (0.10.0-dev). I also appreciate you trying to tinker with PyPGx to do this yourself :)

Regarding this update, one important thing to note is that the predict-alleles command only requires a VcfFrame[Consolidated] archive (no CNV data is required at this point). That means, if you can turn your already phased VCF file into a VcfFrame[Consolidated] archive, you should be able to run the predict-alleles command. To qualify as a VcfFrame[Consolidated] archive, your VCF file needs to be fully phased. The only difference between VcfFrame[Consolidated] and VcfFrame[Phased] is that the latter is missing some variants because they were filtered from a phasing tool (e.g. Beagle). Therefore, if your phased VCF contains every variant you need and is fully phased, then you can treat it as VcfFrame[Consolidated].

Creating an archive is easy! Create a directory that contains your VCF file (it must be named data.vcf) and a metadata file (it must be named metadata.txt). Note that the VCF file should be "sliced" to contain only variants for the target gene. The metadata should look something like this:

$ cat metadata.txt 
Platform=WGS
Gene=CYP2D6
Assembly=GRCh37
SemanticType=VcfFrame[Consolidated]
Program=PROGRAM_NAME

Let's say you named this directory test_dir. Then run the following command in Terminal:

$ zip -r test_dir.zip test_dir

The resulting file, test_dir.zip, should now be recognized as a VcfFrame[Consolidated] archive.

Now switching gear, let's talk about CNV data. You mentioned that you have your own CNV data, which is awesome! The current version of PyPGx implements a support vector machine-based classifier to detect PGx CNVs using read depth. If you somehow can generate your own CNV data, 1) please consider exporting the algorithm to PyPGx so others can enjoy your algorithm (your work will be properly credited of course), 2) you can have CNV results in an archive called SampleTable[CNVCalls]. This archive has two files: data.tsv and metadata.txt.

The most important thing here is that your CNV results must conform to the CNV names that I have defined because only these names will be recognized by the call-genotype command. For example, for the CYP2D6 gene, there are currently 8 CNVs defined: Normal, DeletionHet, Duplication, Tandem1, Tandem2A, Tandem2B, DeletionHet,Tandem1 Duplication,Tandem1. You can find these names in pypgx > pypgx > api > data > cnv-table.csv.

I hope this helps. Please let me know if you have any questions or comments.

NTNguyen13 · 2021-12-06T08:13:59Z

Thank you for the suggestion.

Hi, I think I found one solution for the phased VCF. I made a new function import_phased_variants based on the function import_variants by first removing the preprocessing step (add_af, unphase), then edit the SemanticType. I also edited some code and argument parsers inside pipeline.py and run_ngs_pipeline.py to run it properly.

I have updated the code on my fork, if you open another release branch, I will make my PR ASAP.

NTNguyen13 · 2021-12-06T09:34:29Z

I found that the current method for loading and processing VCF files is quite memory-demanding and time-consuming, especially for cohort WGS. I'm thinking about using cyvcf for the region slicing and then load the data into pandas Dataframe, do you think it's appropriate for Pypgx?

About CNV, currently I cannot open the CNV definition files in pypgx/pypgx/api/cnv/GRCh38 because of the *.sav extension.

sbslee · 2021-12-06T10:26:10Z

@NTNguyen13, thanks for your enthusiasm on PyPGx :)

I will soon create a new development branch, 0.10.0-dev. But please do not make PRs yet because I have some plans to update the api.utils.import_variants method to accept phased VCF as input. I understand you want to solve the issue by creating a new method (i.e. import_phased_variants), but I prefer to modify existing methods to creating new ones. Basically, I'm planning to update api.utils.import_variants to detect the phasing status of the input VCF and then output a VcfFrame[Imported] if the input is not phased and VcfFrame[Consolidated] if it's fully phased. I will then update the api.pipeline.run_ngs_pipeline method so that if the input VCF is phased, it skips phasing. I will try to do this update by the end of this weekend.
I'm aware the api.utils.import_variants method is slow for large VCF files. Under the hood, it uses the pyvcf.VcfFrame.from_file method from thefuc package (it's also written by me). I recently updated the pyvcf.VcfFrame.from_file method to use random access to speed up VCF loading for specified region (via the pysam package). So I will update api.utils.import_variants to use this feature of pyvcf.VcfFrame.from_file. It should solve the VCF loading issue for large files.
The .sav files are CNV classifiers. You need to go pypgx > pypgx > api > data > cnv-table.csv to see their definitions (labels). If you want to know which dataset was used to train the classifiers, then take a look at https://github.com/sbslee/pypgx-data/tree/main/cnv-data.

Thanks for your patience.

NTNguyen13 · 2021-12-06T10:55:30Z

That would be great, I can't wait to check out the latest features.

* :issue:`32`: Update :command:`import-variants` command to accept phased VCF as input. It will output VcfFrame[Consolidated] if the input VCF is fully phased or otherwise VcfFrame[Imported] as usual.

* Update :meth:`api.plot.plot_vcf_allele_fraction` method to accept both VcfFrame[Imported] and VcfFrame[Consolidated] * :issue:`32`: Update :command:`run-ngs-pipeline` method to accept phased VCF as input. In this case, the method will skip statistical haplotype phasing.

sbslee · 2021-12-08T11:58:39Z

@NTNguyen13,

Just letting you know that I've made updates so that the run-ngs-pipeline command can accept phased VCF as input. I still need to update it more so that VCF loading is faster for large files, but I wanted to let you know :)

Here's some example using the GeT-RM tutorial:

(fuc) sbslee@Seung-beens-MacBook-Air getrm-wgs-tutorial % pypgx run-ngs-pipeline \
CYP2D6 \
grch37-CYP2D6-pipeline \
--variants grch37-variants.vcf.gz \
--depth-of-coverage grch37-depth-of-coverage.zip \
--control-statistics grch37-control-statistics-VDR.zip --force
Saved VcfFrame[Imported] to: grch37-CYP2D6-pipeline/imported-variants.zip
Saved VcfFrame[Phased] to: grch37-CYP2D6-pipeline/phased-variants.zip
Saved VcfFrame[Consolidated] to: grch37-CYP2D6-pipeline/consolidated-variants.zip
Saved SampleTable[Alleles] to: grch37-CYP2D6-pipeline/alleles.zip
Saved CovFrame[ReadDepth] to: grch37-CYP2D6-pipeline/read-depth.zip
Saved CovFrame[CopyNumber] to: grch37-CYP2D6-pipeline/copy-number.zip
Saved SampleTable[CNVCalls] to: grch37-CYP2D6-pipeline/cnv-calls.zip
Saved SampleTable[Genotypes] to: grch37-CYP2D6-pipeline/genotypes.zip
Saved SampleTable[Phenotypes] to: grch37-CYP2D6-pipeline/phenotypes.zip
Saved SampleTable[Results] to: grch37-CYP2D6-pipeline/results.zip

Unzip the grch37-CYP2D6-pipeline/consolidated-variants.zip file to extract the phased VCF:

(fuc) sbslee@Seung-beens-MacBook-Air getrm-wgs-tutorial % pypgx run-ngs-pipeline \
CYP2D6 \
grch37-CYP2D6-pipeline-test \
--variants data.vcf \
--depth-of-coverage grch37-depth-of-coverage.zip \
--control-statistics grch37-control-statistics-VDR.zip \
--force
Saved VcfFrame[Consolidated] to: grch37-CYP2D6-pipeline-test/imported-variants.zip
Saved SampleTable[Alleles] to: grch37-CYP2D6-pipeline-test/alleles.zip
Saved CovFrame[ReadDepth] to: grch37-CYP2D6-pipeline-test/read-depth.zip
Saved CovFrame[CopyNumber] to: grch37-CYP2D6-pipeline-test/copy-number.zip
Saved SampleTable[CNVCalls] to: grch37-CYP2D6-pipeline-test/cnv-calls.zip
Saved SampleTable[Genotypes] to: grch37-CYP2D6-pipeline-test/genotypes.zip
Saved SampleTable[Phenotypes] to: grch37-CYP2D6-pipeline-test/phenotypes.zip
Saved SampleTable[Results] to: grch37-CYP2D6-pipeline-test/results.zip

Notice how above does not create VcfFrame[Imported] and VcfFrame[Phased].

If you are planning to do your own testing, please make sure to also update your fuc package to use the 0.29.0-dev branch. I added a new property to that branch (pyvcf.VcfFrame.phased) which tests whether the given VCF is fully phased.

NTNguyen13 closed this as completed Dec 6, 2021

sbslee mentioned this issue Dec 8, 2021

Speeding up VCF loading for large files #34

Closed

sbslee added the enhancement New feature or request label Dec 10, 2021

sbslee linked a pull request Mar 17, 2022 that will close this issue

0.10.0 dev #40

Merged

NTNguyen13 mentioned this issue Apr 19, 2022

Integration CNV/SV with PyPGx pipeline #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Pypgx with Phased variants and Predicted CNV #32

Running Pypgx with Phased variants and Predicted CNV #32

NTNguyen13 commented Dec 2, 2021

sbslee commented Dec 3, 2021

NTNguyen13 commented Dec 6, 2021 •

edited

NTNguyen13 commented Dec 6, 2021

sbslee commented Dec 6, 2021 •

edited

NTNguyen13 commented Dec 6, 2021

sbslee commented Dec 8, 2021

Running Pypgx with Phased variants and Predicted CNV #32

Running Pypgx with Phased variants and Predicted CNV #32

Comments

NTNguyen13 commented Dec 2, 2021

sbslee commented Dec 3, 2021

NTNguyen13 commented Dec 6, 2021 • edited

NTNguyen13 commented Dec 6, 2021

sbslee commented Dec 6, 2021 • edited

NTNguyen13 commented Dec 6, 2021

sbslee commented Dec 8, 2021

NTNguyen13 commented Dec 6, 2021 •

edited

sbslee commented Dec 6, 2021 •

edited