Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Pypgx with Phased variants and Predicted CNV #32

Closed
NTNguyen13 opened this issue Dec 2, 2021 · 6 comments · Fixed by #40
Closed

Running Pypgx with Phased variants and Predicted CNV #32

NTNguyen13 opened this issue Dec 2, 2021 · 6 comments · Fixed by #40
Labels
enhancement New feature or request

Comments

@NTNguyen13
Copy link
Contributor

Hi, I have read the manual and the code, I found that Pypgx always needs to run Beagle to phase the variants before genotyping. and the CNV is detected by read depth. However, in some cases, our variants will be phased by another workflow, so does CNV! How can I input phased VCF and predicted CNV (say, in some common format) into predict_allele and call_genotypes ?

In the past, when I used Stargazer, I saw in the code that it will detect the GT separator '|' or '/' before doing phasing, but Pypgx will always unphase all variants first, then phase, and finally consolidate both files. I tried to tinker with the new Pypgx code, but it turned out to be harder than I thought, and I'm not sure that I understand everything.

Thank you very much.

@sbslee
Copy link
Owner

sbslee commented Dec 3, 2021

@NTNguyen13, thank you for asking these important questions! What you are asking is precisely why I started PyPGx! I wanted to give people a unified platform for PGx research where no matter what type of input data a user might have (e.g. phased vs. unphased VCF), he or she can use PyPGx to process their data and get PGx results. That being said, let's delve into the specifics.

It's true that with the current version of PyPGx, it's not easy to work with phased VCF directly. Updating PyPGx to accept phased VCF is definitely on top of my to-do list. I will be releasing the official version of 0.9.0 this weekend, so I will prioritize this change in the next version (0.10.0-dev). I also appreciate you trying to tinker with PyPGx to do this yourself :)

Regarding this update, one important thing to note is that the predict-alleles command only requires a VcfFrame[Consolidated] archive (no CNV data is required at this point). That means, if you can turn your already phased VCF file into a VcfFrame[Consolidated] archive, you should be able to run the predict-alleles command. To qualify as a VcfFrame[Consolidated] archive, your VCF file needs to be fully phased. The only difference between VcfFrame[Consolidated] and VcfFrame[Phased] is that the latter is missing some variants because they were filtered from a phasing tool (e.g. Beagle). Therefore, if your phased VCF contains every variant you need and is fully phased, then you can treat it as VcfFrame[Consolidated].

Creating an archive is easy! Create a directory that contains your VCF file (it must be named data.vcf) and a metadata file (it must be named metadata.txt). Note that the VCF file should be "sliced" to contain only variants for the target gene. The metadata should look something like this:

$ cat metadata.txt 
Platform=WGS
Gene=CYP2D6
Assembly=GRCh37
SemanticType=VcfFrame[Consolidated]
Program=PROGRAM_NAME

Let's say you named this directory test_dir. Then run the following command in Terminal:

$ zip -r test_dir.zip test_dir

The resulting file, test_dir.zip, should now be recognized as a VcfFrame[Consolidated] archive.

Now switching gear, let's talk about CNV data. You mentioned that you have your own CNV data, which is awesome! The current version of PyPGx implements a support vector machine-based classifier to detect PGx CNVs using read depth. If you somehow can generate your own CNV data, 1) please consider exporting the algorithm to PyPGx so others can enjoy your algorithm (your work will be properly credited of course), 2) you can have CNV results in an archive called SampleTable[CNVCalls]. This archive has two files: data.tsv and metadata.txt.

The most important thing here is that your CNV results must conform to the CNV names that I have defined because only these names will be recognized by the call-genotype command. For example, for the CYP2D6 gene, there are currently 8 CNVs defined: Normal, DeletionHet, Duplication, Tandem1, Tandem2A, Tandem2B, DeletionHet,Tandem1 Duplication,Tandem1. You can find these names in pypgx > pypgx > api > data > cnv-table.csv.

I hope this helps. Please let me know if you have any questions or comments.

@NTNguyen13
Copy link
Contributor Author

NTNguyen13 commented Dec 6, 2021

Thank you for the suggestion.

Hi, I think I found one solution for the phased VCF. I made a new function import_phased_variants based on the function import_variants by first removing the preprocessing step (add_af, unphase), then edit the SemanticType. I also edited some code and argument parsers inside pipeline.py and run_ngs_pipeline.py to run it properly.

I have updated the code on my fork, if you open another release branch, I will make my PR ASAP.

@NTNguyen13
Copy link
Contributor Author

I found that the current method for loading and processing VCF files is quite memory-demanding and time-consuming, especially for cohort WGS. I'm thinking about using cyvcf for the region slicing and then load the data into pandas Dataframe, do you think it's appropriate for Pypgx?

About CNV, currently I cannot open the CNV definition files in pypgx/pypgx/api/cnv/GRCh38 because of the *.sav extension.

@sbslee
Copy link
Owner

sbslee commented Dec 6, 2021

@NTNguyen13, thanks for your enthusiasm on PyPGx :)

  1. I will soon create a new development branch, 0.10.0-dev. But please do not make PRs yet because I have some plans to update the api.utils.import_variants method to accept phased VCF as input. I understand you want to solve the issue by creating a new method (i.e. import_phased_variants), but I prefer to modify existing methods to creating new ones. Basically, I'm planning to update api.utils.import_variants to detect the phasing status of the input VCF and then output a VcfFrame[Imported] if the input is not phased and VcfFrame[Consolidated] if it's fully phased. I will then update the api.pipeline.run_ngs_pipeline method so that if the input VCF is phased, it skips phasing. I will try to do this update by the end of this weekend.
  2. I'm aware the api.utils.import_variants method is slow for large VCF files. Under the hood, it uses the pyvcf.VcfFrame.from_file method from thefuc package (it's also written by me). I recently updated the pyvcf.VcfFrame.from_file method to use random access to speed up VCF loading for specified region (via the pysam package). So I will update api.utils.import_variants to use this feature of pyvcf.VcfFrame.from_file. It should solve the VCF loading issue for large files.
  3. The .sav files are CNV classifiers. You need to go pypgx > pypgx > api > data > cnv-table.csv to see their definitions (labels). If you want to know which dataset was used to train the classifiers, then take a look at https://github.com/sbslee/pypgx-data/tree/main/cnv-data.

Thanks for your patience.

@NTNguyen13
Copy link
Contributor Author

That would be great, I can't wait to check out the latest features.

sbslee added a commit that referenced this issue Dec 6, 2021
* :issue:`32`: Update :command:`import-variants` command to accept 
phased VCF as input. It will output VcfFrame[Consolidated] if the input 
VCF is fully phased or otherwise VcfFrame[Imported] as usual.
sbslee added a commit that referenced this issue Dec 8, 2021
* Update :meth:`api.plot.plot_vcf_allele_fraction` method to accept both 
VcfFrame[Imported] and VcfFrame[Consolidated]
* :issue:`32`: Update :command:`run-ngs-pipeline` method to accept 
phased VCF as input. In this case, the method will skip statistical 
haplotype phasing.
@sbslee
Copy link
Owner

sbslee commented Dec 8, 2021

@NTNguyen13,

Just letting you know that I've made updates so that the run-ngs-pipeline command can accept phased VCF as input. I still need to update it more so that VCF loading is faster for large files, but I wanted to let you know :)

Here's some example using the GeT-RM tutorial:

(fuc) sbslee@Seung-beens-MacBook-Air getrm-wgs-tutorial % pypgx run-ngs-pipeline \
CYP2D6 \
grch37-CYP2D6-pipeline \
--variants grch37-variants.vcf.gz \
--depth-of-coverage grch37-depth-of-coverage.zip \
--control-statistics grch37-control-statistics-VDR.zip --force
Saved VcfFrame[Imported] to: grch37-CYP2D6-pipeline/imported-variants.zip
Saved VcfFrame[Phased] to: grch37-CYP2D6-pipeline/phased-variants.zip
Saved VcfFrame[Consolidated] to: grch37-CYP2D6-pipeline/consolidated-variants.zip
Saved SampleTable[Alleles] to: grch37-CYP2D6-pipeline/alleles.zip
Saved CovFrame[ReadDepth] to: grch37-CYP2D6-pipeline/read-depth.zip
Saved CovFrame[CopyNumber] to: grch37-CYP2D6-pipeline/copy-number.zip
Saved SampleTable[CNVCalls] to: grch37-CYP2D6-pipeline/cnv-calls.zip
Saved SampleTable[Genotypes] to: grch37-CYP2D6-pipeline/genotypes.zip
Saved SampleTable[Phenotypes] to: grch37-CYP2D6-pipeline/phenotypes.zip
Saved SampleTable[Results] to: grch37-CYP2D6-pipeline/results.zip

Unzip the grch37-CYP2D6-pipeline/consolidated-variants.zip file to extract the phased VCF:

(fuc) sbslee@Seung-beens-MacBook-Air getrm-wgs-tutorial % pypgx run-ngs-pipeline \
CYP2D6 \
grch37-CYP2D6-pipeline-test \
--variants data.vcf \
--depth-of-coverage grch37-depth-of-coverage.zip \
--control-statistics grch37-control-statistics-VDR.zip \
--force
Saved VcfFrame[Consolidated] to: grch37-CYP2D6-pipeline-test/imported-variants.zip
Saved SampleTable[Alleles] to: grch37-CYP2D6-pipeline-test/alleles.zip
Saved CovFrame[ReadDepth] to: grch37-CYP2D6-pipeline-test/read-depth.zip
Saved CovFrame[CopyNumber] to: grch37-CYP2D6-pipeline-test/copy-number.zip
Saved SampleTable[CNVCalls] to: grch37-CYP2D6-pipeline-test/cnv-calls.zip
Saved SampleTable[Genotypes] to: grch37-CYP2D6-pipeline-test/genotypes.zip
Saved SampleTable[Phenotypes] to: grch37-CYP2D6-pipeline-test/phenotypes.zip
Saved SampleTable[Results] to: grch37-CYP2D6-pipeline-test/results.zip

Notice how above does not create VcfFrame[Imported] and VcfFrame[Phased].

If you are planning to do your own testing, please make sure to also update your fuc package to use the 0.29.0-dev branch. I added a new property to that branch (pyvcf.VcfFrame.phased) which tests whether the given VCF is fully phased.

@sbslee sbslee added the enhancement New feature or request label Dec 10, 2021
@sbslee sbslee linked a pull request Mar 17, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants