Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CYP2B6*17 calling issue caused by having multiple variant synonyms #53

Closed
sbslee opened this issue Mar 31, 2022 · 0 comments · Fixed by #55
Closed

CYP2B6*17 calling issue caused by having multiple variant synonyms #53

sbslee opened this issue Mar 31, 2022 · 0 comments · Fixed by #55
Labels
enhancement New feature or request

Comments

@sbslee
Copy link
Owner

sbslee commented Mar 31, 2022

CYP2B6*17 is currently defined with three GRCh37 variants 19-41497286-A-T, 19-41497292-G-GGC, and 19-41497294-CCG-C. If you look closely, you will note that the latter two variants actually overlap such that:

GRCh37:          GRCh38:
   * **             * **
999999999        888889999
012345678        567890123
ATGACCGCC        ATGACCGCC
ATGgCacCC        ATGgCacCC

Therefore, another way to represent 19-41497292-G-GGC and 19-41497294-CCG-C is to have three separate SNVs instead: 19-41497293-A-G, 19-41497295-C-A, and 19-41497296-G-C.

In fact, different genotype callers use different variant representations -- e.g. GATK4 HaplotypeCaller will output two indels (and gnomAD too: 19-41497292-G-GGC and 19-41497294-CCG-C) while the newly introduced create-input-vcf command in PyPGx (which uses bcftools internally) will output three separate SNVs.

This poses a problem for PyPGx because it needs to be able to call the same CYP2B6*17 allele but with different variant representations depending on the input VCF.

Usually, this type of problem is relatively easily handled by using the idea of variant "synonyms". For example, if you look at the variant-table.csv file, there is GRCh37Synonym column (e.g. 2-234668879-C-CAT in the UGT1A1 gene has 2-234668879-CAT-CATAT as synonym).

However, in the case of CYP2B6*17 we have two indels vs. three SNVs, so it breaks the paradigm of one synonym per variant. Therefore, starting the 0.14.0-dev version, we will abandon this paradigm such that 19-41497292-G-GGC will have 19-41497293-A-G,19-41497295-C-A as synonym while 19-41497294-CCG-C has 19-41497296-G-C as synonym:

>>> import pypgx
pypgx.get_variant_synonyms('CYP2B6')
>>> pypgx.get_variant_synonyms('CYP2B6')
{'19-41497293-A-G': '19-41497292-G-GGC', '19-41497295-C-A': '19-41497292-G-GGC', '19-41497296-G-C': '19-41497294-CCG-C'}

Note that this means both 19-41497293-A-G and 19-41497295-C-A will point to 19-41497292-G-GGC. Therefore, when it comes to star allele calling, technically, having either 19-41497293-A-G or 19-41497295-C-A is equivalent to having 19-41497292-G-GGC.

Obviously this is not ideal, but given how rare this issue is, I think this is a reasonable solution that has minimal impact on the overall data structure in PyPGx.

@sbslee sbslee added the enhancement New feature or request label Mar 31, 2022
sbslee added a commit that referenced this issue Mar 31, 2022
* :issue:`53` Update CYP2B6\*17 variants to have synonyms. Update 
:meth:`api.core.get_variant_synonyms` and 
:meth:`api.utils.predict_alleles` methods to allow mapping of single 
variant to multiple synonyms.
@sbslee sbslee closed this as completed Mar 31, 2022
@sbslee sbslee linked a pull request Apr 2, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant