You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CYP2B6*17 is currently defined with three GRCh37 variants 19-41497286-A-T, 19-41497292-G-GGC, and 19-41497294-CCG-C. If you look closely, you will note that the latter two variants actually overlap such that:
Therefore, another way to represent 19-41497292-G-GGC and 19-41497294-CCG-C is to have three separate SNVs instead: 19-41497293-A-G, 19-41497295-C-A, and 19-41497296-G-C.
In fact, different genotype callers use different variant representations -- e.g. GATK4 HaplotypeCaller will output two indels (and gnomAD too: 19-41497292-G-GGC and 19-41497294-CCG-C) while the newly introduced create-input-vcf command in PyPGx (which uses bcftools internally) will output three separate SNVs.
This poses a problem for PyPGx because it needs to be able to call the same CYP2B6*17 allele but with different variant representations depending on the input VCF.
Usually, this type of problem is relatively easily handled by using the idea of variant "synonyms". For example, if you look at the variant-table.csv file, there is GRCh37Synonym column (e.g. 2-234668879-C-CAT in the UGT1A1 gene has 2-234668879-CAT-CATAT as synonym).
However, in the case of CYP2B6*17 we have two indels vs. three SNVs, so it breaks the paradigm of one synonym per variant. Therefore, starting the 0.14.0-dev version, we will abandon this paradigm such that 19-41497292-G-GGC will have 19-41497293-A-G,19-41497295-C-A as synonym while 19-41497294-CCG-C has 19-41497296-G-C as synonym:
Note that this means both 19-41497293-A-G and 19-41497295-C-A will point to 19-41497292-G-GGC. Therefore, when it comes to star allele calling, technically, having either 19-41497293-A-G or 19-41497295-C-A is equivalent to having 19-41497292-G-GGC.
Obviously this is not ideal, but given how rare this issue is, I think this is a reasonable solution that has minimal impact on the overall data structure in PyPGx.
The text was updated successfully, but these errors were encountered:
* :issue:`53` Update CYP2B6\*17 variants to have synonyms. Update
:meth:`api.core.get_variant_synonyms` and
:meth:`api.utils.predict_alleles` methods to allow mapping of single
variant to multiple synonyms.
CYP2B6*17 is currently defined with three GRCh37 variants
19-41497286-A-T
,19-41497292-G-GGC
, and19-41497294-CCG-C
. If you look closely, you will note that the latter two variants actually overlap such that:Therefore, another way to represent
19-41497292-G-GGC
and19-41497294-CCG-C
is to have three separate SNVs instead:19-41497293-A-G
,19-41497295-C-A
, and19-41497296-G-C
.In fact, different genotype callers use different variant representations -- e.g. GATK4 HaplotypeCaller will output two indels (and gnomAD too: 19-41497292-G-GGC and 19-41497294-CCG-C) while the newly introduced
create-input-vcf
command in PyPGx (which uses bcftools internally) will output three separate SNVs.This poses a problem for PyPGx because it needs to be able to call the same CYP2B6*17 allele but with different variant representations depending on the input VCF.
Usually, this type of problem is relatively easily handled by using the idea of variant "synonyms". For example, if you look at the
variant-table.csv
file, there isGRCh37Synonym
column (e.g.2-234668879-C-CAT
in the UGT1A1 gene has2-234668879-CAT-CATAT
as synonym).However, in the case of CYP2B6*17 we have two indels vs. three SNVs, so it breaks the paradigm of one synonym per variant. Therefore, starting the
0.14.0-dev
version, we will abandon this paradigm such that19-41497292-G-GGC
will have19-41497293-A-G,19-41497295-C-A
as synonym while19-41497294-CCG-C
has19-41497296-G-C
as synonym:Note that this means both
19-41497293-A-G
and19-41497295-C-A
will point to19-41497292-G-GGC
. Therefore, when it comes to star allele calling, technically, having either19-41497293-A-G
or19-41497295-C-A
is equivalent to having19-41497292-G-GGC
.Obviously this is not ideal, but given how rare this issue is, I think this is a reasonable solution that has minimal impact on the overall data structure in PyPGx.
The text was updated successfully, but these errors were encountered: