# Clean up

In this notebook we go over every data set and make sure the ClinVar records can be matched to protein sequences in our ortholog collections.  Along the way we detect and correct errors in the data.

## Extract transcript IDs from variants

Pull the transcript record IDs from the ClinVar records of each gene.

**Output:** `vartranscripts` dictionary with genes as keys and lists of transcripts as values.

In [1]:
run kondrashov

In [2]:
i = 0
vartranscripts = {}
for gene in loci:
    vt = get_transcripts_from_variants(gene, True)
    i += len(vt)    
    print(gene, vt, i, 'transcripts')
    vartranscripts.update({gene: vt})

ABCD1 ['NM_000033.4'] 1 transcripts
ALPL ['NM_000478.6'] 2 transcripts
AR ['NM_000044.6'] 3 transcripts
ATP7B ['NM_000053.4', 'NM_000053.3'] 5 transcripts
BTK ['NM_000061.3', 'NM_000061.2', 'NM_001287345.1'] 8 transcripts
CASR ['NM_000388.4'] 9 transcripts
CBS ['NM_000071.3', 'NM_000071.2', 'NM_001178009.3'] 12 transcripts
CFTR ['NM_000492.4', 'NM_000492.3'] 14 transcripts
CYBB ['NM_000397.3', 'NM_000397.4'] 16 transcripts
F7 ['NM_019616.4'] 17 transcripts
F8 ['NM_019863.2', 'NM_000132.4', 'NM_000132.3'] 20 transcripts
F9 ['NM_000133.3', 'NM_000133.4', 'NM_001313913.1'] 23 transcripts
G6PD ['NM_001042351.1', 'NM_001360016.2', 'NM_000402.4'] 26 transcripts
GALT ['NM_000155.2', 'NM_000155.4'] 28 transcripts
GBA ['NM_001005741.2', 'NM_000157.3', 'NM_000157.4'] 31 transcripts
GJB1 ['NM_000166.6'] 32 transcripts
HBB ['NM_000518.4', 'NM_000518.5'] 34 transcripts
HPRT1 ['NM_000194.3', 'NM_000194.2'] 36 transcripts
IL2RG ['NM_000206.2', 'NM_000206.3'] 38 transcripts
KCNH2 ['NM_000238.4', 'NM_1

In [3]:
vartranscripts

{'ABCD1': ['NM_000033.4'],
 'ALPL': ['NM_000478.6'],
 'AR': ['NM_000044.6'],
 'ATP7B': ['NM_000053.4', 'NM_000053.3'],
 'BTK': ['NM_000061.3', 'NM_000061.2', 'NM_001287345.1'],
 'CASR': ['NM_000388.4'],
 'CBS': ['NM_000071.3', 'NM_000071.2', 'NM_001178009.3'],
 'CFTR': ['NM_000492.4', 'NM_000492.3'],
 'CYBB': ['NM_000397.3', 'NM_000397.4'],
 'F7': ['NM_019616.4'],
 'F8': ['NM_019863.2', 'NM_000132.4', 'NM_000132.3'],
 'F9': ['NM_000133.3', 'NM_000133.4', 'NM_001313913.1'],
 'G6PD': ['NM_001042351.1', 'NM_001360016.2', 'NM_000402.4'],
 'GALT': ['NM_000155.2', 'NM_000155.4'],
 'GBA': ['NM_001005741.2', 'NM_000157.3', 'NM_000157.4'],
 'GJB1': ['NM_000166.6'],
 'HBB': ['NM_000518.4', 'NM_000518.5'],
 'HPRT1': ['NM_000194.3', 'NM_000194.2'],
 'IL2RG': ['NM_000206.2', 'NM_000206.3'],
 'KCNH2': ['NM_000238.4', 'NM_172056.2', 'NM_000238.3'],
 'KCNQ1': ['NM_000218.3', 'NM_181798.1', 'NM_000218.2'],
 'L1CAM': ['NM_001278116.2'],
 'LDLR': ['NM_000527.5', 'NM_000527.4', 'NM_001195803.2', 'NM_00119

You can use this `dict` to look up the transcripts associated with a given gene in the ClinVar data.

In [3]:
vartranscripts['TTR']

['NM_000371.4', 'NM_000371.3']

## Extract human protein sequence records from the fasta files

Take the fasta files containing ortholog sequences, search for human sequences and extract the corresponding record.

**Output:** `protrecs` dictionary with genes as keys and lists of protein records as values.

In [4]:
i = 1
protrecs = {}
for gene in loci:
    protrecs.update({gene: []})
    for record in SeqIO.parse('fasta/{0}.fasta'.format(gene), 'fasta'):
        sp = get_species(record)
        if sp=='Homo sapiens':
            protrecs[gene].append(record)
            print(i, gene, record.id, len(record))
            i += 1

1 ABCD1 NP_000024.2 745
2 ALPL NP_001356734.1 524
3 ALPL XP_016856392.1 472
4 ALPL NP_001170991.1 447
5 ALPL NP_001120973.2 469
6 AR NP_001334993.1 572
7 AR NP_001334992.1 648
8 AR NP_001334990.1 644
9 AR NP_001011645.1 388
10 AR NP_000035.2 920
11 ATP7B NP_001317508.1 1381
12 ATP7B NP_001317507.1 1387
13 ATP7B XP_016876116.1 1433
14 ATP7B XP_005266488.1 1453
15 ATP7B XP_005266487.1 1465
16 ATP7B NP_001230111.1 1354
17 ATP7B NP_001005918.1 1258
18 BTK NP_001274274.1 483
19 BTK NP_001274273.1 693
20 BTK NP_000052.1 659
21 CASR NP_000379.3 1078
22 CASR NP_001171536.2 1088
23 CASR XP_005247894.1 917
24 CBS XP_024307908.1 446
25 CBS XP_024307905.1 568
26 CBS XP_024307904.1 582
27 CBS XP_016883980.1 551
28 CBS XP_011528085.1 460
29 CBS XP_011528079.1 565
30 CFTR NP_000483.3 1480
31 CYBB NP_000388.2 570
32 F7 XP_006720026.2 412
33 F7 XP_011535777.2 433
34 F7 XP_011535776.2 495
35 F7 XP_011535778.1 364
36 F7 NP_001254483.1 382
37 F7 NP_062562.1 444
38 F7 NP_000122.1 466
39 F8 NP_063916.1 216


You can use this `dict` to look up the unique human proteins in the ortholog collections associated with each gene.

In [5]:
protrecs['TYR']

[SeqRecord(seq=Seq('MLLAVLYCLLWSFQTSAGHFPRACVSSKNLMEKECCPPWSGDRSPCGQLSGRGS...FPG'), id='XP_011541272.1', name='XP_011541272.1', description='XP_011541272.1 tyrosinase isoform X1 [Homo sapiens]', dbxrefs=[]),
 SeqRecord(seq=Seq('MLLAVLYCLLWSFQTSAGHFPRACVSSKNLMEKECCPPWSGDRSPCGQLSGRGS...SHL'), id='NP_000363.1', name='NP_000363.1', description='NP_000363.1 tyrosinase precursor [Homo sapiens]', dbxrefs=[])]

Let's look more carefully at the first of the TYR protein records.  It is an object with the following attributes:

* name
* id
* seq
* description

In [6]:
prot = protrecs['TYR'][0]
prot

SeqRecord(seq=Seq('MLLAVLYCLLWSFQTSAGHFPRACVSSKNLMEKECCPPWSGDRSPCGQLSGRGS...FPG'), id='XP_011541272.1', name='XP_011541272.1', description='XP_011541272.1 tyrosinase isoform X1 [Homo sapiens]', dbxrefs=[])

In [7]:
prot.name

'XP_011541272.1'

In [8]:
prot.id

'XP_011541272.1'

In [9]:
prot.description

'XP_011541272.1 tyrosinase isoform X1 [Homo sapiens]'

In [10]:
prot.seq

Seq('MLLAVLYCLLWSFQTSAGHFPRACVSSKNLMEKECCPPWSGDRSPCGQLSGRGS...FPG')

## Match variant transcript IDs with ortholog proteins

Take variant transcript ID, query the NCBI database, extract the corresponding protein, and look for it in the ortholog set.  Flag problems.

**Output:** `translation` dictionary with transcript IDs as keys and lists of protein IDs as values.

In [11]:
%%time
translation = {}
i = 1
for gene in loci:
    for transcript in vartranscripts[gene]:
        if transcript[:3] == 'NM_':
            rec = get_transcript(transcript)
            seq = get_CDS(rec)
            match = False
            for prot in protrecs[gene]: 
                if seq[4] == prot.seq:
                    print(i, gene, transcript, prot.id)
                    match = True
                    translation.update({transcript: prot.id})
            if not match:
                print(i, gene, transcript, 'no matching protein         <<<<<')
        else:
            print(i, gene, transcript, 'invalid transcript name         *****')
        i += 1

1 ABCD1 NM_000033.4 NP_000024.2
2 ALPL NM_000478.6 NP_001356734.1
3 AR NM_000044.6 NP_000035.2
4 ATP7B NM_000053.3 XP_005266487.1
5 ATP7B NM_000053.4 XP_005266487.1
6 BTK NM_000061.2 NP_000052.1
7 BTK NM_001287345.1 NP_001274274.1
8 BTK NM_000061.3 NP_000052.1
9 CASR NM_000388.4 NP_000379.3
10 CBS NM_000071.2 XP_016883980.1
11 CBS NM_001178009.3 XP_016883980.1
12 CBS NM_000071.3 XP_016883980.1
13 CFTR NM_000492.3 NP_000483.3
14 CFTR NM_000492.4 NP_000483.3
15 CYBB NM_000397.3 NP_000388.2
16 CYBB NM_000397.4 NP_000388.2
17 F7 NM_019616.4 NP_062562.1
18 F8 NM_000132.3 NP_000123.1
19 F8 NM_000132.4 NP_000123.1
20 F8 NM_019863.2 NP_063916.1
21 F9 NM_000133.4 NP_000124.1
22 F9 NM_001313913.1 NP_001300842.1
23 F9 NM_000133.3 NP_000124.1
24 G6PD NM_001042351.1 NP_001346945.1
25 G6PD NM_000402.4 NP_000393.4
26 G6PD NM_001360016.2 NP_001346945.1
27 G6PD G6PD A- invalid transcript name         *****
28 GALT NM_000155.4 NP_000146.2
29 GALT NM_000155.2 NP_000146.2
30 GBA NM_000157.4 NP_001005742.1

## Invalid transcript names

The following transcripts have invalid names:

* [G6PD A-](https://www.ncbi.nlm.nih.gov/clinvar/variation/10361/): includes two variants.  **Delete.**

* [HBB Little Rock](https://www.ncbi.nlm.nih.gov/clinvar/variation/15250/): should be `NM_000518.5(HBB):c.432C>R (p.His144Gln)`.  **Fix.**

* [FH Aarhus](https://www.ncbi.nlm.nih.gov/clinvar/variation/3737/): includes two variants.  **Delete.**

### G6PD A-

In [12]:
vardata = pd.read_csv('pathogenic/G6PD_clinvar.csv')
i = vardata['Name'].tolist().index('G6PD A-')
vardata[(vardata.index<i+3) & (vardata.index>i-3)]

Unnamed: 0,Name,Gene(s),Protein change,Condition(s),Clinical significance (Last reviewed),Review status,Accession,GRCh37Chromosome,GRCh37Location,GRCh38Chromosome,GRCh38Location,VariationID,AlleleID(s),dbSNP ID,Canonical SPDI,Unnamed: 15
48,NM_001360016.2(G6PD):c.209A>G (p.Tyr70Cys),G6PD,"Y100C, Y70C",not provided,"Likely pathogenic(Last reviewed: Oct 13, 2017)","criteria provided, single submitter",VCV000449323,X,153764210,X,154535995,449323,446587,rs782090947,NC_000023.11:154535994:T:C,
49,NM_000402.4(G6PD):c.298T>C (p.Tyr100His),G6PD,"Y100H, Y70H","G6PD NAMORU|Anemia, nonspherocytic hemolytic, ...",Likely pathogenic,"criteria provided, single submitter",VCV000010416,X,153764211,X,154535996,10416,25455,rs137852349,NC_000023.11:154535995:A:G,
50,G6PD A-,G6PD,"V68M, V98M, N126D, N156D",Glucose 6 phosphate dehydrogenase deficiency,"Pathogenic(Last reviewed: Mar 31, 2000)",no assertion criteria provided,VCV000010361,X|X,153764217,X|X,154536002,10361,25400|25399,rs1050828|rs1050829,NC_000023.11:154536001:C:T|NC_000023.11:154535...,
51,NM_001360016.2(G6PD):c.152C>T (p.Thr51Ile),G6PD,"T51I, T81I","Anemia, nonspherocytic hemolytic, due to G6PD ...","Likely pathogenic(Last reviewed: Apr 28, 2021)","criteria provided, single submitter",VCV001098498,X,153764362,X,154536147,1098498,1087248,,NC_000023.11:154536146:G:A,
52,NM_001360016.2(G6PD):c.98T>C (p.Ile33Thr),G6PD|IKBKG,"I33T, I63T",not provided,"Pathogenic(Last reviewed: Sep 6, 2012)","criteria provided, single submitter",VCV000093504,X,153774273,X,154546058,93504,99409,rs398123552,NC_000023.11:154546057:A:G,


In [15]:
vardata.drop(i, inplace=True)
vardata.to_csv('pathogenic/G6PD_clinvar.csv', index=False)
vardata = pd.read_csv('pathogenic/G6PD_clinvar.csv')
vardata[vardata.index>i-3]

Unnamed: 0,Name,Gene(s),Protein change,Condition(s),Clinical significance (Last reviewed),Review status,Accession,GRCh37Chromosome,GRCh37Location,GRCh38Chromosome,GRCh38Location,VariationID,AlleleID(s),dbSNP ID,Canonical SPDI,Unnamed: 15
48,NM_001360016.2(G6PD):c.209A>G (p.Tyr70Cys),G6PD,"Y100C, Y70C",not provided,"Likely pathogenic(Last reviewed: Oct 13, 2017)","criteria provided, single submitter",VCV000449323,X,153764210,X,154535995,449323,446587,rs782090947,NC_000023.11:154535994:T:C,
49,NM_000402.4(G6PD):c.298T>C (p.Tyr100His),G6PD,"Y100H, Y70H","G6PD NAMORU|Anemia, nonspherocytic hemolytic, ...",Likely pathogenic,"criteria provided, single submitter",VCV000010416,X,153764211,X,154535996,10416,25455,rs137852349,NC_000023.11:154535995:A:G,
50,NM_001360016.2(G6PD):c.152C>T (p.Thr51Ile),G6PD,"T51I, T81I","Anemia, nonspherocytic hemolytic, due to G6PD ...","Likely pathogenic(Last reviewed: Apr 28, 2021)","criteria provided, single submitter",VCV001098498,X,153764362,X,154536147,1098498,1087248,,NC_000023.11:154536146:G:A,
51,NM_001360016.2(G6PD):c.98T>C (p.Ile33Thr),G6PD|IKBKG,"I33T, I63T",not provided,"Pathogenic(Last reviewed: Sep 6, 2012)","criteria provided, single submitter",VCV000093504,X,153774273,X,154546058,93504,99409,rs398123552,NC_000023.11:154546057:A:G,


### FH Aarhus

In [16]:
vardata = pd.read_csv('pathogenic/LDLR_clinvar.csv')
i = vardata['Name'].tolist().index('FH Aarhus')
vardata[(vardata.index<i+3) & (vardata.index>i-3)]

Unnamed: 0,Name,Gene(s),Protein change,Condition(s),Clinical significance (Last reviewed),Review status,Accession,GRCh37Chromosome,GRCh37Location,GRCh38Chromosome,GRCh38Location,VariationID,AlleleID(s),dbSNP ID,Canonical SPDI,Unnamed: 15
630,NM_000527.5(LDLR):c.1687C>T (p.Pro563Ser),LDLR,"P563S, P395S, P522S, P436S",Familial hypercholesterolemia 1,"Likely pathogenic(Last reviewed: Dec 16, 2016)","criteria provided, multiple submitters, no con...",VCV000251970,19,11226870,19,11116194,251970,246275,rs879254986,NC_000019.10:11116193:C:T,
631,NM_000527.5(LDLR):c.1688C>A (p.Pro563His),LDLR,"P563H, P395H, P436H, P522H",Familial hypercholesterolemia 1,"Likely pathogenic(Last reviewed: Mar 25, 2016)","criteria provided, single submitter",VCV000251971,19,11226871,19,11116195,251971,246276,rs879254987,NC_000019.10:11116194:C:A,
632,FH Aarhus,LDLR,"N564H, N437H, N523H, N396H",Homozygous familial hypercholesterolemia|Famil...,"Pathogenic(Last reviewed: Dec 7, 2018)","criteria provided, multiple submitters, no con...",VCV000003737,19|19,11226873,19|19,11116197,3737,18776|71434,,NC_000019.10:11116196:A:C|NC_000019.10:1112951...,
633,NM_000527.5(LDLR):c.1691A>G (p.Asn564Ser),LDLR,"N564S, N396S, N523S, N437S",Familial hypercholesterolemia|Familial hyperch...,Pathogenic/Likely pathogenic(Last reviewed: Ma...,"criteria provided, multiple submitters, no con...",VCV000224616,19,11226874,19,11116198,224616,226401,rs758194385,NC_000019.10:11116197:A:G,
634,NM_000527.5(LDLR):c.1694G>C (p.Gly565Ala),LDLR,"G565A, G438A, G524A, G397A",Familial hypercholesterolemia 1,Pathogenic/Likely pathogenic(Last reviewed: Ma...,"criteria provided, multiple submitters, no con...",VCV000226366,19,11226877,19,11116201,226366,228178,rs28942082,NC_000019.10:11116200:G:C,


In [17]:
vardata.drop(i, inplace=True)
vardata.to_csv('pathogenic/LDLR_clinvar.csv', index=False)
vardata = pd.read_csv('pathogenic/LDLR_clinvar.csv')
vardata[(vardata.index<i+3) & (vardata.index>i-3)]

Unnamed: 0,Name,Gene(s),Protein change,Condition(s),Clinical significance (Last reviewed),Review status,Accession,GRCh37Chromosome,GRCh37Location,GRCh38Chromosome,GRCh38Location,VariationID,AlleleID(s),dbSNP ID,Canonical SPDI,Unnamed: 15
630,NM_000527.5(LDLR):c.1687C>T (p.Pro563Ser),LDLR,"P563S, P395S, P522S, P436S",Familial hypercholesterolemia 1,"Likely pathogenic(Last reviewed: Dec 16, 2016)","criteria provided, multiple submitters, no con...",VCV000251970,19,11226870,19,11116194,251970,246275,rs879254986,NC_000019.10:11116193:C:T,
631,NM_000527.5(LDLR):c.1688C>A (p.Pro563His),LDLR,"P563H, P395H, P436H, P522H",Familial hypercholesterolemia 1,"Likely pathogenic(Last reviewed: Mar 25, 2016)","criteria provided, single submitter",VCV000251971,19,11226871,19,11116195,251971,246276,rs879254987,NC_000019.10:11116194:C:A,
632,NM_000527.5(LDLR):c.1691A>G (p.Asn564Ser),LDLR,"N564S, N396S, N523S, N437S",Familial hypercholesterolemia|Familial hyperch...,Pathogenic/Likely pathogenic(Last reviewed: Ma...,"criteria provided, multiple submitters, no con...",VCV000224616,19,11226874,19,11116198,224616,226401,rs758194385,NC_000019.10:11116197:A:G,
633,NM_000527.5(LDLR):c.1694G>C (p.Gly565Ala),LDLR,"G565A, G438A, G524A, G397A",Familial hypercholesterolemia 1,Pathogenic/Likely pathogenic(Last reviewed: Ma...,"criteria provided, multiple submitters, no con...",VCV000226366,19,11226877,19,11116201,226366,228178,rs28942082,NC_000019.10:11116200:G:C,
634,NM_000527.5(LDLR):c.1694G>T (p.Gly565Val),LDLR,"G565V, G524V, G438V, G397V",Homozygous familial hypercholesterolemia|Famil...,Pathogenic/Likely pathogenic(Last reviewed: Ma...,"criteria provided, multiple submitters, no con...",VCV000003688,19,11226877,19,11116201,3688,18727,rs28942082,NC_000019.10:11116200:G:T,


### Hb Little Rock

In [18]:
vardata = pd.read_csv('pathogenic/HBB_clinvar.csv')
i = vardata['Name'].tolist().index('Hb Little Rock')
vardata[(vardata.index<i+3) & (vardata.index>i-3)]

Unnamed: 0,Name,Gene(s),Protein change,Condition(s),Clinical significance (Last reviewed),Review status,Accession,GRCh37Chromosome,GRCh37Location,GRCh38Chromosome,GRCh38Location,VariationID,AlleleID(s),dbSNP ID,Canonical SPDI,Unnamed: 15
6,NM_000518.5(HBB):c.436T>C (p.Tyr146His),HBB|LOC107133510|LOC110006319,Y146H,none provided|HEMOGLOBIN BETHESDA|Erythrocytos...,"Pathogenic(Last reviewed: Jan 2, 2020)","criteria provided, single submitter",VCV000015112,11,5246836,11,5225606,15112,30151,rs33949869,NC_000011.10:5225605:A:G,
7,NM_000518.5(HBB):c.435G>C (p.Lys145Asn),HBB|LOC107133510|LOC110006319,K145N,HEMOGLOBIN ANDREW-MINNEAPOLIS|Erythrocytosis 6...,"Pathogenic(Last reviewed: Jan 27, 2019)","criteria provided, single submitter",VCV000015096,11,5246837,11,5225607,15096,30135,rs35020585,NC_000011.10:5225606:C:G,
8,Hb Little Rock,HBB|LOC107133510|LOC110006319,H144Q,"Erythrocytosis 6, familial","Pathogenic(Last reviewed: Jan 1, 1987)",no assertion criteria provided,VCV000015250,11,5246840,11,5225610,15250,30289,,NC_000011.10:5225609:G:Y,
9,NM_000518.5(HBB):c.431A>C (p.His144Pro),HBB|LOC107133510|LOC110006319,H144P,"Erythrocytosis 6, familial|HEMOGLOBIN SYRACUSE...","Pathogenic(Last reviewed: Jun 20, 2019)","criteria provided, single submitter",VCV000015365,11,5246841,11,5225611,15365,30404,rs33918338,NC_000011.10:5225610:T:G,
10,NM_000518.5(HBB):c.431A>G (p.His144Arg),HBB|LOC107133510|LOC110006319,H144R,not provided|none provided|HEMOGLOBIN ABRUZZO,Pathogenic/Likely pathogenic(Last reviewed: Ju...,"criteria provided, multiple submitters, no con...",VCV000015090,11,5246841,11,5225611,15090,30129,rs33918338,NC_000011.10:5225610:T:C,


In [19]:
vardata.loc[i, 'Name'] = 'NM_000518.5(HBB):c.432C>R (p.His144Gln)'
vardata.to_csv('pathogenic/HBB_clinvar.csv', index=False)
vardata[(vardata.index<i+3) & (vardata.index>i-3)]

Unnamed: 0,Name,Gene(s),Protein change,Condition(s),Clinical significance (Last reviewed),Review status,Accession,GRCh37Chromosome,GRCh37Location,GRCh38Chromosome,GRCh38Location,VariationID,AlleleID(s),dbSNP ID,Canonical SPDI,Unnamed: 15
6,NM_000518.5(HBB):c.436T>C (p.Tyr146His),HBB|LOC107133510|LOC110006319,Y146H,none provided|HEMOGLOBIN BETHESDA|Erythrocytos...,"Pathogenic(Last reviewed: Jan 2, 2020)","criteria provided, single submitter",VCV000015112,11,5246836,11,5225606,15112,30151,rs33949869,NC_000011.10:5225605:A:G,
7,NM_000518.5(HBB):c.435G>C (p.Lys145Asn),HBB|LOC107133510|LOC110006319,K145N,HEMOGLOBIN ANDREW-MINNEAPOLIS|Erythrocytosis 6...,"Pathogenic(Last reviewed: Jan 27, 2019)","criteria provided, single submitter",VCV000015096,11,5246837,11,5225607,15096,30135,rs35020585,NC_000011.10:5225606:C:G,
8,NM_000518.5(HBB):c.432C>R (p.His144Gln),HBB|LOC107133510|LOC110006319,H144Q,"Erythrocytosis 6, familial","Pathogenic(Last reviewed: Jan 1, 1987)",no assertion criteria provided,VCV000015250,11,5246840,11,5225610,15250,30289,,NC_000011.10:5225609:G:Y,
9,NM_000518.5(HBB):c.431A>C (p.His144Pro),HBB|LOC107133510|LOC110006319,H144P,"Erythrocytosis 6, familial|HEMOGLOBIN SYRACUSE...","Pathogenic(Last reviewed: Jun 20, 2019)","criteria provided, single submitter",VCV000015365,11,5246841,11,5225611,15365,30404,rs33918338,NC_000011.10:5225610:T:G,
10,NM_000518.5(HBB):c.431A>G (p.His144Arg),HBB|LOC107133510|LOC110006319,H144R,not provided|none provided|HEMOGLOBIN ABRUZZO,Pathogenic/Likely pathogenic(Last reviewed: Ju...,"criteria provided, multiple submitters, no con...",VCV000015090,11,5246841,11,5225611,15090,30129,rs33918338,NC_000011.10:5225610:T:C,


## Transcripts not matching any human protein

### VWF NM_000552.4

We compare it with the transcript with the later version number:

* They have the same length.

* They differ at only one site.  

Thus, we can analyse variants referencing the old transcript as long as we avoid position 851.  I add it to the `translation`dictionary in the [find_CPDs.ipynb](find_CPDs.ipynb) notebook.

In [20]:
old_rec = get_transcript('NM_000552.4')
old_prot = get_CDS(old_rec)[4]
old_prot

Seq('MIPARFAGVLLALALILPGTLCAEGTRGRSSTARCSLFGSDFVNTFDGSMYSFA...CSK')

In [21]:
new_rec = get_transcript('NM_000552.5')
new_prot = get_CDS(new_rec)[4]
new_prot

Seq('MIPARFAGVLLALALILPGTLCAEGTRGRSSTARCSLFGSDFVNTFDGSMYSFA...CSK')

In [22]:
print(len(old_prot), len(new_prot))

2813 2813


In [23]:
for i in range(2813):
    if old_prot[i]!=new_prot[i]:
        print(i, old_prot[i], new_prot[i])

851 R Q


### PAH NM_004316.4

The following cell shows that this unmatched transcript is not of the PAH gene.  **I delete the variant associated with this transcript.**

In [24]:
# should have been PAH
transcript = 'NM_004316.4'
rec = get_transcript(transcript)
rec['GBSeq_definition']

'Homo sapiens achaete-scute family bHLH transcription factor 1 (ASCL1), mRNA'

In [25]:
# example of a PAH transcript (control)
transcript = 'NM_000277.3'
rec = get_transcript(transcript)
rec['GBSeq_definition']

'Homo sapiens phenylalanine hydroxylase (PAH), transcript variant 1, mRNA'

In [26]:
vardata = pd.read_csv('pathogenic/PAH_clinvar.csv')
vardata['Gene(s)'].value_counts()

PAH          357
PAH|ASCL1      1
Name: Gene(s), dtype: int64

In [27]:
i = vardata['Gene(s)'].tolist().index('PAH|ASCL1')
i

357

In [28]:
vardata.tail()

Unnamed: 0,Name,Gene(s),Protein change,Condition(s),Clinical significance (Last reviewed),Review status,Accession,GRCh37Chromosome,GRCh37Location,GRCh38Chromosome,GRCh38Location,VariationID,AlleleID(s),dbSNP ID,Canonical SPDI,Unnamed: 15
353,NM_000277.3(PAH):c.2T>C (p.Met1Thr),PAH,M1T,not provided|Phenylketonuria,"Likely pathogenic(Last reviewed: Jul 7, 2019)",reviewed by expert panel,VCV000203873,12,103310907,12,102917129,203873,200225,rs62508575,NC_000012.12:102917128:A:G,
354,NM_000277.3(PAH):c.2T>G (p.Met1Arg),PAH,M1R,not provided|Phenylketonuria,"Pathogenic(Last reviewed: Jul 7, 2019)",reviewed by expert panel,VCV000102647,12,103310907,12,102917129,102647,108383,rs62508575,NC_000012.12:102917128:A:C,
355,NM_000277.3(PAH):c.1A>T (p.Met1Leu),PAH,M1L,Phenylketonuria|not provided,"Pathogenic(Last reviewed: Jul 7, 2019)",reviewed by expert panel,VCV000102626,12,103310908,12,102917130,102626,108362,rs62514891,NC_000012.12:102917129:T:A,
356,NM_000277.3(PAH):c.1A>G (p.Met1Val),PAH,M1V,"not provided|Hyperphenylalaninemia, non-pku|Ph...","Pathogenic(Last reviewed: Aug 7, 2018)",reviewed by expert panel,VCV000000586,12,103310908,12,102917130,586,15625,rs62514891,NC_000012.12:102917129:T:C,
357,NM_004316.4(ASCL1):c.52C>A (p.Pro18Thr),PAH|ASCL1,P18T,Congenital central hypoventilation,"Pathogenic(Last reviewed: Dec 1, 2003)",no assertion criteria provided,VCV000018332,12,103352074,12,102958296,18332,33371,rs267606667,NC_000012.12:102958295:C:A,


In [29]:
vardata.drop(i, inplace=True)
vardata.to_csv('pathogenic/PAH_clinvar.csv', index=False)
vardata = pd.read_csv('pathogenic/PAH_clinvar.csv')
vardata.tail()

Unnamed: 0,Name,Gene(s),Protein change,Condition(s),Clinical significance (Last reviewed),Review status,Accession,GRCh37Chromosome,GRCh37Location,GRCh38Chromosome,GRCh38Location,VariationID,AlleleID(s),dbSNP ID,Canonical SPDI,Unnamed: 15
352,NM_000277.3(PAH):c.3G>A (p.Met1Ile),PAH,M1I,Phenylketonuria|not provided,"Pathogenic(Last reviewed: Jul 7, 2019)",reviewed by expert panel,VCV000000622,12,103310906,12,102917128,622,15661,rs62514893,NC_000012.12:102917127:C:T,
353,NM_000277.3(PAH):c.2T>C (p.Met1Thr),PAH,M1T,not provided|Phenylketonuria,"Likely pathogenic(Last reviewed: Jul 7, 2019)",reviewed by expert panel,VCV000203873,12,103310907,12,102917129,203873,200225,rs62508575,NC_000012.12:102917128:A:G,
354,NM_000277.3(PAH):c.2T>G (p.Met1Arg),PAH,M1R,not provided|Phenylketonuria,"Pathogenic(Last reviewed: Jul 7, 2019)",reviewed by expert panel,VCV000102647,12,103310907,12,102917129,102647,108383,rs62508575,NC_000012.12:102917128:A:C,
355,NM_000277.3(PAH):c.1A>T (p.Met1Leu),PAH,M1L,Phenylketonuria|not provided,"Pathogenic(Last reviewed: Jul 7, 2019)",reviewed by expert panel,VCV000102626,12,103310908,12,102917130,102626,108362,rs62514891,NC_000012.12:102917129:T:A,
356,NM_000277.3(PAH):c.1A>G (p.Met1Val),PAH,M1V,"not provided|Hyperphenylalaninemia, non-pku|Ph...","Pathogenic(Last reviewed: Aug 7, 2018)",reviewed by expert panel,VCV000000586,12,103310908,12,102917130,586,15625,rs62514891,NC_000012.12:102917129:T:C,
