# Find putative CPDs

## Rerun clean up commands

Since the ClinVar data was modified we begin by rerun the commands in the [clean_up.ipynb](clean_up.ipynb) notebook.

### Extract transcript IDs from variants

Pull the transcript record IDs from the ClinVar records of each gene.

**Output:** `vartranscripts` dictionary with genes as keys and lists of transcripts as values.  The new `dict` has 68 transcripts instead of 72.

In [1]:
run kondrashov

In [2]:
i = 0
vartranscripts = {}
for gene in loci:
    vt = get_transcripts_from_variants(gene, True)
    i += len(vt)    
    print(gene, vt, i, 'transcripts')
    vartranscripts.update({gene: vt})

ABCD1 ['NM_000033.4'] 1 transcripts
ALPL ['NM_000478.6'] 2 transcripts
AR ['NM_000044.6'] 3 transcripts
ATP7B ['NM_000053.3', 'NM_000053.4'] 5 transcripts
BTK ['NM_001287345.1', 'NM_000061.3', 'NM_000061.2'] 8 transcripts
CASR ['NM_000388.4'] 9 transcripts
CBS ['NM_001178009.3', 'NM_000071.3', 'NM_000071.2'] 12 transcripts
CFTR ['NM_000492.3', 'NM_000492.4'] 14 transcripts
CYBB ['NM_000397.3', 'NM_000397.4'] 16 transcripts
F7 ['NM_019616.4'] 17 transcripts
F8 ['NM_019863.2', 'NM_000132.4', 'NM_000132.3'] 20 transcripts
F9 ['NM_000133.4', 'NM_001313913.1', 'NM_000133.3'] 23 transcripts
G6PD ['NM_001042351.1', 'NM_001360016.2', 'NM_000402.4'] 26 transcripts
GALT ['NM_000155.2', 'NM_000155.4'] 28 transcripts
GBA ['NM_000157.3', 'NM_001005741.2', 'NM_000157.4'] 31 transcripts
GJB1 ['NM_000166.6'] 32 transcripts
HBB ['NM_000518.4', 'NM_000518.5'] 34 transcripts
HPRT1 ['NM_000194.2', 'NM_000194.3'] 36 transcripts
IL2RG ['NM_000206.2', 'NM_000206.3'] 38 transcripts
KCNH2 ['NM_172056.2', 'NM_0

### Extract human protein sequence records from the fasta files

Take the fasta files containing ortholog sequences, search for human sequences and extract the corresponding record.

**Output:** `protrecs` dictionary with genes as keys and lists of protein records as values. Unchanged.

In [3]:
i = 1
protrecs = {}
for gene in loci:
    protrecs.update({gene: []})
    for record in SeqIO.parse('fasta/{0}.fasta'.format(gene), 'fasta'):
        sp = get_species(record)
        if sp=='Homo sapiens':
            protrecs[gene].append(record)
            print(i, gene, record.id, len(record))
            i += 1

1 ABCD1 NP_000024.2 745
2 ALPL NP_001356734.1 524
3 ALPL XP_016856392.1 472
4 ALPL NP_001170991.1 447
5 ALPL NP_001120973.2 469
6 AR NP_001334993.1 572
7 AR NP_001334992.1 648
8 AR NP_001334990.1 644
9 AR NP_001011645.1 388
10 AR NP_000035.2 920
11 ATP7B NP_001317508.1 1381
12 ATP7B NP_001317507.1 1387
13 ATP7B XP_016876116.1 1433
14 ATP7B XP_005266488.1 1453
15 ATP7B XP_005266487.1 1465
16 ATP7B NP_001230111.1 1354
17 ATP7B NP_001005918.1 1258
18 BTK NP_001274274.1 483
19 BTK NP_001274273.1 693
20 BTK NP_000052.1 659
21 CASR NP_000379.3 1078
22 CASR NP_001171536.2 1088
23 CASR XP_005247894.1 917
24 CBS XP_024307908.1 446
25 CBS XP_024307905.1 568
26 CBS XP_024307904.1 582
27 CBS XP_016883980.1 551
28 CBS XP_011528085.1 460
29 CBS XP_011528079.1 565
30 CFTR NP_000483.3 1480
31 CYBB NP_000388.2 570
32 F7 XP_006720026.2 412
33 F7 XP_011535777.2 433
34 F7 XP_011535776.2 495
35 F7 XP_011535778.1 364
36 F7 NP_001254483.1 382
37 F7 NP_062562.1 444
38 F7 NP_000122.1 466
39 F8 NP_063916.1 216


### Match variant transcript IDs with ortholog proteins

Take variant transcript ID, query the NCBI database, extract the corresponding protein, and look for it in the ortholog set.  Flag problems.

**Output:** `translation` dictionary with transcript IDs as keys and lists of protein IDs as values.

In [4]:
%%time
translation = {}
i = 1
for gene in loci:
    for transcript in vartranscripts[gene]:
        if transcript[:3] == 'NM_':
            rec = get_transcript(transcript)
            seq = get_CDS(rec)
            match = False
            for prot in protrecs[gene]: 
                if seq[4] == prot.seq:
                    print(i, gene, transcript, prot.id)
                    match = True
                    translation.update({transcript: prot.id})
            if not match:
                print(i, gene, transcript, 'no matching protein         <<<<<')
        else:
            print(i, gene, transcript, 'invalid transcript name         *****')
        i += 1

1 ABCD1 NM_000033.4 NP_000024.2
2 ALPL NM_000478.6 NP_001356734.1
3 AR NM_000044.6 NP_000035.2
4 ATP7B NM_000053.3 XP_005266487.1
5 ATP7B NM_000053.4 XP_005266487.1
6 BTK NM_001287345.1 NP_001274274.1
7 BTK NM_000061.3 NP_000052.1
8 BTK NM_000061.2 NP_000052.1
9 CASR NM_000388.4 NP_000379.3
10 CBS NM_001178009.3 XP_016883980.1
11 CBS NM_000071.3 XP_016883980.1
12 CBS NM_000071.2 XP_016883980.1
13 CFTR NM_000492.3 NP_000483.3
14 CFTR NM_000492.4 NP_000483.3
15 CYBB NM_000397.3 NP_000388.2
16 CYBB NM_000397.4 NP_000388.2
17 F7 NM_019616.4 NP_062562.1
18 F8 NM_019863.2 NP_063916.1
19 F8 NM_000132.4 NP_000123.1
20 F8 NM_000132.3 NP_000123.1
21 F9 NM_000133.4 NP_000124.1
22 F9 NM_001313913.1 NP_001300842.1
23 F9 NM_000133.3 NP_000124.1
24 G6PD NM_001042351.1 NP_001346945.1
25 G6PD NM_001360016.2 NP_001346945.1
26 G6PD NM_000402.4 NP_000393.4
27 GALT NM_000155.2 NP_000146.2
28 GALT NM_000155.4 NP_000146.2
29 GBA NM_000157.3 NP_001005742.1
30 GBA NM_001005741.2 NP_001005742.1
31 GBA NM_000157

### Add VWF NM_000552.4

As explained in the [clean_up.ipynb](clean_up.ipynb) notebook we add it to the `translation` dictionary.

In [5]:
translation.update({'NM_000552.4': 'NP_000543.3'})

## Extract individual variants

Process each ClinVar dataset and extract variants for each transcript.

**Output:** `variants` dictionary with transcript IDs as keys and lists of changes as values.

In [6]:
print('gene transcript protein changes errors')
for gene in loci:
    transcripts = vartranscripts[gene]
    for transcript in transcripts:
        sites, wilds, muts, errs = get_all_changes(gene, transcript, True)
        print(gene, transcript, translation[transcript], len(sites), len(errs))

gene transcript protein changes errors
ABCD1 NM_000033.4 NP_000024.2 134 0
ALPL NM_000478.6 NP_001356734.1 76 0
AR NM_000044.6 NP_000035.2 132 0
ATP7B NM_000053.3 XP_005266487.1 0 2
ATP7B NM_000053.4 XP_005266487.1 183 0
BTK NM_001287345.1 NP_001274274.1 0 8
BTK NM_000061.3 NP_000052.1 25 1
BTK NM_000061.2 NP_000052.1 68 0
CASR NM_000388.4 NP_000379.3 108 1
CBS NM_001178009.3 XP_016883980.1 1 0
CBS NM_000071.3 XP_016883980.1 21 0
CBS NM_000071.2 XP_016883980.1 46 0
CFTR NM_000492.3 NP_000483.3 225 7
CFTR NM_000492.4 NP_000483.3 153 6
CYBB NM_000397.3 NP_000388.2 38 0
CYBB NM_000397.4 NP_000388.2 17 0
F7 NM_019616.4 NP_062562.1 26 0
F8 NM_019863.2 NP_063916.1 1 0
F8 NM_000132.4 NP_000123.1 33 0
F8 NM_000132.3 NP_000123.1 235 0
F9 NM_000133.4 NP_000124.1 51 0
F9 NM_001313913.1 NP_001300842.1 1 5
F9 NM_000133.3 NP_000124.1 81 0
G6PD NM_001042351.1 NP_001346945.1 0 1
G6PD NM_001360016.2 NP_001346945.1 25 1
G6PD NM_000402.4 NP_000393.4 25 0
GALT NM_000155.2 NP_000146.2 0 1
GALT NM_000155.4 

Let's look more closely at the variants where there are errors.  **They appear to be variants where the name does not specify an amino acid change.**  We can ignore them.

In [7]:
print('gene transcript protein changes errors')
for gene in loci:
    transcripts = vartranscripts[gene]
    for transcript in transcripts:
        sites, wilds, muts, errs = get_all_changes(gene, transcript, True)
        if len(errs)>0:
            print(gene, transcript, translation[transcript], len(sites), len(errs))
            print('   ', errs)

gene transcript protein changes errors
ATP7B NM_000053.3 XP_005266487.1 0 2
    [('NM_000053.3(ATP7B):c.3809A>G', 'VCV000003859'), ('NM_000053.3(ATP7B):c.[3443T>C;3526G>A]', 'VCV000003863')]
BTK NM_001287345.1 NP_001274274.1 0 8
    [('NM_001287345.1(BTK):c.1039-1398T>C', 'VCV000571649'), ('NM_001287345.1(BTK):c.1039-1411G>T', 'VCV000656776'), ('NM_001287345.1(BTK):c.1039-1418C>A', 'VCV000011375'), ('NM_001287345.1(BTK):c.1038+1529A>G', 'VCV000011343'), ('NM_001287345.1(BTK):c.1038+1516C>A', 'VCV000011374'), ('NM_001287345.1(BTK):c.1038+1464T>C', 'VCV000011373'), ('NM_001287345.1(BTK):c.1038+813T>G', 'VCV000011371'), ('NM_001287345.1(BTK):c.1038+44A>G', 'VCV000011346')]
BTK NM_000061.3 NP_000052.1 25 1
    [('NM_000061.3(BTK):c.763C>T', 'VCV000011363')]
CASR NM_000388.4 NP_000379.3 108 1
    [('NM_000388.4(CASR):c.1609-2A>G', 'VCV000854168')]
CFTR NM_000492.3 NP_000483.3 225 7
    [('NM_000492.3(CFTR):c.[350G>A;1210-12[5]]', 'VCV000209047'), ('NM_000492.3(CFTR):c.[1075C>A;1079C>A]', 'V

We collect all valid variants in a `variants` dictionary.

In [8]:
variants = {}
for gene in loci:
    print(gene, end=' ... ')
    transcripts = vartranscripts[gene]
    for transcript in transcripts:
        sites, wilds, muts, errs = get_all_changes(gene, transcript, True)
        if len(sites)>0:
            variants.update({transcript: (sites, wilds, muts, errs)})

ABCD1 ... ALPL ... AR ... ATP7B ... BTK ... CASR ... CBS ... CFTR ... CYBB ... F7 ... F8 ... F9 ... G6PD ... GALT ... GBA ... GJB1 ... HBB ... HPRT1 ... IL2RG ... KCNH2 ... KCNQ1 ... L1CAM ... LDLR ... MPZ ... MYH7 ... TYR ... PAH ... PMM2 ... RHO ... TP53 ... TTR ... VWF ... 

In [9]:
sites, wilds, muts, errs = variants['NM_000518.4']
print(sites)
print(wilds)
print(muts)

[147, 147, 147, 143, 128, 125, 122, 111, 107, 107, 102, 101, 100, 100, 100, 100, 100, 100, 98, 95, 93, 93, 92, 90, 89, 83, 83, 83, 69, 68, 61, 46, 43, 37, 33, 31, 29, 27, 24, 9, 7]
['H', 'H', 'H', 'A', 'Q', 'P', 'E', 'L', 'L', 'L', 'E', 'P', 'D', 'D', 'D', 'D', 'D', 'D', 'H', 'D', 'H', 'H', 'L', 'S', 'L', 'K', 'K', 'K', 'L', 'V', 'V', 'F', 'F', 'P', 'L', 'R', 'L', 'E', 'V', 'K', 'E']
['P', 'L', 'D', 'D', 'R', 'R', 'K', 'P', 'R', 'Q', 'K', 'L', 'A', 'G', 'V', 'Y', 'H', 'N', 'L', 'H', 'N', 'Y', 'P', 'N', 'P', 'T', 'M', 'E', 'H', 'E', 'E', 'S', 'V', 'H', 'Q', 'T', 'Q', 'G', 'F', 'E', 'K']


## Extract variable sites

Process each alignment and extract variable sites for each protein.

**Output:** `variable` dictionary with transcript IDs as keys and dictionaries of variable sites as values.

In [10]:
variable = {}
for gene in loci:
    print(gene, end=' ... ')
    transcripts = vartranscripts[gene]
    for transcript in transcripts:
        protein = translation[transcript]
        variable.update({transcript: get_variable_sites(gene, protein, False)})

ABCD1 ... ALPL ... AR ... ATP7B ... BTK ... CASR ... CBS ... CFTR ... CYBB ... F7 ... F8 ... F9 ... G6PD ... GALT ... GBA ... GJB1 ... HBB ... HPRT1 ... IL2RG ... KCNH2 ... KCNQ1 ... L1CAM ... LDLR ... MPZ ... MYH7 ... TYR ... PAH ... PMM2 ... RHO ... TP53 ... TTR ... VWF ... 

In [11]:
!open .

## Identify putative CPDs

In [12]:
print('gene\ttranscript\tsite\twild\tmut\tvar')
final_transcripts = variants.keys()
for gene in loci:
    transcripts = vartranscripts[gene]
    for transcript in transcripts:
        if transcript in final_transcripts:
            sites, wilds, muts, errs = variants[transcript]
            tmp_variable = variable[transcript]
            variable_sites = tmp_variable.keys()
            n = len(sites)
            for i in range(n):
                site = sites[i]
                if (site in variable_sites) and (muts[i] in tmp_variable[site]):
                    print(gene, transcript, site, wilds[i], muts[i], tmp_variable[site], sep='\t')

gene	transcript	site	wild	mut	var
ALPL	NM_000478.6	491	G	R	['R']
AR	NM_000044.6	611	N	K	['E', 'K']
AR	NM_000044.6	788	M	V	['V', 'C', 'Y', 'I']
ATP7B	NM_000053.4	1178	T	A	['A']
ATP7B	NM_000053.4	969	R	Q	['Q']
CFTR	NM_000492.4	1	M	R	['L', 'P', 'Q', 'R']
CFTR	NM_000492.4	13	S	F	['G', 'F']
F8	NM_000132.3	2319	P	L	['L']
F8	NM_000132.3	2185	L	S	['S']
F8	NM_000132.3	2183	M	V	['G', 'V']
F8	NM_000132.3	2038	N	S	['H', 'K', 'S']
F8	NM_000132.3	1979	G	V	['V']
F8	NM_000132.3	585	I	T	['M', 'T']
F8	NM_000132.3	584	Q	K	['K', 'S', 'P']
F8	NM_000132.3	494	I	T	['T']
F9	NM_000133.3	75	R	Q	['Q']
GALT	NM_000155.4	23	T	A	['A', 'W']
GALT	NM_000155.4	97	N	D	['Q', 'D', 'E']
GALT	NM_000155.4	114	H	L	['L']
GALT	NM_000155.4	129	M	L	['L']
GALT	NM_000155.4	186	H	Y	['Y']
GALT	NM_000155.4	198	I	T	['D', 'T', 'V']
GALT	NM_000155.4	204	R	P	['P']
GALT	NM_000155.4	212	Q	P	['P']
GALT	NM_000155.4	226	L	P	['P']
GALT	NM_000155.4	319	H	Q	['Q']
GALT	NM_000155.4	329	S	P	['P']
GALT	NM_000155.4	333	R	L	['L', 'S']
GALT	NM_000155.4	3

## Look at particular sites

The following command allows you to double check particular sites.  It now also outputs which site each species has.

In [13]:
get_alleles('ATP7B', translation['NM_000053.4'], 969, True)

     Gene: ATP7B
    Human: 969 R
Alignment: 1102 R
R    105
-      6
Q      1
dtype: int64

R XP_012668202.1 Otolemur garnettii
R XP_023363838.1 Otolemur garnettii
R XP_023363839.1 Otolemur garnettii
R XP_012638114.1 Microcebus murinus
R XP_012638117.1 Microcebus murinus
R XP_020145970.1 Microcebus murinus
R XP_012638108.1 Microcebus murinus
R XP_012504533.1 Propithecus coquereli
R XP_012504538.1 Propithecus coquereli
R XP_012504534.1 Propithecus coquereli
R XP_012504529.1 Propithecus coquereli
R XP_008070488.1 Carlito syrichta
R XP_008070485.1 Carlito syrichta
R XP_008070483.1 Carlito syrichta
- XP_008070484.1 Carlito syrichta
R XP_008070486.1 Carlito syrichta
R XP_021561720.1 Carlito syrichta
R XP_008070482.1 Carlito syrichta
R XP_011920046.1 Cercocebus atys
R XP_016780802.1 Pan troglodytes
R XP_005266488.1 Homo sapiens
R XP_016876116.1 Homo sapiens
R NP_001005918.1 Homo sapiens
R NP_001230111.1 Homo sapiens
R NP_001317507.1 Homo sapiens
R NP_001317508.1 Homo sapiens
R XP_005266487.

(True,
 969,
 'R',
 R    105
 -      6
 Q      1
 dtype: int64)

In [14]:
get_alleles('F8', translation['NM_000132.3'], 2038, True)

     Gene: F8
    Human: 2038 N
Alignment: 2964 N
N    46
-    14
H     4
S     3
K     3
dtype: int64

N XP_021524500.1 Aotus nancymaae
K XP_003802782.1 Otolemur garnettii
K XP_012517042.1 Propithecus coquereli
K XP_020140823.1 Microcebus murinus
N XP_017376032.2 Cebus imitator
N XP_032121193.1 Sapajus apella
N XP_032121192.1 Sapajus apella
N XP_032121191.1 Sapajus apella
N XP_032121190.1 Sapajus apella
N XP_003943974.1 Saimiri boliviensis boliviensis
S XP_035144731.1 Callithrix jacchus
S XP_035144732.1 Callithrix jacchus
S XP_035144729.1 Callithrix jacchus
- XP_008960582.2 Pan paniscus
- XP_030861623.1 Gorilla gorilla gorilla
- XP_030662513.1 Nomascus leucogenys
N XP_032612179.1 Hylobates moloch
N XP_003279379.1 Nomascus leucogenys
- NP_063916.1 Homo sapiens
N NP_000123.1 Homo sapiens
N XP_034806192.1 Pan paniscus
N XP_003810005.2 Pan paniscus
N XP_034806193.1 Pan paniscus
N XP_016800115.1 Pan troglodytes
N XP_009438145.1 Pan troglodytes
N XP_016800113.1 Pan troglodytes
N XP_01680011

(True,
 2038,
 'N',
 N    46
 -    14
 H     4
 S     3
 K     3
 dtype: int64)

In [18]:
translation['NM_000157.4']

'NP_001005742.1'

In [15]:
get_alleles('TTR', translation['NM_000371.3'], 142, True)

     Gene: TTR
    Human: 142 V
Alignment: 144 V
V    26
-     2
L     1
I     1
dtype: int64

L XP_011788672.1 Colobus angolensis palliatus
- XP_010333769.1 Saimiri boliviensis boliviensis
V XP_010333768.1 Saimiri boliviensis boliviensis
V XP_032149005.1 Sapajus apella
V XP_012302669.1 Aotus nancymaae
V NP_001254679.1 Callithrix jacchus
V XP_003784812.1 Otolemur garnettii
V XP_012638014.1 Microcebus murinus
V XP_012512049.1 Propithecus coquereli
V NP_001127064.1 Pongo abelii
V NP_000362.1 Homo sapiens
V XP_004059337.1 Gorilla gorilla gorilla
V XP_032008476.1 Hylobates moloch
V XP_003830303.2 Pan paniscus
V NP_001009137.1 Pan troglodytes
V XP_011930274.1 Cercocebus atys
I XP_007972558.2 Chlorocebus sabaeus
V XP_011829762.1 Mandrillus leucophaeus
V XP_025220794.1 Theropithecus gelada
V XP_025220793.1 Theropithecus gelada
V XP_023063511.2 Piliocolobus tephrosceles
V XP_017713537.1 Rhinopithecus bieti
V XP_010379114.1 Rhinopithecus roxellana
V XP_011788674.1 Colobus angolensis palliatus
V

(True,
 142,
 'V',
 V    26
 -     2
 L     1
 I     1
 dtype: int64)

In [16]:
get_alleles('GBA', translation['NM_000157.4'], 535, True)

     Gene: GBA
    Human: 535 R
Alignment: 547 R
R    36
L     3
H     1
-     1
dtype: int64

- XP_037855735.1 Chlorocebus sabaeus
R XP_012514200.1 Propithecus coquereli
R XP_012623306.1 Microcebus murinus
R XP_012623305.1 Microcebus murinus
R XP_023372167.1 Otolemur garnettii
R XP_012664043.1 Otolemur garnettii
R XP_008048281.1 Carlito syrichta
R XP_021564251.1 Carlito syrichta
R XP_008048283.1 Carlito syrichta
R XP_011932225.1 Cercocebus atys
R XP_021528768.1 Aotus nancymaae
R XP_039315825.1 Saimiri boliviensis boliviensis
R XP_010346697.1 Saimiri boliviensis boliviensis
R XP_032129098.1 Sapajus apella
R XP_017368970.1 Cebus imitator
H XP_032010219.1 Hylobates moloch
R XP_030866469.1 Gorilla gorilla gorilla
R XP_030866465.1 Gorilla gorilla gorilla
R NP_001165282.1 Homo sapiens
R NP_001165283.1 Homo sapiens
R NP_001005742.1 Homo sapiens
R XP_030680202.1 Nomascus leucogenys
R XP_024098242.1 Pongo abelii
R NP_001127488.1 Pongo abelii
L XP_005541629.1 Macaca fascicularis
L XP_005541627.

(True,
 535,
 'R',
 R    36
 L     3
 H     1
 -     1
 dtype: int64)

In [17]:
get_alleles('GALT', translation['NM_000155.4'], 23, True)

     Gene: GALT
    Human: 23 T
Alignment: 43 T
T    37
-    24
W     1
A     1
dtype: int64

T XP_035106695.1 Callithrix jacchus
T XP_008068994.1 Carlito syrichta
- XP_034823261.1 Pan paniscus
T XP_034823259.1 Pan paniscus
T XP_034823258.1 Pan paniscus
T XP_034823256.1 Pan paniscus
- XP_023059055.1 Piliocolobus tephrosceles
- XP_037854236.1 Chlorocebus sabaeus
- XP_033094123.1 Trachypithecus francoisi
- XP_017747549.1 Rhinopithecus bieti
- XP_028690929.1 Macaca mulatta
- XP_011805906.1 Colobus angolensis palliatus
- NP_001245261.1 Homo sapiens
- XP_034823265.1 Pan paniscus
T XP_011924614.1 Cercocebus atys
W XP_011924616.1 Cercocebus atys
T XP_011924615.1 Cercocebus atys
- XP_011924617.1 Cercocebus atys
T XP_021783112.2 Papio anubis
T XP_025214557.1 Theropithecus gelada
T XP_021783108.2 Papio anubis
- XP_021783109.2 Papio anubis
- XP_021783111.1 Papio anubis
T XP_009186860.2 Papio anubis
T XP_033094120.1 Trachypithecus francoisi
- XP_033094122.1 Trachypithecus francoisi
- XP_023059054.

(True,
 23,
 'T',
 T    37
 -    24
 W     1
 A     1
 dtype: int64)

In [18]:
get_alleles('F9', translation['NM_000133.3'], 75, True)

     Gene: F9
    Human: 75 R
Alignment: 75 R
R    39
Q     1
dtype: int64

R XP_002763378.1 Callithrix jacchus
R XP_012325320.1 Aotus nancymaae
R XP_017355402.1 Cebus imitator
R XP_032124214.1 Sapajus apella
R XP_003940777.1 Saimiri boliviensis boliviensis
R XP_011930388.1 Cercocebus atys
R XP_037844800.1 Chlorocebus sabaeus
R XP_007991031.1 Chlorocebus sabaeus
R NP_001103153.1 Macaca mulatta
R XP_028697500.1 Macaca mulatta
R XP_011739181.1 Macaca nemestrina
R XP_028697499.1 Macaca mulatta
R XP_025228623.1 Theropithecus gelada
R XP_025228622.1 Theropithecus gelada
R XP_017809908.1 Papio anubis
R XP_003918402.1 Papio anubis
R XP_023058339.1 Piliocolobus tephrosceles
R XP_023058338.1 Piliocolobus tephrosceles
R XP_033057437.1 Trachypithecus francoisi
R XP_010387043.1 Rhinopithecus roxellana
R XP_011816431.1 Colobus angolensis palliatus
R XP_011816430.1 Colobus angolensis palliatus
R XP_009233598.2 Pongo abelii
R XP_002832230.2 Pongo abelii
R XP_032612351.1 Hylobates moloch
R XP_03261235

(True,
 75,
 'R',
 R    39
 Q     1
 dtype: int64)

In [14]:
translation['NM_000527.4']

'NP_000518.1'

In [13]:
get_alleles('LDLR', translation['NM_000527.4'], 215, True)

     Gene: LDLR
    Human: 215 R
Alignment: 351 R
R    35
-     9
H     9
dtype: int64

R XP_020139390.1 Microcebus murinus
R XP_012496802.1 Propithecus coquereli
H XP_021573646.1 Carlito syrichta
H XP_023373820.1 Otolemur garnettii
H XP_012665677.1 Otolemur garnettii
H XP_021532454.1 Aotus nancymaae
H XP_035141328.1 Callithrix jacchus
H XP_002761791.1 Callithrix jacchus
H XP_039321728.1 Saimiri boliviensis boliviensis
H XP_037583448.1 Cebus imitator
H XP_032107631.1 Sapajus apella
R XP_011810343.1 Colobus angolensis palliatus
R XP_032025606.1 Hylobates moloch
R XP_030676686.1 Nomascus leucogenys
R XP_024092494.1 Pongo abelii
R XP_024092493.1 Pongo abelii
- XP_024206936.1 Pan troglodytes
- NP_001182729.1 Homo sapiens
R NP_001182728.1 Homo sapiens
- NP_001182732.1 Homo sapiens
R XP_011526312.1 Homo sapiens
R NP_001182727.1 Homo sapiens
R NP_000518.1 Homo sapiens
R XP_018871452.1 Gorilla gorilla gorilla
R XP_018871450.1 Gorilla gorilla gorilla
- XP_034800282.1 Pan paniscus
R XP_034800281

(True,
 215,
 'R',
 R    35
 -     9
 H     9
 dtype: int64)

# Validate CPDs

Here we test whether a putative CPD is valid by investigating the vicinity of the sequence, 10 amino acids on either side of the site.

This function pulls out the species and protein IDs showing a particular amino acid at a site.

In [19]:
get_species_with_allele('GBA', translation['NM_000157.4'], 535, 'L')

[('Macaca fascicularis', 'XP_005541629.1'),
 ('Macaca fascicularis', 'XP_005541627.1'),
 ('Macaca fascicularis', 'XP_005541626.1')]

In [20]:
get_species_with_allele('GALT', translation['NM_000155.4'], 23, 'A')

[('Otolemur garnettii', 'XP_003800307.1')]

This function compares the human and nonhuman sequences in the -10 / +10 range.  The output at the bottom shows the numbers of:

* differences between the sequences
* gaps 
* sites out of range

For example, the GBA site at the end means that 9 sites are outside the range.  The GALT sequence shows some aminoacid differences. 

In [21]:
local_compare_to_human('GBA', translation['NM_000157.4'], 535, 'XP_005541629.1')

-10 G G
-9 Y Y
-8 S S
-7 I I
-6 H H
-5 T T
-4 Y Y
-3 L L
-2 W W
-1 R C
0 R L
1 Q Q
2 outside sequence
3 outside sequence
4 outside sequence
5 outside sequence
6 outside sequence
7 outside sequence
8 outside sequence
9 outside sequence
10 outside sequence


(1, 0, 9)

In [22]:
local_compare_to_human('GALT', translation['NM_000155.4'], 23, 'XP_003800307.1')

-10 Q Q
-9 A A
-8 S S
-7 E E
-6 A A
-5 D D
-4 A I
-3 A P
-2 A V
-1 A A
0 T A
1 F F
2 R Q
3 A A
4 N S
5 D D
6 H H
7 Q Q
8 H H
9 I I
10 R R


(5, 0, 0)

This function spits out a site by site comparison of the human and nonhuman sequences.  Only the differences are shown.

In [23]:
global_compare_to_human('GBA', translation['NM_000157.4'], 'XP_005541629.1')

5 S -
6 S -
7 P -
10 E K
16 L S
17 S G
24 G A
67 F L
68 D E
70 P L
94 M T
96 P T
101 H R
213 R H
344 V -
345 L -
346 T -
347 D -
348 P -
349 E -
350 A -
351 A -
352 K -
353 Y -
354 V -
355 H -
356 G -
357 I -
358 A -
359 V -
360 H -
361 W -
362 Y -
363 L -
364 D -
365 F -
366 L -
367 A -
368 P -
369 A -
370 K -
371 A -
372 T -
373 L -
374 G -
375 E -
376 T -
377 H -
378 R -
379 L -
380 F -
381 P -
382 N -
383 T -
384 M -
385 L -
386 F -
387 A -
388 S -
389 E -
390 A -
391 C -
392 V -
393 G -
394 S -
395 K -
396 F -
397 W -
398 E -
399 Q -
401 S -
402 V -
403 R -
404 L -
405 G -
406 S -
407 W -
408 D -
409 R -
410 G -
411 M -
412 Q -
413 Y -
414 S -
415 H -
416 S -
417 I -
418 I -
419 T -
452 I V
496 A T
507 A T
546 R C
547 R L


In [24]:
global_compare_to_human('GALT', translation['NM_000155.4'], 'XP_003800307.1')

23 R C
26 T A
28 P L
29 Q D
30 Q R
39 A I
40 A P
41 A V
43 T A
45 R Q
47 N S
106 I T
115 Q H
118 S A
153 K E
154 S A
177 V M
203 A V
216 A S
248 Q R
251 K Q
256 E V
268 L I
269 R K
289 T V
310 P S
360 E A
363 A D
366 N D
669 H C
671 G R
828 T A


In [25]:
get_species_with_allele('CFTR', translation['NM_000492.4'], 13, 'F')

[('Otolemur garnettii', 'XP_003789873.1')]

In [26]:
local_compare_to_human('CFTR', translation['NM_000492.4'], 13, 'XP_003789873.1')

-10 P P
-9 L L
-8 E K
-7 K K
-6 A L
-5 S N
-4 V L
-3 V L
-2 - -
-1 - -
0 S F
1 K F
2 L L
3 F Y
4 F F
5 S S
6 W W
7 T T
8 R R
9 P P
10 I I


(7, 0, 0)

In [28]:
translation['NM_000132.3']

'NP_000123.1'

In [29]:
get_species_with_allele('F8', 'NP_000123.1', 2183, 'V')

[('Hylobates moloch', 'XP_032612179.1')]

In [30]:
local_compare_to_human('F8', 'NP_000123.1', 2183, 'XP_003789873.1')

UnboundLocalError: local variable 'nonaln' referenced before assignment