## Extract pathogenic variants from ClinVar data

I renamed the folder `clinvar`.  I then survey all *Clinical significance* categories in the data.

In [1]:
run kondrashov

In [2]:
signif = []
for gene in loci:
    vardata = pd.read_table('clinvar/' + gene + '_clinvar.txt.txt', sep='\t')
    n = len(vardata)
    for i in range(n):
        tmp = vardata['Clinical significance (Last reviewed)'][i]
        signif.append(tmp.split('(Last')[0])
    print(gene, n, len(signif))

ABCD1 325 325
ALPL 187 512
AR 189 701
ATP7B 509 1210
BTK 160 1370
CASR 770 2140
CBS 228 2368
CFTR 1289 3657
CYBB 151 3808
F7 50 3858
F8 363 4221
F9 181 4402
G6PD 119 4521
GALT 300 4821
GBA 160 4981
GJB1 415 5396
HBB 473 5869
HPRT1 56 5925
IL2RG 115 6040
KCNH2 1084 7124
KCNQ1 673 7797
L1CAM 148 7945
LDLR 1459 9404
MPZ 289 9693
MYH7 1625 11318
TYR 139 11457
PAH 602 12059
PMM2 109 12168
RHO 198 12366
TP53 1054 13420
TTR 147 13567
VWF 583 14150


In [3]:
categs = pd.Series(signif).value_counts()
categs

Uncertain significance                                       6400
Pathogenic                                                   2562
Likely pathogenic                                            1721
Conflicting interpretations of pathogenicity                 1178
not provided                                                  769
Pathogenic/Likely pathogenic                                  691
Likely benign                                                 285
other                                                         237
Benign                                                        126
Benign/Likely benign                                           65
Pathogenic, other                                              63
drug response                                                  16
no interpretation for the single variant                       12
Pathogenic, drug response                                      10
Pathogenic/Likely pathogenic, drug response                     6
Conflictin

We then construct a condition to isolate *Pathogenic* and *Likely pathogenic* variants only.

In [4]:
clin = categs.index.tolist()
for i in clin:
    if (i[:4]=='Path') or (i[:11]=='Likely path'):
        print(i)

Pathogenic
Likely pathogenic
Pathogenic/Likely pathogenic
Pathogenic, other
Pathogenic, drug response
Pathogenic/Likely pathogenic, drug response
Pathogenic/Likely pathogenic, other
Likely pathogenic, drug response
Pathogenic, risk factor
Pathogenic/Likely pathogenic, risk factor


The following categories are not included.

In [5]:
for i in clin:
    if (i[:4]=='Path') or (i[:11]=='Likely path'):
        pass
    else:
        print(i)

Uncertain significance
Conflicting interpretations of pathogenicity
not provided
Likely benign
other
Benign
Benign/Likely benign
drug response
no interpretation for the single variant
Conflicting interpretations of pathogenicity, other
Conflicting interpretations of pathogenicity, risk factor


Finally, we generate new spreadsheets with only pathogenic entries in a new folder `pathogenic`.  **We focus on those data from now on.**

In [6]:
!mkdir pathogenic

In [7]:
for gene in loci:
    pathog = []
    vardata = pd.read_table('clinvar/' + gene + '_clinvar.txt.txt', sep='\t')
    n = len(vardata)
    for i in range(n):
        signif = vardata['Clinical significance (Last reviewed)'][i]
        if (signif[:4]=='Path') or (signif[:11]=='Likely path'):
            pathog.append(True)
        else:
            pathog.append(False)
    vardata['Pathogenic'] = pathog
    print(gene, n, sum(pathog))
    tmp = vardata[vardata['Pathogenic']==True]
    del tmp['Pathogenic']
    tmp.to_csv('pathogenic/' + gene + '_clinvar.csv', index=False)

ABCD1 325 134
ALPL 187 76
AR 189 132
ATP7B 509 185
BTK 160 102
CASR 770 109
CBS 228 68
CFTR 1289 391
CYBB 151 55
F7 50 26
F8 363 269
F9 181 138
G6PD 119 53
GALT 300 174
GBA 160 103
GJB1 415 111
HBB 473 127
HPRT1 56 46
IL2RG 115 55
KCNH2 1084 206
KCNQ1 673 210
L1CAM 148 54
LDLR 1459 808
MPZ 289 97
MYH7 1625 270
TYR 139 72
PAH 602 358
PMM2 109 56
RHO 198 105
TP53 1054 241
TTR 147 77
VWF 583 149


In [8]:
ls pathogenic/

ABCD1_clinvar.csv  CYBB_clinvar.csv   HBB_clinvar.csv    MYH7_clinvar.csv
ALPL_clinvar.csv   F7_clinvar.csv     HPRT1_clinvar.csv  PAH_clinvar.csv
AR_clinvar.csv     F8_clinvar.csv     IL2RG_clinvar.csv  PMM2_clinvar.csv
ATP7B_clinvar.csv  F9_clinvar.csv     KCNH2_clinvar.csv  RHO_clinvar.csv
BTK_clinvar.csv    G6PD_clinvar.csv   KCNQ1_clinvar.csv  TP53_clinvar.csv
CASR_clinvar.csv   GALT_clinvar.csv   L1CAM_clinvar.csv  TTR_clinvar.csv
CBS_clinvar.csv    GBA_clinvar.csv    LDLR_clinvar.csv   TYR_clinvar.csv
CFTR_clinvar.csv   GJB1_clinvar.csv   MPZ_clinvar.csv    VWF_clinvar.csv


In [1]:
!open .