In [1]:
import pandas as pd
import re

In [2]:
datadir = '/home/xavier/data/'
clinvar_file = 'variant_summary_02.txt' # Updated 5/21/2018

# First thing, download and survey the 'Clinvar' data

## ➜ ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.

+ A phenotype results from the expression of an organism's genetic code, its genotype, as well as the influence of environmental factors and the interactions between the two. [Wikipedia](https://en.wikipedia.org/wiki/Phenotype)
+ File is from **[THIS FTP DIRECTORY](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/)** (which is updated monthly)
+ ** Download this file [Variant_summary.txt](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz)** which is a 809K row dataset listing 

In [3]:
clinvar = pd.read_csv(datadir + clinvar_file, sep='\t', lineterminator='\n')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
list(clinvar)

['#AlleleID',
 'Type',
 'Name',
 'GeneID',
 'GeneSymbol',
 'HGNC_ID',
 'ClinicalSignificance',
 'ClinSigSimple',
 'LastEvaluated',
 'RS# (dbSNP)',
 'nsv/esv (dbVar)',
 'RCVaccession',
 'PhenotypeIDS',
 'PhenotypeList',
 'Origin',
 'OriginSimple',
 'Assembly',
 'ChromosomeAccession',
 'Chromosome',
 'Start',
 'Stop',
 'ReferenceAllele',
 'AlternateAllele',
 'Cytogenetic',
 'ReviewStatus',
 'NumberSubmitters',
 'Guidelines',
 'TestedInGTR',
 'OtherIDs',
 'SubmitterCategories',
 'VariationID']

In [5]:
# count the rows of data
clinvar.shape

(811417, 31)

In [6]:
# Work with Human Reference Genome #38
latest = clinvar[clinvar['Assembly'] == 'GRCh38']
latest.shape

(389089, 31)

In [7]:
# Further filter the dataset down by these 'authoritative' flags
# from the 'Review Status' column

authoritative = latest[
    (latest['ReviewStatus'] == 'criteria provided, multiple submitters, no conflicts') | 
    (latest['ReviewStatus'] == 'practice guideline') | 
    (latest['ReviewStatus'] == 'reviewed by expert panel') 
]
authoritative.shape

(62166, 31)

## Time to start matching the OMIM tags

+ First order is to find OMIM tags that match the **59 ACMG Recommendations Diseases**
+ Import this file: [ACMG Recommendations for Reporting of Incidental Findings in Clinical Exome and Genome Sequencing](https://www.ncbi.nlm.nih.gov/clinvar/docs/acmg/)

In [8]:
# Import Disease names and OMIM tags and convert to a list
acmg = pd.read_csv(datadir + 'ACMG_Conditions_OMIM.csv')
acmgtags = acmg.ACMGTAGS.tolist()

In [9]:
# function to write OMIM TAGS to a new Column
def omim_tags (x):
    # Find anything that has an OMIM tag, from 1 to 6 numerals long
    res = re.findall(r"(?:OMIM:(\d{1,6}))",x)
    if res:
        return res
    else:
#         print("NA")
         return ("NA")

In [10]:
authoritative['OMIMTAGS'] = authoritative.PhenotypeIDS.apply(omim_tags)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [11]:
authoritative.shape

(62166, 32)

In [12]:
# function to write "OMIM > ACMG" matches to a new Column

def acmg_matches (x):
    # Find anything that has an OMIM tag, from 4 to 6 numerals long
    res = re.findall(r"(?:OMIM:(\d{4,6}))",x)
    if res:
        # Return (a list) of matches with the ACMG List
        mymatch = [int(match) for match in res if int(match) in acmgtags]
        if len(mymatch) != 0:
            return set(mymatch)
        else:
            return "NOACMG"
    else:
        return "NOACMG"

In [13]:
authoritative['ACMGTAGS'] = authoritative.PhenotypeIDS.apply(acmg_matches)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [14]:
authoritative.shape

(62166, 33)

In [15]:
onlyomim = authoritative[authoritative['OMIMTAGS'] != "NA"]
onlyacmg = authoritative[authoritative['ACMGTAGS'] != "NOACMG"]

In [16]:
onlyacmg.shape

(12138, 33)

In [17]:
onlyomim.shape

(38590, 33)

In [18]:
onlyacmg.to_csv(datadir + 'onlyacmg_v5.csv')

In [19]:
onlyacmg.iloc[0:1500].to_csv(datadir + 'onlyacmg_v5_1500.csv')

In [20]:
onlyacmg["MATCHCOUNT"] = onlyacmg["ACMGTAGS"].apply(lambda x: len(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [29]:
multgroup = onlyacmg.groupby("MATCHCOUNT").size()
multgroup

MATCHCOUNT
1    12028
2       97
3       10
4        3
dtype: int64

In [34]:
int(multgroup[2])

97


## ...⇧
> **~Less than 1% of ACMG listed diseases have two or more diseases** 

+ This may be due to duplications of the same tag
    + 0.8% have 2 Diseases
    + 0.08% have 3 Diseases

In [36]:
(97/12138) * 100

0.7991431866864394

In [37]:
(10/12138) * 100

0.08238589553468446

In [24]:
multiples = onlyacmg[onlyacmg["MATCHCOUNT"] != 1].filter(['ACMGTAGS','MATCHCOUNT'])

In [25]:
multiples

Unnamed: 0,ACMGTAGS,MATCHCOUNT
5184,"{603776, 143890}",2
5659,"{613688, 192500}",2
9509,"{608456, 132600}",2
9514,"{608456, 132600}",2
15426,"{192600, 115197}",2
16701,"{604370, 612555}",2
16719,"{604370, 612555, 194070}",3
16728,"{612555, 194070}",2
16730,"{604370, 612555}",2
16779,"{601144, 192500, 603830}",3


## ...⇧
> **~110 rows out of 12K row have two or more diseases** 

+ We're not leaving too much behind until we figure out how to display two diseases