In [6]:
import pandas as pd
import re

In [7]:
datadir = '/home/xavier/data/'
clinvar_file = 'variant_summary_02.txt' # Updated 5/21/2018

# First thing, download and survey the 'Clinvar' data

## ➜ ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.

+ A phenotype results from the expression of an organism's genetic code, its genotype, as well as the influence of environmental factors and the interactions between the two. [Wikipedia](https://en.wikipedia.org/wiki/Phenotype)
+ File is from **[THIS FTP DIRECTORY](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/)** (which is updated monthly)
+ ** Download this file [Variant_summary.txt](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz)** which is a 809K row dataset listing 

In [8]:
clinvar = pd.read_csv(datadir + clinvar_file, sep='\t', lineterminator='\n')

  interactivity=interactivity, compiler=compiler, result=result)


In [9]:
list(clinvar)

['#AlleleID',
 'Type',
 'Name',
 'GeneID',
 'GeneSymbol',
 'HGNC_ID',
 'ClinicalSignificance',
 'ClinSigSimple',
 'LastEvaluated',
 'RS# (dbSNP)',
 'nsv/esv (dbVar)',
 'RCVaccession',
 'PhenotypeIDS',
 'PhenotypeList',
 'Origin',
 'OriginSimple',
 'Assembly',
 'ChromosomeAccession',
 'Chromosome',
 'Start',
 'Stop',
 'ReferenceAllele',
 'AlternateAllele',
 'Cytogenetic',
 'ReviewStatus',
 'NumberSubmitters',
 'Guidelines',
 'TestedInGTR',
 'OtherIDs',
 'SubmitterCategories',
 'VariationID']

In [10]:
# count the rows of data
clinvar.shape

(811417, 31)

In [11]:
# Work with Human Reference Genome #38
latest = clinvar[clinvar['Assembly'] == 'GRCh38']
latest.shape

(389089, 31)

In [12]:
# Further filter the dataset down by these 'authoritative' flags
# from the 'Review Status' column

authoritative = latest[
    (latest['ReviewStatus'] == 'criteria provided, multiple submitters, no conflicts') | 
    (latest['ReviewStatus'] == 'practice guideline') | 
    (latest['ReviewStatus'] == 'reviewed by expert panel') 
]
authoritative.shape

(62166, 31)

## Time to start matching the OMIM tags

+ First order is to find OMIM tags that match the **59 ACMG Recommendations Diseases**
+ Import this file: [ACMG Recommendations for Reporting of Incidental Findings in Clinical Exome and Genome Sequencing](https://www.ncbi.nlm.nih.gov/clinvar/docs/acmg/)

In [13]:
# Import Disease names and OMIM tags and convert to a list
acmg = pd.read_csv(datadir + 'ACMG_Conditions_OMIM.csv')
acmgtags = acmg.ACMGTAGS.tolist()

In [14]:
# function to write OMIM TAGS to a new Column
def omim_tags (x):
    # Find anything that has an OMIM tag, from 1 to 6 numerals long
    res = re.findall(r"(?:OMIM:(\d{1,6}))",x)
    if res:
        return res
    else:
#         print("NA")
         return ("NA")

In [15]:
authoritative['OMIMTAGS'] = authoritative.PhenotypeIDS.apply(omim_tags)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [16]:
authoritative.shape

(62166, 32)

In [17]:
# function to write "OMIM > ACMG" matches to a new Column

def acmg_matches (x):
    # Find anything that has an OMIM tag, from 4 to 6 numerals long
    res = re.findall(r"(?:OMIM:(\d{4,6}))",x)
    if res:
        # Return (a list) of matches with the ACMG List
        mymatch = [int(match) for match in res if int(match) in acmgtags]
        if len(mymatch) != 0:
            return mymatch
        else:
            return "NOACMG"
    else:
        return "NOACMG"

In [18]:
authoritative['ACMGTAGS'] = authoritative.PhenotypeIDS.apply(acmg_matches)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [19]:
authoritative.shape

(62166, 33)

In [20]:
onlyomim = authoritative[authoritative['OMIMTAGS'] != "NA"]
onlyacmg = authoritative[authoritative['ACMGTAGS'] != "NOACMG"]

In [21]:
onlyacmg.shape

(12138, 33)

In [22]:
onlyomim.shape

(38590, 33)

In [23]:
onlyacmg.to_csv(datadir + 'onlyacmg_v4.csv')

In [24]:
onlyacmg.iloc[0:1500].to_csv(datadir + 'onlyacmg_v4_1500.csv')

In [25]:
onlyacmg["MATCHCOUNT"] = onlyacmg["ACMGTAGS"].apply(lambda x: len(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [26]:
onlyacmg.groupby("MATCHCOUNT").size()

MATCHCOUNT
1    11898
2      204
3       30
4        4
5        1
6        1
dtype: int64

## ...⇧
> **~2% of ACMG listed diseases have two diseases** 

+ This may be due to duplications of the same tag
    + 1.68% have 2 Diseases
    + 0.2% have 3 Diseases

In [27]:
(204/12138) * 100

1.680672268907563

In [28]:
(30/12138) * 100

0.2471576866040534

In [34]:
multiples = onlyacmg[onlyacmg["MATCHCOUNT"] != 1].filter(['ACMGTAGS','MATCHCOUNT'])

In [36]:
multiples

Unnamed: 0,ACMGTAGS,MATCHCOUNT
4023,"[193300, 193300]",2
5184,"[143890, 603776]",2
5631,"[192500, 192500]",2
5659,"[192500, 613688]",2
6284,"[194070, 194070]",2
6928,"[277900, 277900, 277900]",3
6936,"[277900, 277900]",2
9509,"[608456, 608456, 132600]",3
9514,"[608456, 608456, 608456, 132600]",4
12349,"[168000, 168000]",2


## ...⇧
> **Lots of Duplicate Tags** 

+ Need to filter duplicates tags