In [22]:
import pandas as pd
import re

In [13]:
datadir = '/home/xavier/data/'
clinvar_file = 'variant_summary_02.txt' # Updated 5/21/2018

# First thing, download and survey the 'Clinvar' data

## ➜ ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.

+ A phenotype results from the expression of an organism's genetic code, its genotype, as well as the influence of environmental factors and the interactions between the two. [Wikipedia](https://en.wikipedia.org/wiki/Phenotype)
+ File is from **[THIS FTP DIRECTORY](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/)** (which is updated monthly)
+ ** Download this file [Variant_summary.txt](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz)** which is a 809K row dataset listing 

In [3]:
clinvar = pd.read_csv(datadir + clinvar_file, sep='\t', lineterminator='\n')

  interactivity=interactivity, compiler=compiler, result=result)


In [12]:
# count the rows of data
clinvar.shape

(811417, 31)

## ...⇧
> **811K Rows of data:** The Variant Summary updated on 5/21 is 2K rows longer than the previous

+ Previous was 809K rows in this dataset.

- - -

In [8]:
# What are the Reference Genomes in this data?
clinvar.groupby('Assembly').size()

Assembly
GRCh37    397447
GRCh38    389089
NCBI36     19545
dtype: int64

In [6]:
397447 + 389089 + 19545

806081

## ...⇧
> **806K Rows of "Reference Genome" data:** This is an anomaly!!

+ Data has **811K** rows
+ Yet the Assembly counts **only 806K** rows

In [7]:
clinvar.Assembly

0         GRCh37
1         GRCh38
2         GRCh37
3         GRCh38
4         GRCh37
5         GRCh38
6         GRCh37
7         GRCh38
8         GRCh37
9         GRCh38
10        GRCh37
11        GRCh38
12        NCBI36
13        GRCh37
14        GRCh38
15        GRCh38
16        GRCh37
17        GRCh37
18        GRCh38
19        GRCh37
20        GRCh38
21        GRCh37
22        GRCh38
23        GRCh37
24        GRCh38
25        GRCh37
26        GRCh38
27        GRCh37
28        GRCh38
29        GRCh37
           ...  
811387    GRCh38
811388    GRCh37
811389    GRCh38
811390    GRCh38
811391    GRCh37
811392    GRCh37
811393    GRCh38
811394    GRCh37
811395    GRCh38
811396    GRCh38
811397    GRCh37
811398    GRCh38
811399    GRCh37
811400    GRCh38
811401    GRCh37
811402    GRCh37
811403    GRCh38
811404    GRCh38
811405    GRCh37
811406    GRCh38
811407    GRCh37
811408    GRCh38
811409    GRCh37
811410       NaN
811411       NaN
811412       NaN
811413       NaN
811414       N

## ...⇧
> **~5K Rows of "Reference Genome" are neither 36, 37, or 38:**

+ For example, the NaN values in the last few rows listed above.
+ So will only work with those explicitly referenced as Human Reference Genome #38

In [11]:
# Work with Human Reference Genome #38
latest = clinvar[clinvar['Assembly'] == 'GRCh38']
latest.shape

(389089, 31)

## ...⇧

> **389K Unfiltered data for Reference Genome #38**

- - -

## Now let's cross-reference this data using other tools

+ The [ClinVar Advanced Search Builder](https://www.ncbi.nlm.nih.gov/clinvar/advanced/) allows you to slice Clinvar data using custom variables
+ We're interested in the most authoritative data.

> Let's start with "multiple submitters, no conflict." Should be **~53K rows**

![Should be ~ 53K rows](https://i.imgur.com/2cmVd4jl.png)


- - -

In [15]:
# Let's double check some stats on this dataframe:
latest.groupby('ReviewStatus').size()


ReviewStatus
criteria provided, conflicting interpretations           18415
criteria provided, multiple submitters, no conflicts     52969
criteria provided, single submitter                     249069
no assertion criteria provided                           48076
no assertion provided                                    10707
no interpretation for the single variant                   656
practice guideline                                          23
reviewed by expert panel                                  9174
dtype: int64

## ...⇧

> **Matching up pretty well**

+ Discrepancies of 'No Conflicts and 'Reviewed by Experts' can be explained by the NaN values which we have already filtered out.

![Matchup](https://i.imgur.com/pswXi9Zl.jpg)

In [17]:
# Further filter the dataset down by these 'authoritative' flags
# from the 'Review Status' column

authoritative = latest[
    (latest['ReviewStatus'] == 'criteria provided, multiple submitters, no conflicts') | 
    (latest['ReviewStatus'] == 'practice guideline') | 
    (latest['ReviewStatus'] == 'reviewed by expert panel') 
]
authoritative.shape

(62166, 31)

In [19]:
52969 + 23 + 9174

62166

## ...⇧

> **62K Rows of 'Authoritative' reviews**

+ So that means that of the original **811K** rows of data, on **62K** rows of data are 'meaningful...
+ Only **7.6%** of this data is relevant, based on our filters !!

- - -

# Idea for marketing

> Show the rise in 'authoritative' data on clinical signifigance by applying this exercise back in time on retrospective datasets. This would show the rise in authoritative data over time. Nice little marketing tidbit.

![marketing ideas](https://i.imgur.com/gshsM2Z.png?1)

## Time to start matching the OMIM tags

+ First order is to filter for only OMIM tags into a new column

In [34]:
# function to write OMIM TAG MATCHES TO A NEW COL
def omim_matches (x):
    # Find anything that has an OMIM tag, from 1 to 6 numerals long
    res = re.findall(r"(OMIM:\d{1,6})",x)
    if res:
#         print(res)
         return res
    else:
#         print("NA")
         return ("NA")

In [35]:
authoritative['OMIMTAGS'] = authoritative.PhenotypeIDS.apply(omim_matches)
authoritative['OMIMTAGS']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


11                     [OMIM:252010]
20        [OMIM:235200, OMIM:612635]
57                     [OMIM:613616]
70                     [OMIM:606068]
72                     [OMIM:606068]
74                     [OMIM:606068]
105                    [OMIM:613559]
107                    [OMIM:614852]
184                    [OMIM:275350]
223                    [OMIM:236200]
227                    [OMIM:236200]
229                    [OMIM:236200]
233                    [OMIM:236200]
239                    [OMIM:236200]
241                    [OMIM:236200]
246                    [OMIM:236200]
252                    [OMIM:236200]
254                    [OMIM:236200]
380       [OMIM:608091, OMIM:603194]
471                               NA
517       [OMIM:613172, OMIM:613426]
519                    [OMIM:613172]
521                    [OMIM:613172]
523                    [OMIM:613172]
572       [OMIM:193400, OMIM:613554]
657                    [OMIM:162200]
659                    [OMIM:162200]
6

In [36]:
authoritative.shape

(62166, 33)

In [39]:
onlyomim = authoritative[authoritative['OMIMTAGS'] != "NA"]
onlyomim.shape

(38590, 33)

## ...⇧
> **38K rows contain OMIM tags**: 62% of the expert data

+ ...that's a lot of data loss, because it doesn't share a tag.
+ We may want to come back and try to match by disease ddescription instead of just the OMIM TAG

In [51]:
# with a small set of rows, look at a Typical 'raw' row of 
# data. Trying to understand what a typical cell contains

typical = clinvar[clinvar['PhenotypeIDS'].str.contains("MedGen:C2673611", na=False)]

# drop columns for a simpler 'preview' of these few rows
# (remember, this is a throwaway snapshot, just trying to get a view
# of the phenotype data)

typical.drop(['#AlleleID',
     'RS# (dbSNP)',
     'ClinSigSimple',
     'nsv/esv (dbVar)',
     'RCVaccession',
     'Origin',
     'Start',
     'Stop',    
     'Assembly',
     'ChromosomeAccession',
     'Chromosome',
     'ReferenceAllele',
     'AlternateAllele',
     'Cytogenetic',
     'NumberSubmitters',
     'Guidelines',
     'TestedInGTR',
     'OtherIDs',
     'SubmitterCategories',
     'VariationID'],axis=1,inplace=True)
typical

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,LastEvaluated,PhenotypeIDS,PhenotypeList,OriginSimple,ReviewStatus
31029,single nucleotide variant,"COL7A1, IVS3DS, A-G, -2",1294,COL7A1,HGNC:2214,Pathogenic,1,"Oct 01, 2006",MedGen:C2673611;MedGen:C1853063,"Epidermolysis bullosa dystrophica, autosomal r...",germline,no assertion criteria provided
31030,single nucleotide variant,"COL7A1, 5820G-A",1294,COL7A1,HGNC:2214,Pathogenic,1,"Nov 01, 1998",MedGen:C2673611,"Epidermolysis bullosa dystrophica, autosomal r...",germline,no assertion criteria provided
31031,single nucleotide variant,"COL7A1, IVS95DS, G-A, -1",1294,COL7A1,HGNC:2214,Pathogenic,1,"Oct 01, 2006",MedGen:C2673611;MedGen:C1853063,"Epidermolysis bullosa dystrophica, autosomal r...",germline,no assertion criteria provided
31032,single nucleotide variant,NM_000094.3(COL7A1):c.4039G>C (p.Gly1347Arg),1294,COL7A1,HGNC:2214,Pathogenic,1,"Feb 29, 2016",MedGen:C2673611;MedGen:CN517202,"Epidermolysis bullosa dystrophica, autosomal r...",germline,"criteria provided, single submitter"
31033,single nucleotide variant,NM_000094.3(COL7A1):c.4039G>C (p.Gly1347Arg),1294,COL7A1,HGNC:2214,Pathogenic,1,"Feb 29, 2016",MedGen:C2673611;MedGen:CN517202,"Epidermolysis bullosa dystrophica, autosomal r...",germline,"criteria provided, single submitter"


## ...⇧
> **FINDING:** Looks like a typical cell contains a list of MedGen references

+ MedGen:Number
+ separated by semicolons

In [50]:
# typical.to_csv(datadir + 'typical_clinvar_rows.csv')

## NOTES FROM ALICE

+ Expert Panel, Practice Guideline, multiple submitters, no conflict
+ Check out information on drug response
    + Add  Drug Response (clniincial signifigance > Drug Response)
+ Assembly
    + Only Keep 38


