In [1]:
import pandas as pd
import re

In [2]:
datadir = '/home/xavier/data/'
clinvar_file = 'variant_summary_02.txt' # Updated 5/21/2018

# First thing, download and survey the 'Clinvar' data

## ➜ ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.

+ A phenotype results from the expression of an organism's genetic code, its genotype, as well as the influence of environmental factors and the interactions between the two. [Wikipedia](https://en.wikipedia.org/wiki/Phenotype)
+ File is from **[THIS FTP DIRECTORY](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/)** (which is updated monthly)
+ ** Download this file [Variant_summary.txt](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz)** which is a 809K row dataset listing 

In [3]:
clinvar = pd.read_csv(datadir + clinvar_file, sep='\t', lineterminator='\n')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
# count the rows of data
clinvar.shape

(811417, 31)

## ...⇧
> **811K Rows of data:** The Variant Summary updated on 5/21 is 2K rows longer than the previous

+ Previous was 809K rows in this dataset.

- - -

In [5]:
# What are the Reference Genomes in this data?
clinvar.groupby('Assembly').size()

Assembly
GRCh37    397447
GRCh38    389089
NCBI36     19545
dtype: int64

In [6]:
397447 + 389089 + 19545

806081

## ...⇧
> **806K Rows of "Reference Genome" data:** This is an anomaly!!

+ Data has **811K** rows
+ Yet the Assembly counts **only 806K** rows

In [7]:
clinvar.Assembly

0         GRCh37
1         GRCh38
2         GRCh37
3         GRCh38
4         GRCh37
5         GRCh38
6         GRCh37
7         GRCh38
8         GRCh37
9         GRCh38
10        GRCh37
11        GRCh38
12        NCBI36
13        GRCh37
14        GRCh38
15        GRCh38
16        GRCh37
17        GRCh37
18        GRCh38
19        GRCh37
20        GRCh38
21        GRCh37
22        GRCh38
23        GRCh37
24        GRCh38
25        GRCh37
26        GRCh38
27        GRCh37
28        GRCh38
29        GRCh37
           ...  
811387    GRCh38
811388    GRCh37
811389    GRCh38
811390    GRCh38
811391    GRCh37
811392    GRCh37
811393    GRCh38
811394    GRCh37
811395    GRCh38
811396    GRCh38
811397    GRCh37
811398    GRCh38
811399    GRCh37
811400    GRCh38
811401    GRCh37
811402    GRCh37
811403    GRCh38
811404    GRCh38
811405    GRCh37
811406    GRCh38
811407    GRCh37
811408    GRCh38
811409    GRCh37
811410       NaN
811411       NaN
811412       NaN
811413       NaN
811414       N

## ...⇧
> **~5K Rows of "Reference Genome" are neither 36, 37, or 38:**

+ For example, the NaN values in the last few rows listed above.
+ So will only work with those explicitly referenced as Human Reference Genome #38

In [8]:
# Work with Human Reference Genome #38
latest = clinvar[clinvar['Assembly'] == 'GRCh38']
latest.shape

(389089, 31)

## ...⇧

> **389K Unfiltered rows of data fir Reference Genome #38**

- - -

## Now let's cross-reference some count of this data using other tools

+ The [ClinVar Advanced Search Builder](https://www.ncbi.nlm.nih.gov/clinvar/advanced/) allows you to slice Clinvar data using custom variables
+ We're interested in the most authoritative data.

> Let's start with "multiple submitters, no conflict." Should be **~53K rows**

![Should be ~ 53K rows](https://i.imgur.com/2cmVd4jl.png)


- - -

In [9]:
# Let's double check some stats on this dataframe:
latest.groupby('ReviewStatus').size()


ReviewStatus
criteria provided, conflicting interpretations           18415
criteria provided, multiple submitters, no conflicts     52969
criteria provided, single submitter                     249069
no assertion criteria provided                           48076
no assertion provided                                    10707
no interpretation for the single variant                   656
practice guideline                                          23
reviewed by expert panel                                  9174
dtype: int64

## ...⇧

> **Matching up pretty well**

+ Discrepancies of 'No Conflicts and 'Reviewed by Experts' can be explained by the NaN values are filtered out of our dataset, but remain in the 'Advanced Search Builder'

![Matchup](https://i.imgur.com/pswXi9Zl.jpg)

In [10]:
# Further filter the dataset down by these 'authoritative' flags
# from the 'Review Status' column

authoritative = latest[
    (latest['ReviewStatus'] == 'criteria provided, multiple submitters, no conflicts') | 
    (latest['ReviewStatus'] == 'practice guideline') | 
    (latest['ReviewStatus'] == 'reviewed by expert panel') 
]
authoritative.shape

(62166, 31)

In [11]:
# double check
52969 + 23 + 9174

62166

In [12]:
# percentage of total Ref #38 data
(62166 / 389089) * 100

15.97732138405352

## ...⇧

> **62K Rows of 'Authoritative' reviews**

+ So that means that of the original **389K** rows of data, on **62K** rows of data are 'meaningful...
+ Only **16%** of this data is 'authoritative', based on our filters.

- - -

# Idea for marketing

> Show the rise in 'authoritative' data on clinical signifigance by applying this exercise back in time on retrospective datasets. ➜ This would demonstrate how rapidly authoritative correlations in markers are rising, over a period of months. (Nice little marketing tidbit.)

![marketing ideas](https://i.imgur.com/gshsM2Zm.png?1)

## Time to start matching the OMIM tags

+ First order is to filter for only OMIM tags into a new column

In [224]:
# function to write OMIM TAG MATCHES TO A NEW COL
def omim_matches (x):
    # Find anything that has an OMIM tag, from 1 to 6 numerals long
    res = re.findall(r"(?:OMIM:(\d{1,6}))",x)
    if res:
        return res
    else:
#         print("NA")
         return ("NA")

In [225]:
authoritative['OMIMTAGS'] = authoritative.PhenotypeIDS.apply(omim_matches)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [226]:
authoritative['OMIMTAGS']

11                [252010]
20        [235200, 612635]
57                [613616]
70                [606068]
72                [606068]
74                [606068]
105               [613559]
107               [614852]
184               [275350]
223               [236200]
227               [236200]
229               [236200]
233               [236200]
239               [236200]
241               [236200]
246               [236200]
252               [236200]
254               [236200]
380       [608091, 603194]
471                     NA
517       [613172, 613426]
519               [613172]
521               [613172]
523               [613172]
572       [193400, 613554]
657               [162200]
659               [162200]
663               [162200]
688       [162210, 162200]
691               [162200]
                ...       
808067            [605259]
808068            [209950]
808070            [236200]
808073            [600920]
808074            [105400]
808089            [612067]
8

In [16]:
authoritative.shape

(62166, 32)

In [17]:
onlyomim = authoritative[authoritative['OMIMTAGS'] != "NA"]

In [18]:
onlyomim.shape

(38590, 32)

In [19]:
# percentage data sliced away by not using OMIM tags
(onlyomim.shape[0]/authoritative.shape[0])*100

62.075732715632334

## ...⇧
> **38K rows contain OMIM tags**: 62% of total "Expert Reviewed, No Conflict" data

+ ...that's a lot of data loss, because it doesn't share a tag.
+ We may want to come back and try to match by disease ddescription instead of just the OMIM TAG

In [20]:
# create a column that counts the number of matches
onlyomim["MATCHCOUNT"]=onlyomim["OMIMTAGS"].apply(lambda x: len(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [21]:
# Table of the number of matches
onlyomim.groupby('MATCHCOUNT').size()

MATCHCOUNT
1     31562
2      4628
3      1164
4       383
5       688
6        57
7        48
8        21
9         9
10       21
11        4
12        1
13        2
14        1
19        1
dtype: int64

## ...⇧
> **Signifigant number have multiple OMIM tags**: 

+ 12% of rows have 2 OMIM tags
+ 3% of rows have 3 OMIM tags


In [22]:
(1164/38590)*100

3.0163254729204456

In [23]:
(4628/38590)*100

11.992744234257579

In [24]:
# write 1000 rows to a swapfile
onlyomim.iloc[0:999].to_csv(datadir + 'onlyomim_matches-1000.csv')

# Not sure how to handle multiple OMIM matches to the ACMG diseases.

+ Let's not overthink it. 
+ What really matters is how many OMIM tags match...NOT how many OMIM tags exist.
## ...⇧
> **Let's match the OMIM tags against the ACMG Disease Names**: 

+ Working with [ACMG Recommendations for Reporting of Incidental Findings in Clinical Exome and Genome Sequencing](https://www.ncbi.nlm.nih.gov/clinvar/docs/acmg/)
+ 

In [165]:
acmg = pd.read_csv(datadir + 'ACMG_Conditions_OMIM.csv')

In [166]:
acmg.shape

(59, 2)

In [167]:
len(set(acmg.DISEASE))

59

## ...⇧
> **59 Unique Disease names listed**: 

+ 59 rows in the file, 59  unique disease names

In [217]:
acmgtags = acmg.TAGS.tolist()

In [218]:
acmgtags

[175100,
 132900,
 611788,
 604400,
 607450,
 609040,
 610193,
 610476,
 604370,
 612555,
 601144,
 604772,
 115200,
 130050,
 301500,
 143890,
 192600,
 115196,
 115197,
 600858,
 613690,
 608751,
 608758,
 612098,
 155240,
 603776,
 174900,
 601494,
 151623,
 609192,
 610168,
 608967,
 610380,
 613795,
 192500,
 613688,
 603830,
 120435,
 145600,
 154700,
 131100,
 171400,
 162300,
 608456,
 101000,
 311250,
 168000,
 601650,
 605373,
 115310,
 175200,
 132600,
 153480,
 180200,
 191100,
 613254,
 193300,
 194070,
 277900]

In [187]:
# set(acmgtags)

In [201]:
slice = onlyomim.iloc[80:90]

In [206]:
slice.OMIMTAGS.iloc[0:1]

1386    [612285]
Name: OMIMTAGS, dtype: object

In [203]:
# tags = pd.Series(slice.OMIMTAGS).reset_index(drop=True)
# tags

In [207]:
for item in slice.OMIMTAGS:
    print (len(item))

1
1
1
2
1
1
1
1
1
1


In [213]:
mylist = [612285,100100,100111]
mylist

[612285, 100100, 100111]

In [215]:
slice["NEW"] = slice.OMIMTAGS.apply(lambda x: any(item in x for item in mylist))
# slice["NEW"] = slice.OMIMTAGS.apply(lambda x: any(item in x for item in acmgtags))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [216]:
slice.NEW

1386    False
1389    False
1484    False
1490    False
1501    False
1503    False
1506    False
1510    False
1514    False
1518    False
Name: NEW, dtype: bool

In [179]:
def matcher (x):
    for item in x:
        print (set(x))
        print (set(mylist))
        return set(x).intersection(mylist)        

In [180]:
acmgtags

0     175100
1     132900
2     611788
3     604400
4     607450
5     609040
6     610193
7     610476
8     604370
9     612555
10    601144
11    604772
12    115200
13    130050
14    301500
15    143890
16    192600
17    115196
18    115197
19    600858
20    613690
21    608751
22    608758
23    612098
24    155240
25    603776
26    174900
27    601494
28    151623
29    609192
30    610168
31    608967
32    610380
33    613795
34    192500
35    613688
36    603830
37    120435
38    145600
39    154700
40    131100
41    171400
42    162300
43    608456
44    101000
45    311250
46    168000
47    601650
48    605373
49    115310
50    175200
51    132600
52    153480
53    180200
54    191100
55    613254
56    193300
57    194070
58    277900
Name: TAGS, dtype: object

{'612285'}
{'100100', '001001', '612285'}
{'216360'}
{'100100', '001001', '612285'}
{'175100'}
{'100100', '001001', '612285'}
{'175100', '114500'}
{'100100', '001001', '612285'}
{'175100', '114500'}
{'100100', '001001', '612285'}
{'175100'}
{'100100', '001001', '612285'}
{'175100'}
{'100100', '001001', '612285'}
{'175100'}
{'100100', '001001', '612285'}
{'175100'}
{'100100', '001001', '612285'}
{'175100'}
{'100100', '001001', '612285'}
{'175100'}
{'100100', '001001', '612285'}


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [122]:
slice.NEW

1386    None
1389    None
1484    None
1490    None
1501    None
1503    None
1506    None
1510    None
1514    None
1518    None
Name: NEW, dtype: object

## NOTES FROM ALICE

+ Expert Panel, Practice Guideline, multiple submitters, no conflict
+ Check out information on drug response
    + Add  Drug Response (clniincial signifigance > Drug Response)
+ Assembly
    + Only Keep 38


