In [1]:
import pandas as pd

In [2]:
datadir = '/home/xavier/data/'
clinvar_file = 'variant_summary_02.txt'

# First thing, download and survey the 'Clinvar' data

## ➜ ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.

+ A phenotype results from the expression of an organism's genetic code, its genotype, as well as the influence of environmental factors and the interactions between the two. [Wikipedia](https://en.wikipedia.org/wiki/Phenotype)
+ File is from **[THIS FTP DIRECTORY](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/)** (which is updated monthly)
+ ** Download this file [Variant_summary.txt](ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz)** which is a 809K row dataset listing 

In [3]:
clinvar = pd.read_csv(datadir + clinvar_file, sep='\t', lineterminator='\n')

  interactivity=interactivity, compiler=compiler, result=result)


In [11]:
# count the rows of data
clinvar.shape

(811417, 31)

## ...⇧
> **811K Rows of data:** The Variant Summary updated on 5/21 is 2K rows longer than the previous

+ Previous was 809K rows in this dataset.

In [14]:
# Only work with Reference Genome 38
clinvar.groupby('Assembly').size()

Assembly
GRCh37    397447
GRCh38    389089
NCBI36     19545
dtype: int64

In [15]:
397447 + 389089 + 19545

806081

## ...⇧
> **806K Rows of "Reference Genome" data:** This is an anomaly!!

+ Data has 811K rows
+ Yet the Assembly counts only 806K rows

In [18]:
clinvar.Assembly

0         GRCh37
1         GRCh38
2         GRCh37
3         GRCh38
4         GRCh37
5         GRCh38
6         GRCh37
7         GRCh38
8         GRCh37
9         GRCh38
10        GRCh37
11        GRCh38
12        NCBI36
13        GRCh37
14        GRCh38
15        GRCh38
16        GRCh37
17        GRCh37
18        GRCh38
19        GRCh37
20        GRCh38
21        GRCh37
22        GRCh38
23        GRCh37
24        GRCh38
25        GRCh37
26        GRCh38
27        GRCh37
28        GRCh38
29        GRCh37
           ...  
811387    GRCh38
811388    GRCh37
811389    GRCh38
811390    GRCh38
811391    GRCh37
811392    GRCh37
811393    GRCh38
811394    GRCh37
811395    GRCh38
811396    GRCh38
811397    GRCh37
811398    GRCh38
811399    GRCh37
811400    GRCh38
811401    GRCh37
811402    GRCh37
811403    GRCh38
811404    GRCh38
811405    GRCh37
811406    GRCh38
811407    GRCh37
811408    GRCh38
811409    GRCh37
811410       NaN
811411       NaN
811412       NaN
811413       NaN
811414       N

## ...⇧
> **~5K Rows of "Reference Genome" are neither 36, 37, or 38:**

+ For example, the NaN value above.
+ So will only work with those explicitly referenced as Human Reference Genome #38

In [44]:
# Group By Phenotype and roll up the data in this column 
phenotypes = clinvar.groupby('PhenotypeIDS').size().reset_index(name='count')
phenotypes.sort_values(by='count', ascending=False,inplace=True)
phenotypes

Unnamed: 0,PhenotypeIDS,count
18216,MedGen:CN169374,169984
20990,na,53995
20811,MedGen:CN517202,48007
4288,"MedGen:C0027672,SNOMED CT:699346009",37184
18567,MedGen:CN230736,5518
4017,"MedGen:C0020445,OMIM:143890,SNOMED CT:39791500...",4720
12785,"MedGen:C2675520,OMIM:612555",4633
8483,"MedGen:C1333990,Orphanet:ORPHA144,SNOMED CT:31...",4444
12948,"MedGen:C2676676,OMIM:604370",4322
4686,"MedGen:C0027672,SNOMED CT:699346009;MedGen:CN1...",4186


## ...⇧
> **FINDING:** 21K unique groupings of Phenotypes

+ A lot of variance in the Phenotype IDs, pairs (dict) including:
    + MedGen
    + SNOMED CT
    + Human Phenotype Ontology
+ Disparate ways of listing the code
    + Human Phenotype Ontology:HP:0007018
+ Disparate separators
    + comma
    + semicolon


In [51]:
# with a small set of rows, look at a Typical 'raw' row of 
# data. Trying to understand what a typical cell contains

typical = clinvar[clinvar['PhenotypeIDS'].str.contains("MedGen:C2673611", na=False)]

# drop columns for a simpler 'preview' of these few rows
# (remember, this is a throwaway snapshot, just trying to get a view
# of the phenotype data)

typical.drop(['#AlleleID',
     'RS# (dbSNP)',
     'ClinSigSimple',
     'nsv/esv (dbVar)',
     'RCVaccession',
     'Origin',
     'Start',
     'Stop',    
     'Assembly',
     'ChromosomeAccession',
     'Chromosome',
     'ReferenceAllele',
     'AlternateAllele',
     'Cytogenetic',
     'NumberSubmitters',
     'Guidelines',
     'TestedInGTR',
     'OtherIDs',
     'SubmitterCategories',
     'VariationID'],axis=1,inplace=True)
typical

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Type,Name,GeneID,GeneSymbol,HGNC_ID,ClinicalSignificance,ClinSigSimple,LastEvaluated,PhenotypeIDS,PhenotypeList,OriginSimple,ReviewStatus
31029,single nucleotide variant,"COL7A1, IVS3DS, A-G, -2",1294,COL7A1,HGNC:2214,Pathogenic,1,"Oct 01, 2006",MedGen:C2673611;MedGen:C1853063,"Epidermolysis bullosa dystrophica, autosomal r...",germline,no assertion criteria provided
31030,single nucleotide variant,"COL7A1, 5820G-A",1294,COL7A1,HGNC:2214,Pathogenic,1,"Nov 01, 1998",MedGen:C2673611,"Epidermolysis bullosa dystrophica, autosomal r...",germline,no assertion criteria provided
31031,single nucleotide variant,"COL7A1, IVS95DS, G-A, -1",1294,COL7A1,HGNC:2214,Pathogenic,1,"Oct 01, 2006",MedGen:C2673611;MedGen:C1853063,"Epidermolysis bullosa dystrophica, autosomal r...",germline,no assertion criteria provided
31032,single nucleotide variant,NM_000094.3(COL7A1):c.4039G>C (p.Gly1347Arg),1294,COL7A1,HGNC:2214,Pathogenic,1,"Feb 29, 2016",MedGen:C2673611;MedGen:CN517202,"Epidermolysis bullosa dystrophica, autosomal r...",germline,"criteria provided, single submitter"
31033,single nucleotide variant,NM_000094.3(COL7A1):c.4039G>C (p.Gly1347Arg),1294,COL7A1,HGNC:2214,Pathogenic,1,"Feb 29, 2016",MedGen:C2673611;MedGen:CN517202,"Epidermolysis bullosa dystrophica, autosomal r...",germline,"criteria provided, single submitter"


## ...⇧
> **FINDING:** Looks like a typical cell contains a list of MedGen references

+ MedGen:Number
+ separated by semicolons

In [50]:
# typical.to_csv(datadir + 'typical_clinvar_rows.csv')

## NOTES FROM ALICE

+ Expert Panel, Practice Guideline, multiple submitters, no conflict
+ Check out information on drug response
    + Add  Drug Response (clniincial signifigance > Drug Response)
+ Assembly
    + Only Keep 38


