# Creating Annotated Table of Global Distinct Variants

All tables containing variants in impala will be used to create a global table of disctinct variants. There variants will be given the following annotations: 

    - SnpEff coding consequence preditions  
    - DANN scores  
    - Ensembl gene annotations  
    - dbSNP rsid  
    - Clinvar clinical significance rating  
    - Kaviar allele frequency  
    
To be added when available:   
    - PFAM  
    - CADD  

### Script Requirements

- Ibis: install Ibis from git master https://github.com/cloudera/ibis
- Impyla: install impyla from git master https://github.com/cloudera/impyla
- SnpEff: http://snpeff.sourceforge.net/
- Reference (downloaded automoatically) GRCh37.74 to match vcf files

### Connect to Impala with Ibis

hdfs: update 'user' argument. 

In [60]:
import ibis
import os

# connect to impala with ibis
hdfs_port = os.environ.get('glados20', 50070)
hdfs = ibis.hdfs_connect(host='glados20', port=hdfs_port, user='selasady')
con = ibis.impala.connect(host='glados19', port=21050, timeout=120)

# enable interactive mode
ibis.options.interactive = True

## Add All Variants and Annotations from Reference Tables

All reference sources containing variants were outer joined to compile a list of all variants found on impala. 

TODO: Variants from CGI and Illumina will be added after normalized files are available. 

### Create connections to each needed table

In [61]:
db = 'p7_ref_grch37'

# connect to variant tables
kaviar = con.table('kaviar_isb', database=db)
clinvar = con.table('clinvar', database=db) 
dbsnp = con.table('dbsnp', database=db)

### Create Views with Distinct Subsets of each table

In [62]:
# limit dbsnp to build 142
dbsnp = dbsnp[dbsnp.dbsnpbuildid == '142']

# subset the data for distinct elements from the needed columns
kav_sub = kaviar['chrom', 'pos', 'ref', 'alt', 'allele_freq', 'sources'].distinct()
clin_sub = clinvar['chrom', 'pos', 'ref', 'alt', 'clin_sig', 'clin_dbn'].distinct()
dbsnp_sub = dbsnp['chrom', 'pos', 'ref', 'alt', 'rs_id'].distinct()

In [33]:
# create views on impala
con.create_view('kav_distinct', kav_sub, database='training')
con.create_view('clin_distinct', clin_sub, database='training')
con.create_view('dbsnp_distinct', dbsnp_sub, database='training')

In [63]:
# create an object for each view
kav = con.table('kav_distinct', database='training')
clin = con.table('clin_distinct', database='training')
dbsnp = con.table('dbsnp_distinct', database='training')

### Outer Join between Kaviar_ISB and ClinVar

 The Kaviar_ISB file and ClinVar files were joined, keeping all variants found in both tables, and Clinvar significance rating and phenotype name. 
 
 Kaviar was parsed separately for SNP's and for indels. 

In [65]:
# join clinvar and kaviar for SNP's

# create statements to join clinvar and kaviar
clin_expr = [kav.chrom == clin.chrom, kav.pos == clin.pos, kav.ref == clin.ref,\
               kav.alt == clin.alt, len(str(kav.alt)) == 1]

#outer join clinvar with kaviar_isb
clin_kav_snp = kav.outer_join(clin, clin_expr)[kav.chrom, kav.pos, \
           kav.ref, kav.alt, kav.allele_freq.name('kav_freq'), \
           kav.sources.name('kav_source'), clin.clin_sig, clin.clin_dbn]

NotImplementedError: 

In [None]:
# join clinvar and kaviar for indels

# create statements to join clinvar and kaviar
clin_expr = [kav.chrom == clin.chrom, kav.pos == clin.pos, kav.ref == clin.ref,\
               kav.alt == clin.alt, len(str(kav.alt)) == 1]

#outer join clinvar with kaviar_isb
clin_kav_snp = kav.outer_join(clin, clin_expr)[kav.chrom, kav.pos, \
           kav.ref, kav.alt, kav.allele_freq.name('kav_freq'), \
           kav.sources.name('kav_source'), clin.clin_sig, clin.clin_dbn]

In [6]:
# create statements to join the clin_kav table with dbsnp
dbsnp_exprs = [clin_kav.chrom == dbsnp.chrom, clin_kav.pos == dbsnp.pos, \
               clin_kav.ref == dbsnp.ref, clin_kav.alt == dbsnp.alt]

# outer join clin_kav with dbsnp
all_vars = clin_kav.outer_join(dbsnp, dbsnp_exprs)[clin_kav, dbsnp.rs_id]

In [87]:
# reduce all_vars to distinct entries and save as a table on impala for downstream analysis
con.create_table('all_vars', all_vars.distinct(), database='training')

In [2]:
# create a connection to the all_vars table and preview
all_vars = con.table('all_vars', database='training')
print all_vars.limit(5)

  chrom        pos ref alt  kav_freq  \
0     5  174017416  TG  TT  0.002075   
1     1  106839757   A   G  0.001153   
2     X    9421656  AG   A  0.000038   
3     7  101028028   A   G  0.000038   
4    20   14800837   A   G  0.000038   

                                          kav_source clin_sig clin_dbn  \
0  Inova_Illumina_founders|SS6004468|UK10K|phase3...      NaN      NaN   
1  ADNI|ISB_founders|Inova_CGI_founders|Inova_Ill...      NaN      NaN   
2                                              UK10K      NaN      NaN   
3                                 Inova_CGI_founders      NaN      NaN   
4                                        GS000016338      NaN      NaN   

         rs_id  
0          NaN  
1  rs538323235  
2          NaN  
3          NaN  
4          NaN  


## Add Regional Annotations

Region-based annotations will be added to speed up downstream analysis and lookups. 

TODO: Add CADD and PFAM annotations when available. 

In [5]:
# connect to regional annotation tables
dann = con.table('dann', database='p7_ref_grch37')
ensembl = con.table('ensembl_genes', database='p7_ref_grch37')
dbnsfp = con.table('dbnsfp_variant', database='p7_ref_grch37')

### Create a subset from each table

In [6]:
# subset tables for only required columns and distinct rows
ensembl_sub = ensembl['chrom', 'start', 'stop', 'strand', 'gene_name', 'gene_id'].distinct()
dbnsfp_sub = dbnsfp['chrom', 'pos', 'ref', 'alt', 'sift_score', 'sift_pred', 'polyphen2_hdiv_score', \
             'polyphen2_hvar_score', 'cadd_raw', 'interpro_domain'].distinct()

In [8]:
# create views on impala
con.create_view('ens_distinct', ensembl_sub, database='training')
con.create_view('dbnsfp_distinct', dbnsfp_sub, database='training')

In [9]:
# create a connection to each view
ens = con.table('ens_distinct', database='training')
dbnsfp_dist = con.table('dbnsfp_distinct', database='training')

### Add Regional Annotations

The following annotations will be broken out into chromosome as these tables are getting very large. 

In [45]:
# create case statement to add DANN score depending on reference
dann_case = (ibis.case()
            .when(all_vars.alt == 'A', dann.score_a)
            .when(all_vars.alt == 'T', dann.score_t)
            .when(all_vars.alt == 'C', dann.score_c)
            .when(all_vars.alt == 'G', dann.score_g)
            .else_(ibis.NA)
            .end())    

# create table with variants and dann score annnoations
dann_exp = [all_vars.chrom == '1', dann.chrom == all_vars.chrom, all_vars.pos == dann.pos]
con.create_table('vars_dann', all_vars.join(dann, dann_exp)[all_vars, dann_case.name('dann_score')].distinct(), \
                 database='training')

In [55]:
# create string of chromosomes in the data set
chroms = str(all_vars.chrom.distinct())
# remove chromosome 1
chroms = chroms.translate(None, '1')

print all_vars.chrom.distinct()

# append each chromosome to vars_dann table
#for this_chrom in chroms:
#     # create table with variants and dann score annnoations
#     dann_exp = [all_vars.chrom == this_chrom, dann.chrom == all_vars.chrom, all_vars.pos == dann.pos]
#     con.insert('vars_dann', all_vars.join(dann, dann_exp)[all_vars, dann_case.name('dann_score')].distinct(), \
#                  database='training')


0      14
1       6
2      21
3       X
4       1
5     NaN
6       2
7       7
8      11
9      12
10     16
11      9
12     13
13      5
14     18
15      Y
16      4
17     20
18     17
19     22
20      3
21      8
22     19
23     10
24     15
Name: chrom, dtype: object


In [36]:
# connect to vars_dann table and run compute stats
vars_dann = con.table('vars_dann', database='training')
vars_dann.compute_stats()

# create table with ensembl gene annotations
ensembl_exp = [vars_dann.chrom == ens.chrom, vars_dann.pos >= ens.start, \
               vars_dann.pos <= ens.stop]
vars_dann_ens = vars_dann.join(ens, ensembl_exp)[vars_dann, ens.gene_name.name('ens_gene')].distinct()

# create a table with variants, dann and ensembl annotations
con.create_table('vars_dann_ens', vars_dann_ens, database= 'training')

In [17]:
#  create a connection to variants with dann and ensembl annotations view 
add_dbnsfp = con.table('vars_dann_ens', database = 'training')
add_dbnsfp.compute_stats()

# add CADD and PFAM scores from dbNSFP table
dbnsfp_exp = [add_dbnsfp.chrom == dbnsfp_dist.chrom, add_dbnsfp.pos == dbnsfp_dist.pos, \
              add_dbnsfp.ref == dbnsfp_dist.ref, add_dbnsfp.alt == dbnsfp_dist.alt]

global_vars = add_dbnsfp.join(dbnsfp_dist, dbnsfp_exp)[add_dbnsfp, dbnsfp_dist.sift_score,  \
                         dbnsfp_dist.polyphen2_hdiv_score, dbnsfp_dist.polyphen2_hvar_score, \
                         dbnsfp_dist.cadd_raw, dbnsfp_dist.interpro_domain].distinct()

## Save global variants as impala table

The table is too large to add snpeff annotations without first saving the table to impala. 

In [19]:
con.create_table('global_vars', global_vars.distinct(), database='training')

OperationalError: Operation is in ERROR_STATE

### Clean up temp tables

TODO: 
    - Add normalized variants from illumina and cgi
    - Add PFAM domains
    - Add snpeff annotations