# Creating Annotated Table of Global Distinct Variants

All tables containing variants in impala will be used to create a global table of disctinct variants. There variants will be given the following annotations: 

    - SnpEff coding consequence preditions  
    - DANN scores  
    - Ensembl gene annotations  
    - dbSNP rsid  
    - Clinvar clinical significance rating  
    - Kaviar allele frequency  
    
To be added when available:   
    - PFAM  
    - CADD  

### Script Requirements

- Ibis: pip install ibis-framework
- SnpEff: http://snpeff.sourceforge.net/
- Reference (downloaded automoatically) GRCh37.74 to match vcf files

### Connect to Impala with Ibis

hdfs: update 'user' argument. 

In [23]:
import ibis
import os

# connect to impala with ibis
hdfs_port = os.environ.get('glados20', 50070)
hdfs = ibis.hdfs_connect(host='glados20', port=hdfs_port, user='selasady')
con = ibis.impala.connect(host='glados19', port=21050, timeout=120)

# enable interactive mode
ibis.options.interactive = True

## Add All Variants and Annotations from Reference Tables

All reference sources containing variants were outer joined to compile a list of all variants found on impala. 

TODO: Variants from CGI and Illumina will be added after normalized files are available. 

### Create connections to each needed table

In [2]:
db = 'p7_ref_grch37'

# connect to variant tables
kaviar = con.table('kaviar_isb', database=db)
clinvar = con.table('clinvar', database=db) 
dbsnp = con.table('dbsnp', database=db)

### Create Views with Distinct Subsets of each table

In [3]:
# limit dbsnp to build 142
dbsnp = dbsnp[dbsnp.dbsnpbuildid == '142']

# subset the data for distinct elements from the needed columns
kav_sub = kaviar['chrom', 'pos', 'ref', 'alt', 'allele_freq', 'sources'].distinct()
clin_sub = clinvar['chrom', 'pos', 'ref', 'alt', 'clin_sig', 'clin_dbn'].distinct()
dbsnp_sub = dbsnp['chrom', 'pos', 'ref', 'alt', 'rs_id'].distinct()

In [33]:
# create views on impala
con.create_view('kav_distinct', kav_sub, database='training')
con.create_view('clin_distinct', clin_sub, database='training')
con.create_view('dbsnp_distinct', dbsnp_sub, database='training')

In [4]:
# create an object for each view
kav = con.table('kav_distinct', database='training')
clin = con.table('clin_distinct', database='training')
dbsnp = con.table('dbsnp_distinct', database='training')

### Outer Join between Kaviar_ISB and ClinVar

 The Kaviar_ISB file and ClinVar files were joined, keeping all variants found in both tables, and Clinvar significance rating and explanation

In [5]:
# create statements to join clinvar and kaviar
clin_expr = [kav.chrom == clin.chrom, kav.pos == clin.pos, kav.ref == clin.ref,\
               kav.alt == clin.alt]

#outer join clinvar with kaviar_isb
clin_kav = kav.outer_join(clin, clin_expr)[kav.chrom, kav.pos, \
           kav.ref, kav.alt, kav.allele_freq.name('kav_freq'), \
           kav.sources.name('kav_source'), clin.clin_sig, clin.clin_dbn]

In [6]:
# create statements to join the clin_kav table with dbsnp
dbsnp_exprs = [clin_kav.chrom == dbsnp.chrom, clin_kav.pos == dbsnp.pos, \
               clin_kav.ref == dbsnp.ref, clin_kav.alt == dbsnp.alt]

# outer join clin_kav with dbsnp
all_vars = clin_kav.outer_join(dbsnp, dbsnp_exprs)[clin_kav, dbsnp.rs_id]

In [87]:
# reduce all_vars to distinct entries and save as a table on impala for downstream analysis
con.create_table('all_vars', all_vars.distinct(), database='training')

In [7]:
# create a connection to the all_vars table and preview
all_vars = con.table('all_vars', database='training')
print all_vars.limit(5)

  chrom        pos                ref alt  kav_freq  \
0    10   52596386                  G   A  0.000038   
1     5   82238314                CCT   C  0.000038   
2     X  108229812  ATTGATCCACTGGTTGT   A  0.001153   
3     1  166194426                  G  GT  0.000038   
4    19    5093376                  C   T  0.000038   

                                          kav_source clin_sig clin_dbn  \
0                                 Inova_CGI_founders      NaN      NaN   
1                                       ISB_founders      NaN      NaN   
2  ADNI|UK10K|phase3-ACB|phase3-ASW|phase3-ESN|ph...      NaN      NaN   
3                                 Inova_CGI_founders      NaN      NaN   
4                                         phase3-STU      NaN      NaN   

         rs_id  
0          NaN  
1          NaN  
2          NaN  
3          NaN  
4  rs573375168  


## Add Regional Annotations

Region-based annotations will be added to speed up downstream analysis and lookups. 

TODO: Add CADD and PFAM annotations when available. 

In [8]:
# connect to regional annotation tables
dann = con.table('dann', database=db)
ensembl = con.table('ensembl_genes', database=db)
dbnsfp = con.table('dbnsfp_variant', database=db)

### Create a subset from each table

In [9]:
# subset tables for only required columns and distinct rows
ensembl_sub = ensembl['chrom', 'start', 'stop', 'strand', 'gene_name', 'gene_id'].distinct()
dbnsfp_sub = dbnsfp['chrom', 'pos', 'ref', 'alt', 'sift_score', 'sift_pred', 'polyphen2_hdiv_score', \
             'polyphen2_hvar_score', 'cadd_raw', 'interpro_domain'].distinct()

In [15]:
# create views on impala
con.create_view('ens_distinct', ensembl_sub, database='training')
con.create_view('dbnsfp_distinct', dbnsfp_sub, database='training')

In [16]:
# create a connection to each view
ens = con.table('ens_distinct', database='training')
dbnsfp_dist = con.table('dbnsfp_distinct', database='training')

### Add Regional Annotations

In [12]:
# create case statement to add DANN score depending on reference
dann_case = (ibis.case()
            .when(all_vars.alt == 'A', dann.score_a)
            .when(all_vars.alt == 'T', dann.score_t)
            .when(all_vars.alt == 'C', dann.score_c)
            .when(all_vars.alt == 'G', dann.score_g)
            .else_(ibis.NA)
            .end())    

# create view with variants and dann score annnoations
dann_exp = [all_vars.chrom == dann.chrom, all_vars.pos == dann.pos]
con.create_view('vars_dann', all_vars.join(dann, dann_exp)[all_vars, dann_case.name('dann_score')], \
                 database='training')

In [25]:
# connect to vars_dann view
vars_dann = con.table('vars_dann', database='training')

# create view with ensembl gene annotations
ensembl_exp = [vars_dann.chrom == ens.chrom, vars_dann.pos >= ens.start, \
               vars_dann.pos <= ens.stop]
vars_dann_ens = vars_dann.join(ens, ensembl_exp)[vars_dann, ens.gene_name.name('ens_gene')]

# create a table with variants, dann and ensembl annotations
con.create_table('vars_dann_ens', vars_dann_ens, database= 'training')

OperationalError: Operation is in ERROR_STATE

In [17]:
#  create a connection to variants with dann and ensembl annotations view 
add_dbsnfp = con.table('vars_dann_ens', database = 'training')

# add CADD and PFAM scores from dbNSFP table
dbnsfp_exp = [add_dbsnfp.chrom == dbnsfp_dist.chrom, add_dbsnfp.pos == dbnsfp_dist.pos, \
              add_dbsnfp.ref == dbnsfp_dist.ref, add_dbsnfp.alt == dbnsfp_dist.alt]

global_vars = add_dbsnfp.join(dbnsfp_dist, dbnsfp_exp)[add_dbsnfp, dbnsfp_dist.sift_score,  \
                         dbnsfp_dist.polyphen2_hdiv_score, dbnsfp_dist.polyphen2_hvar_score, \
                         dbnsfp_dist.cadd_raw, dbnsfp_dist.interpro_domain]

## Save global variants as impala table

The table is too large to add snpeff annotations without first saving the table to impala. 

In [19]:
con.create_table('global_vars', global_vars.distinct(), database='training')

OperationalError: Operation is in ERROR_STATE

TODO: 
    - Add normalized variants from illumina and cgi
    - Add PFAM domains
    - Add snpeff annotations