# Creating Annotated Table of Global Distinct Variants

All tables containing variants in impala will be used to create a global table of disctinct variants. There variants will be given the following annotations: 

    - SnpEff coding consequence preditions  
    - DANN scores  
    - Ensembl gene annotations  
    - dbSNP rsid  
    - Clinvar clinical significance rating  
    - Kaviar allele frequency  
    
To be added when available:   
    - PFAM  
    - CADD  

### Script Requirements

- Ibis: pip install ibis-framework
- SnpEff: http://snpeff.sourceforge.net/
- Reference (downloaded automoatically) GRCh37.74 to match vcf files

### Connect to Impala with Ibis

hdfs: update 'user' argument. 

In [1]:
import ibis
import os

# connect to impala with ibis
hdfs_port = os.environ.get('glados20', 50070)
hdfs = ibis.hdfs_connect(host='glados20', port=hdfs_port, user='selasady')
con = ibis.impala.connect(host='glados19', port=21050, timeout=120)

# enable interactive mode
ibis.options.interactive = True

## Add All Variants and Annotations from Reference Tables

All reference sources containing variants were outer joined to compile a list of all variants found on impala. 

TODO: Variants from CGI and Illumina will be added after normalized files are available. 

### Create connections to each needed table

In [2]:
db = 'p7_ref_grch37'

# connect to variant tables
kaviar = con.table('kaviar_isb', database=db)
clinvar = con.table('clinvar', database=db)
#dbnsfp = con.table('dbnsfp_variant', database=db) TODO: Add this when bug fixed in ibis
dbsnp = con.table('dbsnp', database=db)

### Outer Join between Kaviar_ISB and ClinVar

 The Kaviar_ISB file and ClinVar files were joined, keeping all variants found in both tables, and Clinvar significance rating and explanation

In [None]:
# outer join clinvar and kaviar_isb
clin_exprs = [kaviar.chrom == clinvar.chrom, kaviar.pos == clinvar.pos, kaviar.ref == clinvar.ref,\
               kaviar.alt == clinvar.alt]
clin_kav = kaviar.outer_join(clinvar, clin_exprs)[kaviar.chrom, kaviar.pos, kaviar.ref, kaviar.alt, kaviar.allele_freq.name('kav_freq') ,clinvar.clin_sig, clinvar.clin_dbn]

The code above is equivalent to the following sql statment: 

In [None]:
print clin_exprs.compile()

In [4]:
# outer join clin_kav_esp with dbsnp
dbsnp_exprs = [clin_kav.chrom == dbsnp.chrom, clin_kav.pos == dbsnp.pos, \
               clin_kav.ref == dbsnp.ref, clin_kav.alt == dbsnp.alt, dbsnp.dbsnpbuildid == '142']
all_vars = clin_kav.outer_join(dbsnp, dbsnp_exprs)[clin_kav, dbsnp.rs_id]

In [None]:
print dbsnp_exprs.compile()

In [5]:
# reduce all_vars to distinct entries before joining with other tables
all_vars = all_vars.distinct()

## Add Regional Annotations

Region-based annotations will be added to speed up downstream analysis and lookups. 

TODO: Add CADD and PFAM annotations when available. 

In [6]:
# connect to regional tables
dann = con.table('dann', database=db)
ensembl = con.table('ensembl_genes', database=db)

### Connect to Regional Annotation Tables

In [7]:
# add dann scores
dann_case = (ibis.case()
            .when(all_vars.alt == 'A', dann.score_a)
            .when(all_vars.alt == 'T', dann.score_t)
            .when(all_vars.alt == 'C', dann.score_c)
            .when(all_vars.alt == 'G', dann.score_g)
            .else_(ibis.NA)
            .end())    

dann_exp = [all_vars.chrom == dann.chrom, all_vars.pos == dann.pos]
all_vars = all_vars.join(dann, dann_exp)[all_vars, dann_case.name('dann_score')]

In [8]:
# keep only one transcript per gene, instead of each exon
ensembl = ensembl['chrom', 'start', 'stop', 'gene_name', 'gene_id'].distinct()

# add ensembl gene annotations
ensembl_exp = [all_vars.chrom == ensembl.chrom, all_vars.pos >= ensembl.start, \
               all_vars.pos <= ensembl.stop]
all_vars = all_vars.join(ensembl, ensembl_exp)[all_vars, ensembl.gene_name.name('ens_gene'), 
                                               ensembl.gene_id.name('ens_geneid')]

### Preview Results

In [9]:
print all_vars.limit(5)

  chrom       pos ref alt  kav_freq clin_sig clin_dbn rs_id  \
0    15  62560626   T   C  0.001460     None     None  None   
1    15  62562863   T   G  0.000038     None     None  None   
2    15  62562863   T  TG  0.000038     None     None  None   
3    15  62562863  TG   T  0.000038     None     None  None   
4    15  62566380   T   G  0.000038     None     None  None   

            dann_score       ens_gene       ens_geneid  
0  0.60789313687215341  RP11-299H22.5  ENSG00000260062  
1  0.17816605345991465  RP11-299H22.5  ENSG00000260062  
2                 None  RP11-299H22.5  ENSG00000260062  
3                 None  RP11-299H22.5  ENSG00000260062  
4  0.50747597957204360  RP11-299H22.6  ENSG00000261296  


## Save global variants as impala table

The table is too large to add snpeff annotations without first saving the table to impala. 

In [None]:
con.create_table('global_variants', all_vars, database='p7_ref_grch37')

TODO: 
    - Add normalized variants from illumina and cgi
    - Add CADD Scores
    - Add PFAM domains