# Creating Annotated Table of Global Distinct Variants

All tables containing variants in impala will be used to create a global table of disctinct variants. There variants will be given the following annotations: 
    - SnpEff coding consequence preditions
    - DANN scores
    - Ensembl gene annotations
    - dbSNP rsid
    - Clinvar clinical significance rating
    - Kaviar allele frequency
    
To be added as available: 
    - PFAM
    - CADD

### Script Requirements

- Ibis: pip install ibis-framework
- SnpEff: http://snpeff.sourceforge.net/
- Reference (downloaded automoatically) GRCh37.74 to match vcf files

### Connect to Impala

In [97]:
import ibis
import os

# connect to impala with ibis
hdfs_port = os.environ.get('glados20', 50070)
hdfs = ibis.hdfs_connect(host='glados20', port=hdfs_port, user='selasady')
con = ibis.impala.connect(host='glados19', port=21050, timeout=120)

# enable interactive mode
ibis.options.interactive = True

## Add All Variants and Annotations from Reference Tables

In [98]:
db = 'p7_ref_grch37'

# connect to variant tables
kaviar = con.table('kaviar_isb', database=db)
clinvar = con.table('clinvar', database=db)
#dbnsfp = con.table('dbnsfp_variant', database=db) TODO: Add this when bug fixed in ibis
dbsnp = con.table('dbsnp', database=db)

In [99]:
# outer join clinvar and kaviar_isb
clin_exprs = [kaviar.chrom == clinvar.chrom, kaviar.pos == clinvar.pos, kaviar.ref == clinvar.ref,\
               kaviar.alt == clinvar.alt]
clin_kav = kaviar.outer_join(clinvar, clin_exprs)[kaviar.chrom, kaviar.pos, kaviar.ref, kaviar.alt, kaviar.allele_freq.name('kav_freq') ,clinvar.clin_sig, clinvar.clin_dbn]

In [100]:
# outer join clin_kav_esp with dbsnp
dbsnp_exprs = [clin_kav.chrom == dbsnp.chrom, clin_kav.pos == dbsnp.pos, \
               clin_kav.ref == dbsnp.ref, clin_kav.alt == dbsnp.alt, dbsnp.dbsnpbuildid == '142']
all_vars = clin_kav.outer_join(dbsnp, dbsnp_exprs)[clin_kav, dbsnp.rs_id]

In [101]:
# reduce all_vars to distinct entries before joining with other tables
all_vars = all_vars.distinct()

## Add Regional Annotations

In [102]:
# connect to regional tables
dann = con.table('dann', database=db)
ensembl = con.table('ensembl_genes', database=db)

In [103]:
# add dann scores
dann_case = (ibis.case()
            .when(all_vars.alt == 'A', dann.score_a)
            .when(all_vars.alt == 'T', dann.score_t)
            .when(all_vars.alt == 'C', dann.score_c)
            .when(all_vars.alt == 'G', dann.score_g)
            .else_(ibis.NA)
            .end())    

dann_exp = [all_vars.chrom == dann.chrom, all_vars.pos == dann.pos]
all_vars = all_vars.join(dann, dann_exp)[all_vars, dann_case.name('dann_score')]

In [104]:
# keep only one transcript per gene, instead of each exon
ensembl = ensembl['chrom', 'start', 'stop', 'gene_name', 'gene_id'].distinct()

# add ensembl gene annotations
ensembl_exp = [all_vars.chrom == ensembl.chrom, all_vars.pos >= ensembl.start, \
               all_vars.pos <= ensembl.stop]
all_vars = all_vars.join(ensembl, ensembl_exp)[all_vars, ensembl.gene_name.name('ens_gene'), 
                                               ensembl.gene_id.name('ens_geneid')]

### Save global variants to impala

The table is too large to add snpeff annotations without first saving the table to impala. 

In [None]:
con.create_table('global_variants', all_vars, database='p7_ref_grch37')

TODO: 
    - Add normalized variants from illumina and cgi
    - Add CADD Scores
    - Add PFAM domains