# Run SnpEff

Genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes). 

### Run VCF Through SnpEff

Requires: 
- SnpEff: http://snpeff.sourceforge.net/
- Reference (downloaded automoatically) GRCh37.74 to match vcf files
- GATK: to validate vcf file https://www.broadinstitute.org/gatk/download/
- 1000 genomes hg37 reference fasta: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz  
- Matching index file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai
- Mathching dict file (created using Picard tools) 
- Ibis: 'pip install ibis-framework' 

In [21]:
! java -jar D:/Documents/tools/picard/src/java/picard/sam/CreateSequenceDictionary.java \
    R=D:/Documents/GitHub/impala_scripts/annotation/human_g1k_v37.fasta \
    O=D:/Documents/GitHub/impala_scripts/annotation/human_g1k_v37.dict


Error: Invalid or corrupt jarfile D:/Documents/tools/picard/src/java/picard/sam/CreateSequenceDictionary.java


### Create VCF from distinct kaviar variants

For each distinct variant found in ISB's Kaivar table, a vcf file will be created and run through SnpEff. 

### Connect to Impala

For HDFS connection, update user argument. 

In [5]:
import ibis
import os

# connect to impala with ibis
hdfs_port = os.environ.get('glados20', 50070)
hdfs = ibis.hdfs_connect(host='glados20', port=hdfs_port, user='selasady')
con = ibis.impala.connect(host='glados19', port=21050, timeout=120)

# enable interactive mode
ibis.options.interactive = True

### Download Distinct Variant Table

This pipeline begins with the table of global distcint variants in impala. Here we are downloading them into an ibis table object to output to vcf and run through snpeff for predicting coding consequences. 

In [6]:
def create_vcf(tbl_name, db_name):
    # create ibis object from distinct vars table
    distinct_vars = con.table(tbl_name, database=db_name) 
    # limit table to just the columns we need to output to vcf
    distinct_df = distinct_vars['chrom', 'pos', 'ref', 'alt']
    # download table from ibis table connection object to local memory
    distinct_df = distinct_df.execute(limit=100000000000)
    # add blank fields for vcf format
    distinct_df['id'] = '.'
    distinct_df['QUAL'] = '30'
    distinct_df['FILTER'] = 'PASS'
    distinct_df['INFO'] = '.'
    distinct_df['FORMAT'] = 'GT:'
    distinct_df['subject'] = './.'
    # rename chrom column to match vcf header
    distinct_df =distinct_df.rename(columns = {'CHROM':'#CHROM'})
    # uppercase column names to match vcf header
    distinct_df.columns = [x.upper() for x in distinct_df.columns]
    # remove duplicated rows
    distinct_df.drop_duplicates(inplace=True)
    return distinct_df

In [7]:
# download table from ibis table connection object to local memory
distinct_df = create_vcf('global_vars', 'training')

In [8]:
print distinct_df.head(5) 
print "\n Distinct rows = " + str(len(distinct_df))

  CHROM        POS  REF  ALT ID QUAL FILTER INFO FORMAT SUBJECT
0    20    5918258    T    C  .   30   PASS    .    GT:     ./.
1    15   28356557    G   GA  .   30   PASS    .    GT:     ./.
2     2  187867377  CAG    C  .   30   PASS    .    GT:     ./.
3    11  108158205   TC    T  .   30   PASS    .    GT:     ./.
4     1   74837216    T  TAG  .   30   PASS    .    GT:     ./.

 Distinct rows = 53065


### Output Distinct Variants as VCF File

In [9]:
import time
import pandas as pd

# disable extraneous pandas warning
pd.options.mode.chained_assignment = None

In [10]:
# create vcf header
def create_header(outfile_name):
   # create vcf header
    lines=[]
    lines.append('##fileformat=VCFv4.0')
    lines.append('##fileDate='+ time.strftime("%y%m%d"))
    lines.append('##reference=grch37 v.74 \n')
    header = '\n'.join(lines)
    out = open(outfile_name, 'wb')
    out.write(header)
    out.close()
    
# create vcf body and append to file with header
def impala_to_vcf(input_df, outfile_name):
    # add blank columns for vcf format and format col names
    input_df.columns = [x.upper() for x in input_df.columns]
    input_df= input_df.rename(columns = {'CHROM':'#CHROM'})    
    # order chromosomes to match ref fastas
    chroms = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'M', 'MT']
    input_df['#CHROM'] = input_df['#CHROM'].astype("category")
    input_df['#CHROM'].cat.set_categories(chroms, inplace=True)
    # sort file by chrom then pos
    input_df = input_df.sort(['#CHROM', 'POS'])
    # write to file for conversion to vcf
    input_df.to_csv(outfile_name, header=True, encoding='utf-8', sep="\t", index=False, mode='a')

In [11]:
vcf_out = 'distinct_vars.vcf'

create_header(vcf_out)
impala_to_vcf(distinct_df, vcf_out)



### Verify VCF format using GATK ValidateVariants

The GATK's validate variants method is used to verify only the VCF formatting with the --validationTypeToExclude ALL to avoid checks for allele count, etc. 

In [16]:
!java -Xmx16g -jar D:/Documents/tools/GenomeAnalysisTK.jar  -T ValidateVariants \
    -R D:/Documents/GitHub/impala_scripts/annotation/human_g1k_v37.fasta \
        -V distinct_vars.vcf --validationTypeToExclude ALL

INFO  20:22:29,559 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  20:22:29,562 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12 
INFO  20:22:29,562 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  20:22:29,562 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  20:22:29,566 HelpFormatter - Program Args: -T ValidateVariants -R D:/Documents/GitHub/impala_scripts/annotation/human_g1k_v37.fasta -V distinct_vars.vcf --validationTypeToExclude ALL 
INFO  20:22:29,575 HelpFormatter - Executing as Summer@Beast on Windows 8.1 6.3 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_51-b16. 
INFO  20:22:29,576 HelpFormatter - Date/Time: 2015/11/02 20:22:29 
INFO  20:22:29,576 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  20:22:29,576 HelpFormatter - ------------------------------------

### Run VCF File through SnpEff to predict coding consequences

Running SnpEff on output vcf file: 
    - -Xmx16g : specify amount of java memory 
    - -t : multi-threaded mode, cannot spedify threads, determines automatically
    - -v : verbose mode
    - -noStats : save time by not calculating stats on the variants found
    - GRCh37.74 : specifies reference database to use, matched with VCF files

In [17]:
# command to run snpeff on mac
#!java -Xmx16g -jar /Users/selasady/tools/snpEff/snpEff.jar -t -v -noStats GRCh37.74 distinct_vars.vcf > distinct_snpeff.vcf

# command to run snpeff windows
!java -Xmx16g -jar D:/Documents/tools/snpEff/snpEff.jar -t -noStats GRCh37.74 distinct_vars.vcf > distinct_snpeff.vcf

### Output SnpEff effects as tsv file, one effect per line

Convert the snpeff from vcf format to tsv, with one line per effect, instead of grouped with multiple effects for each position. 

NOTE: vcfEffOnePerLine.pl does not work properly in older verions of SnpEff, make sure you have downloaded the latest version. The perl script is containted within the /snpeff/scripts directory that comes with snpeff. 

In [18]:
cat distinct_snpeff.vcf \
    | /Users/selasady/tools/snpEff/scripts/vcfEffOnePerLine.pl \
    | java -jar /Users/selasady/tools/snpEff/SnpSift.jar extractFields \
    - CHROM POS REF ALT "ANN[*].GENE" "ANN[*].GENEID" "ANN[*].EFFECT" "ANN[*].IMPACT" \
    "ANN[*].FEATURE" "ANN[*].FEATUREID" "ANN[*].BIOTYPE" "ANN[*].RANK" \
    "ANN[*].HGVS_C" "ANN[*].HGVS_P" > distinct.tsv 

SyntaxError: invalid syntax (<ipython-input-18-90aeeedd9941>, line 1)

### Remove Header and convert to CSV

To prepare for upload to HDFS and conversion into impala table, the header will be removed so we can specify our own column names in the schema below and parse. 

In [27]:
!sed '1d' distinct.tsv | tr '\t' ',' > distinct_snpeff.csv

## Save Results as an Impala Table

### Upload Results to HDFS

The CSV file is uploaded to HDFS, in a directory called 'snpeff'. 

In [28]:
path = '/user/selasady/training/'

# create directory
hdfs.mkdir(path)

# upload file
file_name = 'distinct_snpeff.csv'
admix_file = path + '/' + file_name

# upload admix file
hdfs.put(path, file_name, verbose=True)



'/user/selasady/training/distinct_snpeff.csv'

### Convert TSV into Ibis Object

Once the TSV file is in HDFS, we can use ibis to convert it into an impala table by pairing the file with a schema. 

In [29]:
# define table schema for tsv file
schema = ibis.schema([
    ('chrom', 'string'), 
    ('pos', 'int32'),
    ('ref', 'string'),
    ('alt', 'string'),
    ('gene', 'string'),
    ('gene_id', 'string'),
    ('effect', 'string'),
    ('impact', 'string'),
    ('feature', 'string'),
    ('feature_id', 'string'),
    ('biotype', 'string'),
    ('rank', 'int32'),
    ('hgvs_c', 'string'),
    ('hgvs_p', 'string')
])

# create ibis object from  tsv
snpeff = con.delimited_file(path, schema)

In [36]:
print snpeff.limit(5)

  chrom     pos ref alt gene gene_id effect impact feature feature_id biotype  \
0     1  949696  CG   .                                                         
1     1  949696  CG   .                                                         
2     1  949739   T   .                                                         
3     1  949739   T   .                                                         
4     1  985955   C   .                                                         

   rank hgvs_c hgvs_p  
0  None                
1  None                
2  None                
3  None                
4  None                


### Create impala table

Once the TSV file is paired with a schema, we can save it as a table in impala. 

In [38]:
con.create_table('coding_consequences', snpeff, database='p7_ref_grch37')