# Introduction to VAPr: package for the aggregation of genomic variant data 

#### Author: C. Mazzaferro & Kathleen Fisch
#### Email: cmazzafe@ucsd.edu
#### Date: June 2016
 
## Outline of Notebook
<a id = "toc"></a>
1. <a href = "#background">Background</a>
2. <a href = "#setup">Set Up File and Libraries</a>
  * <a href = "#naming">Set paths and names</a>
3. <a href = "#filtering">Implement a filter</a>
4. <a href = "#export">Export CSV and VCF files</a>

<a id = "background"></a>
## Background

This notebook will walk you through the basic steps of how variants coming from a VCF can be annotated efficiently and thoroughly using the package VAPr. In particular, the package is aimed at providing a way of retrieving variant information using [ANNOVAR](http://annovar.openbioinformatics.org/en/latest/) and [myvariant.info](myvariant.info) and consolidating it in conveninent formats. It is well-suited for bioinformaticians interested in aggregating variant information into a single database for ease of use and to provide higher analysis capabities.

For a more complete description of the functionalities of the package, you can visit the [VAPr Sample Usage Notebook](https://github.com/ucsd-ccbb/VAPr/blob/master/VAPr%20Sample%20Usage.ipynb) and/or check the full documentation on GitHub. 

### Download databases and run ANNOVAR

In [111]:
import pandas as pd

#variantannotation functions
from VAPr import parser_models, annovar_suprocess
import importlib
importlib.reload(parser_models)
importlib.reload(annovar_suprocess)

<module 'VAPr.annovar_suprocess' from '/Users/carlomazzaferro/Documents/CCBB/ucsd-ccbb/VAPr/VAPr/annovar_suprocess.py'>

In [142]:
IN_PATH = "/Volumes/Carlo_HD1/CCBB/VAPr_files/vcf_files/not_annotated/Normal_targeted_seq.vcf"
OUT_PATH = "/Volumes/Carlo_HD1/CCBB/VAPr_files/csv_files/"
ANNOVAR_PATH = '/Volumes/Carlo_HD1/CCBB/annovar/'   #location of the scipts and databases

sub_process = annovar_suprocess.AnnovarWrapper(IN_PATH, OUT_PATH, ANNOVAR_PATH)
sub_process.check_for_database_updates()

['perl /Volumes/Carlo_HD1/CCBB/annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avdblist /Volumes/Carlo_HD1/CCBB/annovar/humandb/']
Currently downloading database file: hg19_avdblist

Annovar finished dowloading on file : hg19_avdblist. A .txt file has been created in the ANNOVAR_PATH directory

2016-11-28 00:00:00 clinvar_20161128.txt.gz
Database clinvar_20161128.txt.gz outdated, will download newer version
clinvar_20161128
['perl /Volumes/Carlo_HD1/CCBB/annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar clinvar_20161128 /Volumes/Carlo_HD1/CCBB/annovar/humandb/']
Currently downloading database file: hg19_clinvar_20161128
Currently downloading database file: hg19_clinvar_20161128

Annovar finished dowloading on file : hg19_clinvar_20161128. A .txt file has been created in the ANNOVAR_PATH directory



In [66]:
is_ = 0
for db in list(sub_process.supported_databases.keys()):
    for db_ in list(db_dict.keys()):
        if db_.startswith(db):
            print(db)
            is_ +=1

esp6500siv2_all
knownGene
knownGene
popfreq_all_20150413
nci60
cosmic70
1000g2015aug
clinvar_20161128


In [75]:
int('03')

3

['perl /Volumes/Carlo_HD1/CCBB/annovar/annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avdblist /Volumes/Carlo_HD1/CCBB/annovar/humandb/']
Currently downloading database file: hg19_avdblist

Annovar finished dowloading on file : hg19_avdblist. A .txt file has been created in the ANNOVAR_PATH directory



KeyboardInterrupt: 

In [26]:
sub_process.annovar_command_str

'perl /Volumes/Carlo_HD1/CCBB/annovar/table_annovar.pl /Volumes/Carlo_HD1/CCBB/VAPr_files/vcf_files/not_annotated/Normal_targeted_seq.vcf /Volumes/Carlo_HD1/CCBB/annovar/humandb/ -buildver hg19 -out /Volumes/Carlo_HD1/CCBB/VAPr_files/csv_files/Normal_targeted_seq_annotated -remove -protocol genomicSuperDups,cosmic70,esp6500siv2_all,nci60,cytoBand,clinvar_20161128,targetScanS,tfbsConsSites,knownGene -operation r,f,f,f,r,f,r,r,g -nastring . -otherinfo -vcfinput'

In [21]:
sub_process.run_annovar()

Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_genomicSuperDups
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_cosmic70_filtered
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_cosmic70_dropped
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_esp6500siv2_all_filtered
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_esp6500siv2_all_dropped
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_nci60_dropped
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_nci60_filtered
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_cytoBand
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_clinvar_20161128_dropped
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_clinvar_20161128_filtered
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_targetS

'Finished running ANNOVAR on /Volumes/Carlo_HD1/CCBB/VAPr_files/vcf_files/not_annotated/Normal_targeted_seq.vcf'

<a id = "setup"></a>
## Import libraries and specify file paths

<a id = "naming"></a>
### Set paths and names

- CSV annotated (assuming its been annotated with ANNOVAR already. If not, refer to the documentation for instructions on how to perform annotation).
- VCF (original file containing the variants
- MongoDB collection and Database names. A mongoDB instance must be running upon calling the main VAPr function `push_to_db()`




In [24]:
vcf_file = '.../normal_targeted_seq.vcf'
csv_file = '.../annotated.hg19_multianno.txt'

csv_ = '/Volumes/Carlo_HD1/CCBB/VAPr_files/csv_files/Normal_targeted_seq_annotated.hg19_multianno.txt'

collection_name = 'My_Variant_Info_Collection_Full'
db_name = 'My_Variant_Database'

pars = parser_models.VariantParsing(IN_PATH, collection_name, db_name, annotated_file=csv_)

26653


In [25]:
pars.push_to_db()

querying 1-602...done.


KeyError: '1000g20XX'

<a id = "filtering"></a>
## Implement a filter

 - filter 1: ThousandGenomeAll < 0.05 or info not available
 - filter 2: ESP6500siv2_all < 0.05 or info not available
 - filter 3: cosmic70 information is present
 - filter 4: Func_knownGene is exonic, splicing, or both
 - filter 5: ExonicFunc_knownGene is not "synonymous SNV"
 - filter 6: Read Depth (DP) > 10

In [12]:
#Apply filter.
from VAPr import MongoDB_querying, file_writer
filter_collection = MongoDB_querying.Filters(db_name, collection_name)
#rare_cancer_variants = filter_collection.rare_cancer_variant()

#Crete writer object for filtered lists:
my_writer = file_writer.FileWriter(db_name, collection_name)

#cancer variants filtered files
#my_writer.generate_annotated_csv(rare_cancer_variants, rare_cancer_variants_csv)
#my_writer.generate_annotated_vcf(rare_cancer_variants,input_vcf_compressed, rare_cancer_variants_vcf)

In [16]:
my_writer.generate_unfiltered_annotated_csv('/Volumes/Carlo_HD1/fffs.csv')

'Finished writing annotated CSV file'

<a id = "export"></a>
## Write Filtered Annotated Output to CSV and VCF
.gz file required if vcf output is needed. Refer to the documentation for more info.

In [None]:
#Create output files (if needed): specify name of files and path 

rare_cancer_variants_csv = ".../out_files/filtered_csv.csv"
rare_cancer_variants_vcf =  ".../out_files/filtered_vcf.vcf"


input_vcf_compressed =  '.../test_vcf/Tumor_RNAseq_variants.vcf.gz'

#Crete writer object for filtered lists:

my_writer = create_output_files.FileWriter(db_name, collection_name)

#cancer variants filtered files
my_writer.generate_annotated_csv(rare_cancer_variants, rare_cancer_variants_csv)
my_writer.generate_annotated_vcf(rare_cancer_variants,input_vcf_compressed, rare_cancer_variants_vcf)