# Introduction to VAPr: package for the aggregation of genomic variant data 

#### Author: C. Mazzaferro & Kathleen Fisch
#### Email: cmazzafe@ucsd.edu
#### Date: June 2016
 
## Outline of Notebook
<a id = "toc"></a>
1. <a href = "#background">Background</a>
2. <a href = "#setup">Set Up File and Libraries</a>
  * <a href = "#naming">Set paths and names</a>
3. <a href = "#filtering">Implement a filter</a>
4. <a href = "#export">Export CSV and VCF files</a>

<a id = "background"></a>
## Background

This notebook will walk you through the basic steps of how variants coming from a VCF can be annotated efficiently and thoroughly using the package variantannotation. In particular, the package is aimed at providing a way of retrieving variant information using [ANNOVAR](http://annovar.openbioinformatics.org/en/latest/) and [myvariant.info](myvariant.info) and consolidating it in conveninent formats. It is well-suited for bioinformaticians interested in aggregating variant information into a single database for ease of use and to provide higher analysis capabities. 

For a more complete description of the functionalities of the package, you can visit the [VAPR Sample Usage Notebook](...) and/or check the full documentation on GitHub. 

<a id = "setup"></a>
## Import libraries and specify file paths

In [124]:
import pandas as pd

#variantannotation functions
from VAPr import parser_models, annovar_suprocess

import importlib
importlib.reload(parser_models)
importlib.reload(annovar_suprocess)

<module 'VAPr.annovar_suprocess' from '/Users/carlomazzaferro/Documents/CCBB/ucsd-ccbb/VAPr/VAPr/annovar_suprocess.py'>

<a id = "naming"></a>
### Set paths and names

- CSV annotated (assuming its been annotated with ANNOVAR already. If not, refer to the documentation for instructions on how to perform annotation).
- VCF (original file containing the variants
- MongoDB collection and Database names. A mongoDB instance must be running upon calling the main VAPr function `push_to_db()`




In [122]:
vcf_file = '/Volumes/Carlo_HD1/CCBB/VAPr_files/vcf_files/not_annotated/Normal_targeted_seq.vcf'
csv_file = '/Volumes/Carlo_HD1/CCBB/VAPr_files/csv_files/annotated.hg19_multianno.txt'

collection_name = 'My_Variant_Info_Collection_Full'
db_name = 'My_Variant_Database'

pars = parser_models.VariantParsing(vcf_file, collection_name, db_name, annotated_file=csv_file)

In [123]:
pars.push_to_db()

querying 1-601...done.
601 601
querying 1-607...done.
607 607
querying 1-605...done.
605 605
querying 1-605...done.
605 605
querying 1-101...done.
101 101


'Done'

<a id = "filtering"></a>
## Implement a filter

 - filter 1: ThousandGenomeAll < 0.05 or info not available
 - filter 2: ESP6500siv2_all < 0.05 or info not available
 - filter 3: cosmic70 information is present
 - filter 4: Func_knownGene is exonic, splicing, or both
 - filter 5: ExonicFunc_knownGene is not "synonymous SNV"
 - filter 6: Read Depth (DP) > 10

In [9]:
#Apply filter.
filter_collection = MongoDB_querying.Filters(db_name, collection_name)
rare_cancer_variants = filter_collection.rare_cancer_variant()

#Crete writer object for filtered lists:
my_writer = create_output_files.FileWriter(db_name, collection_name)

#cancer variants filtered files
my_writer.generate_annotated_csv(rare_cancer_variants, rare_cancer_variants_csv)
my_writer.generate_annotated_vcf(rare_cancer_variants,input_vcf_compressed, rare_cancer_variants_vcf)

Variants found that match rarity criteria: 11


'Finished writing annotated, filtered VCF file'

<a id = "export"></a>
## Write Filtered Annotated Output to CSV and VCF
.gz file required if vcf output is needed. Refer to the documentation for more info.

In [None]:
#Create output files (if needed): specify name of files and path 

rare_cancer_variants_csv = ".../out_files/filtered_csv.csv"
rare_cancer_variants_vcf =  ".../out_files/filtered_vcf.vcf"


input_vcf_compressed =  '.../test_vcf/Tumor_RNAseq_variants.vcf.gz'

#Crete writer object for filtered lists:

my_writer = create_output_files.FileWriter(db_name, collection_name)

#cancer variants filtered files
my_writer.generate_annotated_csv(rare_cancer_variants, rare_cancer_variants_csv)
my_writer.generate_annotated_vcf(rare_cancer_variants,input_vcf_compressed, rare_cancer_variants_vcf)