# VAPr Quick-Start Guide

#### Authors: C. Mazzaferro, A. Birmingham, Kathleen Fisch
#### Contact Email: kfisch@ucsd.edu
#### Last Update: February 2017
 
## Table of Contents

* [Introduction](#Introduction)
* [Installation](#Installation)
    * [Python 3 and MongoDB](#Python-3-and-MongoDB)
    * [ANNOVAR](#ANNOVAR)
    * [VAPr](#VAPr)
    * [Dataset Downloads](#Dataset-Downloads)
* [Usage](#Usage)
    * [Novel Variant Annotation](#Novel-Variant-Annotation)
    * [Known Variant Annotation and Storage](#Known-Variant-Annotation-and-Storage)
    * [Variant Prioritization](#Variant-Prioritization)
    * [File Export](#File-Export)
* [Citations](#Citations)

## Introduction

This notebook will walk you through the basic steps of how variants coming from a VCF file can be annotated efficiently and thoroughly using the package VAPr. VAPr provides a way of retrieving variant information using [ANNOVAR](http://annovar.openbioinformatics.org/en/latest/) and/or [myvariant.info](myvariant.info) through Python calls and consolidating it in conveninent formats. It is well-suited for bioinformaticians interested in aggregating variant information for ease of use and to provide higher analysis capabities.

For a more complete description of the functionalities of the package, you can visit the [VAPr Sample Usage Notebook](https://github.com/ucsd-ccbb/VAPr/blob/master/VAPr%20Sample%20Usage.ipynb) and/or check the full documentation on GitHub. 

*Note to reviewers*: this notebook is in all its functionalities the same as the one in the official documentation (https://github.com/ucsd-ccbb/VAPr/blob/master/VAPr%20Quick-Start%20Guide.ipynb). We have added some notes here regarding performance and regarding the file structure of this instance.

## Installation

VAPr requires a Python 3 environment with the pandas library and MongoDB installed.  Here, we provide instructions for how to set up such an environment using  `conda`, a cross-platform package and environment manager.  (If you wish to follow this approach but do not already have either the full or minimal version of `conda` (`Anaconda` or `miniconda`, respectively) installed on your system, visit [https://conda.io/docs/get-started.html](https://conda.io/docs/get-started.html) to decide which is right for you and then follow the instructions provided there to install it.)

### Python 3 and MongoDB

VAPr is written in Python 3, and stores variant annotations in NoSQL database, using a locally-installed instance of MongoDB.  If you wish to run the Jupyter notebooks provided with VAPr locally, it is also necessary to have a local Jupyter notebook server installed. To set up these requirements in your machine's default environment using `conda`, run the following command:

    conda install python=3 pandas mongodb pymongo jupyter notebook

MongoDB also needs a location to store its data, so create a directory for this in the location of your choice, e.g.:

    mkdir -p /Temp/MongoDbData
    
(or, on Windows systems, `md /Temp/MongoDbData`). 

Then start the MongoDb with the command

    mongod --dbpath /Temp/MongoDbData

    
To check if mongodb is currently running, run:

    service mongod status
    
which should return    
    
    mongod (pid 591) is running...
    
519 is the process number so it may differ every time it is run. In case it is not running, the comand will return:

    mongod dead but subsys locked
    
    
### VAPr

VAPr is available from PyPi.  Once the above requirements have been installed, VAPr itself can be installed by just running:

    pip install VAPr


### ANNOVAR
(It is possible to proceed without installing ANNOVAR. In that case, the variants that will be annotated and sent ot Mongo are the ones found in MyVariant.info. In that case, users can skip the next steps and go straight to the section **Known Variant Annotation and Storage**)


Users who wish to annotate novel variants will also need to have a local installation of the popular command-line software ANNOVAR([1](#Citations)), which VAPr wraps with a Python interface.  If you use ANNOVAR's functionality through VAPr, please remember to cite the ANNOVAR publication (see #1 in [Citations](#Citations))!

The base ANNOVAR program must be installed by each user individually, since its license agreement does not permit redistribution.  Please visit the ANNOVAR download form at [http://www.openbioinformatics.org/annovar/annovar_download_form.php](http://www.openbioinformatics.org/annovar/annovar_download_form.php), ensure that you meet the requirements for a free license, and fill out the required form.  You will then receive an email providing a link to the latest ANNOVAR release file.  Download this file (which will usually have a name like `annovar.latest.tar.gz`) and place it in the location on your machine in which you would like the ANNOVAR program and its data to be installed--the entire disk size of the databases will be around 25 GB, so make sure you have such space available!  Record the complete path to this location in the `ANNOVAR_PATH` variable below so that VAPr will know where to access it:


In [3]:
# NOTE: You MUST fill in this variable with the location of ANNOVAR *on your system* 
ANNOVAR_PATH = # e.g., /User/Me/Applications/annovar/

Note that if execution of the above cell caused a syntax error, you have probably forgotten to fill in the path to ANNOVAR on your system; please add that information to the above cell and run it again!

Then simply extract the ANNOVAR tar.gz file in that location to complete the software installation. 

### Dataset Downloads

ANNOVAR requires a number of external data files containing annotation information.  VAPr provides wrapper functions that automatically download these data files and, when desired, update them. The data sources currently supported by default are:
 
- knownGene
- tfbsConsSites
- cytoBand
- genomicSuperDups
- esp6500siv2_all
- 1000g2015aug_all
- popfreq_all
- clinvar_20140929
- cosmic70
- nci60

These can be expanded or restricted by the user, depending on his or her research needs. 

The simple Python code below downloads and installs the necessary databases for ANNOVAR:

In [4]:
# third-party libraries
import pandas as pd

# VAPr libraries
from VAPr import parser_models, annovar_suprocess

IN_PATH = # e.g., ".../normal_targeted_seq.vcf"
OUT_PATH = # e.g., ".../csv_files/"
sub_process = annovar_suprocess.AnnovarWrapper(IN_PATH, OUT_PATH, ANNOVAR_PATH)
sub_process.download_dbs()

Currently downloading database file: annovar_downdb
Currently downloading database file: hg19_knownGene
Currently downloading database file: hg19_kgXref
Currently downloading database file: hg19_knownGeneMrna
Currently downloading database file: hg19_knownGene

Annovar finished dowloading on file : hg19_knownGene. A .txt file has been created in the ANNOVAR_PATH directory

Currently downloading database file: tfbsConsSites
Currently downloading database file: hg19_kgXref

Annovar finished dowloading on file : hg19_kgXref. A .txt file has been created in the ANNOVAR_PATH directory

Currently downloading database file: hg19_knownGeneMrna
Currently downloading database file: cytoBand
Currently downloading database file: cytoBand

Annovar finished dowloading on file : cytoBand. A .txt file has been created in the ANNOVAR_PATH directory

Currently downloading database file: targetScanS
Currently downloading database file: targetScanS

Annovar finished dowloading on file : targetScanS. A .tx

'Finished downloading databases to /Applications/annovar/humandb/'

## Usage

### Novel Variant Annotation

Variant calling pipelines often identify novel variants that have not been annotated in existing data sources. VAPr encourages (but does not require) the use of ANNOVAR to assign even these novel variants minimal annotation such as predicted mutation type.  Such primary annotation can be run through VAPr as shown in the below example:

In [4]:
#Get an estimate of the number of variants in the vcf file
print("Number of variants in vcf file: %i" % sum(1 for line in open(IN_PATH)))

Number of variants in vcf file: 26625


In [28]:
%%time
sub_process.run_annovar()

Currently working on VCF file: test_file_one_annotated, field hg19_tfbsConsSites
Currently working on VCF file: test_file_one_annotated, field hg19_nci60_filtered
Currently working on VCF file: test_file_one_annotated, field hg19_nci60_dropped
Currently working on VCF file: test_file_one_annotated, field hg19_cytoBand
Currently working on VCF file: test_file_one_annotated, field hg19_clinvar_20161128_filtered
Currently working on VCF file: test_file_one_annotated, field hg19_clinvar_20161128_dropped
Currently working on VCF file: test_file_one_annotated, field hg19_esp6500siv2_all_filtered
Currently working on VCF file: test_file_one_annotated, field hg19_esp6500siv2_all_dropped
Currently working on VCF file: test_file_one_annotated, field 2015_08_filtered
Currently working on VCF file: test_file_one_annotated, field 2015_08_dropped
Currently working on VCF file: test_file_one_annotated, field hg19_cosmic70_filtered
Currently working on VCF file: test_file_one_annotated, field hg19_cos

'Finished running ANNOVAR on /data/vcf_files/test_file_one.vcf'

~1 minute for 25k variants

Note that the screen output of `run_annovar` includes a statement like the following:

    Annovar finished working on file : test_file_one_annotated.A.txt file has been created in the OUT_PATH directory
    
This file name and path will be used in the next step.

### Known Variant Annotation and Storage

Most often, researchers are primarily interested in variants about which some information is already available (such as population frequency or disease association).  VAPr utilizes the MyVariant.info web service ([2](#Citations)) to identify up-to-date annotation information from more than a dozen sources for such known variants; the resulting annotations are then stored to the local MongoDB database.  If additional annotations should be included--such as those generated by ANNOVAR for novel variants--these can be provided as a CSV-formatted file.  

To attach known annotations to variants and store them to the local database, it is only necessary to provide three pieces of information:
 1. The path to the VCF file containing the variants to be annotated and stored
 2. The name of the MongoDB database to which the variants will be stored.  Note that this database need not be created ahead of time by the user: if a database of the input name does not exist, VAPr will automatically create one.
 3. The name of the MongoDB collection to which the variants will be stored.  As for the database, this collection need not pre-exist, since VAPr will automatically create it if necessary.
 
Optionally, the user may also provide the path to a CSV file containing additional annotations to be stored. These are the annotations that would be retrieved by annovar. If the argument `annotated_file=csv_file` is not provided, then the package only gets variants from MyVariant.Info

Below is an example of the simple Python required to annotate and store variants, including novel variant annotations produced earlier with ANNOVAR:

In [29]:
vcf_file = # e.g., '/Volumes/Carlo_HD1/CCBB/VAPr_files/vcf_files/not_annotated/Normal_targeted_seq.vcf'
csv_file = # e.g., '/Volumes/Carlo_HD1/CCBB/VAPr_files/csv_files/Normal_targeted_seq_annotated.hg19_multianno.txt'
db_name = # e.g., 'My_Variant_Database'
collection_name = # e.g., 'My_Variant_Collection_File_One'

pars = parser_models.VariantParsing(vcf_file, collection_name, db_name, annotated_file=csv_file)

### annotate_and_save


The main method of the package is `annotate_and_save`. It will take care of grabbing your variants from a vcf file, extracting the HGVS IDs, pulling the variants from MyVariant.info and sending them to mongo. 

In [30]:
%%time 
pars.annotate_and_save() 

querying 1-602...done.
querying 1-604...done.
querying 1-605...done.
querying 1-602...done.
querying 1-601...done.
querying 1-604...done.
querying 1-603...done.
querying 1-601...done.
querying 1-603...done.
querying 1-601...done.
querying 1-604...done.
querying 1-604...done.
querying 1-606...done.
querying 1-601...done.
querying 1-608...done.
querying 1-606...done.
querying 1-604...done.
querying 1-603...done.
querying 1-602...done.
querying 1-602...done.
querying 1-604...done.
querying 1-605...done.
querying 1-600...done.
querying 1-604...done.
querying 1-605...done.
querying 1-600...done.
querying 1-605...done.
querying 1-603...done.
querying 1-601...done.
querying 1-600...done.
querying 1-606...done.
querying 1-604...done.
querying 1-608...done.
querying 1-600...done.
querying 1-604...done.
querying 1-603...done.
querying 1-605...done.
querying 1-603...done.
querying 1-604...done.
querying 1-603...done.
querying 1-601...done.
querying 1-607...done.
querying 1-605...done.
querying 1-

'Done'

~3 minutes for 25k variants

A great introduction to the concepts relevant to MongoDB, and specifically how to interact with it using python can be found [here](https://docs.mongodb.com/getting-started/python/introduction/). Of particular interest is the explanation on how documents are formatted and stored inside a the database. Variants will follow that format as well: a sample entry of the database will look roughly as follows (the document for this variant (HGVS_id: chr20:g.25194768A>G) has been reduced to half of its actual size). 


```python
{
  "_id": ObjectId("5887c6d3bc644e51c028971a"),
  "chr": "chr19",
  "cytoband": {
    "Region": "13",
    "Sub_Band": "41",
    "Chromosome": 19,
    "Band": "q",
    "Name": "19q13.41"
  },
  "alt": "C",
  "hg19": {
    "end": 25194768,
    "start": 25194768
  },
  
  ...
  
  "otherinfo": [
    "GT:AD:DP:GQ:PL",
    "1/1:0,2:2:6:89,6,0"
  ],
  "hgvs_id": "chr20:g.25194768A>G",
  "esp6500siv2_all": "0.71",
  "ref": "T",
  "chrom": "20",
  "grasp": {
    "exclusively_male_female": "n",
    "initial_sample_description": [
      "Up to 46186 EA individuals",
      [
        "Up to 3445 EA cases",
        " 6935 EA controls"
      ],
      "69395 EA individuals",
      "69395 EA individuals"
    ],
    "creation_date": "8/17/12",
    "gwas_ancestry_description": "European",
    "srsid": 6083780,
    "platform_snps_passing_qc": [
      "Affymetrix & Illumina [~2.5 million] (imputed)",
      "Illumina [528745]",
      [
        "Affymetrix",
        " Illumina & Perlegen [~2.5 million] (imputed)"
      ],
      [
        "Affymetrix",
        " Illumina & Perlegen [~2.5 million] (imputed)"
      ]
    ],
    "hg19": {
      "chr": 20,
      "pos": 25194768
    },
    "publication": [
      {
        "pmid": 20081858,
        "phenotype": "HOMA-B",
        "p_value": 0.033930000000000001825,
        "snpid": "rs6083780",
        "paper_phenotype_description": [
          "Glucose homeostasis traits (fasting glucose",
          " fasting insulin",
          " HOMA-B",
          " HOMA-IR)"
        ],
        "date_pub": "1/17/2010",
        "title": "New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk.",
        "location_within_paper": "FullData",
        "paper_phenotype_categories": "Quantitative trait(s);Type 2 diabetes (T2D);Blood-related",
        "journal": "Nat Genet"
      },
      {
        "pmid": 20522523,
        "phenotype": "Partial epilepsy",
        "p_value": 0.022849999999999998784,
        "snpid": "rs6083780",
        "paper_phenotype_description": "Epilepsy (partial epilepsy)",
        "date_pub": "6/22/2010",
        "title": "Common genetic variation and susceptibility to partial epilepsies: a genome-wide association study.",
        "location_within_paper": "FullScan",
        "paper_phenotype_categories": "Neuro;Epilepsy",
        "journal": "Brain"
      }
    ],
    "last_curation_date": "8/17/12",
    "replication": [
      {
        "total_samples": 76558,
        "european": 76558
      },
      {
        "total_samples": 133661,
        "european": 133661
      }
    ],
    "discovery": [
      {
        "total_samples": 46186,
        "european": 46186
      },
      {
        "total_samples": 10380,
        "european": 10380
      }
    ],
    "hupfield": "Jan2014",
    "includes_male_female_only_analyses": "n",
    "in_gene": "(ENTPD6)",
    "replication_sample_description": [
      "up to 76558 EA individuals",
      "NR",
      "133661 EA individuals",
      "133661 EA individuals"
    ]
  },
  "start": 51447065
} ```


The richness of the data derives from the usage of MyVariant.info services and the high-availability of the datasets hosted by them. Further, being updated at least monthly, their databases can be guaranteed to deliver the most accurate and relevat data for specific variants. 

## Variant Prioritization

Once annotated, variants can be prioritized for further examination by applying built-in or custom filters.  For example, the built-in filter function `rare_cancer_variant()` will identify variants with annotation that satisfies all of the below filter conditions:

 - filter 1: ThousandGenomeAll < 0.05 or info not available
 - filter 2: ESP6500siv2_all < 0.05 or info not available
 - filter 3: cosmic70 information is present
 - filter 4: Func_knownGene is exonic, splicing, or both
 - filter 5: ExonicFunc_knownGene is not "synonymous SNV"
 - filter 6: Read Depth (DP) > 10
 
Identifying these variants with VAPr requires very little coding.  (Note that in this case, the database name and collection name *do* need to refer to existing items, since they specify the variant data that will be searched).

In [35]:
from VAPr import MongoDB_querying

collection_name = 'My_Variant_Collection_File_One'
db_name = 'My_Variant_Database'

filter_collection = MongoDB_querying.Filters(db_name, collection_name)
rare_cancer_variants = filter_collection.rare_cancer_variant()

Variants found that match rarity criteria: 1


Custom filters can be implemented using the keys of the annotation dictionary and the standard [pymongo query syntax](http://api.mongodb.com/python/current/tutorial.html), developed by MongoDB, as shown below:

In [54]:
from pymongo import MongoClient

client = MongoClient()
db = getattr(client, db_name)
collection = getattr(db, collection_name)


filtered = collection.find({"$and": [
                                   {"$or": [{"esp6500siv2_all": {"$lt": 0.1}}, {"esp6500siv2_all": {"$exists": False}}]},
                                   {"$or": [{"func_knowngene": "exonic"}, {"func_knowngene": "splicing"}]},
                                   {"genotype.filter_passing_reads_count": {"$gte": 1}},
                                   {"cosmic70": {"$exists": True}},
                                   {"1000g2015aug_all": {"$exists": True}}
                         ]})

as_list = list(filtered)
len(as_list)

9

The returned annotation for each variant matching a filter is provided as a Python dictionary that can easily be accessed from script.  For example, the full annotation for a single variant might look like this:

In [59]:
rare_cancer_variants[0]

{'1000g2015aug_all': 0.00159744,
 '_id': ObjectId('589e35893c5b990f8441ed78'),
 'alt': 'G',
 'cadd': {'1000g': {'af': 0.5, 'afr': 0.5, 'amr': 0.5, 'asn': 0.5, 'eur': 0.5},
  '_license': 'http://goo.gl/bkpNhq',
  'alt': 'G',
  'anc': 'A',
  'annotype': 'CodingTranscript',
  'bstatistic': 872,
  'chmm': {'bivflnk': 0.0,
   'enh': 0.0,
   'enhbiv': 0.008,
   'het': 0.016,
   'quies': 0.504,
   'reprpc': 0.087,
   'reprpcwk': 0.354,
   'tssa': 0.0,
   'tssaflnk': 0.0,
   'tssbiv': 0.008,
   'tx': 0.0,
   'txflnk': 0.008,
   'txwk': 0.008,
   'znfrpts': 0.0},
  'chrom': 7,
  'consdetail': 'missense',
  'consequence': 'NON_SYNONYMOUS',
  'consscore': 7,
  'cpg': 0.03,
  'dna': {'helt': 0.03, 'mgw': 0.4, 'prot': 3.05, 'roll': 3.04},
  'encode': {'exp': 40.15,
   'h3k27ac': 7.04,
   'h3k4me1': 3.0,
   'h3k4me3': 3.0,
   'nucleo': 3.7},
  'exon': '4/5',
  'fitcons': 0.527649,
  'gc': 0.55,
  'gene': {'ccds_id': 'CCDS5872.1',
   'cds': {'cdna_pos': 514,
    'cds_pos': 508,
    'rel_cdna_pos': 0.

Accessing a particular element of the annotation, such as determining the genomic environment of the annotated variant, can be accessed with simple Python dictionary syntax:

In [None]:
rare_cancer_variants[0]["annotype"]

Alternately, the list of annotation dictionaries can be loaded into a data frame in the [pandas](http://pandas.pydata.org/) Python data analysis library and then manipulated with all the numerous options provided by that package:

In [56]:
df = pd.DataFrame(as_list)
df.columns

Index(['1000g2015aug_all', '_id', 'alt', 'cadd', 'chr', 'chrom', 'cosmic',
       'cosmic70', 'cytoband', 'dbnsfp', 'dbsnp', 'end', 'exac',
       'exac_nontcga', 'exonicfunc_knowngene', 'func_knowngene',
       'gene_knowngene', 'geno2mp', 'genomicsuperdups', 'genotype', 'hg19',
       'hgvs_id', 'mutdb', 'nci60', 'otherinfo', 'ref', 'snpeff', 'start',
       'tfbsconssites', 'vcf', 'wellderly'],
      dtype='object')

In [57]:
df.head(4)

Unnamed: 0,1000g2015aug_all,_id,alt,cadd,chr,chrom,cosmic,cosmic70,cytoband,dbnsfp,...,hgvs_id,mutdb,nci60,otherinfo,ref,snpeff,start,tfbsconssites,vcf,wellderly
0,0.39397,589e35563c5b990f8441c699,A,"{'cpg': 0.24, 'consdetail': ['synonymous', 're...",chr1,1,"{'alt': 'T', 'chrom': '1', 'ref': 'C', 'mut_fr...","ID=COSM1320072;OCCURENCE=2(thyroid),1(ovary)","{'Band': 'p', 'Chromosome': 1, 'Sub_Band': '2'...",,...,chr1:g.120612006G>A,,0.22,"[GT:AD:DP:GQ:PL, 0/1:9,8:17:99:240,0,335]",G,"{'ann': [{'feature_type': 'transcript', 'rank'...",120612006,,"{'alt': 'A', 'position': '120612006', 'ref': 'G'}","{'adviser_score': '5~NOTCH2~Common, Predicted ..."
1,0.289137,589e35593c5b990f8441c96e,AGACCATGGCCCCGCCCAGTCCCT,,chr1,1,,ID=COSM1745478;OCCURENCE=4(urinary_tract),"{'Band': 'q', 'Chromosome': 1, 'Sub_Band': '1'...",,...,chr1:g.203186950_203186951insAGACCATGGCCCCGCCC...,,,"[GT:AD:DP:GQ:PL, 1/1:0,3:3:9:135,9,0]",-,"{'lof': {'gene_id': 'CHIT1', 'genename': 'CHIT...",203186950,,"{'alt': 'CAGACCATGGCCCCGCCCAGTCCCT', 'position...","{'adviser_score': '5~CHIT1~Common, Predicted N..."
2,0.706669,589e35593c5b990f8441ca33,-,,chr1,1,,"ID=COSM244564;OCCURENCE=1(NS),2(pancreas),2(la...","{'Band': 'q', 'Chromosome': 1, 'Name': '1q43',...",,...,chr1:g.240255569_240255571del,,,"[GT:AD:DP:GQ:PL, 1/1:0,2:2:6:80,6,0]",GGC,"{'ann': {'feature_type': 'transcript', 'rank':...",240255569,,"{'alt': 'G', 'position': '240255568', 'ref': '...","{'alt': 'G', 'chrom': '1', 'gene': 'FMN2', 're..."
3,0.062899,589e355e3c5b990f8441ce6a,G,"{'cpg': 0.05, 'consdetail': 'intron', 'gene': ...",chr2,2,,ID=COSM3836683;OCCURENCE=1(breast),"{'Band': 'q', 'Chromosome': 2, 'Sub_Band': '2'...",,...,chr2:g.120015164A>G,,,"[GT:AD:DP:GQ:PL, 0/1:1,4:5:30:156,0,30]",A,"{'ann': [{'feature_id': 'NM_182915.2', 'gene_i...",120015164,,"{'alt': 'G', 'position': '120015164', 'ref': 'A'}","{'alt': 'G', 'chrom': '2', 'gene': 'STEAP3', '..."


## File Export

Full or filtered sets of variants and their annotations can be exported to CSV:

In [61]:
rare_cancer_variants_csv = '/data/out_files/rare_vars.csv'

my_writer = file_writer.FileWriter(db_name, collection_name)
my_writer.generate_annotated_csv(rare_cancer_variants, rare_cancer_variants_csv)

'Finished writing annotated, filtered CSV file'

Full sets of variants and their annotations can also be output to VCF using a similar but slightly more complex syntax.  Annotations from VAPr will be added to the INFO field:

In [None]:
rare_cancer_variants_vcf = "/data/out_files/rare_variants.vcf"
input_vcf_compressed = "/data/vcf_files/test_file_one.vcf.gz"
filtered_variant_list = list(filtered)

my_writer = file_writer.FileWriter(db_name, collection_name)
my_writer.generate_annotated_vcf(filtered_variant_list, input_vcf_compressed, rare_cancer_variants_vcf)

Outputting filtered sets of variants to VCF requires somewhat more work, as re-creating a vcf file from a list of variants requires an index file to the original file from which a filtered version will be built. Nonetheless, this can be achieved using the tabix tool from http://genometoolbox.blogspot.com/2013/11/installing-tabix-on-unix.html.

**Step 1**: Download Tabix:

`tar xvjf tabix-0.2.6.tar.bz2`  
`cd tabix-0.2.6`    
`make`    
`export PATH=$PATH:/path_to_tabix/tabix-0.2.6`   

**Step 2**: create a .vcf.gz file from original vcf:

`bgzip -c file.vcf > file.vcf.gz`


**Step 3**: run tabix on the input vcf file:

`tabix -p vcf file.vcf.gz`

## Citations

1. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010 Sep;38(16):e164. doi: 10.1093/nar/gkq603. PubMed PMID: 20601685; PubMed Central PMCID: PMC2938201.
2. Xin J, Mark A, Afrasiabi C, Tsueng G, Juchler M, Gopal N, Stupp GS, Putman TE, Ainscough BJ, Griffith OL, Torkamani A, Whetzel PL, Mungall CJ, Mooney SD, Su AI, Wu C (2016) High-performance web services for querying gene and variant annotation. Genome Biology 17(1):1-7.