create a gemini-compatible database from a VCF:
python vcf2db.py diseaseX.anno.vcf.gz disease_x.ped x.db python vcf2db.py diseaseX.anno.vcf.gz disease_x.ped "postgres://brentp:password@localhost/gemini" python vcf2db.py diseaseX.anno.vcf.gz disease_x.ped "mysql://brentp:password@localhost/gemini"
With sqlite3. This inserts at about 1200 variants / second including time to index.
NOTE while this allows loading into
postgres, you will need gemini version
from github to use the database once it is loaded into
postgres. Due to some idiosyncrasies, Amazon's Elastic File Storage (EFS) is not supported for the creation of sqlite3 databases. Elastic Block Storage (EBS) is suitable for this step.
git clone https://github.com/quinlan-lab/vcf2db cd vcf2db conda install -y gcc snappy # install the C library for snappy conda install -c conda-forge python-snappy conda install -c bioconda cyvcf2 peddy pip install -r requirements.txt
vcf2db now supports using bcftools csq so you can annotate with bcftools like:
wget ftp://ftp.ensembl.org/pub/grch37/release-84/gff3/homo_sapiens/Homo_sapiens.GRCh37.82.chr.gff3.gz ./bcftools csq --local-csq -p R -g Homo_sapiens.GRCh37.82.chr.gff3.gz -f $fasta $vcf
How It Works
Previously (and currently), gemini kept a bunch of vetted annotations along with the gemini install and annotated an incoming VCF with those annotations as it was loaded into gemini. This is nice for users but by de-coupling the annotation from the loading, we have more flexiblility.
This script pulls annotations that are defined in the INFO field, using the types defined in the header,
to create a database schema. It expects a
CSQ tag from VEP or a
ANN tag from snpEff in order to
determine the associated gene and consequence.
This means that the user is responsible for annotating their own VCF--though we will provide a simple means to do this with vcfanno.
At this point, the script works and creates a gemini-compatible database. It is therefore possible to use gemini with GRCh38 or other organisms. But the utility will depend on the resources that are available for the given genome build and organism.
If there are available resources that indicate common variants, for example, 1KG or ExAC for GRCh37, then it is useful to annotate with the allele frequencies in those populations.
Use VEP or snpEff to annotate the VCF by consequence.
Gather a PED file and load with the script:
python vcf2db.py some.annotated.vcf.gz some.ped my.gemini.db
To have the sample fields expanded into separate tables so that they can be used INFO SQL queries directly, use:
python vcf2db.py some.annotated.vcf.gz some.ped my.gemini.db --expand gt_types --expand gt_ref_depths --expand gt_alt_depths