FVC

FVC: an adaptive method for filtering variant calls from different analysis pipelines

It is freely available for academic use. However, users should consider the licensing of GIAB database about the original gold-standard variants and sequenced results. More details can be seen on the https://www.nature.com/articles/s42003-022-03397-7

Introduction

Whole-genome sequencing (WGS) is frequently used in large-scale population genetic research and the medical diagnosis of congenital or acquired disorders. Here, we present FVC, an adaptive method for quality control of whole-genome sequencing variants identified by different variant callers, and test it on the variants identified by GATK HaplotypeCaller, Mutect2, Varscan2, and DeepVariant..

FVC removed far more false variants than the current state-of-the-art filtering methods and recalled ~51-99% true variants filtered out by the others. Moreover, FVC decreased the loss of true variants from 0.05-1661 to 0.02-0.57 per filtering each false variant. Furthermore, it is the only method that consistently eliminates more false-positive variants than the loss of true-positive variants (log OFO <0).

Usage

Filtering (exist pre-trained model)

(optional step) VCF with multiple samples
split VCF file with multiple samples into multiple VCF files with single sample

perl multi_samples_to_single.pl $in_vcf $outDir
input file: $in_vcf
output files: ${outDir}/${sampleId}.split.vcf

step1. Features construction

singularity exec FVC_image.sif bash ${absolute_path}/Features_construction.sh \
    -i ${absolute_path}/input.vcf \
    -b ${absolute_path}/input.bam \
    -r ${absolute_path}/hg1kv37.fa \
    -t 40 \
    -p ${absolute_path}/out_snv.record \
    -q ${absolute_path}/out_indel.record

-t: cpu cores (about 23min/40cores for each WGS VCF file), we suggest use as many as you can in this step.
-r: the human refSeq used for read alignment.
The 'input.vcf' contains the variants derived from the reads aligned file 'input.bam'.
note: users should provide the full path of the files

step2. Filtering

singularity exec FVC_image.sif bash ${absolute_path}/Supervised_learning_filtering.sh \
    -f filter
    -i ${absolute_path}/input.vcf \
    -o ${absolute_path}/input_filtered.vcf \
    -p ${absolute_path}/out_snv.record \
    -q ${absolute_path}/out_indel.record \
    -m ${absolute_path}/snv.model \
    -n ${absolute_path}/indel.model

pre-trained models: '-m': snv.model; '-n': indel.model

Adapt to specified variant caller

step1. Constructing training data and features

singularity exec FVC_image.sif bash ${absolute_path}/Data_construction.sh \
    -i ${absolute_path}/config.json \
    -t 40

Required
1 Training data: raw VCF and BAM files generated
2 gold-standard files: high-confident variants (VCF file) and the sequencing regions (bed file)
3 human reference seq: fasta format file

Example of config.json

{
    "data": [
        {
            "unfiltered_VCF": "HG001_gatk.vcf",
            "BAM_file": "HG001_aligned.bam",
            "goldStandard_VCF": "HG001_highConfidence.vcf",
            "goldStandard_bedFile": "HG001_highConfidence_region.bed"
        },
        {
            "unfiltered_VCF": "HG002_gatk.vcf",
            "BAM_file": "HG002_aligned.bam",
            "goldStandard_VCF": "HG002_highConfidence.vcf",
            "goldStandard_bedFile": "HG002_highConfidence_region.bed"
        }
    ],
    "refSeq": "/data/refSeq/hg19.fa",
    "outDir": "/data/output"
}

step2. Supervised learning

singularity exec FVC_image.sif bash ${absolute_path}/Supervised_learning_filtering.sh \
    -f train \
    -a ${outDir}/Training_tp_snv.record \
    -b ${outDir}/Training_fp_snv.record \
    -c ${outDir}/Training_tp_indel.record \
    -d ${outDir}/Training_fp_indel.record \
    -j ${outDir}/pipeline_adapted_snv.model \
    -k ${outDir}/pipeline_adapted_indel.model

output models: ${prefix_name}.snv.model and ${prefix_name}.snv.model

step3. Filtering
Same with Filtering (exist pre-trained model) step1-2

Computational Environment

Users can download the singularity image from our docker image sofware responsibility

FVC Docker image

The FVC_image.sif can be downlaod from http://bmap.sjtu.edu.cn/softstorage/details/31 (or https://drive.google.com/file/d/1PkcX2MZFYyi86wqkjC5pia-qElvhZOcP/view?usp=sharing)

OR manually install the requirements with the version equal or later:

Manually install

Python Requirements

python v3.6.10
xgboost v1.1.1
scikit-learn v0.23.0
pandas v1.0.4
numpy v1.18.4
lightgbm v3.1.0
imbalanced-learn v0.7.0
re v2.2.1
argparse v1.1
os
collections
math
itertools
datetime
sys

Perl Requirements

perl v5.0
Getopt::Long
FindBin
other customized models (###.pm) are released in FVC folder

Java Requirements

jdk1.8

Other

GATK(version 4.1.9)

Docker image for variant calling

http://bmap.sjtu.edu.cn/softstorage/details/21 (or https://drive.google.com/file/d/1oGmi8wnV6GDMJq43BDzwjhPtohE9KKH3/view?usp=sharing)

Training and testing data

Sequencing alignment, marking duplicates, and local realignment were performed using the BWA-MEM, Dedup, and Realigner that are integrated into Sentieon.
The germline variants were identified using GATK HaplotypeCaller(version 4.0.11, with default parameters) and Mutect2 (Integrated in GATK version 4.1.9 with default parameters) and Varscan2 (version 2.3.9 with default parameters, except where --min-coverage 3, --p-value 0.10, --min-var-freq 0.01).

The variant calling process can be found in the 'Variant Calling/variant_calling.sh'

The true-positive variants and false-positive variants were defined based on the consistency of the variant calls with the high confident variant calls from NIST’s GIAB consortium (version 3.3.2) using RTG-vcfeval method and regardless of the zygosity differences via setting the argument --squash_ploidy.

The details of the data can be found in the folder 'Data for Training and Testing'.

Leave one individual out cross validation study data

The training and testing data used for the leave-one-individual-out cross-validation study are available in the:

http://bmap.sjtu.edu.cn/datastorage/main/40

High-confident variants in vcf format

The high-confident variants released by GIAB can be download from:

https://github.com/genome-in-a-bottle/giab_latest_release

High-confident variants in fastq or bam format

The raw sequencing data used in the paper are available in the:

https://github.com/genome-in-a-bottle/giab_data_indexes/tree/master/

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Code for Reproducing Results		Code for Reproducing Results
Computing Enverionment		Computing Enverionment
Data for Training and Testing		Data for Training and Testing
Example		Example
Pictures		Pictures
Pre-trained_Models		Pre-trained_Models
Variant Calling		Variant Calling
Data_construction.sh		Data_construction.sh
FVC_Prediction.py		FVC_Prediction.py
FVC_Train.py		FVC_Train.py
FVC_utils.py		FVC_utils.py
Features_construction.sh		Features_construction.sh
Filter_by_bed.pl		Filter_by_bed.pl
LICENSE		LICENSE
README.md		README.md
Supervised_learning_filtering.sh		Supervised_learning_filtering.sh
__init__.py		__init__.py
add_region_feature.pl		add_region_feature.pl
change_header.pl		change_header.pl
get_vcf_features.py		get_vcf_features.py
merge_predict_results.pl		merge_predict_results.pl
multi_samples_to_single.pl		multi_samples_to_single.pl
separate_snp_indel.pl		separate_snp_indel.pl
training_data_config.json		training_data_config.json

License

yyren/FVC

Folders and files

Latest commit

History

Repository files navigation

FVC