Skip to content
/ cievad Public

A tool suite for a simple, streamlined and rapid evaluation of variant callsets

License

Notifications You must be signed in to change notification settings

rki-mf1/cievad

Repository files navigation

run with conda Nextflow GitHub Actions Workflow Status GitHub Release GitHub commit activity

CIEVaD

Continuous Integration and Evaluation for Variant Detection. This repository provides a tool suite for simple, streamlined and rapid creation and evaluation of genomic variant callsets. It is primarily designed for continuous integration of variant detection software and a plain containment check between sets of variants. The tools suite utilizes the conda package management system and nextflow workflow language.

Contents:

  1. System requirements
  2. Installation
  3. Usage
  4. Help
  5. Citation

System requirements:

This tool suite was developed for Linux and is the only officially supported operating system here. Having any derivative of the conda package management system installed is the only strict system requirement. A recent version (≥20.04.0) of nextflow is required to execute the workflows, but can easily be installed via conda. For an installation instruction of nextflow via conda see Installation.

🖥️ See list of tested setups:
Requirement Tested with
64 bits Linux operating system Ubuntu 20.04.5 LTS
Conda vers. 23.5.0, 24.1.2
Nextflow vers. 20.04.0, 23.10.1

Installation:

  1. Download the repository:
git clone https://github.com/rki-mf1/cievad.git
  1. [Optional] Install nextflow if not yet on your system. For good practise you should use a new conda environment:
conda deactivate
conda create -n cievad -c bioconda nextflow
conda activate cievad

Usage:

This tool suite provides multiple functional features to generate synthetic sequencing data, generate sets of ground truth variants (truthsets) and evaluate sets of predicted variants (callsets). There are two main workflows, hap.nf and eval.nf. Both workflows are executed via the nextflow command line interface (CLI).

⚠️ Run commands from the root directory: Without further ado, please run the commands from a terminal at the top folder (root directory) of this repository. Otherwise relative paths within the workflows might be invalid.

Generating haplotype data

The minimal command to generate haplotype data is

nextflow run hap.nf -profile local,conda

This generates the following data within the <project_root>/results/ directory:

  • a haplotype (FASTA), which is a copy of the provided reference sequence but deviates by a set of synthetic genomic variants
  • the variant set (VCF) of synthetic genomic variants in the haplotype
  • a set of reads (FASTQ) representing a sequencing experiment from the haplotype

Evaluating variant calls

The minimal command to evaluate the accordance between a truthset (generated data) and a callset is

nextflow run eval.nf -profile local,conda --callsets_dir <path/to/callsets>

where --callsets_dir is the parameter to specify a folder containing the callset VCF files. Currently, a callset within this folder has to follow the naming convention callset_<X>.vcf[.gz] where <X> is the integer of the corresponding truthset. Alternatively, one can provide a sample sheet of comma separated values (CSV file) with the columns "index", "truthset" and callset", where "index" is an integer from 1 to n (number of samples) and "callset"/"truthset" are paths to the pairwise matching VCF files. Callsets can optionally be gzip compressed. The command for the sample sheet input is

nextflow run eval.nf -profile local,conda --sample_sheet <path/to/sample_sheet>

This generates the following data within the <project_root>/results/ directory:

  • a report (CSV, JSON) about accordance between the synthetic variant set and a given corresponding callset
  • a report (CSV) with statistis across all tested individuals

Tuning the workflow parameters

CIEVaD enables access and finetuning to a vast majority of parameters of the internal software tools. The parameters to adjust the workflows are listed on their respective help pages. To inspect the help pages type --help after the script name, e.g. nextflow run hap.nf --help for the hap.nf workflow. Parameters can be adjusted via the CLI or directly within the nextflow.config file. Mind that parameters provided by the CLI will overwrite parameters set in config. More information about tuning crucial parameters, e.g. read quality and genome coverage, can be found in the Wiki.

Help:

Visit the project wiki for more detail information on parameters, help and FAQs.
Please file issues, bug reports and questions to the issues section.

Citation:

We have a manuscript available for CIEVaD. If you use CIEVaD please cite

@article{krannich2024cievad,
  title={CIEVaD: A Lightweight Workflow Collection for the Rapid and On-Demand Deployment of End-to-End Testing for Genomic Variant Detection},
  author={Krannich, Thomas and Ternovoj, Dmitrii and Paraskevopoulou, Sofia and Fuchs, Stephan},
  journal={Viruses},
  volume={16},
  number={9},
  pages={1444},
  year={2024},
  doi={10.3390/v16091444}
}