- Abstract
- Motivation
- Description of Framework
- Installation
- Data Download
- Execution
- Contribute
- Citation
-
Synth Data Generation: Synthetic genomics data is generated based on the TP53 gene using the NEATv3.3 simulator in order to create synthetic datasets that mimic real cancer genome data. The "Ground Truth" is established by creating 10 individual datasets (each one of the same characteristics) containing Single Nucleotide Polymorphisms (SNPs) and Insertions/Deletions (INDELs). The genomic regions where variants accure with 100% Allele Frequency are chosen. The reason behind this choice is to avoid variants that are related to errors and products of noise. Then all these datasets are merged into one single file and the allele frequency is again measured at these genomic regions of interest.
-
Benchmarking Variant Callers: Somatic variant callers are evaluated using this synthetic Ground Truth dataset. GATK-Mutect2, Freebayes, VarDict, VarScan2 and LoFreq variant callers are assessed for their performance on our synthetic ground truth dataset. Their impact at low frequencies (≤10%) is explored, as these are particularly challenging to detect accurately.
The framework's overall aim is to provide a robust framework for evaluating the performance of tumor-only somatic variant calling algorithms by using synthetic datasets. By having a reliable ground truth, we can thoroughly test and improve the accuracy of variant calling algorithms for cancer genomics applications. This framework represents an essential step towards more precise and effective identification of genetic lesions associated with cancer and other diseases.
All data are open and available in Zenodo.
- To create the conda environment that was used for the analysis run
conda env create -f environment.yml
and to activate it runconda activate synth4bench
. - To install NEATv3.3, download version v3.3. To call the main script run the command
python gen_reads.py --help
. For any further info please see the README.md file from the downloaded files of version 3.3. - To install bam-readcount follow their instructions and then run
build/bin/bam-readcount --help
to see that it has being installed properly. - To install R packages dependencies run this command
install.packages(c("stringr", "data.table", "vcfR", "ggplot2", "ggvenn", "ggforce", "ggsci", "patchwork"))
. - The extra script
vscan_pileup2cns2vcf.py
for VarScan can be found here.
1_synth4bench.sh
- This bash script is the basis of synth4bench workflow. It calls NEATv3.3 in order to generate 10 individual synthetic data datasets, create one Merged bam file, performs some preprocess steps before implementing somatic variant calling using GATK-Mutect2, Freebayes, VarDict, VarScan and LoFreq and produces bam report files with the genomic content at certain chromosomal positionsusing bam-readcount. Please replace all path/to/
with desired paths in all commands.
-
Input: fasta reference file
-
Output: fastq files with pair end reads, "golden" bam file and bai index file, "golden" vcf file, Merged bam file, processed bam files and vcf file with all variants that were detected, tsv file with the genomic content
2_downstream_analysis_*.R
- This R script compares the variants that a selected caller reported against the ground truth. Firsty it identifies the variants with 100% Allele Frequency(AF) in the individual bam files and then caclulates their AF in the final Merged bam file.
-
Input: bam-readcount tsv reports, vcf file from a selected caller
-
Output: tsv file containing information regarding the ground truth variants
3_plot_patchwork_*.R
- This R script produces the final Figure of the Benchmarking of a selected caller.
-
Input: annotated tsv file, ground truth vcf, a selected caller vcf
-
Output: final Figure for the Benchmarking of a selected caller
helpers_*.R
- This R script incudes all necessary functions for 02_patchwork_*.R
and 03_patchwork_*.R
scripts.
libraries.R
- This R script incudes all necessary libraries for 02_patchwork_*.R
and 03_patchwork_*.R
scripts.
Note that * is one of the following = (Mutect2, Freebayes, VarDict, VarScan, LoFreq).
paper_plots.R
- This R script produces the final Multipanel Figure for the paper.
-
Input: comparison tsv file, ground truth vcf, caller's vcf
-
Output: Paper Multipanel Figure
For the case of VarScan an extra step was required to convert its output to the standard VCF format. The script vscan_pileup2cns2vcf.py
can be found here.
4.1_report_varbp.R
- A script to report all ground truth variants in each chromosomal position.
-
Input: individual "golden" bam files produced by NEAT
-
Output: reports in tsv format with the variants in each chromosomal position for each individual "golden" bam file
4.2_explore_mut_pos.R
- A script to report variants that were either detected or not detected by each caller. The read positions were divided in bins to study possible correlated trends.
-
Input: tsv file with reported variants from caller, reports in tsv format with the variants in each chromosomal position for each individual "golden" bam file
-
Output: reports with variants that were either detected or not detected by each caller. The read positions were divided in bins.
5.1_Prediction_Coverage_Length.R
- This script calculates the Kandall's tau coefficient of the Coverage and Read length with the modified Accuracy (mAC), False negative (FN) and False positive (FP) rates.
-
Input: an xlsx file with a column containing the callers (named Caller), a column specifying the AC, FN, FP rate (named Variants), and a column for each coverage (in a Sheet named Coverage) or read length (in a Sheet named Read_length) with the respective rates.
-
Output: A xlsx file with the results in two sheets: Coverage and Read_length
5.2_Statistical_Analysis.Rmd
- This script runs the statistical analysis for a given dataset.
-
Input: A csv file with variants that were either detected or not detected by each caller.
-
Output: A word document with the results of the fatalistically analyses.
5.3_ROC_Curves.R
- This script produces a ROC curve for a given dataset.
-
Input: A csv file with variants that were either detected or not detected by each caller.
-
Output: A ROC curve.
5.4_ROC_Curves_Merged.R
- This script produces a merged figure with the ROC curves for a given caller, each curve corresponding to a specific Coverage and Read length.
-
Input: A folder with the files containing all the datasets and the 5.3_ROC_Curves.R" file.
-
Output: A figure with the ROC curves for each caller.
We welcome and greatly appreciate any sort of feedback and/or contribution!
If you have any questions, please either open an issue or write us at sfragkoul@certh.gr
or inab.bioinformatics@lists.certh.gr
.
Our work has been submitted to the bioRxiv preprint repository . Please, use the citation as follows:
S.-C. Fragkouli, N. Pechlivanis, A. Anastasiadou, G. Karakatsoulis, A. Orfanou, P. Kollia, A. Agathangelidis, and F. E. Psomopoulos, “Synth4bench: a framework for generating synthetic genomics data for the evaluation of tumor-only somatic variant calling algorithms.” 2024, doi:10.1101/2024.03.07.582313.
-
Styliani-Christina Fragkouli, Nikolaos Pechlivanis, Aspasia Orfanou, Anastasia Anastasiadou, Andreas Agathangelidis and Fotis Psomopoulos, Synth4bench: a framework for generating synthetic genomics data for the evaluation of somatic variant calling algorithms, 17th Conference of Hellenic Society for Computational Biology and Bioinformatics (HSCBB), Oct 2023, Thessaloniki, Greece, doi:10.5281/zenodo.8432060
-
Styliani-Christina Fragkouli, Nikolaos Pechlivanis, Andreas Agathangelidis and Fotis Psomopoulos, Synthetic Genomics Data Generation and Evaluation for the Use Case of Benchmarking Somatic Variant Calling Algorithms, 31st Conference in Intelligent Systems For Molecular Biology and the 22nd European Conference On Computational Biology (ISΜB-ECCB23), Jul 2023, Lyon, France doi:10.7490/f1000research.1119575.1