Skip to content
Xiaoxu Yang edited this page Jul 17, 2024 · 14 revisions

Welcome to the DeepMosaic wiki!

This page is an extension of the current DeepMosaic tutorial page, most of the Q&As and commonly reported questions will be dealt with here. We are aiming toward a more detailed wiki system for DeepMosaic.

Q&A

  1. Q: How do I run DeepMosaic for multiple samples most efficiently?

    A: If you have a large number of variants in each file, to run DeepMosaic in parallel, submit each file in independent input files. If you have a relatively small number of variants from each file but multiple files (samples), integrate everything together into one input file. If you have a huge vcf, you can split it into smaller vcfs and run them parallelly (for both visualization and quantification). You only need to split the vcf, not the bam file.

  2. Q: How do I balance/further filter the variants base on DeepMosaic output?

    A: For WGS variants, the exclusion of annotated homopolymer and dinucleotide repeats will remove false positives and increase the validation rate, but decrease the sensitivity.

  3. Q: What do Score 1, Score 2, and Score 3 mean in the output file?

    A: The three scores are combined information from the complex features extracted by the neural network, from our experiences, Score 1 is more like a "het and homo probability", Scores 2&3, especially Score 3 is more like a "potential mosaic possibility". In other words, the higher Score 1 is, the more likely the candidate is a germline variant, whereas the higher Score 3 is, the more likely the candidate is a mosaic variant. But both categories contained a lot of potential artifacts, that's why for the final output we included a more complex classifier.

  4. Q: How to deal with mitochondria and sex chromosomes?

    A: First you should choose a reference genome that supports mitochondria as a separate chromosome. DeepMosaic is not specifically trained on mitochondria variants so we can't guarantee the result, thus we suggest removing the MT variants from DeepMosaic input. For sex chromosomes, DeepMosaic takes into consideration the biological gender of the input sample and also considered the pseudo autosomal regions separately.

  5. Q: Can I use DeepMosaic for cancer somatic mutation detection without control?

    A: The current models presented by DeepMosaic does not support cancer samples, according to benchmarks, the specificity is high (0.97) while the sensitivity is low. We are training new models that support single sample accurate detection of somatic mutations in cancer.

  6. Q: What genome versions does DeepMosaic support?

    A: DeepMosaic is benchmarked on GRCh37(hg19) we are working on some tests for GRCh38(hg38) and are providing some scripts here the model is still the same so the main differences lie in coordinate differences. We will make further updates when we finish new models trained on GRCh38 or CHM13. As most of our current benchmark experiments are carried out on GRCh37 we cannot guarantee the performance on GRCh38.

  7. Q: Why I got errors about pickle_module.load(f, **pickle_load_args)?

    A: Because you didn't fully download DeepMosaic, the entire model folder should be more than 200 MB. Please refer to the git-lfs section in the tutorial.

  8. Q: How are the current DeepMosaic models trained and benchmarked?

    A: image

Three independent simulated and six biologically validated non-cancer mosaic mutation datasets are used in the training and benchmarking of DeepMosaic.

SimData1: For the initial training procedure, 10,000 variants were randomly generated on chromosome 22 to get the list of alternative bases. Pysim (PMID 28361688) was then used to generate paired-end sequencing reads with random errors generated from the Illumina HiSeq sequencer error model based on the reference genome sequences. Alternative reads were generated by replacing the genomic bases with the alternative bases in the list, with the same error model. Alternative and reference reads were randomly mixed to generate an alternative AF of 0, 1, 2, 3, 4, 5, 10, 15, 20, 25, and 50%. The data were randomly sampled for a targeted depth of 30, 50, 100, 120, 150, 200, 250, 300, 400, and 500x. FASTQ files were aligned to the GRCh37d5 human reference genome with BWA (v0.7.17) mem command. Aligned data were processed by GATK (v3.8.1) and Picard (v2.18.27) for marking duplicates, sorting, INDEL realignment, base quality recalibration, and germline variant calling. The up- and down-sampling expanded this dataset into a pool of 990,000 different variants. Depth ratios were calculated as defined. To avoid the situation that randomly generated mutations fall on a common SNP position in the genome, which would bias the training and benchmarking, gnomAD allele frequencies were randomly assigned from 0 to 0.001 for simulated mosaic positive and from 0 to 1 for simulated negative variants, which were established as homozygous or heterozygous.

SimData2: To compare the performance of DeepMosaic and other software to detect mosaicism on simulated data, we randomly generated another simulation dataset with Pysim, all procedures are similar to SimData1, with the following modifications: 1] only 7610 variants on the non-repetitive region of chromosome 22 were considered true positive genomic positions; 2] random errors were generated from the Illumina NovaSeq sequencer error model. 3] Data was randomly down-sampled and up-sampled for a targeted depth of 50, 100, 200, 300, 400, and 500x. A total of 439,200 different variants were generated. FASTQ files were aligned and processed with BWA (v0.7.17), SAMtools (v1.9), and Picard (v2.18.27). The data were subjected to DeepMosaic as well as MuTect2 (GATK v4.0.4, both paired mode and single mode), Strelka2 (v2.9.2), MosaicHunter (v1.0.0), and MosaicForecast (v8-13-2019) with different models trained for different read depth (250x model for depth≥300x).

SimData3: We further generated another simulation dataset in a way that was fundamentally different from the training data with a positive: negative ratio similar to real data to compare the performance of DeepMosaic and other software for the detection of mosaic variants. We selected 30,090 genomic positions with reference homozygous genotype from a different genomic region (the entire Chromosome 1) of the whole-genome deep sequences from the ‘Genome In a Bottle’ sample HG002 (PMID 30858580). The genomic positions from the 30,090 positions were genotyped as homozygous and fulfilled additional criteria 1] zero alternative bases in the raw sequencing data; 2] no detectable insertions/deletions in the position of interest; 3] have a genomic distance of at least 1000 bases between each other. On this clear background, 15,471 of them were labeled as “true negative“ with reference homozygous genotype, and 6868 were labeled as “true positive” mosaic variants with expected alternative AF 0.01, 0.02, 0.03, 0.04, 0.05, 0.10, 0.15, 0.20, and 0.25 (on average 763 variants for each genotype); 7751 were labeled as “true negative” heterozygous variants with alternative AF 0.50; the latest version of a different software BAMSurgeon (updated 24 Dec 2020) was used to generate this simulation dataset and retain the sequencing errors from the original biological samples. The original bam file was first up-sampled, and alternative reads were replaced to generate the expected AF, mapped back to the genome, and merged back to the bam file, according to the software manual (PMID 25984700). Bam files with and without simulated data were downsampled to 500x, 400x, 300x, 200x, 100x, and 50x. The data were subjected to DeepMosaic as well as MuTect2 (GATK v4.0.4, both paired mode, and single mode), Strelka2 (v2.9.2), MosaicHunter (v1.0.0), and MosaicForecast (v8-13-2019) with different models trained for different read depth (250x model for depth≥300x), the performance of the 180,540 points were evaluated.

BioData1: Variant information and raw sequencing read from 80-120x PCR-amplified PE-150 WGS data of 29 samples from 6 normal individuals were extracted from published data (PMID 25312340, 29763432) on SRA (SRP028833, SRP100797, and SRP136305). 921 variants identified from WGS of samples from different organs of the donors and validated by orthogonal experiments were selected and labeled as mosaic positive. 492 genomic positions from the control samples validated with 0% AF were selected and labeled as negative. 162 variants with known sequencing artifacts were first filtered by MosaicHunter, manually selected, and labeled as negative. The 1575 genomic positions were also down-sampled and up-sampled for a targeted depth of 30, 50, 100, 150, 200, 250, 300, 400, and 500x, to expand this dataset into a pool of 14,175 different conditions. Depth ratios were calculated accordingly, and gnomAD allele frequencies, segmental duplication, and repeat masker information were annotated. Categories of technical artifacts include a) variants with multiple alternative alleles, as from our experiences, the chance of a noncancer MV occurring twice at the same genomic position at the early embryonic development stage is rare; b) alignment artifacts, evidenced by short truncated or hard-clipped reads mapping to a certain genomic region, resulting in small truncated mapped reads piled up; c) ultra low mapping quality and base reads; d) ultra high allelic fraction variants because they are not expected in postzygotic noncancer situations. The entire BioData1 and random subsampling from SimData1 were combined to generate a training and validation dataset with approximately 200,000 variants from the 1,000,000 training variants. 180,000 variants were selected for model training, 45% from SimData1 and 55% from resampling of BioData1. This dataset was used for the model training and evaluation of the sensitivity and specificity of the selected model, and their features including AF distribution and biological appearances were very similar to published biological data.

BioData2: To estimate the performance of the pre-trained models and select the model with the best performance for DeepMosaic-CM, we introduced an independent gold-standard dataset (PMID 33781308). Variants were computationally detected from replicated sequencing experiments generated from 6 distinct sequencing centers and validated in 5 different centers, known as the common reference tissue project from the Brain Somatic Mosaicism Network16. 400 variants underwent multiple levels of computational validation including haplotype phasing, CNV exclusion, population shared exclusion, as well as experimental validation such as whole-genome single-cell sequencing, Chromium Linked-read sequencing (10X Genomics), PCR amplicon sequencing, and droplet digital PCR. After validation, 43 true positive MVs and 357 false positive variants were determined as gold-standard evaluation sets for low-fraction single nucleotide MVs from the 250x WGS data16. We extracted deep whole-genome sequences for those variants, labeled them accordingly, and used them as gold standard validation set for model selection.

BioData3: To evaluate the performance of DeepMosaic-CM trained on a different portion of biological variants, we included another large-scale validation experiment we recently generated (PMID 35444276). Variant information and raw sequencing read of 300x PCR-free PE150-only WGS of 18 samples from 9 different brain regions, cerebellum, heart, liver, and both kidneys of one individual was extracted from the capstone project of the Brain Somatic Mosaicism Network19. 1400 genomic positions with variants identified from the WGS sample and reference homozygous/heterozygous controls validated by orthogonal experiments were selected and labeled as positive and negative according to the experimental validation result. The 1400 genomic positions were also down- and up-sampled for a targeted depth of 30, 50, 100, 150, 200, 250, 300, 400, and 500x. Depth ratios were calculated accordingly, and gnomAD allele frequencies, segmental duplication, and repeat masker information were annotated.

BioData4: This additional WGS dataset (PMID 31873310, 34388390) was used to compare the performance of DeepMosaic and other mosaic variant callers on biological samples. 16 WGS samples from the blood and sperm of 8 individuals were sequenced at 200x28 (PRJNA588332). WGS was performed using an Illumina TrueSeq PCR-free kit with 350bp insertion size and sequenced on an Illumina HiSeq sequencer. Reads were aligned to the GRCh37d5 genome with BWA (v0.7.15) mem and duplicates were removed with sambamba (v0.6.6) and base quality recalibrated by GATK (v3.5.0). Processed BAM files were subjected to DeepMosaic as well as MuTect2 (GATK v4.0.4, both paired mode and single mode), Strelka2 (v2.9.2), MosaicHunter (v1.0.0), and MosaicForecast (v8-13-2019) with 200x models trained for the specific depth. Data from one of the individuals (F02) was down-sampled to 150x, 100x, 50x, and 30x with the SAMtools (v1.9) view command for the further benchmark of DeepMosaic.

BioData5: We included an additional WES dataset (DOI: 10.1101/2022.04.07.487401) that was used to compare the performance of DeepMosaic and other mosaic variant calling pipelines on WES data. 181 WES samples from the brain and blood/saliva of 101 individuals were sequenced at ~300x (NDA). gDNA was extracted from pulverized brain and white blood cells/buccal epithelial samples using Qiagen Miniprep and Maxiprep kits according to the protocols provided by the manufacturer. Genomic DNA samples were prepared for whole-exome sequencing using the Agilent SureSelect XT Human All Exon v.5 kits and sequenced on an Illumina HiSeq 2500 sequencer at a targeted depth of ~300x. Reads were aligned to the GRCh37d5 genome with BWA(v0.7.17) mem and duplicates were removed and base quality recalibrated by GATK (v4.0.4) according to the established best-practice pipeline16. Processed BAM files were subjected to the DeepMosaic pipeline followed by MuTect2 (GATK v4.0.4) single mode as well as GATK (v4.0.4) Haplotypecaller (“polidy” 50) and previously established filters16.

BioData6: We assessed the performance of DeepMosaic on a large-scale tumor dataset. We downloaded and analyzed 2430 WES samples from 1215 individuals from six different cancer types from the TCGA-MC3 collection (PMID 29596782). 468 were patients with Skin Cutaneous Melanoma (SKCM), 406 with Bladder Urothelial Carcinoma (BLCA), 157 with Glioblastoma Multiforme (GBM), 112 with Breast invasive carcinoma (BRCA), 50 with Lung Squamous Cell Carcinoma (LUSC), and 23 with Colon Adenocarcinoma (COAD). Performance was compared with call sets provided in their respective original publications. Data were downloaded from the GDC portal (https://portal.gdc.cancer.gov/, sample IDs provided with variants in Supplementary Table 3). Fastq files were generated using Picard SAMTOFASTQ and aligned to GRCh37d5 genome with BWA (v0.7.17) mem. Duplicates were removed, reads near INDEL regions were realigned, and base quality scores were recalibrated with GATK v3.8.1 and Picard v2.20.7. Processed BAM files were subjected to the DeepMosaic pipeline followed by MuTect2 (GATK v4.0.4) single mode, then the final call set was compared with the TCGA-MC3 call set detected by MuSE (PMID 27557938), MuTect (PMID 23396013), SomaticSniper (PMID 22155872), VarScan2 (PMID 22300766), and Radia (PMID 25405470) using the publicly released gold standard (https://gdc.cancer.gov/about-data/publications/mc3-2017) from the same dataset34. Part of the computing resources and CPU consumption also were estimated from this dataset with Linux command time.

  1. Q: Can I use DeepMosaic for existing low-depth sequencing data (WGS or WES)?

    A: Yeah we have trained the models with low-depth input. According to a recent benchmark analysis, DeepMosaic performs better at a lower depth.

  2. Q: How should I prepare the input bam file for DeepMosaic?

    A: DeepMosaic, similar to other mosaic callers such as MosaicForecast or MosaicHunter, requires sorted, realigned, and BQSRed bam files. You can follow the BSMN common pipeline or this pipeline for GRCh37. Note that as the current GATK4 requires MuTect2 and Haplotypecaller for SNV calling, and they built indel realignment internally, the GATK4 best practice does not include indel realign any more, if you want to use 3rd party tools that do not internally have indel alignment function, you should follow these pipelines.

  3. Q: How to resolve the dependency issues for DeepMosaic most efficiently?

    A: We now provide a singularity image. For DeepMosaic which will resolve the dependency issue most efficiently. The current singularity is only built for GRCh37 with the current best-performing model. However, we provide tutorials on how users can train/import customized models using singularity.

  4. Q: Why is my result on the demo data slightly different from the provided demo outputs?

    A: Depending on the machine, if you are using CPU or GPU, the output scores from the neural network will vary at ~ 0.1%, and this won't significantly affect the decision of the final output. Larger differences are seen because of the different software versions. Thus we recommend that user choose the singularity image.

  5. Q: I found NAs in the prediction output?

    A: Please first check if the demo input work for you. If you are using the correct genome and annotation versions (same as your vcf coordinate), then either you have low-quality calls (such as lots of Ns in the bam) or some of the annotations/feature extractions are not matching.

Clone this wiki locally