Assessing imputation accuracy #95

kiran-lee · 2024-03-20T16:52:26Z

Do you have any suggestion for estimating imputation performance, beyond the plots produced? As stated in your README, "(Plots) are not meant to substitute for a more in depth investigation of imputation performance in your setting".

I have imputed ~1900 samples (~3x coverage on average, per sample) aligned to a reference genome for a recently bottlenecked non-model organism. We do not have a high coverage reference panel.

As far as I am aware, most imputation softwares use a “reference panel” of high coverage genomes, which is used to benchmark imputation success. This involves downsampling a panel of high coverage samples, performing imputation and then assessing concordance of original and downsampled genotypes.

I am thinking to downsample all 1900 samples to eg. 1x and 0.1x coverage using the downsampleToCov = argument, imputing and then testing concordance with the original, albeit not-so-great-coverage-to-start-with samples using, for example: https://github.com/LindoNkambule/VCFCompare

I attach the plots for a chromosome (and the parameters used), which I think has generally run well, though I am yet to optimise k and nGen parameter settings.

STITCH(tempdir = tempdir(), chr="ENA|OU383783|OU383783.1_RagTag", bamlist="bamlistrg.txt", posfile="ENA|OU383783.txt", sampleNames_file="bamlist.txt", outputdir= paste0(getwd(), "/"), K=4, nGen=100, nCores=1)
[2024-02-28 13:55:38] Running STITCH(chr = ENA|OU383783|OU383783.1_RagTag, nGen = 100, posfile = ENA|OU383783.txt, K = 4, S = 1, outputdir = /mnt/fastdata/bop21kgl/RawData/allplates/, nStarts = , tempdir = /scratch/3496373/RtmpiFUn7y, bamlist = bamlistrg.txt, cramlist = , sampleNames_file = bamlist.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = NA, regionEnd = NA, buffer = NA, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = 25, nCores = 1, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = c(6, 10, 14, 18), downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE)

Thank you for providing and supporting STITCH.

Kiran

Zilong-Li · 2024-03-22T12:58:17Z

Hey Kiran,

You can take a look at the vcfcomp function in https://github.com/Zilong-Li/vcfppR. You need to prepare 3 files, i.e imputed vcf, true vcf and a tsv file of allele frequency https://github.com/Zilong-Li/lcWGS-imputation-workflow/blob/main/config/af.tsv.

Then you can calculate correlation (r2) between the imputed dosage and the truth genotypes via

vcfcomp(test = "imputed.vcf.gz", truth = "true.vcf.gz", af = "af.tsv", stats = "r2")

You need to install the latest vcfppR by remotes::install_github("Zilong-Li/vcfppR"), since it is not yet on CRAN.

Hope it helps.

kiran-lee · 2024-03-26T11:28:28Z

Hi Zilong-Li,

Is the vcfcomp correlation (r2) different from the plot produced in STITCH ? It seems like they report the same r2.

Zilong-Li · 2024-03-26T12:36:57Z

Ah, sorry. I thought you have high coverage sequencing samples that can be used as true genotypes. If you only have maximum 3x samples, then you can only filter the variants based on INFO score (say >0.8) , Minor Allele Frequency (>0.05) and so on.

kiran-lee · 2024-03-26T14:18:46Z

Unfortunately we do not have high coverage samples. This might be a good compromise to not having a high coverage reference panel? :

downsample the 25 best coverage (mean >8x) samples to 1x
impute these downsamples samples with all the other ~1900 samples for a single chromosome
compare r2 of the imputed downsampled samples to their "true" genotypes at 8x mean coverage

Zilong-Li added help wanted accuracy labels Mar 22, 2024

kiran-lee closed this as completed Mar 26, 2024

kiran-lee reopened this Mar 26, 2024

kiran-lee mentioned this issue Apr 4, 2024

vcfcomp returns NAs Zilong-Li/vcfppR#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assessing imputation accuracy #95

Assessing imputation accuracy #95

kiran-lee commented Mar 20, 2024 •

edited

Zilong-Li commented Mar 22, 2024 •

edited

kiran-lee commented Mar 26, 2024

Zilong-Li commented Mar 26, 2024

kiran-lee commented Mar 26, 2024

Assessing imputation accuracy #95

Assessing imputation accuracy #95

Comments

kiran-lee commented Mar 20, 2024 • edited

Zilong-Li commented Mar 22, 2024 • edited

kiran-lee commented Mar 26, 2024

Zilong-Li commented Mar 26, 2024

kiran-lee commented Mar 26, 2024

kiran-lee commented Mar 20, 2024 •

edited

Zilong-Li commented Mar 22, 2024 •

edited