-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assessing imputation accuracy #95
Comments
Hey Kiran, You can take a look at the Then you can calculate correlation (r2) between the imputed dosage and the truth genotypes via
You need to install the latest vcfppR by Hope it helps. |
Hi Zilong-Li, Is the vcfcomp correlation (r2) different from the plot produced in STITCH ? It seems like they report the same r2. |
Ah, sorry. I thought you have high coverage sequencing samples that can be used as true genotypes. If you only have maximum 3x samples, then you can only filter the variants based on INFO score (say >0.8) , Minor Allele Frequency (>0.05) and so on. |
Unfortunately we do not have high coverage samples. This might be a good compromise to not having a high coverage reference panel? :
|
Do you have any suggestion for estimating imputation performance, beyond the plots produced? As stated in your README, "(Plots) are not meant to substitute for a more in depth investigation of imputation performance in your setting".
I have imputed ~1900 samples (~3x coverage on average, per sample) aligned to a reference genome for a recently bottlenecked non-model organism. We do not have a high coverage reference panel.
As far as I am aware, most imputation softwares use a “reference panel” of high coverage genomes, which is used to benchmark imputation success. This involves downsampling a panel of high coverage samples, performing imputation and then assessing concordance of original and downsampled genotypes.
I am thinking to downsample all 1900 samples to eg. 1x and 0.1x coverage using the downsampleToCov = argument, imputing and then testing concordance with the original, albeit not-so-great-coverage-to-start-with samples using, for example: https://github.com/LindoNkambule/VCFCompare
I attach the plots for a chromosome (and the parameters used), which I think has generally run well, though I am yet to optimise k and nGen parameter settings.
STITCH(tempdir = tempdir(), chr="ENA|OU383783|OU383783.1_RagTag", bamlist="bamlistrg.txt", posfile="ENA|OU383783.txt", sampleNames_file="bamlist.txt", outputdir= paste0(getwd(), "/"), K=4, nGen=100, nCores=1)
[2024-02-28 13:55:38] Running STITCH(chr = ENA|OU383783|OU383783.1_RagTag, nGen = 100, posfile = ENA|OU383783.txt, K = 4, S = 1, outputdir = /mnt/fastdata/bop21kgl/RawData/allplates/, nStarts = , tempdir = /scratch/3496373/RtmpiFUn7y, bamlist = bamlistrg.txt, cramlist = , sampleNames_file = bamlist.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = NA, regionEnd = NA, buffer = NA, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = 25, nCores = 1, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = c(6, 10, 14, 18), downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE)
Thank you for providing and supporting STITCH.
Kiran
The text was updated successfully, but these errors were encountered: