Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assessing imputation accuracy #95

Open
kiran-lee opened this issue Mar 20, 2024 · 4 comments
Open

Assessing imputation accuracy #95

kiran-lee opened this issue Mar 20, 2024 · 4 comments

Comments

@kiran-lee
Copy link

kiran-lee commented Mar 20, 2024

Do you have any suggestion for estimating imputation performance, beyond the plots produced? As stated in your README, "(Plots) are not meant to substitute for a more in depth investigation of imputation performance in your setting".

I have imputed ~1900 samples (~3x coverage on average, per sample) aligned to a reference genome for a recently bottlenecked non-model organism. We do not have a high coverage reference panel.

As far as I am aware, most imputation softwares use a “reference panel” of high coverage genomes, which is used to benchmark imputation success. This involves downsampling a panel of high coverage samples, performing imputation and then assessing concordance of original and downsampled genotypes.

I am thinking to downsample all 1900 samples to eg. 1x and 0.1x coverage using the downsampleToCov = argument, imputing and then testing concordance with the original, albeit not-so-great-coverage-to-start-with samples using, for example: https://github.com/LindoNkambule/VCFCompare

I attach the plots for a chromosome (and the parameters used), which I think has generally run well, though I am yet to optimise k and nGen parameter settings.

hapSum_log ENA|OU383783|OU383783 1_RagTag s 1
hapSum ENA|OU383783|OU383783 1_RagTag s 1
metricsForPostImputationQC ENA|OU383783|OU383783 1_RagTag sample
metricsForPostImputationQCChromosomeWide ENA|OU383783|OU383783 1_RagTag sample
r2 ENA|OU383783|OU383783 1_RagTag goodonly

STITCH(tempdir = tempdir(), chr="ENA|OU383783|OU383783.1_RagTag", bamlist="bamlistrg.txt", posfile="ENA|OU383783.txt", sampleNames_file="bamlist.txt", outputdir= paste0(getwd(), "/"), K=4, nGen=100, nCores=1)
[2024-02-28 13:55:38] Running STITCH(chr = ENA|OU383783|OU383783.1_RagTag, nGen = 100, posfile = ENA|OU383783.txt, K = 4, S = 1, outputdir = /mnt/fastdata/bop21kgl/RawData/allplates/, nStarts = , tempdir = /scratch/3496373/RtmpiFUn7y, bamlist = bamlistrg.txt, cramlist = , sampleNames_file = bamlist.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = NA, regionEnd = NA, buffer = NA, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = 25, nCores = 1, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = c(6, 10, 14, 18), downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE)

Thank you for providing and supporting STITCH.

Kiran

@Zilong-Li
Copy link
Collaborator

Zilong-Li commented Mar 22, 2024

Hey Kiran,

You can take a look at the vcfcomp function in https://github.com/Zilong-Li/vcfppR. You need to prepare 3 files, i.e imputed vcf, true vcf and a tsv file of allele frequency https://github.com/Zilong-Li/lcWGS-imputation-workflow/blob/main/config/af.tsv.

Then you can calculate correlation (r2) between the imputed dosage and the truth genotypes via

vcfcomp(test = "imputed.vcf.gz", truth = "true.vcf.gz", af = "af.tsv", stats = "r2")

You need to install the latest vcfppR by remotes::install_github("Zilong-Li/vcfppR"), since it is not yet on CRAN.

Hope it helps.

@kiran-lee
Copy link
Author

Hi Zilong-Li,

Is the vcfcomp correlation (r2) different from the plot produced in STITCH ? It seems like they report the same r2.

@Zilong-Li
Copy link
Collaborator

Ah, sorry. I thought you have high coverage sequencing samples that can be used as true genotypes. If you only have maximum 3x samples, then you can only filter the variants based on INFO score (say >0.8) , Minor Allele Frequency (>0.05) and so on.

@kiran-lee
Copy link
Author

Unfortunately we do not have high coverage samples. This might be a good compromise to not having a high coverage reference panel? :

  1. downsample the 25 best coverage (mean >8x) samples to 1x
  2. impute these downsamples samples with all the other ~1900 samples for a single chromosome
  3. compare r2 of the imputed downsampled samples to their "true" genotypes at 8x mean coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants