New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model Fit vs k-mer size #32
Comments
Hi Ben,
Thanks for your interest. I would say this is great consistency between the
two runs: a 1% variance in genome size and a 5% variance in the expected
heterozygosity rate is within the limits of the data. I noticed your kmer
peaks are truncated at 100,000 which probably contributes to some of this
variance, especially for the genome size. If you increase this to 1,000,000
I would be the values to be even closer.
For most genomes, we recommend a kmer length of 21 to balance out
repetitive elements vs the impact of sequencing errors and heterozygosity.
This is more-or-less in agreement with the value Merqury suggests over
typical genome sizes (a few megabases to a few gigabases). If you use a
shorter kmer, then more of the genome becomes repetitive until the peaks
are poorly resolved. If you use a longer kmer, more of the kmers will be
unique, although this will increasingly obscure how the heterozygosity is
distributed within the genome. For a given amount of heterozygosity
observed in the kmers, GenomeScope infers the level of heterozygosity at
the nucleotide level by assuming the heterozygosity is distributed at a
uniform rate. This was necessary for modeling purposes, but in practice,
heterozygosity is not uniformly distributed due to variable selective
pressures (coding/non-coding etc) so there can be some variability
depending on the kmer length used.
As an extreme example, in a typical human genome there is a heterozygous
variant every 1000bp. So, if you used 1,000 mers or longer mers almost
every kmer in a human genome would look heterozygous but it is ambiguous as
to if every nucleotide is heterozygous or only 1 in 1000. As you shrink the
kmer length, fewer and fewer kmers will overlap a heterozyous nucleotide
until the true value can be found of about 0.1%.
Hope this helps!
Mike
…On Mon, May 4, 2020 at 10:49 PM Ben Mansfeld ***@***.***> wrote:
Hi Mike,
Hope all is well with you
I'm revisiting our GenomeScope analyses with cleaner reads and was curious
about a trend I see in many papers. As well as looking at other k-mer based
assembly evaluation tools such as Merqury
<https://github.com/marbl/merqury>
It seems that in lieu of knowing what size k-mer to use, many people (self
included) run Jellyfish with different k's and after using GenomeScope to
estimate their genome parameters use the Model fit reported by GS to select
the optimal value for k.
The fit values usually range within the 90's so I wonder how
useful/accurate this approach is...
Furthermore, it seems that in some cases people report moderately
different results from GS using different k's.
In my case its rather marginal to the final result (704 vs 711Mb size; 1.4
vs 1.33% het) , but using different ks results in visibly different
histograms.
k=21 -
http://qb.cshl.edu/genomescope/analysis.php?code=IKX3YGi4tZUIurkadcZO
k=31 -
http://qb.cshl.edu/genomescope/analysis.php?code=Dzsv8ydHbwFiRgfkmyOe
Which makes me wonder if there is a "correct" value for k? I've done some
searching but answers are confusing.
The tool I mention above (Merqury) has bash script
<https://github.com/marbl/merqury/blob/master/best_k.sh> to calculate
best k which seems a rather simple equation based on Fofanov et al (2004).
Software like kmergenie estimate the best kmer length for assembly by
finding the max distinct kmers. Is this appropriate for size and
heterzygosity estimation?
In any case any advice would be welcome. How should one find the best k
for running GS?
Thanks in advance,
Ben
In general do you have best practice re
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#32>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP347AMRJAZ7FPFZXZP6DRP55DNANCNFSM4MZG7VTQ>
.
|
Mike as usual thanks for the quick and informative reply. The example helps understand a lot.
I'm not sure I understand your meaning about truncated peaks. I've included an upper count of 1e6 in jellyfish and set that as max kmer coverage in GS as well. Am I confusing these terms? Yes, Merqury suggested a k-mer of 20 as optimal for our genome size whether diploid or haploid. So, where does the model fit... um fit? It seems arbitrary to me to use this as a "kmer selection method"? What are your thoughts? Thanks again, |
Since jellyfish imposes a maximum count on the histogram, there might be
kmers that occur more than 100,000 times, just they are reported as 100,000
times in the file. The number (and sequence) of kmers that exceed this
limit will depend in part on the kmer length -- short kmers are less
impacted by the ends of the reads and sequencing errors to tend to have
higher coverage. So you might partially stabilize the results by increasing
the maximum kmer count to around 500,000 times so all kmers are considered
in both runs.
And for your second question, the relative heights of the peaks directly
depends on the choice of k since a single heterozygous base leads to 2*k
heterozygous kmers (k from the maternal haplotype and k from the paternal).
Consequently, the heterozygous peak for 31mers will be about 50% taller
than it will be for 21mers from the same data (it is actually a little bit
more subtle than this since changing k will also change the average kmer
coverage). The GenomeScope model knows how to correct for this, which is
why the estimates different by only a few percent and not 50%. I would
encourage you to read the GenomeScope 1.0 supplement where these factors
are discussed - you may even want to implement Supplemental equations 2 and
3 in R/python so you can play with how different parameters impact the
shapes of the curves. The nls_4peak function here will get you started:
https://github.com/schatzlab/genomescope/blob/master/genomescope.R
Hope this helps
Mike
…On Thu, May 7, 2020 at 11:34 PM Ben Mansfeld ***@***.***> wrote:
Mike as usual thanks for the quick and informative reply. The example
helps understand a lot.
Oh and I agree I think these numbers are all fair estimates in both values
of k.
I noticed your kmer
peaks are truncated at 100,000 which probably contributes to some of this
variance, especially for the genome size. If you increase this to 1,000,000
I would be the values to be even closer.
I'm not sure I understand your meaning about truncated peaks. I've
included an upper count of 1e6 in jellyfish and set that as max kmer
coverage in GS as well. Am I confusing these terms?
Yes, Merqury suggested a k-mer of 20 as optimal for our genome size
whether diploid or haploid.
But looking at the distributions I noticed that the 1x-het-peak looks
shorter with that value, and was wondering if we were excluding potential
hets. As you say, larger k more unique kmers.
So, where does the model fit... um fit? It seems arbitrary to me to use
this as a "kmer selection method"? What are your thoughts?
Thanks again,
Ben
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#32 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABP342GF3PMF43RQS5E43DRQN4TPANCNFSM4MZG7VTQ>
.
|
Thanks mike! I'll re-read the supplement. |
Hi Mike,
Hope all is well with you
I'm revisiting our GenomeScope analyses with cleaner reads and was curious about a trend I see in many papers. As well as looking at other k-mer based assembly evaluation tools such as Merqury
It seems that in lieu of knowing what size k-mer to use, many people (self included) run Jellyfish with different k's and after using GenomeScope to estimate their genome parameters use the Model fit reported by GS to select the optimal value for k.
The fit values usually range within the 90's so I wonder how useful/accurate this approach is...
Furthermore, it seems that in some cases people report moderately different results from GS using different k's.
In my case its rather marginal to the final result (704 vs 711Mb size; 1.4 vs 1.33% het) , but using different ks results in visibly different histograms.
k=21 - http://qb.cshl.edu/genomescope/analysis.php?code=IKX3YGi4tZUIurkadcZO
k=31 - http://qb.cshl.edu/genomescope/analysis.php?code=Dzsv8ydHbwFiRgfkmyOe
Which makes me wonder if there is a "correct" value for k? I've done some searching but answers are confusing.
The tool I mention above (Merqury) has bash script to calculate best k which seems a rather simple equation based on Fofanov et al (2004).
Software like kmergenie estimate the best kmer length for assembly by finding the max distinct kmers. Is this appropriate for size and heterzygosity estimation?
In any case any advice would be welcome. How should one find the best k for running GS?
Thanks in advance,
Ben
In general do you have best practice re
The text was updated successfully, but these errors were encountered: