Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Fit vs k-mer size #32

Open
bmansfeld opened this issue May 5, 2020 · 4 comments
Open

Model Fit vs k-mer size #32

bmansfeld opened this issue May 5, 2020 · 4 comments

Comments

@bmansfeld
Copy link

Hi Mike,
Hope all is well with you
I'm revisiting our GenomeScope analyses with cleaner reads and was curious about a trend I see in many papers. As well as looking at other k-mer based assembly evaluation tools such as Merqury

It seems that in lieu of knowing what size k-mer to use, many people (self included) run Jellyfish with different k's and after using GenomeScope to estimate their genome parameters use the Model fit reported by GS to select the optimal value for k.

The fit values usually range within the 90's so I wonder how useful/accurate this approach is...

Furthermore, it seems that in some cases people report moderately different results from GS using different k's.

In my case its rather marginal to the final result (704 vs 711Mb size; 1.4 vs 1.33% het) , but using different ks results in visibly different histograms.
k=21 - http://qb.cshl.edu/genomescope/analysis.php?code=IKX3YGi4tZUIurkadcZO
k=31 - http://qb.cshl.edu/genomescope/analysis.php?code=Dzsv8ydHbwFiRgfkmyOe

Which makes me wonder if there is a "correct" value for k? I've done some searching but answers are confusing.

The tool I mention above (Merqury) has bash script to calculate best k which seems a rather simple equation based on Fofanov et al (2004).

Software like kmergenie estimate the best kmer length for assembly by finding the max distinct kmers. Is this appropriate for size and heterzygosity estimation?
In any case any advice would be welcome. How should one find the best k for running GS?
Thanks in advance,
Ben

In general do you have best practice re

@mschatz
Copy link
Contributor

mschatz commented May 7, 2020 via email

@bmansfeld
Copy link
Author

Mike as usual thanks for the quick and informative reply. The example helps understand a lot.
Oh and I agree I think these numbers are all fair estimates in both values of k.

I noticed your kmer
peaks are truncated at 100,000 which probably contributes to some of this
variance, especially for the genome size. If you increase this to 1,000,000
I would be the values to be even closer.

I'm not sure I understand your meaning about truncated peaks. I've included an upper count of 1e6 in jellyfish and set that as max kmer coverage in GS as well. Am I confusing these terms?

Yes, Merqury suggested a k-mer of 20 as optimal for our genome size whether diploid or haploid.
But looking at the distributions I noticed that the 1x-het-peak looks shorter with that value, and was wondering if we were excluding potential hets. As you say, larger k more unique kmers.

So, where does the model fit... um fit? It seems arbitrary to me to use this as a "kmer selection method"? What are your thoughts?

Thanks again,
Ben

@mschatz
Copy link
Contributor

mschatz commented May 11, 2020 via email

@bmansfeld
Copy link
Author

bmansfeld commented May 13, 2020

Thanks mike! I'll re-read the supplement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants