Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seed for minhash algorithm #11

Open
CBorreda opened this issue Feb 10, 2020 · 8 comments
Open

Seed for minhash algorithm #11

CBorreda opened this issue Feb 10, 2020 · 8 comments
Assignees

Comments

@CBorreda
Copy link

I've read the paper for kmer-db and I weren't able to find anywhere whether kmer-db uses a seed for minhashing during the build step. I've ran kmer-db build twice (from KMC-counted kmers) and it seems to use the same seed every time, since the results are identical. Is there a way to alter this seed? I'd like to somehow generate ~100-200 distance matrices and use them as a support value for the distance estimations, but I would need to minhash with a different seed each time.

Best

Carles

@agudys agudys self-assigned this Feb 10, 2020
@agudys
Copy link
Member

agudys commented Feb 10, 2020

Dear Carles,

At the moment there is no possibility to use different seeds - we will add this feature in the next release. In the meantime, you can try generating a distance matrix without minhashing (no -f parameter specified) to obtain more stable results. How large is your dataset?

Regards,
Adam

@CBorreda
Copy link
Author

Yes I know I could use the whole kmer number but that is too much to run in my machine.

I am analyzing 75 samples resequenced by illumina. I used -ci5 in kmc to get rid of erroneous kmers (those with a count lower than 5) since, as far as I understood, they might inflate the RAM usage of kmer-db build. I checked and 5 is the upper limit to filter out by kmer abundance in my samples, mainly due to some low-coverage samples I need to keep in the dataset.

I have ran the whole pipeline (build, all2all and distance) for 3% of the kmers, it took about 40% of my RAM. I could try to increase the fraction to 5 or 10% but I think I won't be able to use the whole dataset. Still, the tree looks good so far, I just want to give it some bootstrap support. Since I have some other projects to work in, I could go into a different project for some time and come back later to this project to check if the feature is implemented. I see this project is in constant development.

Best
Carles

@agudys
Copy link
Member

agudys commented Feb 10, 2020

Actually, there is something you could use. There is an undocumented option -f-start that was designed to process all kmers in portions. It represents the relative minimum threshold of the minhash filter (whille -f its the filter width). Therefore, you can for instance run kmer-db 10 times at each run analyzing different 10% of k-mers:

-f 0.1 
-f 0.1 -f-start 0.1
-f 0.1 -f-start 0.2
...
-f 0.1 -f-start 0.9

It's not exactly bootstraping (no replacement in sampling), but maybe you can find it useful.

@agudys
Copy link
Member

agudys commented Feb 10, 2020

I've accidentally sent you a half of the comment but its been edited now :)

@CBorreda
Copy link
Author

Very nice! You're right, this is not exactly what I was looking for (due to the lack of replacement in sampling), but it will for sure allow me to do some testing of the robustness of the tree. Still, I'll check for updates on the main request about the seeding.

I was wondering how would this option handle overlapping windows, say

-f 0.1 -f-start 0
-f 0.1 -f-start 0.01
-f 0.1 -f-start 0.02

I guess it would resample (not randomly though) part of the kmers?

Best
Carles

@agudys
Copy link
Member

agudys commented Feb 10, 2020

Exactly, you'll have overlapping k-mer spectra used in distance calculation. To have real bootstrapping, different seeds are needed. We'll work on that.

@CBorreda
Copy link
Author

Hi there,

Have you managed to implement a way to specify a seed to the minhash algorithm, as we commented? I have even tried to dig in your source code, but without C knowledge, I can't really understand what's going on there.

Best,

Carles

@agudys
Copy link
Member

agudys commented Jun 29, 2020

Hello!

We had some ideas about, but didn't want to provide a solution without testing if it's properly random. We'll dig into that again soon and let you know.

Adam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants