Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting "scaled=1" doesn't retrieve all protein k-mers #701

Closed
apcamargo opened this issue Jul 21, 2019 · 4 comments
Closed

Setting "scaled=1" doesn't retrieve all protein k-mers #701

apcamargo opened this issue Jul 21, 2019 · 4 comments
Labels

Comments

@apcamargo
Copy link

apcamargo commented Jul 21, 2019

Hi!

I was evaluating sourmash as a fast option to create signatures from the full k-mer set of single protein sequences. According to the documentation, setting scaled=1 would retrieve all the k-mers of the input sequence.

However, in my tests I observed that the sketches were much smaller than I'd expect. Indeed, the number of hashes in the MinHash objects is much smaller than the real number of k-mers (the magnitude of this difference depends on the protein sequence).

sourmash_protein_test.txt

I'd expect this behavior for nucleotide sequences, due to canonical k-mers, but not for protein sequences. I don't think that hash collision would be causing this. I'm I missing something?

Thank you!

@luizirber luizirber added the bug label Aug 23, 2019
@ctb
Copy link
Contributor

ctb commented Aug 23, 2019

hi @apcamargo thanks for posting this issue, and especially for providing the code you used!

The core discrepancy is because our documentation is bad and you are using MinHash.add_sequence rather than MinHash.add_protein to hash amino acid sequences.
MinHash.add_sequence is meant for DNA sequence and will do all the translations etc. etc. This explains the weird hashing results - it's ignoring all the non-ATCG k-mers :). I'm going to dig into this a bit more and figure out what our documentation should say... and maybe identify some warnings that we should be providing...

When I update your code to use MinHash.add_protein and print out the numbers, I get:

226
226
95
99

indicating that the hash count for amino acid 11-mers is off by 4 from the set based approach on the 11-mers. I'm not sure why this is but will dig into that separately.

apologies & thanks again!

@ctb
Copy link
Contributor

ctb commented Aug 23, 2019

ok, the second discrepancy (95 vs 99 k-mers in the second protein MinHash) is because add_protein divides the k-mer size by 3. So in your example code you would need to set the k-size to 33, not 11. Then not only do the sizes of the k-mer sets match up, but the contents do as well.

Documentation fail and quite possible a design WTF on our side. Apologies!

@ctb
Copy link
Contributor

ctb commented Jan 8, 2020

Closed cf #720

@ctb ctb closed this as completed Jan 8, 2020
@apcamargo
Copy link
Author

I'm very sorry. I think I missed your response at the time! Thank you @ctb !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants