Setting "scaled=1" doesn't retrieve all protein k-mers #701

apcamargo · 2019-07-21T21:22:01Z

Hi!

I was evaluating sourmash as a fast option to create signatures from the full k-mer set of single protein sequences. According to the documentation, setting scaled=1 would retrieve all the k-mers of the input sequence.

However, in my tests I observed that the sketches were much smaller than I'd expect. Indeed, the number of hashes in the MinHash objects is much smaller than the real number of k-mers (the magnitude of this difference depends on the protein sequence).

sourmash_protein_test.txt

I'd expect this behavior for nucleotide sequences, due to canonical k-mers, but not for protein sequences. I don't think that hash collision would be causing this. I'm I missing something?

Thank you!

The text was updated successfully, but these errors were encountered:

ctb · 2019-08-23T23:27:09Z

hi @apcamargo thanks for posting this issue, and especially for providing the code you used!

The core discrepancy is because our documentation is bad and you are using MinHash.add_sequence rather than MinHash.add_protein to hash amino acid sequences.
MinHash.add_sequence is meant for DNA sequence and will do all the translations etc. etc. This explains the weird hashing results - it's ignoring all the non-ATCG k-mers :). I'm going to dig into this a bit more and figure out what our documentation should say... and maybe identify some warnings that we should be providing...

When I update your code to use MinHash.add_protein and print out the numbers, I get:

indicating that the hash count for amino acid 11-mers is off by 4 from the set based approach on the 11-mers. I'm not sure why this is but will dig into that separately.

apologies & thanks again!

ctb · 2019-08-23T23:45:25Z

ok, the second discrepancy (95 vs 99 k-mers in the second protein MinHash) is because add_protein divides the k-mer size by 3. So in your example code you would need to set the k-size to 33, not 11. Then not only do the sizes of the k-mer sets match up, but the contents do as well.

Documentation fail and quite possible a design WTF on our side. Apologies!

ctb · 2020-01-08T14:22:22Z

Closed cf #720

apcamargo · 2020-01-09T22:12:05Z

I'm very sorry. I think I missed your response at the time! Thank you @ctb !

luizirber added the bug label Aug 23, 2019

ctb mentioned this issue Aug 24, 2019

Document and/or improve protein MinHash API. #720

Open

ctb closed this as completed Jan 8, 2020

ctb mentioned this issue May 25, 2020

new behavior for protein k-mer size calculations - gathering the issues together. #999

Closed

ctb mentioned this issue May 15, 2021

summary: further improvements to protein handling in sourmash #1525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting "scaled=1" doesn't retrieve all protein k-mers #701

Setting "scaled=1" doesn't retrieve all protein k-mers #701

apcamargo commented Jul 21, 2019 •

edited by ctb

Loading

ctb commented Aug 23, 2019

ctb commented Aug 23, 2019

ctb commented Jan 8, 2020

apcamargo commented Jan 9, 2020

Setting "scaled=1" doesn't retrieve all protein k-mers #701

Setting "scaled=1" doesn't retrieve all protein k-mers #701

Comments

apcamargo commented Jul 21, 2019 • edited by ctb Loading

ctb commented Aug 23, 2019

ctb commented Aug 23, 2019

ctb commented Jan 8, 2020

apcamargo commented Jan 9, 2020

apcamargo commented Jul 21, 2019 •

edited by ctb

Loading