-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting "scaled=1" doesn't retrieve all protein k-mers #701
Comments
hi @apcamargo thanks for posting this issue, and especially for providing the code you used! The core discrepancy is because our documentation is bad and you are using When I update your code to use
indicating that the hash count for amino acid 11-mers is off by 4 from the set based approach on the 11-mers. I'm not sure why this is but will dig into that separately. apologies & thanks again! |
ok, the second discrepancy (95 vs 99 k-mers in the second protein MinHash) is because Documentation fail and quite possible a design WTF on our side. Apologies! |
Closed cf #720 |
I'm very sorry. I think I missed your response at the time! Thank you @ctb ! |
Hi!
I was evaluating sourmash as a fast option to create signatures from the full k-mer set of single protein sequences. According to the documentation, setting
scaled=1
would retrieve all the k-mers of the input sequence.However, in my tests I observed that the sketches were much smaller than I'd expect. Indeed, the number of hashes in the MinHash objects is much smaller than the real number of k-mers (the magnitude of this difference depends on the protein sequence).
sourmash_protein_test.txt
I'd expect this behavior for nucleotide sequences, due to canonical k-mers, but not for protein sequences. I don't think that hash collision would be causing this. I'm I missing something?
Thank you!
The text was updated successfully, but these errors were encountered: