-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] add MinHash.kmers_and_hashes(...)
and sourmash sig kmers
#1695
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1695 +/- ##
========================================
Coverage 82.63% 82.64%
========================================
Files 113 114 +1
Lines 11994 12189 +195
Branches 1513 1554 +41
========================================
+ Hits 9911 10073 +162
- Misses 1828 1855 +27
- Partials 255 261 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
…o add/sig_kmers
I was just thinking about sth similar! That's a 🆒 feature! |
…o add/sig_kmers
@bluegenes @Glfrey @elzerac @olgabot @mr-eyes ok, I'm getting close to done with the main features, and will round out out the tests and the documentation next. At this stage I would very much appreciate feedback on the Note that there is still a big potential problem with invalid DNA sequence, so, um, don't give it bad DNA sequences, k? 😆 |
ok @mr-eyes we've reached the time that I was dreading... I added some Bad DNA 🧬 into the sequences, and it all broke 😭 this is because bad k-mers are skipped. what do you think about modifying your code from #1653 to yield |
MinHash.kmers_and_hashes(...)
and sourmash sig kmers
, to retrieve k-mers for hashesMinHash.kmers_and_hashes(...)
and sourmash sig kmers
Hello, Thank you so much for working on this and sorry for my late reply. I had to learn for the first time how to use git to go into the developer software. Here is my implementation and it fails with the following error:
Here is what I did to get here:
|
MinHash.kmers_and_hashes(...)
and sourmash sig kmers
MinHash.kmers_and_hashes(...)
and sourmash sig kmers
I think this is ready for review @sourmash-bio/devs! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was testing the sig kmers
cli and have some comments.
All files used in the testing is attached in the zip file.
Creating the sketch
sourmash sketch dna ref.fa -o ref.fa.sig -p k=31,scaled=1 --check-sequence
Perform sig kmers on a typical sequence to the ref.fa
sourmash sig kmers --signatures ref.fa.sig --sequences query_1.fa --save-sequences matches_1.fasta --save-kmers kmer-matches_1.csv --check-sequence
loaded 1 signatures total, from 1 files
loaded and merged 1 signatures
merged signature has the following properties:
k=31 molecule=DNA num=0 scaled=1 seed=42
total hashes in merged signature: 35
now processing sequence files for matches!
opening sequence file 'query_1.fa'
DONE.
searched 1 sequences from 1 files, containing a total of 0.0 Mbp.
matched and saved a total of 1 sequences with 0.0 Mbp.
matched and saved a total of 35 k-mers.
found 35 distinct matching hashes (100.0%)
Perform sig kmers on a typical sequence to the ref.fa with an extra valid kmer
sourmash sig kmers --signatures ref.fa.sig --sequences query_3.fa --save-sequences matches_3.fasta --save-kmers kmer-matches_3.csv --check-sequence
loaded 1 signatures total, from 1 files
loaded and merged 1 signatures
merged signature has the following properties:
k=31 molecule=DNA num=0 scaled=1 seed=42
total hashes in merged signature: 35
now processing sequence files for matches!
opening sequence file 'query_3.fa'
DONE.
searched 1 sequences from 1 files, containing a total of 0.0 Mbp.
matched and saved a total of 1 sequences with 0.0 Mbp.
matched and saved a total of 35 k-mers.
found 36 distinct matching hashes (100.0%)
Q1. I think there are only 35 distinct matching hashes, not 36. Am I getting it right?
Perform sig kmers on a typical sequence to the ref.fa with an extra (bad) kmer
sourmash sig kmers --signatures ref.fa.sig --sequences query_2.fa --save-sequences matches_2.fasta --save-kmers kmer-matches_2.csv --check-sequence --force
loaded 1 signatures total, from 1 files
loaded and merged 1 signatures
merged signature has the following properties:
k=31 molecule=DNA num=0 scaled=1 seed=42
total hashes in merged signature: 35
now processing sequence files for matches!
opening sequence file 'query_2.fa'
ERROR in sequence 'with_badkmer_at_the_end', file 'query_2.fa'
invalid DNA character in input k-mer: GCATCGACTAGCTACGGCGATCGACTAAACN
(continuing)
DONE.
searched 0 sequences from 1 files, containing a total of 0.0 Mbp.
matched and saved a total of 0 sequences with 0.0 Mbp.
matched and saved a total of 0 k-mers.
found 0 distinct matching hashes (0.0%)
Comment: I think the last extra bad kmer should be skipped and show that there's another 35 matched hashes, so the result files in this step should match the query_1 output files.
Thanks @mr-eyes - fixed, with extra tests now! 😎
You are correct. I was counting all of the hashes in the sequence, not just the ones in the query sig!
The Ready for re-review :) |
Thank you!
So, |
yes, that's what Note that without Incidentally, I did update the error message to make it clear it was skipping the entire sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you!
will merge when tests pass. |
Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
hah! I thought I'd merged this already and was surprised to receive a suggestion from @luizirber 😆 . Thanks for the reminder (and the fix!) |
🎉 |
This PR builds on #1653 to provide a way to get the k-mers and/or sequences underlying MinHash sketches - including DNA sketches, translated sketches, and protein/dayhoff/hp sketches.
The first addition is a new
MinHash
method,kmers_and_hashes(seq)
, that provides matched tuples of(kmer, hashval)
from the sequence.The second addition is a new command-line method,
sourmash sig kmers
, that when given one or more signatures and some sequence files, will retrieve all k-mers and/or sequences that correspond to the hashes in the signatures. The usage looks like the following:with the following output:
Fixes #1372.
Fixes #1692.
Fixes #483.
Replaces #724.
Ref #477.
The code and tests are not yet complete so various edge cases may not quite work yet, but it's headed in the right direction!
TODO:
MinHash.seq_to_hashes
returnNone
for invalid k-mers #1751)