Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add MinHash.kmers_and_hashes(...) and sourmash sig kmers #1695

Merged
merged 35 commits into from
Aug 7, 2021

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jul 29, 2021

This PR builds on #1653 to provide a way to get the k-mers and/or sequences underlying MinHash sketches - including DNA sketches, translated sketches, and protein/dayhoff/hp sketches.

The first addition is a new MinHash method, kmers_and_hashes(seq), that provides matched tuples of (kmer, hashval) from the sequence.

The second addition is a new command-line method, sourmash sig kmers, that when given one or more signatures and some sequence files, will retrieve all k-mers and/or sequences that correspond to the hashes in the signatures. The usage looks like the following:

sourmash sig kmers --sig xxx.sig --seq podar-ref/1.fa podar-ref/2.fa \
       --save-kmers xxx.csv --save-sequences xxx.fa 

with the following output:

== This is sourmash version 4.2.1.dev9+g02f2128. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 1 signatures total, from 1 files
loaded and merged 1 signatures

merged signature has the following properties:
k=31 molecule=DNA num=0 scaled=100000 seed=42
total hashes in merged signature: 6

now processing sequence files for matches!
opening sequence file 'podar-ref/1.fa'
... searched 1486778 from 1 files so far
opening sequence file 'podar-ref/2.fa'
... searched 4150880 from 2 files so far
DONE.
searched 2 sequences from 2 files, containing a total of 4.2 Mbp.
matched and saved a total of 1 sequences with 1.5 Mbp.
matched and saved a total of 6 k-mers.
found 6 distinct matching hashes (100.0%)

Fixes #1372.
Fixes #1692.
Fixes #483.
Replaces #724.
Ref #477.

The code and tests are not yet complete so various edge cases may not quite work yet, but it's headed in the right direction!

TODO:

@codecov
Copy link

codecov bot commented Jul 29, 2021

Codecov Report

Merging #1695 (44decbe) into latest (cdc1fc6) will increase coverage by 0.00%.
The diff coverage is 81.28%.

Impacted file tree graph

@@           Coverage Diff            @@
##           latest    #1695    +/-   ##
========================================
  Coverage   82.63%   82.64%            
========================================
  Files         113      114     +1     
  Lines       11994    12189   +195     
  Branches     1513     1554    +41     
========================================
+ Hits         9911    10073   +162     
- Misses       1828     1855    +27     
- Partials      255      261     +6     
Flag Coverage Δ
python 89.99% <85.49%> (-0.11%) ⬇️
rust 66.41% <0.00%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/core/src/ffi/minhash.rs 0.00% <0.00%> (ø)
src/sourmash/cli/sketch/dna.py 100.00% <ø> (ø)
src/sourmash/sig/__main__.py 90.66% <78.46%> (-3.23%) ⬇️
src/sourmash/cli/sig/__init__.py 100.00% <100.00%> (ø)
src/sourmash/cli/sig/kmers.py 100.00% <100.00%> (ø)
src/sourmash/minhash.py 92.37% <100.00%> (+0.75%) ⬆️
src/sourmash/sourmash_args.py 93.56% <100.00%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cdc1fc6...44decbe. Read the comment docs.

@mr-eyes
Copy link
Member

mr-eyes commented Jul 29, 2021

I was just thinking about sth similar! That's a 🆒 feature!

@ctb
Copy link
Contributor Author

ctb commented Jul 30, 2021

@bluegenes @Glfrey @elzerac @olgabot @mr-eyes ok, I'm getting close to done with the main features, and will round out out the tests and the documentation next. At this stage I would very much appreciate feedback on the sourmash sig kmers output, the output formats, and anything else you are interested in seeing this code do!

Note that there is still a big potential problem with invalid DNA sequence, so, um, don't give it bad DNA sequences, k? 😆

@ctb
Copy link
Contributor Author

ctb commented Jul 30, 2021

ok @mr-eyes we've reached the time that I was dreading... I added some Bad DNA 🧬 into the sequences, and it all broke 😭

b3f82e0

this is because bad k-mers are skipped.

what do you think about modifying your code from #1653 to yield None as the hash value for k-mers in Python when 'force' is used to hash bad DNA? That seems to be fairly parsimonious to me, and shouldn't break existing code 🤞 .

@ctb ctb changed the title [WIP] add MinHash.kmers_and_hashes(...) and sourmash sig kmers, to retrieve k-mers for hashes [WIP] add MinHash.kmers_and_hashes(...) and sourmash sig kmers Jul 30, 2021
@elzerac
Copy link

elzerac commented Jul 30, 2021

Hello, Thank you so much for working on this and sorry for my late reply. I had to learn for the first time how to use git to go into the developer software. Here is my implementation and it fails with the following error:

ERROR when merging signature 'Lyn_proteins.fasta' (b8c953b9) from file Lyn_proteins.fasta.sig different ksizes cannot be compared

Here is what I did to get here:


#use the development version of Sourmash:

#follow the protocol here: https://github.com/sourmash-bio/sourmash/blob/add/sig_kmers/doc/developer.md#running-tests-and-checks

#first install mamba

wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
bash Mambaforge-$(uname)-$(uname -m).sh

mamba create -n sourmash_dev tox-conda rust git compilers pandoc
#must re-start shell for it to work

conda activate sourmash_dev

git clone https://github.com/sourmash-bio/sourmash.git

cd sourmash

pip install -r requirements.txt

#make sure that I have the correct version from github
git branch —all

git checkout add/sig_kmers

git status

#now try to get protein signatures and use sig kmers to extract

sourmash sketch protein -p k=7,scaled=1,dayhoff -p k=8,scaled=1,dayhoff Lyn_proteins.fasta

sourmash sig kmers --sig Lyn_proteins.fasta.sig --seq Lyn_proteins.fasta --save-kmers test.csv --save-sequences test_seqs.fa

@ctb ctb changed the title [WIP] add MinHash.kmers_and_hashes(...) and sourmash sig kmers [MRG] add MinHash.kmers_and_hashes(...) and sourmash sig kmers Aug 3, 2021
@ctb
Copy link
Contributor Author

ctb commented Aug 3, 2021

I think this is ready for review @sourmash-bio/devs!

Copy link
Member

@mr-eyes mr-eyes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was testing the sig kmers cli and have some comments.
All files used in the testing is attached in the zip file.

sig_kmers_mo_test_files.zip

Creating the sketch

sourmash sketch dna ref.fa -o ref.fa.sig -p k=31,scaled=1 --check-sequence

Perform sig kmers on a typical sequence to the ref.fa

sourmash sig kmers --signatures ref.fa.sig --sequences query_1.fa --save-sequences matches_1.fasta --save-kmers kmer-matches_1.csv --check-sequence
loaded 1 signatures total, from 1 files
loaded and merged 1 signatures

merged signature has the following properties:
k=31 molecule=DNA num=0 scaled=1 seed=42
total hashes in merged signature: 35

now processing sequence files for matches!
opening sequence file 'query_1.fa'
DONE.
searched 1 sequences from 1 files, containing a total of 0.0 Mbp.
matched and saved a total of 1 sequences with 0.0 Mbp.
matched and saved a total of 35 k-mers.
found 35 distinct matching hashes (100.0%)

Perform sig kmers on a typical sequence to the ref.fa with an extra valid kmer

sourmash sig kmers --signatures ref.fa.sig --sequences query_3.fa --save-sequences matches_3.fasta --save-kmers kmer-matches_3.csv --check-sequence
loaded 1 signatures total, from 1 files
loaded and merged 1 signatures

merged signature has the following properties:
k=31 molecule=DNA num=0 scaled=1 seed=42
total hashes in merged signature: 35

now processing sequence files for matches!
opening sequence file 'query_3.fa'
DONE.
searched 1 sequences from 1 files, containing a total of 0.0 Mbp.
matched and saved a total of 1 sequences with 0.0 Mbp.
matched and saved a total of 35 k-mers.
found 36 distinct matching hashes (100.0%)

Q1. I think there are only 35 distinct matching hashes, not 36. Am I getting it right?



Perform sig kmers on a typical sequence to the ref.fa with an extra (bad) kmer

sourmash sig kmers --signatures ref.fa.sig --sequences query_2.fa --save-sequences matches_2.fasta --save-kmers kmer-matches_2.csv --check-sequence --force
loaded 1 signatures total, from 1 files
loaded and merged 1 signatures

merged signature has the following properties:
k=31 molecule=DNA num=0 scaled=1 seed=42
total hashes in merged signature: 35

now processing sequence files for matches!
opening sequence file 'query_2.fa'
ERROR in sequence 'with_badkmer_at_the_end', file 'query_2.fa'
invalid DNA character in input k-mer: GCATCGACTAGCTACGGCGATCGACTAAACN
(continuing)
DONE.
searched 0 sequences from 1 files, containing a total of 0.0 Mbp.
matched and saved a total of 0 sequences with 0.0 Mbp.
matched and saved a total of 0 k-mers.
found 0 distinct matching hashes (0.0%)

Comment: I think the last extra bad kmer should be skipped and show that there's another 35 matched hashes, so the result files in this step should match the query_1 output files.

@ctb
Copy link
Contributor Author

ctb commented Aug 4, 2021

Thanks @mr-eyes - fixed, with extra tests now! 😎

Q1. I think there are only 35 distinct matching hashes, not 36. Am I getting it right?

You are correct. I was counting all of the hashes in the sequence, not just the ones in the query sig!

Comment: I think the last extra bad kmer should be skipped and show that there's another 35 matched hashes, so the result files in this step should match the query_1 output files.

The MinHash.add_sequence(...) raises an exception if there's a bad character in the sequence itself, so we don't have the granularity to add the individual k-mers at this time.

Ready for re-review :)

@mr-eyes
Copy link
Member

mr-eyes commented Aug 4, 2021

Thank you!

The MinHash.add_sequence(...) raises an exception if there's a bad character in the sequence itself, so we don't have the granularity to add the individual k-mers at this time.

So, --force should skip the whole sequence instead of just the bad kmers in it and continue querying the rest?

@ctb
Copy link
Contributor Author

ctb commented Aug 4, 2021

Thank you!

The MinHash.add_sequence(...) raises an exception if there's a bad character in the sequence itself, so we don't have the granularity to add the individual k-mers at this time.

So, --force should skip the whole sequence instead of just the bad kmers in it and continue querying the rest?

yes, that's what --force with --check-sequence does. Whether it should is maybe a different question 😄

Note that without --check-sequence, the sequence is processed normally.

Incidentally, I did update the error message to make it clear it was skipping the entire sequence.

Copy link
Member

@mr-eyes mr-eyes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!

@ctb
Copy link
Contributor Author

ctb commented Aug 4, 2021

will merge when tests pass.

doc/api-example.md Outdated Show resolved Hide resolved
Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
@ctb
Copy link
Contributor Author

ctb commented Aug 7, 2021

hah! I thought I'd merged this already and was surprised to receive a suggestion from @luizirber 😆 . Thanks for the reminder (and the fix!)

@ctb ctb merged commit 79cb35e into latest Aug 7, 2021
@ctb ctb deleted the add/sig_kmers branch August 7, 2021 14:45
@ctb
Copy link
Contributor Author

ctb commented Aug 7, 2021

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants