-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add utility scripts for converting hash values to k-mers. #724
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #724 +/- ##
==========================================
+ Coverage 89.17% 94.42% +5.25%
==========================================
Files 123 96 -27
Lines 18615 15001 -3614
Branches 1433 1433
==========================================
- Hits 16599 14164 -2435
+ Misses 1780 601 -1179
Partials 236 236
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
Yayy! So excited |
So I tested this out to figure out where the signal for two mouse bladder cells' similarity differences for with and without abundance is coming from. I'm not sure I'm doing it quite right, because these 14 hashes (ksize=21, moltype=DNA, num_hashes=500) were the only ones that overlapped between these two cells when reading in their
But then using these scripts, only 11 hashes from the files were extracted, though I'm pretty sure that I used the same fastq files as the original. The ksize (21) and molecule (DNA) are also identical. hashvals-to-signature.py command and output
stderr:
hashvals-to-signature.py command & output
stderr:
BLAST hits on the k-mersIf I'm doing this right, then it's a little depressing because it looks to me that a bunch of these hits are from crappy genome assemblies and that these k-mers are from lab environment contaminants, rather than from mouse bladder-specific genes. Suggesting that most of the jaccard similarity is coming from random signal. Or maybe real k-mers that are non-mouse, but actually present in the cells and are unique to certain cell types? It's especially exciting to see ERCCs here ... 🤦♀
Do you have a sense of what could be happening here? |
Here's an excerpt from one of the signatures just to make sure I"m providing all possible information: import json
from IPython.lib.pretty import pretty
with open(f"{folder}/A1-B000610-3_56_F-1-1_S28.sig") as f:
signature_data = json.load(f)
print(pretty(signature_data, max_seq_length=10)) Here is the summarized json:
|
OK, I updated the scripts to be less clever :). I also added a script The key function to look at (and you can use this directly in a notebook if you like...) is In terms of what went wrong before, I was not accounting for using 'num' minhashes, I think, and was assuming you were using scaled minhashes. Mea culpa. The latest code is less clever and should be more robust. |
@ctb so I'm testing this (yay!) I would expect that if I calc the sig of some genome, and then use the hash2kmer script, to obtain the same number of kmers as the number of hash vals used in the sig. sourmash compute -k 31 -n 100 GCF_000281435.2_ASM28143v2_genomic.fna.gz
python signature-to-kmers.py --output-kmers kmers.txt GCF_000281435.2_ASM28143v2_genomic.fna.gz.sig GCF_000281435.2_ASM28143v2_genomic.fna.gz
# read 5767822 bp, found 56 kmers matching hashvals Am I doing something wrong? I noticed that the number of returned kmers is roughly half the number of hash vals, which makes me suspect that there is a problem with canonical kmers -- do they "get lost" during backtranslation from hash to kmer? |
Hmm the updates from @ctb seem to still be too clever :) I'm still getting 11/14 hashes back, and it's the same ones as are missing from this comment, so at least it's consistent! Is there another filtering step happening? |
Aaaaaand this is why I should always write tests, ladies and gentleman... It was an issue of not being minimally clever, @olgabot - I wasn't choosing the correct canonical k-mer representation for reverse complements, so approximately 50% of the hashes were not being found. Tested all three scripts like so:
and got identical output yielding all 1,000 k-mers. |
Awesome, thank you! This is yielding the same number of k-mers as before now. Thank you! |
@olgabot it sounds like this works for you. @luizirber any object to me merging this and creating issues for:
? |
I just tried utils/hashvals-to-signature.py. It ran almost instantaneously on a file with 10 hashes, but it never finished after running for 24 hours on a file with 48 million hashes. I'm trying it now with 9 million instead and hoping for a slightly faster run time. |
For 9 million hashes, the run time was 7 hours. |
just to confirm: you ran hashvals-to-signature, not signature-to-kmers? I have to say at that scale I'm ok with a custom solution :). I was not thinking of signatures with millions of hashes when we designed... well, any of sourmash! |
...but the good news is we can cannibalize code from those scripts now that we know they works properly... |
yes! I ran Originally, I generated a set of all abundance 1 hashes across the data set -- this was 48 million hashes (scaled 2000, k 31). I was going to use Then, I decided to make a set of all hashes greater than abundance 1. This was 9 million hashes, and I used Cannibalize away :) |
In fine with this plan. Is this ready for review and merge (despite the |
there aren't any tests, tho. and it's not integrated into sourmash. so...
concerned about that.
On Thu, Nov 21, 2019 at 04:27:37PM -0800, Luiz Irber wrote:
> @luizirber any object to me merging this and creating issues for:
>
> * adding `hashvals-to-signatures` functionality to `sourmash signature import/export` instead
>
> * generalizing the `get_kmers_for_hashvals` function in `signature-to-kmers` by refactoring some of the code in `_minhash.pyx` to expose a k-mer extraction function on MinHash objects
In fine with this plan. Is this ready for review and merge (despite the `[WIP]` in the title), @ctb?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#724 (comment)
--
C. Titus Brown, ctbrown@ucdavis.edu
|
It's also possible that @pranathivemuri's parallelized `compute` code could
be extracted out to do a parallelized hash extraction and filtering
---
Olga Botvinnik, PhD
olgabotvinnik.com <http://www.olgabotvinnik.com>
On Fri, Nov 22, 2019 at 7:31 AM C. Titus Brown <notifications@github.com>
wrote:
… there aren't any tests, tho. and it's not integrated into sourmash. so...
concerned about that.
On Thu, Nov 21, 2019 at 04:27:37PM -0800, Luiz Irber wrote:
> > @luizirber any object to me merging this and creating issues for:
> >
> > * adding `hashvals-to-signatures` functionality to `sourmash signature
import/export` instead
> >
> > * generalizing the `get_kmers_for_hashvals` function in
`signature-to-kmers` by refactoring some of the code in `_minhash.pyx` to
expose a k-mer extraction function on MinHash objects
>
> In fine with this plan. Is this ready for review and merge (despite the
`[WIP]` in the title), @ctb?
>
>
> --
> You are receiving this because you were mentioned.
> Reply to this email directly or view it on GitHub:
> #724 (comment)
--
C. Titus Brown, ***@***.***
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#724>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGE24H4UQVNFA43KFC2AILQU73MNANCNFSM4ISYGMUA>
.
|
Fair point. Since it doesn't touch anything that is actually changing with #424, I won't worry about merging this soon. |
Would be massively sped up by #1214 |
see #1695 |
closing in favor of #1695. |
Adds two utility scripts in
utils/
,hashvals-to-signatures.py
andsignature-to-kmers.py
, for the purpose of converting hashes into k-mers.hashvals-to-signatures
takes a collection of hash values and turns them into a sourmash signature.signature-to-kmers
takes a signature and one or more sequence files, and outputs the k-mers that correspond to the hash values and the sequences that contain those k-mers.Random musings -
hashvals-to-signatures
functionality could/should be added tosourmash signature import/export
insteadget_kmers_for_hashvals
function insignature-to-kmers
is not very general. I think we should refactor some of the code in_minhash.pyx
to expose a k-mer extraction function on MinHash objects, which would support a much more general function.ref #483
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?