-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return k-mers in addition to sketch/hash values #2073
Comments
what format would be desirable here? CSV hashes <=> k-mers? |
I was initially thinking another entry in the JSON output, similar to how the mins and counts are currently stored as ordered lists. I also forgot to mention, in order to get #2039 to to work properly, you'll want to make a uniform decision about what to do with collisions. The most straightforward way to do this is that if two or more k-mers hash to the same value, then store the lexicographically smallest k-mer. |
I think this can be achieved by creating another vector of kmers in the Rust side alongside the two main vectors sourmash/src/core/src/sketch/minhash.rs Lines 50 to 53 in 13bf0d5
Maybe saving it in memory as a two-bit representation? but this will limit the kmer-size. As long as the output is a text format, then just the kmer characters will be okay to be saved in the JSON. |
Sorry for the confusion. Two-bit representation can't work due to encodings other than DNA. @dkoslicki Will sorting the kmers as a post-process after saving the sketch work? Instead of serializing the kmers in their lexicographical order? |
@mr-eyes I don't know if I quite understand. Do you mean some sort of second pass (when you mentioned post processing)? I imagine that during the min hashing, when inserting hashes, you keep another list in parallel and insert k-mers in it. The lexicographic part is only for resolving collisions. So the k-mers are kept in the same order as the hashes. It's just if ever you have two distinct k-mers hashing to the same value, just save the k-mers with the smaller lexico value. Or do I misunderstand your question? |
@dkoslicki @ctb Please correct me if I am wrong. My guess about collisions is that minhashes are less likely to produce collisions, especially when they are scaled. If my assumption is correct, I would suggest implementing this part without adding extra conditions to resolve collisions for speed.
Yes, exactly! |
collisions do happen - we've found one or two in the wild, over in spacegraphcats! - and since scaled is applied after hashing, the scaled factor does not affect them.
we've talked about this elsewhere - originally in #211, and there's an extended discussion here. I think the place to start is to return them when sketching. Saving them into the signature JSON is a whole separate thing and likely to take a while to happen; if there's an intermediate output format that would work for you, @dkoslicki, that would probably be much easier. |
Collisions will invariably happen; you'll have a list of hashes/mins with values [v1,v2,v3] and a list of corresponding k-mers [k1, k2, k3], and then you have another k-mer k4 with hash v2. Yet k4 != k2, so what to do!? Replace k2 with k4, or leave k2 as is? The idea is to check which is lexicographically smaller: k2 or k4. If k4 is smaller, then your lists become [v1,v2,v3] and [k1,k4,k3], if k2 is smaller, then the lists stay [v1,v2,v3] and [k1,k2,k3]. There is no speed penalty since there is exactly one extra comparison to be had (is k2<=k4 or k2>k4) |
I see! Then a simple csv in the same order as the JSON hashes would be great |
This is a feature request (as discussed with @bluegenes earlier today) to have the k-mers that correspond to the hash values also returned when doing a
sourmash sketch
. I was made aware of sourmash sig kmers, but this is a two-pass method that can be quite slow on large files (the intended use-case). There exists in the API a way to do this, but it would be most useful as a flag tosourmash sketch
(say,--save-kmers
).The intended use cases include:
ArgMinHash
-based FracMinHash #2039The text was updated successfully, but these errors were encountered: