Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return k-mers in addition to sketch/hash values #2073

Open
dkoslicki opened this issue Jun 2, 2022 · 9 comments
Open

Return k-mers in addition to sketch/hash values #2073

dkoslicki opened this issue Jun 2, 2022 · 9 comments

Comments

@dkoslicki
Copy link
Collaborator

This is a feature request (as discussed with @bluegenes earlier today) to have the k-mers that correspond to the hash values also returned when doing a sourmash sketch. I was made aware of sourmash sig kmers, but this is a two-pass method that can be quite slow on large files (the intended use-case). There exists in the API a way to do this, but it would be most useful as a flag to sourmash sketch (say, --save-kmers).

The intended use cases include:

  1. Progress towards ArgFracMinHash as discussed in investigate new sketch type, ArgMinHash-based FracMinHash #2039
  2. I plan on implementing a ternary search tree data structure that will help with large databases of sketches. This will be used for functional profiling investigations, and may help @bluegenes too in speeding up large-scale queries.
@ctb
Copy link
Contributor

ctb commented Jun 2, 2022

what format would be desirable here? CSV hashes <=> k-mers?

@dkoslicki
Copy link
Collaborator Author

I was initially thinking another entry in the JSON output, similar to how the mins and counts are currently stored as ordered lists.

I also forgot to mention, in order to get #2039 to to work properly, you'll want to make a uniform decision about what to do with collisions. The most straightforward way to do this is that if two or more k-mers hash to the same value, then store the lexicographically smallest k-mer.

@mr-eyes
Copy link
Member

mr-eyes commented Jun 2, 2022

I think this can be achieved by creating another vector of kmers in the Rust side alongside the two main vectors mins and abunds.

mins: Vec<u64>,
#[builder(default)]
abunds: Option<Vec<u64>>,

Maybe saving it in memory as a two-bit representation? but this will limit the kmer-size. As long as the output is a text format, then just the kmer characters will be okay to be saved in the JSON.

@mr-eyes
Copy link
Member

mr-eyes commented Jun 2, 2022

Sorry for the confusion. Two-bit representation can't work due to encodings other than DNA.

@dkoslicki Will sorting the kmers as a post-process after saving the sketch work? Instead of serializing the kmers in their lexicographical order?

@dkoslicki
Copy link
Collaborator Author

@mr-eyes I don't know if I quite understand. Do you mean some sort of second pass (when you mentioned post processing)? I imagine that during the min hashing, when inserting hashes, you keep another list in parallel and insert k-mers in it.

The lexicographic part is only for resolving collisions. So the k-mers are kept in the same order as the hashes. It's just if ever you have two distinct k-mers hashing to the same value, just save the k-mers with the smaller lexico value.

Or do I misunderstand your question?

@mr-eyes
Copy link
Member

mr-eyes commented Jun 2, 2022

@dkoslicki @ctb Please correct me if I am wrong. My guess about collisions is that minhashes are less likely to produce collisions, especially when they are scaled.

If my assumption is correct, I would suggest implementing this part without adding extra conditions to resolve collisions for speed.

I imagine that during the min hashing, when inserting hashes, you keep another list in parallel and insert k-mers in it.

Yes, exactly!

@ctb
Copy link
Contributor

ctb commented Jun 2, 2022

@dkoslicki @ctb Please correct me if I am wrong. My guess about collisions is that minhashes are less likely to produce collisions, especially when they are scaled.

collisions do happen - we've found one or two in the wild, over in spacegraphcats! - and since scaled is applied after hashing, the scaled factor does not affect them.

I was initially thinking another entry in the JSON output, similar to how the mins and counts are currently stored as ordered lists.

we've talked about this elsewhere - originally in #211, and there's an extended discussion here.

I think the place to start is to return them when sketching. Saving them into the signature JSON is a whole separate thing and likely to take a while to happen; if there's an intermediate output format that would work for you, @dkoslicki, that would probably be much easier.

@dkoslicki
Copy link
Collaborator Author

Collisions will invariably happen; you'll have a list of hashes/mins with values [v1,v2,v3] and a list of corresponding k-mers [k1, k2, k3], and then you have another k-mer k4 with hash v2. Yet k4 != k2, so what to do!? Replace k2 with k4, or leave k2 as is? The idea is to check which is lexicographically smaller: k2 or k4. If k4 is smaller, then your lists become [v1,v2,v3] and [k1,k4,k3], if k2 is smaller, then the lists stay [v1,v2,v3] and [k1,k2,k3].

There is no speed penalty since there is exactly one extra comparison to be had (is k2<=k4 or k2>k4)

@dkoslicki
Copy link
Collaborator Author

Saving them into the signature JSON is a whole separate thing and likely to take a while to happen; if there's an intermediate output format that would work for you, @dkoslicki, that would probably be much easier.

I see! Then a simple csv in the same order as the JSON hashes would be great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants