Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to go from the hash value to the k-mer sequence? #483

Closed
phiweger opened this issue May 30, 2018 · 14 comments · Fixed by #1695
Closed

Is there a way to go from the hash value to the k-mer sequence? #483

phiweger opened this issue May 30, 2018 · 14 comments · Fixed by #1695

Comments

@phiweger
Copy link

I.e. any switch that tracks which k-mer sequence is present in the signature?

Thank you.

@luizirber
Copy link
Member

No (yet). See #211

There is a recent PR with a similar structure (hashes to ID in reads) in #477, but it's not going to be merged. #482 implements the changes needed in the C++/Cython layers.

But since it's something people are asking more and more, probably it's a good idea to implement =]

@ctb
Copy link
Contributor

ctb commented May 30, 2018 via email

@phiweger
Copy link
Author

well let's see

  1. it would be nice to align the kmers to the target genome to get a feeling for MinHash "kmer sampling" -- this is probably just instructional
  2. I am looking into pangenomes at the moment, trying to quickly match up genomes on shared hashes, and use more costly aligners around those seeds -- the idea being that you not only want the similarity of two genomes (or 1000) but what actually it is (proteins, ncRNA ...) that is similar. However, you'd need to know where the minhashed kmer is on the genome to do this.

@ctb
Copy link
Contributor

ctb commented May 31, 2018

right! (1) we have a script for, somewhere... @taylorreiter do you remember where this is? (The project really needs to figure out a little utility script solution... see #201)

For both (1) and (2), I think it's important to realize that the power of MinHash comes from the fact that these k-mers don't mean anything. So looking at them directly can be useful for developing intuition but is not so useful for anything else. In particular, they are definitely not seeds and shouldn't be used as such.

For example, there are two reasons you might get a k-mer match: one is that there are lots of exact k-mer matches due to a long stretch of high similarity! the other is that there is a lot of low similarity and statistically you're going to find a few k-mers with high similarity. In either case, there is no guarantee of locality (or non-locality) in the k-mers that are chosen by MinHash, so they could be right next to each other and ignoring patches of k-mers elsewhere, OR they could be "seeding" patches of high similarity that are spread throughout.

There are other, better ways to look systematically at all shared k-mers (this is enabled by khmer and other projects) and quickly do alignments (mashmap and nucmer before that). That software is very fast once you've identified which samples need to be aligned, and it's what we've used.

tl;dr? use sourmash to figure out which samples to compare using a more detailed approach, but please don't pay any attention at all to which specific k-mers are chosen for comparison purposes - they are intentionally meaningless.

@olgabot
Copy link
Collaborator

olgabot commented Jul 16, 2018

What about for gene expression analyses? I'm hashing RNA expression signatures and it would be useful to map those kmers back to genes, even probabilistically like kalliso/salmon's "transcript equivalency" count values.

@ctb
Copy link
Contributor

ctb commented Jul 17, 2018 via email

@taylorreiter
Copy link
Contributor

As @luizirber mentioned, he had implemented a way to save which k-mer sequence is underlying a hash in #477. We tried this on a metagenome and it "worked", but it wasn't incredibly helpful. Because the signatures are scaled down to 1/2000 (or 1/10000 etc), knowing these sequences wasn't terribly helpful because it gave such a limited picture.

I wasn't able to find the script mentioned by @ctb here.

@ctb
Copy link
Contributor

ctb commented Jul 18, 2018

Not entirely on topic, but: here is a script that outputs reads or sequences containing any hashes from a signature.

output-sequences-with-scaled-hashes.txt

@olgabot
Copy link
Collaborator

olgabot commented Jul 31, 2019

Related: #678

Here's a potential use case:

In single-cell ATAC seq data shows open chromatin regions, and often people are interested in "what TF motif(s) are available in this cell type?" The processing is rather long and laborious, involving alignment and then calling peaks and then using the peak calls to find similarities. I think the alignment could be skipped and the sequences could be used directly to find cell-cell similarities and cluster them.

Given the output k-mers, one could find most abundant k-mers and match them to e.g. JASPAR or other databases to find related TFs.

... something to think about!

@ctb
Copy link
Contributor

ctb commented Aug 2, 2019

hi olga, re the use case - do you see subsampling as being problematic here?

my current approach with @taylorreiter is this: we have started using spacegraphcats [ref] to index large data sets and retrieve the neighborhoods of specific hashes. For an identified hash-of-interest, we can get a guarantee on retrieval (has to do with recovering everything within the cDBG within some radius).

but, this is not what you're proposing above, because we are ok with the hashes of interest in our case being a subsample of all possible hashes of interest. in your case it seems like you'd really want to guarantee that you had all interesting k-mers.

you might be interested in looking at kevlar [ref] for ideas.

in sum, there's lots of uses for k-mers and De Bruijn graphs and cDBGs, but sourmash is more about subsampling than k-mers and graphs. Maybe we need to revisit these functions in libraries like khmer instead?

@olgabot
Copy link
Collaborator

olgabot commented Aug 21, 2019

hi olga, re the use case - do you see subsampling as being problematic here?

True, the subsampling may be problematic as to find the TF binding sites, we'd want all the relevant reads. One option is to use the hashes to find cell-cell similarities, build a KNN graph and do leiden clustering, then extract hashes present in only certain clusters, and extract the reads containing these hashes.

Re finding reads, I think I'm doing something wrong because when I use hash_murmur directly on extracted k-mers to hash, they don't match what minhash.add_sequence outputs! I'm trying to understand why ignore_abundance=True and ignore_abundance=False produce vastly different k-nearest neighbor (KNN) graphs, and whether there's difference in coverage of the same k-mers from sample to sample.

ignore_abundance=True

image

This is very close to what we expect for these celltypes, which are very different.

ignore_abundance=False

However, if we include abundance, then we get mixing of the celltypes!

image

the code

Here is the notebook, and here is the most relevant excerpt, which is based on output-sequences-with-scaled-hashes.txt above.

%%time
query_hashes = set(hashes_of_interest.index)
minhash = sig1.minhash.copy_and_clear()


# now, iterate over the input sequences and output those that add
# hashes!
n = 0
m = 0

NOTIFY_EVERY_BP=1e6

all_kmers_to_hashes_of_interest = {}

watermark = NOTIFY_EVERY_BP

with open("bladder_test_cells_different_hashes.fasta", 'w') as f:

    for fastq in fastqs:
        sample_id = os.path.basename(fastq).split("_R")[0]
        for record in screed.open(fastq):
            n += len(record.sequence)
            while n >= watermark:
                sys.stderr.write('... {} {}\r'.format(watermark, fastq))
                watermark += NOTIFY_EVERY_BP

            minhash.add_sequence(record.sequence, force=True)
            hashes = set(minhash.get_hashes())
            match = hashes.intersection(query_hashes)
            if match:
                n_kmers = len(record.sequence) - ksize + 1
                kmers = [record.sequence[i:(i+ksize)] for i in range(n_kmers)]
                hashed_kmers = [hash_murmur(kmer) for kmer in kmers]
                kmer_to_hash = dict(zip(kmers, hashed_kmers))

                kmer_to_hashes_of_interest = {kmer: h for kmer, h in kmer_to_hash.items() if h in query_hashes}
                if len(kmer_to_hashes_of_interest) != len(match):
                    print("Something is weird! Not all the hashes were found...")
                    print(match)
                    break
                all_kmers_to_hashes_of_interest.update(kmer_to_hashes_of_interest)
            if len(all_kmers_to_hashes_of_interest) == len(query_hashes):
                print("Found all hashes! Exiting.")
                break

Is there anything obviously wrong going on here? The kmer size and molecule type are the same, and the default seed of 42 was used.

@ctb
Copy link
Contributor

ctb commented Aug 24, 2019

hi @olgabot I can look into this but could you give some guidance as to a quick test case that is problematic? The notebook looks like it might take a long time to run. (Please do just tell me if it's not and I'll go ahead and do it :)

@olgabot
Copy link
Collaborator

olgabot commented Aug 26, 2019 via email

@ctb
Copy link
Contributor

ctb commented Sep 1, 2019

See #724 for scripts that convert between k-mers and hashes. Still WIP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants