Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add utility scripts for converting hash values to k-mers. #724

Closed
wants to merge 17 commits into from

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Sep 1, 2019

Adds two utility scripts in utils/, hashvals-to-signatures.py and signature-to-kmers.py, for the purpose of converting hashes into k-mers.

hashvals-to-signatures takes a collection of hash values and turns them into a sourmash signature.

signature-to-kmers takes a signature and one or more sequence files, and outputs the k-mers that correspond to the hash values and the sequences that contain those k-mers.

Random musings -

  • hashvals-to-signatures functionality could/should be added to sourmash signature import/export instead
  • the get_kmers_for_hashvals function in signature-to-kmers is not very general. I think we should refactor some of the code in _minhash.pyx to expose a k-mer extraction function on MinHash objects, which would support a much more general function.

ref #483

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov
Copy link

codecov bot commented Sep 1, 2019

Codecov Report

Merging #724 (fedeab0) into latest (13cd95f) will increase coverage by 5.25%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest     #724      +/-   ##
==========================================
+ Coverage   89.17%   94.42%   +5.25%     
==========================================
  Files         123       96      -27     
  Lines       18615    15001    -3614     
  Branches     1433     1433              
==========================================
- Hits        16599    14164    -2435     
+ Misses       1780      601    -1179     
  Partials      236      236              
Flag Coverage Δ
python 94.42% <ø> (ø)
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/core/src/errors.rs
src/core/src/index/search.rs
src/core/src/ffi/hyperloglog.rs
src/core/src/from.rs
src/core/src/index/linear.rs
src/core/src/ffi/utils.rs
src/core/src/sketch/hyperloglog/mod.rs
src/core/src/sketch/nodegraph.rs
src/core/src/encodings.rs
src/core/tests/test.rs
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 13cd95f...fedeab0. Read the comment docs.

@olgabot
Copy link
Collaborator

olgabot commented Sep 2, 2019

Yayy! So excited

@olgabot
Copy link
Collaborator

olgabot commented Sep 9, 2019

So I tested this out to figure out where the signal for two mouse bladder cells' similarity differences for with and without abundance is coming from. I'm not sure I'm doing it quite right, because these 14 hashes (ksize=21, moltype=DNA, num_hashes=500) were the only ones that overlapped between these two cells when reading in their .sig files. The number is the abundance of the k-mer:

hashval A1-B000610-3_56_F-1-1 A1-B002764-3_38_F-1-1
171794956858086 2 1
394031160505449 1 1
513049510280006 6 1
635563321059591 55 1
734478954482604 1 1
790330283259107 1 1
860437213521678 1 10
936637619733676 3 7
937624453106451 301 4976
1012839673718406 5 245
1026077759632952 39 315
1067106972354716 1 1
1215285019941982 2 2
1225906413461687 1 1

But then using these scripts, only 11 hashes from the files were extracted, though I'm pretty sure that I used the same fastq files as the original. The ksize (21) and molecule (DNA) are also identical.

hashvals-to-signature.py command and output

%%bash
/home/olga/code/sourmash/utils/hashvals-to-signature.py \
    --ksize 21 \
    --output bladder_combined_hashes.sig \
    bladder_combined_hashes.txt

stderr:

loaded 14 distinct hashes from bladder_combined_hashes.txt
setting --num automatically from the number of hashes.
wrote signature to bladder_combined_hashes.sig

hashvals-to-signature.py command & output

%%bash

CELL1_R1=/home/olga/pureScratch/czb-maca/Plate_Seq/3_month/170914_A00111_0058_AH3FYKDMXX__170914_A00111_0057_BH3FY7DMXX/fastqs/A1-B000610-3_56_F-1-1_R1_001.fastq.gz
CELL1_R2=/home/olga/pureScratch/czb-maca/Plate_Seq/3_month/170914_A00111_0058_AH3FYKDMXX__170914_A00111_0057_BH3FY7DMXX/fastqs/A1-B000610-3_56_F-1-1_R2_001.fastq.gz
CELL2_R1=/home/olga/pureScratch/czb-maca/Plate_Seq/3_month/170914_A00111_0058_AH3FYKDMXX__170914_A00111_0057_BH3FY7DMXX/fastqs/A1-B002764-3_38_F-1-1_R1_001.fastq.gz
CELL2_R2=/home/olga/pureScratch/czb-maca/Plate_Seq/3_month/170914_A00111_0058_AH3FYKDMXX__170914_A00111_0057_BH3FY7DMXX/fastqs/A1-B002764-3_38_F-1-1_R2_001.fastq.gz

/home/olga/code/sourmash/utils/signature-to-kmers.py \
    --output-kmers bladder_combined_hashes_kmers.txt \
    --output-sequences bladder_combined_hashes_sequences.txt \
    bladder_combined_hashes.sig $CELL1_R1 $CELL1_R2 $CELL2_R1 $CELL2_R2

stderr:

read 311332200 bp, wrote 1177900 bp in matching sequencese_Seq/3_month/170914_A00111_0058_AH3FYKDMXX__170914_A00111_0057_BH3FY7DMXX/fastqs/A1-B002764-3_38_F-1-1_R2_001.fastq.gz
read 311332200 bp, found 11 kmers matching hashvals

BLAST hits on the k-mers

If I'm doing this right, then it's a little depressing because it looks to me that a bunch of these hits are from crappy genome assemblies and that these k-mers are from lab environment contaminants, rather than from mouse bladder-specific genes. Suggesting that most of the jaccard similarity is coming from random signal. Or maybe real k-mers that are non-mouse, but actually present in the cells and are unique to certain cell types?

It's especially exciting to see ERCCs here ... 🤦‍♀

hashval cell1 A1-B000610-3_56_F-1-1 cell2 A1-B002764-3_38_F-1-1 kmer alignment Best BLAST match (mouse if possible)
171794956858086 2 1 AAAAATAAAAAAAAAAACACA Query 1 AAAAATAAAAAAAAAAACACA 21 |||||||||||||||||||||Sbjct 7105 AAAAATAAAAAAAAAAACACA 7125 PREDICTED: Mus musculus eukaryotic translation initiation factor 4 gamma, 3 (Eif4g3), transcript variant X48, mRNA
394031160505449 1 1 CATGATTAAGAGTGACTGCCG Query 1 CATGATTAAGAGTGACT 17 |||||||||||||||||Sbjct 37777 CATGATTAAGAGTGACT 37793 Mus musculus BAC clone RP24-160P1 from chromosome 18, complete sequence
513049510280006 6 1 AAGAGGGTCGGCCGGGGGCAT Query 1 AAGAGGGTCGGCCGGGGG 18 ||||||||||||||||||Sbjct 39855788 AAGAGGGTCGGCCGGGGG 39855771 Ovis canadensis canadensis isolate 43U chromosome 14 sequence
635563321059591 55 1 AGTGGACATGAAAGATGTGGA Query 1 AGTGGACATGAAAGATGTGGA 21 |||||||||||||||||||||Sbjct 2673 AGTGGACATGAAAGATGTGGA 2693 PREDICTED: Mus musculus solute carrier family 3 (activators of dibasic and neutral amino acid transport), member 2 (Slc3a2), transcript variant X1, mRNA
734478954482604 1 1      
790330283259107 1 1      
860437213521678 1 10 AAAGTCGTCGGAAGATCCTGA Query 4 GTCGTCGGAAGATCCTGA 21 ||||||||||||||||||Sbjct 1875693 GTCGTCGGAAGATCCTGA 1875676 Bacillus sp. KBS0812 chromosome, complete genome
936637619733676 3 7 ACTGTGGTAATTCTAGAGCTG Query 1 ACTGTGGTAATTCTAGAGCTG 21 |||||||||||||||||||||Sbjct 145 ACTGTGGTAATTCTAGAGCTG 165 PREDICTED: Ctenocephalides felis small subunit ribosomal RNA (LOC113371995), rRNA
937624453106451 301 4976 ATCGAGGTCATGACTGGACAC Query 1 ATCGAGGTCATGACTGGACAC 21 |||||||||||||||||||||Sbjct 461 ATCGAGGTCATGACTGGACAC 441 Synthetic construct external RNA control ERCC-00096 sequence
1012839673718406 5 245 GGAGTGGGTAAATTGCGCGCC Query 1 GGAGTGGGTAAATTGCGCGCC 21 |||||||||||||||||||||Sbjct 405 GGAGTGGGTAAATTGCGCGCC 385 Castiarina simulata voucher BUP0006 18S ribosomal RNA gene, partial sequence
1026077759632952 39 315 GCTCAGTTATTGACTACGAGC Query 1 GCTCAGTTATTGACTACGAGC 21 |||||||||||||||||||||Sbjct 983 GCTCAGTTATTGACTACGAGC 1003 PREDICTED: Mus musculus annexin A2 (Anxa2), transcript variant X1, mRNA
1067106972354716 1 1 GAGGATGGTGAAGTAAAGACC Query 1 GAGGATGGTGAAGTAAAG 18 ||||||||||||||||||Sbjct 1065 GAGGATGGTGAAGTAAAG 1048 Mus musculus mitochondrion DNA, including ATP6, COX3, tRNA-Gly, ND3, tRNA-Arg, ND4L, ND4 genes, complete sequence, sequence_id: (8070..10988)_29
1215285019941982 2 2 GTGTAATATAAAAAAAAACCA Query 2 TGTAATATAAAAAAAAACCA 21 ||||||||||||||||||||Sbjct 29340858 TGTAATATAAAAAAAAACCA 29340877 Sphaeramia orbicularis genome assembly, chromosome: 20
1225906413461687 1 1      

Do you have a sense of what could be happening here?

@olgabot
Copy link
Collaborator

olgabot commented Sep 9, 2019

Here's an excerpt from one of the signatures just to make sure I"m providing all possible information:

import json

from IPython.lib.pretty import pretty

with open(f"{folder}/A1-B000610-3_56_F-1-1_S28.sig") as f:
    signature_data = json.load(f)

print(pretty(signature_data, max_seq_length=10))

Here is the summarized json:


[{'class': 'sourmash_signature',
  'email': '',
  'filename': '/arg/2/0',
  'hash_function': '0.murmur64',
  'license': 'CC0',
  'name': 'A1-B000610-3_56_F-1-1_S28',
  'signatures': [{'abundances': [5, 1, 14, 32, 8, 2, 1, 1, 46, 1, ...],
    'ksize': 21,
    'max_hash': 0,
    'md5sum': '280b9babf31a3b7f820f9695f948808c',
    'mins': [321797371588609,
     224101973024771,
     113000222478351,
     1333158846758931,
     1102392644880415,
     587674942934704,
     586113718319139,
     278315472003108,
     875509323118629,
     399207939020840,
     ...],
    'molecule': 'DNA',
    'num': 500,
    'seed': 42},
   {'abundances': [7, 11, 1, 2, 1, 1, 2, 8, 5, 1, ...],
    'ksize': 31,
    'max_hash': 0,
    'md5sum': '9d82f6c550e83cdd97a9190e2fa7d1f0',
    'mins': [1098995281889281,
     494274597793794,
     805332503111685,
     944591797540874,
     433029384222735,
     777578192238617,
     815900320305157,
     729433322275847,
     593759110850609,
     823805458110523,
     ...],
    'molecule': 'DNA',
    'num': 500,
    'seed': 42},
   {'abundances': [1, 1, 1, 1, 8, 1, 11, 1, 1, 1, ...],
    'ksize': 51,
    'max_hash': 0,
    'md5sum': 'dd5c95634d6bd0eab9de215fd2bc6401',
    'mins': [947940305352714,
     98079127326732,
     970020576107181,
     108820536326160,
     652103673985041,
     562565376081938,
     714932558929943,
     703817730994207,
     65688630462498,
     679891240570919,
     ...],
    'molecule': 'DNA',
    'num': 500,
    'seed': 42}],
  'version': 0.4}]

@ctb
Copy link
Contributor Author

ctb commented Sep 10, 2019

OK, I updated the scripts to be less clever :). I also added a script utils/hashvals-to-kmers.py.

The key function to look at (and you can use this directly in a notebook if you like...) is get_kmers_for_hashvals(sequence, hashvals, ksize), which is hopefully self-explanatory in its use (but also has a docstring!!!!) This is a generator function that yields pairs of kmer, hashval for a given DNA sequence. Please let me know if they work better for you!


In terms of what went wrong before, I was not accounting for using 'num' minhashes, I think, and was assuming you were using scaled minhashes. Mea culpa. The latest code is less clever and should be more robust.

@phiweger
Copy link

@ctb so I'm testing this (yay!)

I would expect that if I calc the sig of some genome, and then use the hash2kmer script, to obtain the same number of kmers as the number of hash vals used in the sig.

sourmash compute -k 31 -n 100  GCF_000281435.2_ASM28143v2_genomic.fna.gz

python signature-to-kmers.py --output-kmers kmers.txt GCF_000281435.2_ASM28143v2_genomic.fna.gz.sig  GCF_000281435.2_ASM28143v2_genomic.fna.gz

# read 5767822 bp, found 56 kmers matching hashvals

Am I doing something wrong? I noticed that the number of returned kmers is roughly half the number of hash vals, which makes me suspect that there is a problem with canonical kmers -- do they "get lost" during backtranslation from hash to kmer?

@olgabot
Copy link
Collaborator

olgabot commented Sep 17, 2019

Hmm the updates from @ctb seem to still be too clever :) I'm still getting 11/14 hashes back, and it's the same ones as are missing from this comment, so at least it's consistent! Is there another filtering step happening?

@ctb
Copy link
Contributor Author

ctb commented Sep 17, 2019

Aaaaaand this is why I should always write tests, ladies and gentleman...

It was an issue of not being minimally clever, @olgabot - I wasn't choosing the correct canonical k-mer representation for reverse complements, so approximately 50% of the hashes were not being found.

Tested all three scripts like so:

sourmash compute -n 1000 data/GCF_000005845.2_ASM584v2_genomic.fna.gz -o test.sig  -k 31

utils/signature-to-kmers.py test.sig data/GCF_000005845.2_ASM584v2_genomic.fna.gz --output-kmers=test.csv

utils/hashvals-to-kmers.py test.hashes.txt data/GCF_000005845.2_ASM584v2_genomic.fna.gz --output-kmers=test2.csv

utils/hashvals-to-signature.py test.hashes.txt -k 31 -o test3.sig
utils/signature-to-kmers.py test3.sig data/GCF_000005845.2_ASM584v2_genomic.fna.gz --output-kmers=test3.csv

and got identical output yielding all 1,000 k-mers.

@olgabot
Copy link
Collaborator

olgabot commented Sep 18, 2019

Awesome, thank you! This is yielding the same number of k-mers as before now. Thank you!

@ctb
Copy link
Contributor Author

ctb commented Oct 22, 2019

@olgabot it sounds like this works for you. @luizirber any object to me merging this and creating issues for:

  • adding hashvals-to-signatures functionality to sourmash signature import/export instead
  • generalizing the get_kmers_for_hashvals function in signature-to-kmers by refactoring some of the code in _minhash.pyx to expose a k-mer extraction function on MinHash objects

?

@taylorreiter
Copy link
Contributor

I just tried utils/hashvals-to-signature.py. It ran almost instantaneously on a file with 10 hashes, but it never finished after running for 24 hours on a file with 48 million hashes. I'm trying it now with 9 million instead and hoping for a slightly faster run time.

@taylorreiter
Copy link
Contributor

For 9 million hashes, the run time was 7 hours.

@ctb
Copy link
Contributor Author

ctb commented Nov 21, 2019

just to confirm: you ran hashvals-to-signature, not signature-to-kmers?

I have to say at that scale I'm ok with a custom solution :). I was not thinking of signatures with millions of hashes when we designed... well, any of sourmash!

@ctb
Copy link
Contributor Author

ctb commented Nov 21, 2019

...but the good news is we can cannibalize code from those scripts now that we know they works properly...

@taylorreiter
Copy link
Contributor

yes! I ran hashvals-to-signature.py! I wanted to filter all hashes that had abundance 1 across 1000 data sets -- filtering all abundance 1 hashes would have removed some real hashes, but if they were abundance 1 across the entire dataset, they were more likely to be errors.

Originally, I generated a set of all abundance 1 hashes across the data set -- this was 48 million hashes (scaled 2000, k 31). I was going to use sourmash siganture subtract with this, but it took too long to convert to a signature. https://github.com/taylorreiter/ibd/blob/master/sandbox/get_low_count_hashes.ipynb

Then, I decided to make a set of all hashes greater than abundance 1. This was 9 million hashes, and I used sourmash signature intersect to retain only these hashes in each sample-specific signature. https://github.com/taylorreiter/ibd/blob/master/sandbox/get_greater_than_1_count_hashes.ipynb

Cannibalize away :)

@luizirber
Copy link
Member

@luizirber any object to me merging this and creating issues for:

* adding `hashvals-to-signatures` functionality to `sourmash signature import/export` instead

* generalizing the `get_kmers_for_hashvals` function in `signature-to-kmers` by refactoring some of the code in `_minhash.pyx` to expose a k-mer extraction function on MinHash objects

In fine with this plan. Is this ready for review and merge (despite the [WIP] in the title), @ctb?

@ctb
Copy link
Contributor Author

ctb commented Nov 22, 2019 via email

@olgabot
Copy link
Collaborator

olgabot commented Nov 22, 2019 via email

@luizirber
Copy link
Member

there aren't any tests, tho. and it's not integrated into sourmash. so... concerned about that.

Fair point. Since it doesn't touch anything that is actually changing with #424, I won't worry about merging this soon.

@olgabot olgabot mentioned this pull request Mar 13, 2020
5 tasks
olgabot added a commit to czbiohub-sf/nf-predictorthologs that referenced this pull request Apr 17, 2020
@ctb ctb changed the base branch from master to latest March 6, 2021 18:44
@ctb
Copy link
Contributor Author

ctb commented Mar 18, 2021

Would be massively sped up by #1214

@ctb
Copy link
Contributor Author

ctb commented Jul 29, 2021

see #1695

@ctb
Copy link
Contributor Author

ctb commented Aug 2, 2021

closing in favor of #1695.

@ctb ctb closed this Aug 2, 2021
@ctb ctb deleted the add/hashes_to_kmer branch August 2, 2021 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants