[MRG] Explore addition of a `sketch` command. #1159

ctb · 2020-08-09T16:44:39Z

A new `sourmash sketch` command

In this experimental PR, I implement a new command-line submodule, sketch, which has three main subcommands:

sourmash sketch dna|rna <sketch params> [ files ]

sourmash sketch aa|prot|protein <sketch params> [ files]

sourmash sketch translate <sketch params> [ files ]

Note that:

sketch dna is a replacement for compute.
sketch protein is a replacement for compute --input-is-protein --no-dna.
sketch translate replaces compute --protein --no-dna.
sketch protein --dayhoff|--hp replaces compute --dayhoff|--hp --no-dna --input-is-protein.
sketch translate --dayhoff|--hp replaces compute --dayhoff|--hp --no-dna.
k-mer sizes for sketch protein and sketch translate are now 1/3rd of what they are for compute!!!
all of these have different, and appropriate, default parameters!!
- dna: k=31,scaled=1000,noabund
- protein: k=21,scaled=200,noabund
- dayhoff: k=19,scaled=200,noabund
- hp: k=30,scaled=200,noabund
you can pass multiple param strings with very different parameters and they should all work!

Sketch parameters

The <sketch params> arguments cover the common sketch config params: ksize, num/scaled, and track_abund. We use good defaults per #219, and use explicit per-MinHash notation, e.g. -p k=31,scaled=1000,abund to construct each MinHash (with support for multiple -p).

Examples of param strings that work :):

-p k=31,scaled=1000,abund
-p k=31,noabund
-p k=51 - default to a scaled and abund (based on moltype/command)
- for dna, scaled=1000, noabund
- for protein, scaled=200, noabund
-p k=31,k=51,k=21 - compute multiple ksizes, defaults otherwise
-p k=20,num=500,protein -p k=19,num=400,dayhoff,abund
-p k=30,scaled=200,hp -p k=30,scaled=200,seed=58 - computes multiple ksizes, moltypes, scaled/num, and even uses a different seed.

Notes and brainstorming

we could provide additional subcommands, like sourmash sketch reads, sourmash sketch genome, and sourmash sketch ncbi, that choose good defaults for those kinds of inputs
- e.g. for reads, scaled=1000,abund
could add sourmash sketch 10x too!
similarly, we could alias sourmash sketch dayhoff and sourmash sketch hp.
could add sourmash sketch 16s (see e.g. classification with full length 16S gene #1000)
sketch dna -p k=21,protein and sketch protein -p k=21,dna are errors.

Additional thoughts:

@luizirber proposed a passthrough option, where you could chain the inputs together:
- sourmash sketch dna --passthrough <input> | sourmash sketch protein <input> - this would yield two signatures. Not sure how to do the output tho - where does the signature file go? do we need to specify multiple different -o options?!
translate should take options to specify top-strand/rc only, for RNA.
can we make the sketch module suitable for direct use by Python, per Idea: imports match CLI? #1112?
- e.g. sourmash.sketch.dna(filenames=...)
the sketch module should auto-recognize signature files and not try to load them as FASTA... sourmash compute doesn't recognize signature files #814
might be good to add automated traverse functionality to discover all FASTA files, if possible, at least for specific commands...

Is it mergeable?
make test Did it pass the tests?
make coverage Is the new code covered?
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Was a spellchecker run on the source code and documentation after
changes were made?

codecov · 2020-08-09T16:49:49Z

Codecov Report

Merging #1159 into latest will increase coverage by 25.15%.
The diff coverage is 94.52%.

@@             Coverage Diff             @@
##           latest    #1159       +/-   ##
===========================================
+ Coverage   67.86%   93.01%   +25.15%     
===========================================
  Files          24       76       +52     
  Lines        3314     5986     +2672     
===========================================
+ Hits         2249     5568     +3319     
+ Misses       1065      418      -647

Flag	Coverage Δ
#rusttests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sourmash/command_sketch.py	`90.12% <90.12%> (ø)`
sourmash/cli/__init__.py	`98.92% <100.00%> (ø)`
sourmash/cli/sketch/__init__.py	`100.00% <100.00%> (ø)`
sourmash/cli/sketch/dna.py	`100.00% <100.00%> (ø)`
sourmash/cli/sketch/protein.py	`100.00% <100.00%> (ø)`
sourmash/cli/sketch/translate.py	`100.00% <100.00%> (ø)`
sourmash/command_compute.py	`97.70% <100.00%> (ø)`
src/core/src/from.rs
src/core/src/index/storage.rs
src/core/src/index/mod.rs
... and 97 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90db4c4...ca14aa9. Read the comment docs.

ctb · 2020-08-10T19:05:29Z

OK. I think this addresses most of the UX problems with sourmash compute that were mentioned in #999. Would love feedback to see what's missing!

sourmash/command_sketch.py

olgabot · 2020-08-11T13:43:50Z

This is really exciting!!

olgabot · 2020-08-11T13:59:28Z

Notes and brainstorming

we could provide additional subcommands, like sourmash sketch reads, sourmash sketch genome, and sourmash sketch ncbi, that choose good defaults for those kinds of inputs

e.g. for reads, scaled=1000,abund

could add sourmash sketch 10x too!

Interesting! We've found that for "realistic" 10x data that we need to do several steps before getting to the sketch part. Check out the kmermaid pipeline for implementation details, and @pranathivemuri feel free to correct me where I'm wrong.

Extract aligned reads from the 10x bam file into a fastq, with per-cell barcode and per-molecular barcode (aka "UMI") tags --> one mega-fastq per bam file (via samtools fastq)
Extract unaligned reads from the 10x bam file into a fastq, with per-cell barcode and per-molecular barcode (aka "UMI") tags --> one mega-fastq per bam file (via samtools fastq)
Count number of reads/UMIs per cell barcode (via bam2fasta count_umis_percell) --> text file of number molecular barcodes per cell barcode
Extract per-cell fastqs from the mega-fastq of aligned reads, using only the cell barcodes which have UMIs greater than a threshold (e.g. minimum 1000 UMIs) --> per-cell fastqs (via bam2fasta make_fastqs_percell) Must be parallelized or will take ~3-30 days/bam file!
Extract per-cell fastqs from the mega-fastq of unaligned reads, using only the cell barcodes which have UMIs greater than a threshold (e.g. minimum 1000 UMIs) --> per-cell fastqs (via bam2fasta make_fastqs_percell) Must be parallelized or will take ~3-30 days/bam file!
Trim adapters and poly-X from reads --> fastqs (via fastp)
Remove ribosomal RNA sequences --> fastqs (via sortMeRNA)
Translate both aligned/unaligned per-cell fastqs to protein --> per-cell protein fastas (via sencha translate)
NOW we can do sourmash compute --input-is-protein :)

Skipping step (3) results in ~700,000 cell barcode files per bam, because that's the total number of theoretical cell barcodes per 10x run (even more with the v3 chemistry I think) so it is necessary. The "real" number of cell barcodes per run is closer to 2,000-5,000. If you allow all ~700,000 through, then most of your "cells" have ~1-10 reads which usually get removed from downstream analyses anyway. Min 1000 UMIs is a pretty reasonable threshold. I've seen pipelines that filter for min 1000 UMIs/cell and min 100 genes/cell.

similarly, we could alias sourmash sketch dayhoff and sourmash sketch hp.

This would be great!

could add sourmash sketch 16s (see e.g. classification with full length 16S gene #1000)

Additional thinking/tests needed:

what should/does sourmash sketch dna -p k=21,protein do?

Maybe the same as now, where it does the 6-frame translation?

what should/does sourmash sketch protein -p k=21,dna do?

I think this should either:

Throw an error
Compute a protein sketch with protein ksize k=21, then back-convert the protein sequences to all possible DNA sequences that could have generated it, and create DNA sketches with DNA ksize = 3*21 = 63

Additional thoughts:

@luizirber proposed a passthrough option, where you could chain the inputs together:

sourmash sketch dna --passthrough <input> | sourmash sketch protein <input> - this would yield two signatures. Not sure how to do the output tho - where does the signature file go? do we need to specify multiple different -o options?!

Interesting! Maybe --output-signature and --output-reads ?

translate should take options to specify top-strand/rc only, for RNA.

Does this assume that the RNA library prep is strand-specific? If it's not strand-specific, would both sides need to be used?

can we make the sketch module suitable for direct use by Python, per Idea: imports match CLI? #1112?

e.g. sourmash.sketch.dna(filenames=...)

This would be awesome!

the sketch module should auto-recognize signature files and not try to load them as FASTA... sourmash compute doesn't recognize signature files #814

👍

might be good to add automated traverse functionality to discover all FASTA files, if possible, at least for specific commands...

👍

EDIT: Forgot trimming & ribosomal RNA removal steps

ctb · 2020-08-13T14:24:47Z

OK, after talking to @bluegenes she's 👍 on this functionality and already started using the terminology in slides 😂

So, I'm planning to put this up for merge; remaining items:

add multiple ksizes in a single param string per [MRG] Explore addition of a sketch command. #1159 (comment)
create issue outlining potential next steps/additions for sourmash sketch
add issue for SourmashSignature.minhashes stuff [MRG] Explore addition of a sketch command. #1159 (comment)
add CLI tests for sourmash sketch with various bad params strings
resolve the question of what sourmash sketch dna -p k=21,protein does :)
write some documentation and recipes (or punt that to a new issue)

ctb · 2020-08-13T18:20:21Z

ok, so currently:

sourmash sketch dna -p k=63,protein

is identical to

sourmash sketch translate -p k=21

which makes sense!

I might have it throw an error anyway, but it's nice the code "just works"...

conveniently,

sourmash sketch protein -p k=21,dna tests/test-data/ecoli.faa -o foo.sig

throws an error already :). I'll have to make this a nicer error message, is all!

== This is sourmash version 3.5.0a0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

computing signatures for files: tests/test-data/ecoli.faa
Computing a total of 1 signature(s).
... reading sequences from tests/test-data/ecoli.faa
Traceback (most recent call last):
  File "/Users/t/miniconda3/envs/py37/bin/sourmash", line 11, in <module>
    load_entry_point('sourmash', 'console_scripts', 'sourmash')()
  File "/Users/t/dev/sourmash/sourmash/__main__.py", line 13, in main
    return mainmethod(args)
  File "/Users/t/dev/sourmash/sourmash/cli/sketch/protein.py", line 69, in main
    return sourmash.command_sketch.protein(args)
  File "/Users/t/dev/sourmash/sourmash/command_sketch.py", line 220, in protein
    _execute_sketch(args, signatures_factory)
  File "/Users/t/dev/sourmash/sourmash/command_sketch.py", line 175, in _execute_sketch
    _compute_individual(args, signatures_factory)
  File "/Users/t/dev/sourmash/sourmash/command_compute.py", line 228, in _compute_individual
    args.input_is_protein, args.check_sequence)
  File "/Users/t/dev/sourmash/sourmash/command_compute.py", line 281, in add_seq
    sig.add_protein(seq)
  File "/Users/t/dev/sourmash/sourmash/signature.py", line 155, in add_protein
    self._methodcall(lib.signature_add_protein, to_bytes(sequence))
  File "/Users/t/dev/sourmash/sourmash/utils.py", line 25, in _methodcall
    return rustcall(func, self._get_objptr(), *args)
  File "/Users/t/dev/sourmash/sourmash/utils.py", line 78, in rustcall
    raise exc
sourmash.exceptions.Panic: sourmash panicked: thread 'unnamed' panicked with 'called `Result::unwrap()` on an `Err` value: InvalidHashFunction { function: "dna" }' at src/core/src/signature.rs:367

luizirber · 2020-08-13T19:29:15Z

throws an error already :). I'll have to make this a nicer error message, is all!
sourmash.exceptions.Panic: sourmash panicked: thread 'unnamed' panicked with 'called Result::unwrap()on anErr value: InvalidHashFunction { function: "dna" }' at src/core/src/signature.rs:367

Yikes, a Panic... Oh, I see, this is the code where it is being thrown:

self.signatures 
    .iter_mut()  
    .for_each(|sketch| { 
        sketch.add_protein(&seq).unwrap(); }  
    );

I didn't know about try_for_each before, which allows errors, so the fix is

self.signatures 
    .iter_mut()  
    .try_for_each(|sketch| { 
        sketch.add_protein(&seq) }  
    )?;

This is going to raise a ValueError, which you can then capture and give a better error message

ctb · 2020-08-14T15:51:26Z

Ready for review and maybe merge. See #1169 for "next steps" and additions, #1168 for "write docs before 4.0", and #1166 and #1167 for rust fixes/improvements related to issues discovered herein.

ctb · 2020-08-14T21:46:35Z

🎉

ctb added 6 commits August 9, 2020 06:22

add sourmash sketch command

fedf987

it's alive! sourmash sketch dna works!

6f1d422

refactor make_minhashes code to allow more flexibility

d639919

refactor internal compute code to better reflect reality

130ca5b

update sketch dna as well

baf5df2

pluralize signature factory and handling to get ready for sketch

5978f85

ctb added 9 commits August 9, 2020 16:33

initial implementation with parameter strings

2d67fab

add a variety of tests

6cdc770

add sketch protein and sketch translate

db7a2cd

add --hp and --dayhoff to translate and protein

5354f64

add aa|prot and rna aliases for sketch command

17a0ce2

hacked and slashed ksize*3 :)

ad26c6e

refactor and simplify sketch code

d4cbe02

clean up command_sketch a bit

c93e8ab

provide better defaults for protein, translate

1ee28dc

ctb mentioned this pull request Aug 10, 2020

new behavior for protein k-mer size calculations - gathering the issues together. #999

Closed

This comment has been minimized.

Sign in to view

ctb added 6 commits August 10, 2020 08:52

allow param strings to set moltype and use moltype defaults

28c5fbc

add some tests for default parameters, etc

e95215f

refactoring of names

08ce05c

check bad param string

140094f

fix scaled/num checking and (some) associated tests

3171e3e

do tests for multiple output sigs

3d1e271

bluegenes reviewed Aug 10, 2020

View reviewed changes

sourmash/command_sketch.py Outdated Show resolved Hide resolved

Merge branch 'latest' into explore_sketch

2fd5f51

This comment has been minimized.

Sign in to view

ctb added 4 commits August 13, 2020 14:12

add invalid parameter checking on dna/override protein, etc

6edb3c5

remove 10x tests, rename compute -> sketch for remaining

c1dc1ee

add tests for incompatible sketch commands

6615759

multiple ksizes within one param string

0580c3f

This was referenced Aug 14, 2020

fix Rust panic error in signature creation #1166

Closed

fix rust Signature API for minhashes method. #1167

Closed

write documentation for sourmash sketch #1168

Closed

potential next steps for the sourmash sketch command #1169

Open

ctb changed the title ~~[EXP] Explore addition of a sketch command.~~ [MRG] Explore addition of a sketch command. Aug 14, 2020

ctb added 2 commits August 14, 2020 08:56

cleanup of comments

02b2748

Merge branch 'latest' into explore_sketch

a1bf702

ctb requested review from luizirber and bluegenes August 14, 2020 19:01

luizirber approved these changes Aug 14, 2020

View reviewed changes

Merge branch 'latest' of github.com:dib-lab/sourmash into explore_sketch

ca14aa9

ctb merged commit d921fc3 into latest Aug 14, 2020

ctb deleted the explore_sketch branch August 14, 2020 21:37

ctb mentioned this pull request Aug 17, 2020

What defaults do we want? #219

Closed

This was referenced Nov 1, 2020

[MRG] remove 10x support #1229

Merged

Document migration steps from 3.x to 4.x #1069

Closed

ctb mentioned this pull request Jan 10, 2021

rename 'compute' to 'sketch' or compare to similarity? #663

Closed

ctb mentioned this pull request Feb 7, 2021

Draft release notes for v4.0 #1309

Closed

ctb mentioned this pull request May 3, 2021

[MRG] add flexible & iterative support for outputting signatures in variety of formats #1493

Merged

6 tasks

ctb mentioned this pull request May 15, 2021

summary: further improvements to protein handling in sourmash #1525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Explore addition of a `sketch` command. #1159

[MRG] Explore addition of a `sketch` command. #1159

ctb commented Aug 9, 2020 •

edited

Loading

codecov bot commented Aug 9, 2020 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

ctb commented Aug 10, 2020

olgabot commented Aug 11, 2020

olgabot commented Aug 11, 2020 •

edited

Loading

Notes and brainstorming

ctb commented Aug 13, 2020 •

edited

Loading

This comment has been minimized.

ctb commented Aug 13, 2020 •

edited by luizirber

Loading

luizirber commented Aug 13, 2020 •

edited

Loading

ctb commented Aug 14, 2020

ctb commented Aug 14, 2020

[MRG] Explore addition of a sketch command. #1159

[MRG] Explore addition of a sketch command. #1159

Conversation

ctb commented Aug 9, 2020 • edited Loading

A new sourmash sketch command

Sketch parameters

Notes and brainstorming

codecov bot commented Aug 9, 2020 • edited Loading

Codecov Report

This comment has been minimized.

This comment has been minimized.

ctb commented Aug 10, 2020

olgabot commented Aug 11, 2020

olgabot commented Aug 11, 2020 • edited Loading

Notes and brainstorming

ctb commented Aug 13, 2020 • edited Loading

This comment has been minimized.

ctb commented Aug 13, 2020 • edited by luizirber Loading

luizirber commented Aug 13, 2020 • edited Loading

ctb commented Aug 14, 2020

ctb commented Aug 14, 2020

[MRG] Explore addition of a `sketch` command. #1159

[MRG] Explore addition of a `sketch` command. #1159

ctb commented Aug 9, 2020 •

edited

Loading

A new `sourmash sketch` command

codecov bot commented Aug 9, 2020 •

edited

Loading

olgabot commented Aug 11, 2020 •

edited

Loading

ctb commented Aug 13, 2020 •

edited

Loading

ctb commented Aug 13, 2020 •

edited by luizirber

Loading

luizirber commented Aug 13, 2020 •

edited

Loading