New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Add 'sourmash lca' commands for kraken-style lowest-common-ancestor calculations. #367
Conversation
Codecov Report
@@ Coverage Diff @@
## master #367 +/- ##
=========================================
+ Coverage 87.42% 88.03% +0.6%
=========================================
Files 15 23 +8
Lines 2338 3016 +678
Branches 36 36
=========================================
+ Hits 2044 2655 +611
- Misses 293 360 +67
Partials 1 1
Continue to review full report at Codecov.
|
I was thinking about doing a group code review on this with @bluegenes and other people in the lab, do you think it's ready for a first pass? |
sourmash_lib/lca/__main__.py
Outdated
import sys | ||
import argparse | ||
|
||
from . import classify, index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't fully relative imports discouraged for 3?
I don't know - ref?
|
sourmash_lib/lca/__main__.py
Outdated
import argparse | ||
|
||
from . import classify, index | ||
from ..logging import set_quiet, error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is annoying.
sourmash_lib/lca/command_index.py
Outdated
|
||
import sourmash_lib | ||
from . import lca_utils | ||
from ..logging import notify, error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cough
sourmash_lib/lca/command_index.py
Outdated
import sourmash_lib | ||
from . import lca_utils | ||
from ..logging import notify, error | ||
from .. import sourmash_args |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cough cough
sourmash_lib/lca/command_index.py
Outdated
@@ -0,0 +1,186 @@ | |||
#! /usr/bin/env python | |||
""" | |||
Build a least-common-ancestor database with given taxonomy and genome sigs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brooksph says it is 'lowest' or 'last', @halexand supports 'lowest', @bluegenes says 'last'. In any case, @titus is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and probably phylogenetics is also wrong)
sourmash_lib/lca/command_index.py
Outdated
p.add_argument('--scaled', default=10000, type=float) | ||
p.add_argument('-k', '--ksize', default=31, type=int) | ||
p.add_argument('-d', '--debug', action='store_true') | ||
p.add_argument('-1', '--start-column', default=2, type=int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--1
? seems confusing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
further conversation confirms that this is definitely confusing
sourmash_lib/lca/command_index.py
Outdated
ksize = int(args.ksize) | ||
|
||
# parse spreadsheet! | ||
r = csv.reader(open(args.csv, 'rt')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use a context manager:
with open(args.csv, 'rt') as f:
r = csv.reader(f)
@ctb |
It is also breaking on py27 because |
thx!
|
Another one missing: |
thx fixed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments (punt to issues?), but that filter_null
would be better as a function right now =]
|
||
First, install sourmash from the LCA branch: | ||
``` | ||
pip install -U https://github.com/dib-lab/sourmash/archive/add/lca.zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably bump version to 2.0.0a3 after merging this, and saying to install a version >=2.0.0a3
instead?
doc/tutorials-lca.md
Outdated
taxonomy between all the k-mers (`sourmash classify`) or it can summarize | ||
the mixture of k-mers present in one or more signatures (`sourmash summarize`). | ||
|
||
The `sourmash index` command can be used to prepare custom taxonomy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sourmash index
or sourmash lca index
?
# filter function toreplace blank/na/null with 'unassigned' | ||
filter_null = lambda x: 'unassigned' if x.strip() in \ | ||
('[Blank]', 'na', 'null', '') else x | ||
null_names = set(['[Blank]', 'na', 'null']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make null_names
into a constant (frozenset?), use it in filter_null
(defined as a function, not as a lambda).
NULL_NAMES = {'[Blank]', 'na', 'null', ''}
def filter_null(x):
if x.strip() in NULL_NAMES:
return 'unassigned'
return x
|
||
def debug(*args): | ||
if _print_debug: | ||
pprint.pprint(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this debug
function be moved into sourmash_lib/logging.py
instead?
lineage_dict[int(k)] = tuple(vv) | ||
|
||
# convert hashval -> lineage index keys to integers (looks like | ||
# JSON doesn't have a 64 bit type so stores them as strings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, what? JS doesn't support 64-bit ints, but the python JSON module does the right thing and save the number anyway (it should be up to JS to do the proper thing).
(NOTE: all signatures would be wrong if this assumption was true...)
(Let's keep this branch around because there are public tutorials using it) |
New commands:
sourmash lca index taxonomy.csv save.db [list of signatures]
- buildsave.db
using given taxonomy and list of signatures.sourmash lca classify --db save.db --query [list of signatures]
- output a taxonomic classification of query signatures.sourmash lca summarize --db save.db --query [list of signatures]
- output a taxonomic summarization of query signatures.sourmash lca rankinfo [list of databases]
- output summary of database content by taxonomic levelTODO:
In the future:
Ref #302.
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?