Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add 'sourmash lca' commands for kraken-style lowest-common-ancestor calculations. #367

Merged
merged 83 commits into from Dec 2, 2017

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Nov 5, 2017

New commands:

  • sourmash lca index taxonomy.csv save.db [list of signatures] - build save.db using given taxonomy and list of signatures.
  • sourmash lca classify --db save.db --query [list of signatures] - output a taxonomic classification of query signatures.
  • sourmash lca summarize --db save.db --query [list of signatures] - output a taxonomic summarization of query signatures.
  • sourmash lca rankinfo [list of databases] - output summary of database content by taxonomic level

TODO:

  • allow gzipped .lca.json files
  • support indexing of SBT directly?
  • add --traverse-directory
  • move functions out of command_classify.py
  • share taxlist
  • have classify output summary info as in this blog post
  • add metacodeR-compatible output as here
  • figure out how to detect and report taxonomy mismatches when merging databases

In the future:

  • output k-mers directly, not just hashes
  • include (?) validation/consonance of multiple tax dbs a la genbank double check

Ref #302.

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov-io
Copy link

codecov-io commented Nov 5, 2017

Codecov Report

Merging #367 into master will increase coverage by 0.6%.
The diff coverage is 90.14%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #367     +/-   ##
=========================================
+ Coverage   87.42%   88.03%   +0.6%     
=========================================
  Files          15       23      +8     
  Lines        2338     3016    +678     
  Branches       36       36             
=========================================
+ Hits         2044     2655    +611     
- Misses        293      360     +67     
  Partials        1        1
Impacted Files Coverage Δ
sourmash_lib/sourmash_args.py 94.48% <100%> (+0.11%) ⬆️
sourmash_lib/lca/__init__.py 100% <100%> (ø)
sourmash_lib/__main__.py 100% <100%> (ø) ⬆️
sourmash_lib/lca/__main__.py 78.94% <78.94%> (ø)
sourmash_lib/lca/command_summarize.py 83.33% <83.33%> (ø)
sourmash_lib/lca/command_classify.py 87.2% <87.2%> (ø)
sourmash_lib/lca/command_index.py 89.14% <89.14%> (ø)
sourmash_lib/lca/command_rankinfo.py 89.36% <89.36%> (ø)
sourmash_lib/lca/command_compare_csv.py 92.3% <92.3%> (ø)
sourmash_lib/lca/lca_utils.py 95.97% <95.97%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29af52b...cc643fe. Read the comment docs.

@luizirber
Copy link
Member

I was thinking about doing a group code review on this with @bluegenes and other people in the lab, do you think it's ready for a first pass?

import sys
import argparse

from . import classify, index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't fully relative imports discouraged for 3?

@ctb
Copy link
Contributor Author

ctb commented Nov 6, 2017 via email

import argparse

from . import classify, index
from ..logging import set_quiet, error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is annoying.


import sourmash_lib
from . import lca_utils
from ..logging import notify, error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cough

import sourmash_lib
from . import lca_utils
from ..logging import notify, error
from .. import sourmash_args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cough cough

@@ -0,0 +1,186 @@
#! /usr/bin/env python
"""
Build a least-common-ancestor database with given taxonomy and genome sigs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brooksph says it is 'lowest' or 'last', @halexand supports 'lowest', @bluegenes says 'last'. In any case, @titus is wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and probably phylogenetics is also wrong)

p.add_argument('--scaled', default=10000, type=float)
p.add_argument('-k', '--ksize', default=31, type=int)
p.add_argument('-d', '--debug', action='store_true')
p.add_argument('-1', '--start-column', default=2, type=int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--1? seems confusing...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further conversation confirms that this is definitely confusing

ksize = int(args.ksize)

# parse spreadsheet!
r = csv.reader(open(args.csv, 'rt'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use a context manager:

with open(args.csv, 'rt') as f:
    r = csv.reader(f)

@ctb ctb mentioned this pull request Nov 25, 2017
5 tasks
@luizirber
Copy link
Member

@ctb tests/test-data/lca-root/tax.csv is missing

@luizirber
Copy link
Member

It is also breaking on py27 because sourmash_lib/lca/command_index.py is opening the delmont supp table file as rt, but it contains windows newline chars (\r\n). One way to fix it is opening the CSV as rtU (for universal newlines) in this line.

@ctb
Copy link
Contributor Author

ctb commented Nov 27, 2017 via email

@luizirber
Copy link
Member

Another one missing: tests/test-data/lca-root/TOBG_MED-875.fna.gz.sig

@ctb
Copy link
Contributor Author

ctb commented Nov 27, 2017 via email

Copy link
Member

@luizirber luizirber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments (punt to issues?), but that filter_null would be better as a function right now =]


First, install sourmash from the LCA branch:
```
pip install -U https://github.com/dib-lab/sourmash/archive/add/lca.zip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably bump version to 2.0.0a3 after merging this, and saying to install a version >=2.0.0a3 instead?

taxonomy between all the k-mers (`sourmash classify`) or it can summarize
the mixture of k-mers present in one or more signatures (`sourmash summarize`).

The `sourmash index` command can be used to prepare custom taxonomy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sourmash index or sourmash lca index?

# filter function toreplace blank/na/null with 'unassigned'
filter_null = lambda x: 'unassigned' if x.strip() in \
('[Blank]', 'na', 'null', '') else x
null_names = set(['[Blank]', 'na', 'null'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make null_names into a constant (frozenset?), use it in filter_null (defined as a function, not as a lambda).

NULL_NAMES = {'[Blank]', 'na', 'null', ''}

def filter_null(x):
    if x.strip() in NULL_NAMES:
        return 'unassigned'
    return x


def debug(*args):
if _print_debug:
pprint.pprint(args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this debug function be moved into sourmash_lib/logging.py instead?

lineage_dict[int(k)] = tuple(vv)

# convert hashval -> lineage index keys to integers (looks like
# JSON doesn't have a 64 bit type so stores them as strings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, what? JS doesn't support 64-bit ints, but the python JSON module does the right thing and save the number anyway (it should be up to JS to do the proper thing).

(NOTE: all signatures would be wrong if this assumption was true...)

@luizirber luizirber merged commit 47e2c31 into master Dec 2, 2017
@luizirber luizirber deleted the add/lca branch December 2, 2017 00:28
@luizirber luizirber restored the add/lca branch December 6, 2017 23:28
@luizirber
Copy link
Member

(Let's keep this branch around because there are public tutorials using it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants