Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utility: kraken-style LCA classification on banded signatures #302

Closed
ctb opened this issue Aug 23, 2017 · 6 comments
Closed

utility: kraken-style LCA classification on banded signatures #302

ctb opened this issue Aug 23, 2017 · 6 comments

Comments

@ctb
Copy link
Contributor

ctb commented Aug 23, 2017

See https://gist.github.com/ctb/9deb40a68108256ab4fd84c6b8e92e01 for implementation.

cc @brooksph @taylorreiter

@ctb
Copy link
Contributor Author

ctb commented Aug 23, 2017

Some more thoughts:

  • scaling Kraken database building and search by a factor of 100x would already be pretty amazing;
  • we should evaluate Bracken on banded output; that would be the major test
  • support FASTA input in addition to signatures
  • we should support fast merging/updating/etc of the databases.
  • supporting multiple k-mer sizes is also a great idea!

@ctb
Copy link
Contributor Author

ctb commented Aug 24, 2017

  • propose to add as sourmash lca
  • support multiple query databases? would need nodes.dmp.
  • deliver/store as .json file with --scaled and ksize values for multiple lca k-mer size dbs
  • kraken compatible output possible?

@ctb
Copy link
Contributor Author

ctb commented Aug 25, 2017

Current output:

% kraken/classify.py genbank/nodes.dmp genbank/names.dmp genbank-k31.lca sig-to-classify.sig
loading taxonomic nodes from: genbank/nodes.dmp
loading taxonomic names from: genbank/names.dmp
loading k-mer DB from: genbank-k31.lca
loading signatures from 1 signature files
loaded 1 signatures total at k=31
downsampling to scaled value: 10000
found LCA classifications for 411 of 411 hashes
percent below   at node code    taxid   name
100.0   411     0       -       131567  cellular organisms
100.0   411     23      -       2       Bacteria
94.4    388     15      -       1783272 Terrabacteria group
90.75   373     0       P       201174  Actinobacteria
90.75   373     7       C       1760    Actinobacteria
89.05   366     8       O       85007   Corynebacteriales
87.1    358     0       F       1762    Mycobacteriaceae
87.1    358     25      G       1763    Mycobacterium
81.02   333     329     -       77643   Mycobacterium tuberculosis complex
0.97    4       4       S       1773    Mycobacterium tuberculosis

LCA files for genbank-k21, k31, and k51 are available on the OSF under sourmash-lca-mark1.

They were built with the command

python gist/extract.py genbank*.csv.gz nodes.dmp --traverse-directory .sbt.genba
nk-k21/ --savename genbank-k21.lca -k 21 --scaled 10000

and each took approximately 2 hours and 6 GB of RAM to build on the MSU HPCC;

lca.o46877939:    resources_used.walltime = 02:51:58
lca.o46877939:    resources_used.vmem = 6294516kb

@ctb
Copy link
Contributor Author

ctb commented Aug 25, 2017

More TODO items:

  • add to sourmash as sourmash lca search and sourmash lca index (?), w/tests etc.
  • support a short-read search mode.

@ctb
Copy link
Contributor Author

ctb commented Aug 25, 2017

Random other thoughts:

  • nodes/names should be distributed with LCA databases, maybe;
  • do caching of nodes/names?
  • LCA databases are small enough to distribute all together, at least for scaled=10k;
  • provide JSON file linking LCA databases, ksize, nodes/name files;
  • maybe support sqlite or leveldb databases? and/or redis?

@ctb
Copy link
Contributor Author

ctb commented Feb 18, 2018

Fixed in #367.

@ctb ctb closed this as completed Feb 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant