moving toward `sourmash taxonomy` for taxonomy reporting and manipulation from `sourmash gather` results #1515

taylorreiter · 2021-05-11T21:20:18Z

@luizirber and @bluegenes and I have been getting more excited about a sourmash taxonomy command, and potentially tackling pieces of it in a DIB lab hackathon. We had a conversation about this and wanted to summarize the main points here, as well as continue brainstorming.

Goal: command line interface that takes one or multiple sourmash gather csvs and a lineage csv and provides taxonomic rank summarization and downstream formatting for ingestion by popular taxonomy visualization tools.

Relevant Issues:

Relevant Repos:

https://github.com/dib-lab/2018-ncbi-lineages
https://github.com/dib-lab/sourmash_databases
https://github.com/dib-lab/2019-12-12-sourmash_viz
https://github.com/luizirber/2020-cami
[MRG] Building databases from assembly_summary.txt databases#11 <- build databases from assembly_stats.txt. Alternative to 2018-ncbi-lineages that parses the genbank assembly_stats.txt for the assembly accession and taxon id.
- currently on farm /home/irber/sourmash_databases/outputs/lca/lineages

What needs to be included in sourmash taxonomy? Command line interface, inputs and outputs:
What should the command line interface look like? What should the format of the inputs and outputs be? What functionality should be included?

inputs:
- one or multiple sourmash gather csv files
- one or more lineage csv files (one lineage file per SBT used in gather)
  + column 1: dataset identification in database (e.g. unique identifier from SBT [NOT MD5]; assembly accession)
  + formatting of subsequent columns needs to be standardized, or parsed from standardized column names (e.g. superkingdom)
commands:
- convert/liftover <- convert between taxonomies
  - e.g. GTDB <-> NCBI using GTDB csv map
  - or GTDB version conversion (e.g. r95 -> rs202)
- summarize; similar to lca summarize
  - cami format output <- (from https://github.com/luizirber/2020-cami)
  - krona format output
  - newick format output?

Think about for the future

lineage as a manifest packaged with the zip sbt database

The text was updated successfully, but these errors were encountered:

ctb · 2021-05-12T15:55:05Z

I'm on board!

Is the idea that this would be available for sourmash 4.2? #1481

ctb · 2021-05-15T12:21:56Z

do you envision that the taxonomy spreadsheet format would be changing much, or would those be independent changes?

taylorreiter · 2021-05-15T17:56:50Z

This is something we started talking about. The spreadsheet for NCBI currently has fields (synonymous with) accession, taxonid, superkingdom, phylum, class, order, family, genus, species and I think strain. @bluegenes has made her GTDB lineage spreadsheets without the second column taxonid and without strain I think. @luizirber, @bluegenes and I agreed that the first column should remain as the dataset identification in database (e.g. unique identifier from SBT [NOT MD5]; assembly accession). After that, we like taxonid for NCBI, but that doesn't necessarily fit with GTDB...and it may be hand to have excess columns over the lineage. So, what might make sense is having a flexible format where the code interacts with the lineage spreadsheet via column names instead of hard-coded positions...but I think we left the conversation open ended with lots of room for continued brainstorming.

One potential drawback of this that I just thought of is combining lineage sheets from different taxonomies. I recently did a gather run where I used the GTDB database, and then tacked on the protozoa, fungi and viral databases from NCBI. I combined the four lineage spreadsheets to do taxonomy summarization all at once.

ctb · 2021-05-18T01:10:06Z

a few quick thoughts based on thinking while running - feel free to reject

presumably sourmash tax/taxonomy would be another set of subcommands?
I like lca classify for single genomes and lca summarize for multiple
the --query and --db style to take multiple arguments is clumsy when you have one query vs 1 db, but is really nice when you have multiple. dunno what to do here, but suggest being consistent within the subcommand.
column names == better than what we're doing in lca! I think it's fine to be extra special and hardcode ncbi and gtdb taxon names in as things that can be recognized and combined, since they're so ubiquitous.
combining multiple taxonomies like you say above is a really good use case...

bluegenes · 2021-05-19T14:58:58Z

re @taylorreiter comments --

So, what might make sense is having a flexible format where the code interacts with the lineage spreadsheet via column names instead of hard-coded positions...but I think we left the conversation open ended with lots of room for continued brainstorming.
One potential drawback of this that I just thought of is combining lineage sheets from different taxonomies. I recently did a gather run where I used the GTDB database, and then tacked on the protozoa, fungi and viral databases from NCBI. I combined the four lineage spreadsheets to do taxonomy summarization all at once.

I think this use case is a strong argument in support of using column names and enabling multiple lineage spreadsheets to be read in separately. As we read in each lineage csv, we would require that certain columns exist, but otherwise be flexible (e.g. - these can contain additional information that we just ignore). This also would provide a set of standardized guidelines for folks to build their own lineage spreadsheets if they need.

re: @ctb comments --

presumably sourmash tax/taxonomy would be another set of subcommands?

yep!

I like lca classify for single genomes and lca summarize for multiple

yes! Though I think we were talking about dropping the lca -- so sourmash tax classify or sourmash tax summarize just to keep things succinct.

We somewhat decided to focus on the summarize function first, just to narrow the scope. classify is pretty similar to summarize (with a few important diffs), so if we don't get to it now, I will work on it after the hackathon.

the --query and --db style to take multiple arguments is clumsy when you have one query vs 1 db, but is really nice when you have multiple. dunno what to do here, but suggest being consistent within the subcommand.

I think what you're saying is +1 to enabling multiple query/db input in a single command? Agree, and I'll also drop in that a --from-file style input would be very useful, at least for classify, which needs to read in the output of gather run on each genome to be classified.

ctb · 2021-05-19T15:05:42Z

re @taylorreiter comments --

So, what might make sense is having a flexible format where the code interacts with the lineage spreadsheet via column names instead of hard-coded positions...but I think we left the conversation open ended with lots of room for continued brainstorming.
One potential drawback of this that I just thought of is combining lineage sheets from different taxonomies. I recently did a gather run where I used the GTDB database, and then tacked on the protozoa, fungi and viral databases from NCBI. I combined the four lineage spreadsheets to do taxonomy summarization all at once.

I think this use case is a strong argument in support of using column names and enabling multiple lineage spreadsheets to be read in separately. As we read in each lineage csv, we would require that certain columns exist, but otherwise be flexible (e.g. - these can contain additional information that we just ignore). This also would provide a set of standardized guidelines for folks to build their own lineage spreadsheets if they need.

good!

re: @ctb comments --

I like lca classify for single genomes and lca summarize for multiple

yes! Though I think we were talking about dropping the lca -- so sourmash tax classify or sourmash tax summarize just to keep things succinct.

absolutely, especially since we won't be using LCA methods in the same way :)

note that (b/c of semantic versioning) we won't be removing the lca commands completely until v6 at the earliest. But we can deprecate them for v5.

We somewhat decided to focus on the summarize function first, just to narrow the scope. classify is pretty similar to summarize (with a few important diffs), so if we don't get to it now, I will work on it after the hackathon.

k!

the --query and --db style to take multiple arguments is clumsy when you have one query vs 1 db, but is really nice when you have multiple. dunno what to do here, but suggest being consistent within the subcommand.

I think what you're saying is +1 to enabling multiple query/db input in a single command? Agree, and I'll also drop in that a --from-file style input would be very useful, at least for classify, which needs to read in the output of gather run on each genome to be classified.

yep!

bluegenes · 2021-05-19T19:13:28Z

Note - when processing lineages, we should try to ignore assembly version info (.[12]) -- these shouldn't matter for taxonomy, and make things a bit more complicated when an updated assembly is added or an assembly version is redacted (as is currently the case for one in gtdb-rs202)

ctb · 2021-05-21T14:51:59Z

Note - when processing lineages, we should try to ignore assembly version info (.[12]) -- these shouldn't matter for taxonomy, and make things a bit more complicated when an updated assembly is added or an assembly version is redacted (as is currently the case for one in gtdb-rs202)

running into exactly this with sourmash lca index! working on some patches here, #1542, would be nice to fix this up front in sourmash taxonomy code!

bluegenes · 2021-05-22T15:11:04Z

Organization question, mainly for @ctb, but also everyone:

How do we want to split functionality between lca and tax? Since lca already has excellent handy lineage utilities, a simple way to structure the division would be to keep all lineage parsing / manipulation over in lca_utils, and keep tax functions focused around a) parsing and summarizing gather output (using lca functions internally) and b) conversion to output formats for use in visualization.

thoughts?

ctb · 2021-05-22T15:15:18Z

Organization question, mainly for @ctb, but also everyone:

How do we want to split functionality between lca and tax? Since lca already has excellent handy lineage utilities, a simple way to structure the division would be to keep all lineage parsing / manipulation over in lca_utils, and keep tax functions focused around a) parsing and summarizing gather output (using lca functions internally) and b) conversion to output formats for use in visualization.

thoughts?

suggest converse - copy or move functions over to tax, and change lca to reference them (if unchanged) or not (if changed).

To my understanding, lca will become a special case and/or be deprecated in the future, and I think all taxonomy stuff should be moved under tax.

It might complicated things during the hackathon, tho, so it's totally OK to just leave things as they are and reference the lca functions from tax; this also gives you the flexibility to customize the tax functions where needed. we can move them over later, as long as everything is tested (similar to what we are doing with sourmash compute vs sourmash sketch).

bluegenes · 2021-05-22T15:28:49Z

suggest converse - copy or move functions over to tax, and change lca to reference them (if unchanged) or not (if changed).

To my understanding, lca will become a special case and/or be deprecated in the future, and I think all taxonomy stuff should be moved under tax.

wonderful. I didn't want to suggest this because I was worried about backwards compatibility, but ofc, can just reference the functions in lca!

It might complicated things during the hackathon, tho, so it's totally OK to just leave things as they are and reference the lca functions from tax; this also gives you the flexibility to customize the tax functions where needed. we can move them over later, as long as everything is tested (similar to what we are doing with sourmash compute vs sourmash sketch).

good point. Will start with copying over the functions we use directly, to allow modification as needed during the hackathon.

bluegenes · 2021-05-22T16:12:32Z

ref dib-lab/charcoal#174:

It's probably a good idea for gather_at_rank to detect and handle/report such ties, and probably pull the taxonomic assignment up to the level above the tie.

ctb · 2021-05-22T16:40:42Z

good point. Will start with copying over the functions we use directly, to allow modification as needed during the hackathon.

maybe: copy on write? import from lca until you need to change, then when you need to change, copy. e.g. the tree/LCA stuff is unlikely to need changes, but the taxonomy loading stuff is ...questionable :)

bluegenes · 2021-06-16T15:04:02Z

preserving from slack

bluegenes:feet:
is there any need for a sourmash tax label (or similar), where we just add lineage information into the gather results, with no summarization at all?

titus:speech_balloon:
I kinda like that!
you could imagine something like describe or display that would give you something human readable, OR just have it be straight up CSV output

bluegenes:feet:
ooh, definitely
this is also making me think of some reporting folks might like out of classify — x% of genomes classified at species, etc

titus:speech_balloon:
yep

ctb · 2021-06-26T13:41:40Z

I think everything in here is covered by #1543! At this point someone(s) should revisit #969 and create a new "summary" issue that contains the remaining ideas, but no urgency.

bluegenes mentioned this issue May 21, 2021

[MRG] add taxonomy subcommand #1543

Merged

14 tasks

ctb mentioned this issue May 22, 2021

[WIP] improve identifier & taxonomy parsing for lca index #1542

Closed

bluegenes added the taxonomy label Jun 11, 2021

ctb closed this as completed Jun 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

moving toward `sourmash taxonomy` for taxonomy reporting and manipulation from `sourmash gather` results #1515

moving toward `sourmash taxonomy` for taxonomy reporting and manipulation from `sourmash gather` results #1515

taylorreiter commented May 11, 2021

ctb commented May 12, 2021

ctb commented May 15, 2021

taylorreiter commented May 15, 2021

ctb commented May 18, 2021

bluegenes commented May 19, 2021

ctb commented May 19, 2021

bluegenes commented May 19, 2021 •

edited

ctb commented May 21, 2021

bluegenes commented May 22, 2021

ctb commented May 22, 2021

bluegenes commented May 22, 2021

bluegenes commented May 22, 2021

ctb commented May 22, 2021 via email

bluegenes commented Jun 16, 2021

ctb commented Jun 26, 2021

moving toward sourmash taxonomy for taxonomy reporting and manipulation from sourmash gather results #1515

moving toward sourmash taxonomy for taxonomy reporting and manipulation from sourmash gather results #1515

Comments

taylorreiter commented May 11, 2021

ctb commented May 12, 2021

ctb commented May 15, 2021

taylorreiter commented May 15, 2021

ctb commented May 18, 2021

bluegenes commented May 19, 2021

ctb commented May 19, 2021

bluegenes commented May 19, 2021 • edited

ctb commented May 21, 2021

bluegenes commented May 22, 2021

ctb commented May 22, 2021

bluegenes commented May 22, 2021

bluegenes commented May 22, 2021

ctb commented May 22, 2021 via email

bluegenes commented Jun 16, 2021

ctb commented Jun 26, 2021

moving toward `sourmash taxonomy` for taxonomy reporting and manipulation from `sourmash gather` results #1515

moving toward `sourmash taxonomy` for taxonomy reporting and manipulation from `sourmash gather` results #1515

bluegenes commented May 19, 2021 •

edited