Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

moving toward sourmash taxonomy for taxonomy reporting and manipulation from sourmash gather results #1515

Closed
taylorreiter opened this issue May 11, 2021 · 15 comments
Labels

Comments

@taylorreiter
Copy link
Contributor

@luizirber and @bluegenes and I have been getting more excited about a sourmash taxonomy command, and potentially tackling pieces of it in a DIB lab hackathon. We had a conversation about this and wanted to summarize the main points here, as well as continue brainstorming.

Goal: command line interface that takes one or multiple sourmash gather csvs and a lineage csv and provides taxonomic rank summarization and downstream formatting for ingestion by popular taxonomy visualization tools.

Relevant Issues:

Relevant Repos:

What needs to be included in sourmash taxonomy? Command line interface, inputs and outputs:
What should the command line interface look like? What should the format of the inputs and outputs be? What functionality should be included?

  • inputs:
    • one or multiple sourmash gather csv files
    • one or more lineage csv files (one lineage file per SBT used in gather)
      + column 1: dataset identification in database (e.g. unique identifier from SBT [NOT MD5]; assembly accession)
      + formatting of subsequent columns needs to be standardized, or parsed from standardized column names (e.g. superkingdom)
  • commands:
    • convert/liftover <- convert between taxonomies
      • e.g. GTDB <-> NCBI using GTDB csv map
      • or GTDB version conversion (e.g. r95 -> rs202)
    • summarize; similar to lca summarize

Think about for the future

  • lineage as a manifest packaged with the zip sbt database
@ctb
Copy link
Contributor

ctb commented May 12, 2021

I'm on board!

Is the idea that this would be available for sourmash 4.2? #1481

@ctb
Copy link
Contributor

ctb commented May 15, 2021

do you envision that the taxonomy spreadsheet format would be changing much, or would those be independent changes?

@taylorreiter
Copy link
Contributor Author

This is something we started talking about. The spreadsheet for NCBI currently has fields (synonymous with) accession, taxonid, superkingdom, phylum, class, order, family, genus, species and I think strain. @bluegenes has made her GTDB lineage spreadsheets without the second column taxonid and without strain I think. @luizirber, @bluegenes and I agreed that the first column should remain as the dataset identification in database (e.g. unique identifier from SBT [NOT MD5]; assembly accession). After that, we like taxonid for NCBI, but that doesn't necessarily fit with GTDB...and it may be hand to have excess columns over the lineage. So, what might make sense is having a flexible format where the code interacts with the lineage spreadsheet via column names instead of hard-coded positions...but I think we left the conversation open ended with lots of room for continued brainstorming.

One potential drawback of this that I just thought of is combining lineage sheets from different taxonomies. I recently did a gather run where I used the GTDB database, and then tacked on the protozoa, fungi and viral databases from NCBI. I combined the four lineage spreadsheets to do taxonomy summarization all at once.

@ctb
Copy link
Contributor

ctb commented May 18, 2021

a few quick thoughts based on thinking while running - feel free to reject

  • presumably sourmash tax/taxonomy would be another set of subcommands?
  • I like lca classify for single genomes and lca summarize for multiple
  • the --query and --db style to take multiple arguments is clumsy when you have one query vs 1 db, but is really nice when you have multiple. dunno what to do here, but suggest being consistent within the subcommand.
  • column names == better than what we're doing in lca! I think it's fine to be extra special and hardcode ncbi and gtdb taxon names in as things that can be recognized and combined, since they're so ubiquitous.
  • combining multiple taxonomies like you say above is a really good use case...

@bluegenes
Copy link
Contributor

re @taylorreiter comments --

So, what might make sense is having a flexible format where the code interacts with the lineage spreadsheet via column names instead of hard-coded positions...but I think we left the conversation open ended with lots of room for continued brainstorming.
One potential drawback of this that I just thought of is combining lineage sheets from different taxonomies. I recently did a gather run where I used the GTDB database, and then tacked on the protozoa, fungi and viral databases from NCBI. I combined the four lineage spreadsheets to do taxonomy summarization all at once.

I think this use case is a strong argument in support of using column names and enabling multiple lineage spreadsheets to be read in separately. As we read in each lineage csv, we would require that certain columns exist, but otherwise be flexible (e.g. - these can contain additional information that we just ignore). This also would provide a set of standardized guidelines for folks to build their own lineage spreadsheets if they need.

re: @ctb comments --

presumably sourmash tax/taxonomy would be another set of subcommands?

yep!

I like lca classify for single genomes and lca summarize for multiple

yes! Though I think we were talking about dropping the lca -- so sourmash tax classify or sourmash tax summarize just to keep things succinct.

We somewhat decided to focus on the summarize function first, just to narrow the scope. classify is pretty similar to summarize (with a few important diffs), so if we don't get to it now, I will work on it after the hackathon.

the --query and --db style to take multiple arguments is clumsy when you have one query vs 1 db, but is really nice when you have multiple. dunno what to do here, but suggest being consistent within the subcommand.

I think what you're saying is +1 to enabling multiple query/db input in a single command? Agree, and I'll also drop in that a --from-file style input would be very useful, at least for classify, which needs to read in the output of gather run on each genome to be classified.

@ctb
Copy link
Contributor

ctb commented May 19, 2021

re @taylorreiter comments --

So, what might make sense is having a flexible format where the code interacts with the lineage spreadsheet via column names instead of hard-coded positions...but I think we left the conversation open ended with lots of room for continued brainstorming.
One potential drawback of this that I just thought of is combining lineage sheets from different taxonomies. I recently did a gather run where I used the GTDB database, and then tacked on the protozoa, fungi and viral databases from NCBI. I combined the four lineage spreadsheets to do taxonomy summarization all at once.

I think this use case is a strong argument in support of using column names and enabling multiple lineage spreadsheets to be read in separately. As we read in each lineage csv, we would require that certain columns exist, but otherwise be flexible (e.g. - these can contain additional information that we just ignore). This also would provide a set of standardized guidelines for folks to build their own lineage spreadsheets if they need.

good!

re: @ctb comments --

I like lca classify for single genomes and lca summarize for multiple

yes! Though I think we were talking about dropping the lca -- so sourmash tax classify or sourmash tax summarize just to keep things succinct.

absolutely, especially since we won't be using LCA methods in the same way :)

note that (b/c of semantic versioning) we won't be removing the lca commands completely until v6 at the earliest. But we can deprecate them for v5.

We somewhat decided to focus on the summarize function first, just to narrow the scope. classify is pretty similar to summarize (with a few important diffs), so if we don't get to it now, I will work on it after the hackathon.

k!

the --query and --db style to take multiple arguments is clumsy when you have one query vs 1 db, but is really nice when you have multiple. dunno what to do here, but suggest being consistent within the subcommand.

I think what you're saying is +1 to enabling multiple query/db input in a single command? Agree, and I'll also drop in that a --from-file style input would be very useful, at least for classify, which needs to read in the output of gather run on each genome to be classified.

yep!

@bluegenes
Copy link
Contributor

bluegenes commented May 19, 2021

Note - when processing lineages, we should try to ignore assembly version info (.[12]) -- these shouldn't matter for taxonomy, and make things a bit more complicated when an updated assembly is added or an assembly version is redacted (as is currently the case for one in gtdb-rs202)

@ctb
Copy link
Contributor

ctb commented May 21, 2021

Note - when processing lineages, we should try to ignore assembly version info (.[12]) -- these shouldn't matter for taxonomy, and make things a bit more complicated when an updated assembly is added or an assembly version is redacted (as is currently the case for one in gtdb-rs202)

running into exactly this with sourmash lca index! working on some patches here, #1542, would be nice to fix this up front in sourmash taxonomy code!

@bluegenes
Copy link
Contributor

Organization question, mainly for @ctb, but also everyone:

How do we want to split functionality between lca and tax? Since lca already has excellent handy lineage utilities, a simple way to structure the division would be to keep all lineage parsing / manipulation over in lca_utils, and keep tax functions focused around a) parsing and summarizing gather output (using lca functions internally) and b) conversion to output formats for use in visualization.

thoughts?

@ctb
Copy link
Contributor

ctb commented May 22, 2021

Organization question, mainly for @ctb, but also everyone:

How do we want to split functionality between lca and tax? Since lca already has excellent handy lineage utilities, a simple way to structure the division would be to keep all lineage parsing / manipulation over in lca_utils, and keep tax functions focused around a) parsing and summarizing gather output (using lca functions internally) and b) conversion to output formats for use in visualization.

thoughts?

suggest converse - copy or move functions over to tax, and change lca to reference them (if unchanged) or not (if changed).

To my understanding, lca will become a special case and/or be deprecated in the future, and I think all taxonomy stuff should be moved under tax.

It might complicated things during the hackathon, tho, so it's totally OK to just leave things as they are and reference the lca functions from tax; this also gives you the flexibility to customize the tax functions where needed. we can move them over later, as long as everything is tested (similar to what we are doing with sourmash compute vs sourmash sketch).

@bluegenes
Copy link
Contributor

suggest converse - copy or move functions over to tax, and change lca to reference them (if unchanged) or not (if changed).

To my understanding, lca will become a special case and/or be deprecated in the future, and I think all taxonomy stuff should be moved under tax.

wonderful. I didn't want to suggest this because I was worried about backwards compatibility, but ofc, can just reference the functions in lca!

It might complicated things during the hackathon, tho, so it's totally OK to just leave things as they are and reference the lca functions from tax; this also gives you the flexibility to customize the tax functions where needed. we can move them over later, as long as everything is tested (similar to what we are doing with sourmash compute vs sourmash sketch).

good point. Will start with copying over the functions we use directly, to allow modification as needed during the hackathon.

@bluegenes
Copy link
Contributor

ref dib-lab/charcoal#174:

It's probably a good idea for gather_at_rank to detect and handle/report such ties, and probably pull the taxonomic assignment up to the level above the tie.

@ctb
Copy link
Contributor

ctb commented May 22, 2021 via email

@bluegenes
Copy link
Contributor

preserving from slack

bluegenes:feet:
is there any need for a sourmash tax label (or similar), where we just add lineage information into the gather results, with no summarization at all?

titus:speech_balloon:
I kinda like that!
you could imagine something like describe or display that would give you something human readable, OR just have it be straight up CSV output

bluegenes:feet:
ooh, definitely
this is also making me think of some reporting folks might like out of classify — x% of genomes classified at species, etc

titus:speech_balloon:
yep

@ctb
Copy link
Contributor

ctb commented Jun 26, 2021

I think everything in here is covered by #1543! At this point someone(s) should revisit #969 and create a new "summary" issue that contains the remaining ideas, but no urgency.

@ctb ctb closed this as completed Jun 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants