Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding lineage manipulation & taxonomy reporting in more places in sourmash? #969

Open
ctb opened this issue Apr 28, 2020 · 22 comments
Open
Labels
revisit_me An issue that needs attention and clarification

Comments

@ctb
Copy link
Contributor

ctb commented Apr 28, 2020

We are getting closer to being able to separate out the lineage stuff from the revindex stuff in LCA, so we could envision combining searches on signatures and SBTs with taxonomic reporting (where available).

The basic idea here is that we combine taxonomy with search and gather on signatures/SBTs/LCAs, by connecting identifiers with lineages -- do dynamically what sourmash lca index does once.

This falls under use cases of taxonomic filtering and reporting.

What I don't know how to do is actually apply it in reality. A clunky (but perhaps functional) way that I'd been idly thinking of would be to build a taxonomy command-line interface that ingested the CSVs output by sourmash search and gather and did various kinds of taxonomic reporting and manipulation on them, when given a taxonomy or lineage database.

I don't really want to rewrite / extend gather to also have taxonomy... ugly complex code.

It might be possible to add taxonomy/lineage-aware selector functionality on signatures and databases, e.g. "run a gather on this database, but ignore all results that are not Archaea."

Should also look at sourmash lca gather to see about collapsing results taxonomically...

Thoughts welcome!

related conversations

@luizirber luizirber reopened this Apr 29, 2020
@luizirber
Copy link
Member

luizirber commented Apr 29, 2020

(Sorry, wrong button)

We are getting closer to being able to separate out the lineage stuff from the revindex stuff in LCA, so we could envision combining searches on signatures and SBTs with taxonomic reporting (where available).

What I don't know how to do is actually apply it in reality. A clunky (but perhaps functional) way that I'd been idly thinking of would be to build a taxonomy command-line interface that ingested the CSVs output by sourmash search and gather and did various kinds of taxonomic reporting and manipulation on them, when given a taxonomy or lineage database.

Another example: this is the script I'm using for thesis work and 2020-cami: https://github.com/dib-lab/2019-12-12-sourmash_viz/blob/1223b736add63ea49108eecceb3f4bca85c78492/src/gather_to_opal.py

It's using pandas (for summarizing at different taxonomic levels) and the taxonomy package for parsing taxonomies, and I avoided bringing this into sourmash because they are heavy-ish deps, but having a sourmash taxonomy subcommand for working with similar tasks (and bringing the lineage parsers necessary for LCA databases, for example) seems a good idea.

(Converting from an SBT to an LCA index like in https://github.com/luizirber/2020-cami/blob/f7f34d3903cd7d2b6bc2ac0471f7d53a42aa86b2/rules/build_indices.smk#L64L66 without depending on external code would be pretty cool. I guess the revindex part is already doable today?)

I don't really want to rewrite / extend gather to also have taxonomy... ugly complex code.

Totally agree, and I think it also hurts functionality. Deposited genomes in genbank or refseq don't change (but they might get new versions), but the accession to taxid mapping CAN change (see: recent Lactobacillus genera being split).

It is so hard to deal with the NCBI taxonomy because it changes and doesn't have stable downloads for a specific date, so while generating taxinfo and putting it into SBTs has the benefit of capturing the taxonomy at one point in time (and being very convenient to use), but the drawback of potentially being outdated.

the ZipStorage work in #648 makes it plausible to package SBTs up together with a lineage.

Yup, but would also like to see some option to override it somehow (re: outdated taxonomies)

@ctb
Copy link
Contributor Author

ctb commented Apr 29, 2020

OK, a few more thoughts and some summarization --

Out of scope --

@ctb
Copy link
Contributor Author

ctb commented Apr 29, 2020

@ctb
Copy link
Contributor Author

ctb commented Apr 29, 2020

@dkoslicki
Copy link
Collaborator

dkoslicki commented Apr 29, 2020

ETE3 is a great way for filtering/manipulating/etc. NCBI taxonomy that I've used FWIW

@ctb
Copy link
Contributor Author

ctb commented May 3, 2020

another piece of taxonomy related functionality - newick output, ref #915

@ctb
Copy link
Contributor Author

ctb commented May 4, 2020

random thoughts --

  • there is definitely a Kraken-style use case of "what lineages are in this metagenome", both with and without abundance. This could be something that the taxonomy module does after a gather run, and is a place where multi-level output ("just kingdom, please!" or "all the way to strain!") would be useful.
  • check to make sure what output krona and other programs (metacoder, etc.) take in, make sure we output that.

@luizirber
Copy link
Member

luizirber commented May 4, 2020

there is definitely a Kraken-style use case of "what lineages are in this metagenome", both with and without abundance. This could be something that the taxonomy module does after a gather run, and is a place where multi-level output ("just kingdom, please!" or "all the way to strain!") would be useful.

That's kind of what https://github.com/dib-lab/2019-12-12-sourmash_viz/blob/1223b736add63ea49108eecceb3f4bca85c78492/src/gather_to_opal.py is doing, and since the CAMI profiling output requires summarizing for each level the info is already there too (could also ask for a specific rank too, I guess)

@ctb
Copy link
Contributor Author

ctb commented May 4, 2020

@luizirber
Copy link
Member

luizirber commented May 5, 2020

yep - we have plenty of examples of this (sourmash lca summarize, for example). What I'm suggesting here most specifically is that we provide scripts that do this after gather is run, and separate out the "search for genomes" from the "summarize taxonomy" functionality in a way that currently is only done in ad hoc scripts in other projects. I have a prototype of this myself, over in charcoal.

But... that's what I described 🤣

This is exactly what happens in 2020-cami: I run gather first, and then use the gather output CSV to summarize taxonomy (taking a pre-calculated accession-to-taxid file for a specific database, because it takes a long time to process the full acc2taxid every time).
Here is the order they are called in the sourmash biobox:
https://github.com/sourmash-bio/bioboxes/blob/bf27b93d86fccc34540bf709996abc3076c00122/task_sourmash.pl#L21

@ctb
Copy link
Contributor Author

ctb commented May 5, 2020

@ctb
Copy link
Contributor Author

ctb commented May 7, 2020

I've started playing around with separating revindex and lineage information, over in dib-lab/charcoal, just_taxonomy.py.

I created a LineageDB class that just manages the lineages, and then refactored a few functions to use that instead of the internal lineage handling in LCA_Database.

A few observations from this strategy -

  • we need some new functions on LCA_Database - in particular, get_idents_for_hashval - to provide intermediaries. That can be added without a new major version of sourmash.
  • it's not immediately clear to me how to handle loading/saving of lineage DBs to JSON, and if/how to encapsulate LineageDB within LCA_Database, but I have some thoughts. Will play.

@ctb
Copy link
Contributor Author

ctb commented May 23, 2020

here, from silva, is a nice example of multiple taxonomies being available in a database.

With SILVA release 138 the Opens external link in new windowGenome Taxonomy Database (GTDB) has been adopted (Parks 2018). As a consequence of our efforts the following groups were prone to significant adaptations: Archaea, Enterobacterales, Deltaproteobacteria, Firmicutes, Clostridia. Betaproteobacteriales (formerly known as Betaproteobacteria) is now Burkholderiales, an order of Gammaproteobacteria. Epsilonproteobacteria vanishes within a new phylum Campilobacterota. Tenericutes are gone, they are now all part of Bacilli, inside Firmicutes.

Additionally, every sequence in the SILVA datasets carries the EMBL-EBI/ENA taxonomy assignment. Where available, the RDP and GTDB taxonomies are added for comparison. In releases <138 also the greengenes taxonomy was added.

@ctb
Copy link
Contributor Author

ctb commented May 29, 2020

mmseqs2 does nice stuff with multiple taxonomies, it looks like - search for "taxonomy" on https://github.com/soedinglab/mmseqs2/wiki

@ctb
Copy link
Contributor Author

ctb commented Jul 4, 2020

note gather-to-tax.py

@luizirber
Copy link
Member

luizirber commented Aug 7, 2020

I was talking with @bluegenes and after checking @erikyoung85 work in #1131 and the LineageDB from charcoal that @ctb mentioned, I would like to propose a few things:

  • Create TaxInfo, which will be pretty close to what LineageDB is doing in charcoal.
  • I think we need to keep lid_to_lineage and ident_to_lid in the saved files, and reconstruct/cache other useful derivations (lineage_to_lid, the inverse of lid_to_lineage, is one example).
  • SBT and LCA indices grab info from TaxInfo by ident. They are free to use other mappings they need (ident_to_name, idx_to_lid in the LCA case, for example).
  • Query TaxInfo by ident or by lid. Ideally just by ident, to avoid consistency problems if an ident -> lid mapping changes.
  • TaxInfo objects should be optional in an index.

On the Rust side it would be something like

type LineagePairs = BTreeMap<String, String>;

pub struct TaxInfo {
    lid_to_lineage: HashMap<u32, LineagePairs>
    ident_to_lid: HashMap<String, u32>
}

impl TaxInfo {
  fn get_lineage(&self, ident: &str) -> Option<LineagePairs> {}
}

And I think it can be serialized/deserialized with some processing to the current format.

On the Python side it will be similar to LineageDB.


Feature-wise, I think it is also important to be able to override a TaxInfo in the command line (or provide a different one for Index in Python/Rust). Especially since we "precompute" some desired ranks for LCA today, and they might not match other use cases (the CAMI profile format allows providing your own rank, for example). Critically, the important info in the TaxInfo class is ident, and as long as the idents are the same then two different TaxInfo objects can be interchangeable. This is a very lax definition, but I think it also fits well with other taxonomies (since, fundamentally, LineagePairs can be any list of (rank, value) that the user desires.

@luizirber
Copy link
Member

luizirber commented Aug 7, 2020

I was trying to process an existing LCA index with this proposed format, and... is tests/test-data/lca/delmont-1.lca.json broken? I was planning to use ident_to_idx and idx_to_lid to generate ident_to_lid, but idx_to_lid seems to be missing the mapping for idx=1?

$ jq .ident_to_idx tests/test-data/lca/delmont-1.lca.json
{
  "TARA_ASE_MAG_00031": 0,
  "TARA_PSW_MAG_00136": 1
}

$ jq .idx_to_lid tests/test-data/lca/delmont-1.lca.json
{
  "0": 0
}

Or should lineage be optional? That is OK too, I guess. The ident would not be included in any taxonomic commands (classify, lca summarize and so on)

@ctb
Copy link
Contributor Author

ctb commented Aug 10, 2020

re #969 (comment), I am ok with doing the Rust API so that that is easy, but I would suggest getting the current functionality oxidized and merged first before a split. Otherwise you end up with too many moving parts.

@ctb
Copy link
Contributor Author

ctb commented Aug 10, 2020

I was trying to process an existing LCA index with this proposed format, and... is tests/test-data/lca/delmont-1.lca.json broken? I was planning to use ident_to_idx and idx_to_lid to generate ident_to_lid, but idx_to_lid seems to be missing the mapping for idx=1?

$ jq .ident_to_idx tests/test-data/lca/delmont-1.lca.json
{
  "TARA_ASE_MAG_00031": 0,
  "TARA_PSW_MAG_00136": 1
}

$ jq .idx_to_lid tests/test-data/lca/delmont-1.lca.json
{
  "0": 0
}

Or should lineage be optional? That is OK too, I guess. The ident would not be included in any taxonomic commands (classify, lca summarize and so on)

yes, lineage is optional in LCA databases.

@ctb
Copy link
Contributor Author

ctb commented Apr 23, 2021

random aside - splitting revindex and taxonomy would mean that we can update taxonomy without updating revindex, which currently is impossible in LCA databases.

@ctb
Copy link
Contributor Author

ctb commented Apr 23, 2021

(an interesting short research paper might be to compare and contrast gather and LCA approaches on the same database directly)

(and also GTDB and NCBI taxonomy result comparisons by finding the same k-mers, then applying different taxonomies)

@ctb
Copy link
Contributor Author

ctb commented Jan 26, 2022

In #1808, I'm adding SqliteIndex, which could also support LCA-style functionality as well as storing taxonomy information more generally, all in the same file; revisiting #1651, this is just the taxonomy table defined in tax/tax_utils.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
revisit_me An issue that needs attention and clarification
Projects
None yet
Development

No branches or pull requests

3 participants