-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sketch names in GTDB database use NCBI taxonomy #3006
Comments
huh. Thanks for finding, @ccbaumler! We used make-gtdb-taxonomy.py file, which you can see here: https://github.com/sourmash-bio/database-releases/pull/3/files#diff-f2b14d2992941e5da6adff7c90f2d2ed91b1cca96a5a84ee8ca0e6e92feb03d7 |
they probably changed it. they do that. |
I can rerun the script to create the lineage spreadsheet, but I don't see how a GenBank update could affect the GTDB and lineage file. We created those at the same time and they should be accurate to each other. The only way GenBank update could affect this is if we recreated one file and not the other. |
wait I don't understand. apologies if I'm missing something obvious 😓 GTDB is distinct from GenBank taxonomy. They use the same genome accessions, but the taxonomies are intentionally not concordant. when you say "the genbank link implies..." that is something that genbank changes from time to time. |
here, the genbank link matches what's in the GTDB zipfile, which is an assembly record named based on the Genbank classification. the GTDB folk have decided it's a different Escherichia, and the taxonomy file for GTDB assigns that GenBank access to s__Escherichia ruysiae. Probably the most confusing thing I see here is that we did not actually rename the sketch in the zip file to match the GTDB classification, but kept it as the GenBank name... computers are hard 😭 |
Thank you for walking through everything! I see my own misunderstanding clearly now. Would it be best to update the signature name with the species rank name from the lineage spreadsheet and a unique identifier? "Computers are hard." - Titus |
sounds like work |
It's this kind of wisdom that keeps me showing up day after day. |
so, some slightly more sober thoughts ;). I agree that it's confusing that when you find matches with search/gather in a database named 'gtdb', you get the NCBI genome names, which basically use taxonomic names from NCBI. The options to fix this would appear to be, roughly -
I'm tempted to leave this issue open (maybe changing the issue title), or maybe create a new one, and then see what we feel like doing the next time GTDB does a release and we need to update. I agree it is mildly confusing as it is, but we've been doing this for years and you're the first person who has noticed - kudos on paying attention 😆 ! - so maybe it's a minor confusion in a tapestry of many larger confusions? :). As it is, GTDB mostly agrees with NCBI taxonomy at the family/genus level, too, so it's not that jarring a disconnect in practice. |
I think if we add more info into the signature name and discuss it in the documentation, it would alleviate any confusion. It could also show in the output that your return tax has some contradictions to investigate as well. Last night I was looking through the script @bluegenes linked and the metadata files that were used to create the lineage files. I think we could update the sig names throughout the database to include the
signature: GCA_000398885.1 Escherichia coli KTE33 strain=KTE33, Esch_coli_KTE33_V1 (GTDB_GCA_021307345.1 GTDB_Escherichia ruysiae)
AppendixFirst few columns of
The two columns in
|
Back when I added NCBI genome calculation to
There is a whole discussion to be had on splitting "fast-changing" metadata (taxonomy, names) from "slow-changing" metadata/data (accession ID, the actual hashes in the minhash), and "name" is one of these fields that fall more in "fast" than "slow". But it is sort of the only one we have at the signature level. But, turns out we have a better solution for collections nowadays: manifests! We can add an extra column to the manifest for "preferred"/"overwrite" name, and use that (if present) when outputting in |
👍 |
Thanks for all the great explanations, ideas and details! One point I'm having trouble rectifying in my mind is the making signatures and taxonomies from GTDB data but naming the signature as NCBI taxonomic name. Intuitively, I would expect the sig "name" to be the GTDB taxonomic name. |
Separation of name and taxonomy! A sourmash superpower (in many ways) is that we can key multiple different taxonomies on the same accession. But the flip side is that the name (which contains the accession plus some human-readable text) is distinct from taxonomy. (IMO the content is what matters mostly, anyway. Names are ephemeral. Taxonomies change. Content is (closer to) eternal! |
Truth, the representative genomic signature is the most important part of the database. While names are ephemeral, we are still using them. My understanding is we are using them in contradictory ways by making the database name and the signature names from two different sources. And, for taxonomic profiling, would it be more intuitive for users reading their gather output for the database to be called NCBI instead of GTDB because the signature names derive from the NCBI and not the GTDB? Then when using alternate taxonomic naming databases, they would expect to see some opposing taxonomies... ? |
but 'gather' doesn't produce taxonomic output... What we'd really want to call the database is something like |
Please correct me if I misunderstood your comment about liking to see the output. I was referring to the fifth column of gather which returns the signature names of the database against the query. If I use the GTDB sourmash database but it is returning NCBI taxonomies in that output... I don't know. Doesn't seem to be presenting the output intuitively in my mind. Especially when the output could differ from GTDB like the initial comment on this thread points out. |
yes, I understand. someone needs to propose a specific solution, and then someone needs to do the work. The person doing the work generally gets a large say in which solution is chosen :). |
After talking through this with @ccbaumler a bit, I am endorsing this solution a bit more strongly -
|
if we're going to upgrade manifest contents to include
this would (in my opinionated opinion) be distinct from adding other metadata fields, tags, and so on, as explored in my comments on sourmash-bio/branchwater#18 |
I am working on this pangenome database idea at https://github.com/ctb/2022-database-covers/. I may have found an error in the taxonomic classification while trying to figure some stuff out with the scripts.
For example, from the gtdb-rs214.lineages.csv:
When I extract that signature from the GTDB db:
The GenBank link implies that this should be considered E. coli
https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000398885.1/
The text was updated successfully, but these errors were encountered: