Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support gzip and/or zipped taxonomy CSVs? #2012

Closed
ctb opened this issue Apr 30, 2022 · 8 comments · Fixed by #2195 · May be fixed by #2202
Closed

support gzip and/or zipped taxonomy CSVs? #2012

ctb opened this issue Apr 30, 2022 · 8 comments · Fixed by #2195 · May be fixed by #2202

Comments

@ctb
Copy link
Contributor

ctb commented Apr 30, 2022

we should support .gz for our input CSVs.

we could also support .zip files that contain a SOURMASH-TAXONOMY.csv file or something. Or even have standard names for genbank, gtdb, and lins taxonomies.

@ctb
Copy link
Contributor Author

ctb commented Aug 7, 2022

taxonomy csv.gz files implemented in #2178.

@ctb
Copy link
Contributor Author

ctb commented Aug 12, 2022

so #2195 is implementing CSV loading from zipfiles, and it currently uses SOURMASH-TAXONOMY.csv as the filename inside the zip from which it will load taxonomy information. Is this a good name?

Separately, note that there is no sourmash tax command that takes in both a zipfile database and a taxonomy so at least for the moment this code doesn't simplify the CLI that much. However, it could simplify sourmash lca index a bit.

@bluegenes
Copy link
Contributor

bluegenes commented Aug 12, 2022

so #2195 is implementing CSV loading from zipfiles, and it currently uses SOURMASH-TAXONOMY.csv as the filename inside the zip from which it will load taxonomy information. Is this a good name?

Idea to allow multiple taxonomies per db here: #2195 (comment)

Separately, note that there is no sourmash tax command that takes in both a zipfile database and a taxonomy so at least for the moment this code doesn't simplify the CLI that much. However, it could simplify sourmash lca index a bit.

Right... I guess I'm thinking tax commands could take in a database and just read the taxonomy file(s) from it. That way folks can just use a single file, rather than needing to mess with two separate ones.

@ctb
Copy link
Contributor Author

ctb commented Aug 12, 2022

Right... I guess I'm thinking tax commands could take in a database and just read the taxonomy file(s) from it. That way folks can just use a single file, rather than needing to mess with two separate ones.

ok - so it doesn't simplify the CLI, but does remove the need to deal with an extra file. Got it!

@ctb
Copy link
Contributor Author

ctb commented Aug 13, 2022

note, removed the SOURMASH-TAXONOMY name from #2195; will address over in #2154.

@ctb
Copy link
Contributor Author

ctb commented Aug 14, 2022

Trying to figure out how distributing multiple taxonomies in a zip file would work at the command line.

The most obvious idea is:

sourmash tax classify -g gather.csv -t gtdb-xyz.zip --gtdb

which would load GTDB-TAXONOMY.csv from gtdb-xyz.zip, vs

sourmash tax classify -g gather.csv -t gtdb-xyz.zip --ncbi

which would load NCBI-TAXONOMY.csv from gtdb-xyz.zip.

Then we could potentially add --lins later on for #1813.

Alternative command-line switches would be --tax-type ncbi or something but I feel like --ncbi and --gtdb are probably simplest and easiest to remember.

@ctb
Copy link
Contributor Author

ctb commented Aug 14, 2022

@bluegenes thoughts ^^^ ?

@ctb ctb closed this as completed in #2195 Aug 15, 2022
@bluegenes
Copy link
Contributor

I like --gtdb and --ncbi, especially since I can't see us integrating so many taxonomies that having an argument per tax type would be unwieldy.

--lins definitely useful when we get there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants