-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] add generic support for gzipped and zipfile CSVs #2195
Conversation
@bluegenes thoughts welcome :) |
Codecov Report
@@ Coverage Diff @@
## latest #2195 +/- ##
==========================================
+ Coverage 84.53% 91.88% +7.35%
==========================================
Files 131 100 -31
Lines 15458 11232 -4226
Branches 2207 2218 +11
==========================================
- Hits 13067 10321 -2746
+ Misses 2092 612 -1480
Partials 299 299
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
As discussed on slack :) - re: OR, somewhere in database info/metadata (which we don’t have yet, but have talked about), add the default for that database? In this case, I'm thinking about database info/metadata as database version (e.g. gtdb-rs207), sourmash signature version, creation date, etc -- and then adding default-taxonomy. |
might be fine to support Biggest pragmatic question is when we would start distributing taxonomies with the databases... |
hmm - @bluegenes what do you think about me ditching the tl;dr maybe smaller code changes better :) |
That'd be ok! |
Ready for review & merge @sourmash-bio/devs! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Add gzipped CSV input/output and further regularize CSV I/O throughout the codebase.
In brief,
multigather
andtax *
that build their own filenames.lca index
is the exception, see changelca
subcommands to usetax
taxonomy code/handling #2198 for why.This PR also:
sourmash prefetch
andsourmash gather --save-prefetch-csv
to use standard CSV output;sourmash sig check
to use standard CSV loading (now including gz);sourmash sketch fromfile
to use standard CSV loading (now including gz);sourmash_args.FileOutputCSV
;Implementation details
FIRST, this PR provides a new context manager,
sourmash_args.FileInputCSV
, that provides acsv.DictReader
iterator on top of straight CSVs, gzipped CSVs, and zipfiles that contain a specially named file.For example,
does what you'd expect. More intriguingly,
will look for
SOURMASH-TAXONOMY.csv
in the given zipfile and load rows from it.This permits "nice" behavior such as:
to support text CSV, gzip CSV, and zipfiles containing SOURMASH-TAXONOMY.csv.
In turn, this enables things like distributing taxonomies within zipfile databases per #2154.
The
FileInputCSV
context handler also supports CSVs with version lines (# KEY: VERSION
), so can load manifests from zip files (although the manifest loading code does not useFileInputCSV
, because of its reliance onZipStorage
).This PR also adjusts
FileOutputCSV
to automatically gzip the CSV when.gz
is present on the end of the output filename. Small change (3 lines), many impacts 😁Fixes #2188
Fixes #2012
Fixes #1903
Addresses #2154 but doesn't fix it
TODO
FileInputCSV
for cases where we already know it's a zipfile - implemented but not yet used anywhere