Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add generic support for gzipped and zipfile CSVs #2195

Merged
merged 22 commits into from
Aug 15, 2022
Merged

[MRG] add generic support for gzipped and zipfile CSVs #2195

merged 22 commits into from
Aug 15, 2022

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Aug 10, 2022

Add gzipped CSV input/output and further regularize CSV I/O throughout the codebase.

In brief,

  • MOST commands that write CSVs will now gzip the output if a '.gz' extension is provided. The exceptions are commands like multigather and tax * that build their own filenames.
  • MOST commands that read in CSVs will happily handle .gz input. lca index is the exception, see change lca subcommands to use tax taxonomy code/handling #2198 for why.
  • MOST commands will properly detect gzip input based only on content, not filename; the major exception is manifest loading, which is only used internally and does not yet use standard CSV handling. Manifest loading will happily load a csv.gz file based on filename.
  • importantly, picklists can now be gzipped CSVs 🎉

This PR also:

  • fixes sourmash prefetch and sourmash gather --save-prefetch-csv to use standard CSV output;
  • fixes sourmash sig check to use standard CSV loading (now including gz);
  • fixes sourmash sketch fromfile to use standard CSV loading (now including gz);
  • adds explicit tests for sourmash_args.FileOutputCSV ;

Implementation details

FIRST, this PR provides a new context manager, sourmash_args.FileInputCSV, that provides a csv.DictReader iterator on top of straight CSVs, gzipped CSVs, and zipfiles that contain a specially named file.

For example,

with FileInputCSV('taxonomy.csv.gz') as r:
   for row in r:
       # ...

does what you'd expect. More intriguingly,

with FileInputCSV('taxonomy.zip', default_zip_name='SOURMASH-TAXONOMY.csv') as r:
   for row in r:
       # ...

will look for SOURMASH-TAXONOMY.csv in the given zipfile and load rows from it.

This permits "nice" behavior such as:

with sourmash_args.FileInputCSV(filename,
                               default_zip_name="SOURMASH-TAXONOMY.csv") as r:

to support text CSV, gzip CSV, and zipfiles containing SOURMASH-TAXONOMY.csv.

In turn, this enables things like distributing taxonomies within zipfile databases per #2154.

The FileInputCSV context handler also supports CSVs with version lines (# KEY: VERSION), so can load manifests from zip files (although the manifest loading code does not use FileInputCSV, because of its reliance on ZipStorage).

This PR also adjusts FileOutputCSV to automatically gzip the CSV when .gz is present on the end of the output filename. Small change (3 lines), many impacts 😁

Fixes #2188
Fixes #2012
Fixes #1903
Addresses #2154 but doesn't fix it

TODO

  • explore passing a file handle into FileInputCSV for cases where we already know it's a zipfile - implemented but not yet used anywhere
  • write a lot more tests for various error conditions 😆
  • consider support for version lines
  • document and test the csv.gz writing code throughout the codebase

@ctb
Copy link
Contributor Author

ctb commented Aug 10, 2022

@bluegenes thoughts welcome :)

@codecov
Copy link

codecov bot commented Aug 10, 2022

Codecov Report

Merging #2195 (494747d) into latest (010718c) will increase coverage by 7.35%.
The diff coverage is 94.87%.

@@            Coverage Diff             @@
##           latest    #2195      +/-   ##
==========================================
+ Coverage   84.53%   91.88%   +7.35%     
==========================================
  Files         131      100      -31     
  Lines       15458    11232    -4226     
  Branches     2207     2218      +11     
==========================================
- Hits        13067    10321    -2746     
+ Misses       2092      612    -1480     
  Partials      299      299              
Flag Coverage Δ
python 91.88% <94.87%> (+<0.01%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/lca/command_index.py 89.90% <ø> (ø)
src/sourmash/sourmash_args.py 93.20% <93.44%> (+<0.01%) ⬆️
src/sourmash/command_sketch.py 93.27% <100.00%> (-0.02%) ⬇️
src/sourmash/commands.py 89.05% <100.00%> (ø)
src/sourmash/manifest.py 95.02% <100.00%> (+0.12%) ⬆️
src/sourmash/picklist.py 91.27% <100.00%> (-0.23%) ⬇️
src/sourmash/sig/__main__.py 94.00% <100.00%> (-0.01%) ⬇️
src/sourmash/tax/tax_utils.py 98.25% <100.00%> (-0.03%) ⬇️
src/core/src/index/bigsi.rs
src/core/src/index/mod.rs
... and 29 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bluegenes
Copy link
Contributor

As discussed on slack :) -

re: SOURMASH-TAXONOMY — would you consider GTDB-TAXONOMY and NCBI-TAXONOMY instead, with the default being gtdb?

OR, somewhere in database info/metadata (which we don’t have yet, but have talked about), add the default for that database? In this case, I'm thinking about database info/metadata as database version (e.g. gtdb-rs207), sourmash signature version, creation date, etc -- and then adding default-taxonomy.

@ctb
Copy link
Contributor Author

ctb commented Aug 12, 2022

As discussed on slack :) -

re: SOURMASH-TAXONOMY — would you consider GTDB-TAXONOMY and NCBI-TAXONOMY instead, with the default being gtdb?

might be fine to support SOURMASH-TAXONOMY now and then (later) GTDB-TAXONOMY and NCBI-TAXONOMY via command-line switch.

Biggest pragmatic question is when we would start distributing taxonomies with the databases...

@ctb
Copy link
Contributor Author

ctb commented Aug 12, 2022

hmm - @bluegenes what do you think about me ditching the SOURMASH-TAXONOMY.csv for this PR, but keeping the generic support for gzip and zipfiles? We could then explore the Right Way to include taxonomies in another PR, which could focus more on the UX, while being built on top of the foundation in this PR?

tl;dr maybe smaller code changes better :)

@bluegenes
Copy link
Contributor

hmm - @bluegenes what do you think about me ditching the SOURMASH-TAXONOMY.csv for this PR, but keeping the generic support for gzip and zipfiles? We could then explore the Right Way to include taxonomies in another PR, which could focus more on the UX, while being built on top of the foundation in this PR?

tl;dr maybe smaller code changes better :)

That'd be ok!

@ctb ctb changed the title [WIP] add generic support for gzipped and zipfile CSVs [MRG] add generic support for gzipped and zipfile CSVs Aug 13, 2022
@ctb
Copy link
Contributor Author

ctb commented Aug 13, 2022

Ready for review & merge @sourmash-bio/devs!

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants