Add compression to embedding export #53

sacdallago · 2020-08-28T12:36:16Z

An easy improvement when storing embeddings_file and reduced_embeddings_file, supported out of the box, may impact speed (but that's acceptable).

https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline

Also, since at it: double check that stored dataset uses the most fitting dtype.

P.S.: preference for gzip

P.P.S.: would be nice to run this as a test to see "how much it buys". Easy test: take an h5 file and copy all datasets into a new h5 file applying compression. Then we see if this is useful...

The text was updated successfully, but these errors were encountered:

konstin · 2020-09-05T21:55:02Z

Initial check with the use case four fasta, lengthes [300, 544, 184, 1584, 518] goes from 37MB to 26MB - That's significantly more than I expected and really good for this small sample. The reduced embeddings go from 24KB to 32KB, but at that size it's not even a real benchmark.

What I don't understand from using the documentation is whether the compression is applied across datasets or for each dataset individually. That will most likely decide whether it is at all useful for the reduced embeddings.

Code used:

import sys

import h5py
from tqdm import tqdm

lengthes = []
with h5py.File(sys.argv[1], "r") as uncompressed, h5py.File(sys.argv[2], "w") as compressed:
    for key, value in tqdm(uncompressed.items()):
        if len(value.shape) == 3:
            lengthes.append(value.shape[1])
        compressed.create_dataset(key, data=value, compression="gzip")

print(lengthes)

sacdallago · 2020-09-06T13:57:05Z

This already sounds exceptionally good. These are actually the biggest files we produce, so we don't even have to worry about zipping results via pipeline after the run for space reasons if we can directly compress the embeddings.

The only other "worthy" big file produced is the pairwise distance matrix (which on some internal swissprot vs human test is occupying upwards of 10GB in CSV form; thus --> might want to save this as h5 soon).

konstin · 2020-09-23T15:03:24Z

New numbers! Reduced embeddings: 148M -> 207M (yes, this file has apparently grown by compression). Normal embeddings: 19G -> 17G; 17G also by just gzipping the whole file, while the latter was faster. Both took a couple of minutes.

konstin · 2020-11-29T12:58:53Z

Do you still want this now that we have half_precision?

sacdallago · 2020-11-30T15:30:38Z

I guess that we can zip files, too. Closing for now

sacdallago added enhancement New feature or request prio:low labels Aug 28, 2020

sacdallago added this to the Version v0.1.4 milestone Aug 28, 2020

sacdallago assigned konstin and sacdallago Aug 28, 2020

sacdallago modified the milestones: Version v0.1.4, Version v0.1.5 Oct 1, 2020

sacdallago closed this as completed Nov 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add compression to embedding export #53

Add compression to embedding export #53

sacdallago commented Aug 28, 2020 •

edited

Loading

konstin commented Sep 5, 2020

sacdallago commented Sep 6, 2020

konstin commented Sep 23, 2020

konstin commented Nov 29, 2020

sacdallago commented Nov 30, 2020

Add compression to embedding export #53

Add compression to embedding export #53

Comments

sacdallago commented Aug 28, 2020 • edited Loading

konstin commented Sep 5, 2020

sacdallago commented Sep 6, 2020

konstin commented Sep 23, 2020

konstin commented Nov 29, 2020

sacdallago commented Nov 30, 2020

sacdallago commented Aug 28, 2020 •

edited

Loading