Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for compressed HDF5 files. #101

Open
DougRzz opened this issue Jul 17, 2018 · 11 comments
Open

Add support for compressed HDF5 files. #101

DougRzz opened this issue Jul 17, 2018 · 11 comments

Comments

@DougRzz
Copy link
Contributor

DougRzz commented Jul 17, 2018

I would like to expand the amount of data stored in a Vaex HDF format (for example, more data sources, increased sampling rates, a greater number of channels etc etc). These very large datasets would be more expensive to store (especially as I'm using high-performance SSDs) and difficult to manage. Is it possible to add some mild compression to the HDF file without a significant performance hit to the binning and histogramming functions in Vaex?

On a slightly related side note.... Maarten, I thought your Scipy talk was excellent and it has got me thinking about how I could better visualise and interactively interrogate my time series data in 3D using IPyVolume & Vaex!! Thanks for publishing your work.

@maartenbreddels
Copy link
Member

Just noticed I didn't answer yet.

Is it possible to add some mild compression to the HDF file without a significant performance hit to the binning and histogramming functions in Vaex?

Yes, I'm thinking of supporting this, also for chunked storage. It will have a performance penalty, and i'm not sure about if this should be supported in exporting (that will be a bit challenging), or if it can be done with the h5 tools (h5copy?). Then when stored compressed, vaex should only have a different path for reading the compressed/chunked columns.

Maarten, I thought your Scipy talk was excellent and it has got me thinking about how I could better visualise and interactively interrogate my time series data in 3D using IPyVolume & Vaex!! Thanks for publishing your work.

thank you!

@DougRzz
Copy link
Contributor Author

DougRzz commented Jan 8, 2019

HDF compression support in Vaex would be great. But since I submitted this issue, I assumed it couldn't be reasonably done because it wouldn't be possible to memory map compresed data.

I did try to add hdf compression support to Vaex (with support from one of my colleagues). Initial results looked good, especially with lz4 blosc compression (see graph at bottom showing performance vs file size) but when using this modification in the real world I frequently ran into memory problems. In the end, I gave up on it. But here is some of the code I used to monkey patched Vaex ...

# Import pytables compression. Allows support for blosc compression including lz4 and snappy.
__import__('tables')  # <-- import PyTables; __import__ so that linters don't complain
#
def _map_hdf5_array_compressed(self, data, mask=None):
    array = data[:]
    if mask is not None:
        mask_array = self._map_hdf5_array(mask)
        return np.ma.array(array, mask=mask_array, shrink=False)
        assert ar.mask is mask_array, "masked array was copied"
    else:
        return array
    
def _map_hdf5_array(self, data, mask=None):
    offset = data.id.get_offset()
    if offset is None:
        raise Exception("columns doesn't really exist in hdf5 file")
    shape = data.shape
    dtype = data.dtype
    if "dtype" in data.attrs:
        dtype = data.attrs["dtype"]
        if dtype == 'utf32':
            dtype = np.dtype('U' + str(data.attrs['dlength']))
    #self.addColumn(column_name, offset, len(data), dtype=dtype)
    array = self._map_array(data.id.get_offset(), dtype=dtype, length=len(data))
    if mask is not None:
        mask_array = self._map_hdf5_array(mask)
        return np.ma.array(array, mask=mask_array, shrink=False)
        assert ar.mask is mask_array, "masked array was copied"
    else:
        return array    

#  Use this in main code
if compressionType == 'None':
    vx.hdf5.dataset.Hdf5MemoryMapped._map_hdf5_array = _map_hdf5_array 
else:
    vx.hdf5.dataset.Hdf5MemoryMapped._map_hdf5_array = _map_hdf5_array_compressed 

benchmarkvaexdask_compression

@maartenbreddels
Copy link
Member

Nice results, and thanks for sharing that! I'm pretty happy with that plot :)
I think the decompression could be done on the fly as well, but that will most likely not be efficient without a proper chunking, although maybe some compression techniques can efficiently decompress from a particular offset.

@mkst
Copy link
Contributor

mkst commented Jul 9, 2020

Hi all,

Has there been any progress on this topic?

We're looking at libraries that could compete with e.g. kdb+ in terms of fast random access which kdb does via column store of homogeneous data types (indexing is effectively pointer arithmetic), which also supports compression (the data structure stores the starts of each compressed block, and so you only decompress the chunk(s) you need, then it's pointer arithmetic), and now supports encryption of these chunks.

I think that HDF5 fits the bill for compression (happy to hear if you guys know of alternatives), and I wrote a noddy filter to add AES encryption, so we just need some magic (potentially vaex) that would allow us to read this data and perform various analytics (finance related).

If no progress has been made, would it be possible to give a rough breakdown of what it would take, and we might have a stab at it...

EDIT

I probably should have tested before commenting. Whilst there is no support for compression in export_hdf5, if you write down a dataset via h5py (including whatever filter), vaex seems to have no trouble reading it back.

Some example code:

import os

import pandas as pd
import numpy as np
import vaex as vx
import h5py

# create data
a = np.random.uniform(-1, 1, 5000000)
b = np.random.uniform(-1, 1, 5000000)
c = np.random.uniform(-1, 1, 5000000)
d = np.random.uniform(-1, 1, 5000000)
table = dict(zip(['a', 'b', 'c', 'd'], [a, b, c, d]))

print("pandas dataframe:")
print(pd.DataFrame(data=table))
print("")

# set keys for hdf5_aes filter
os.environ['HDF5_AES_IV'] = 'please_change_me'
os.environ['HDF5_AES_KEY'] = 'your_secret_aes256_key_goes_here'

filename = 'example.h5'
# write down using h5py
with h5py.File(filename, 'w') as f:
    for column_name, data in table.items():
        f.create_dataset(column_name,
                         data=data,
                         chunks=True,
                         compression=444) # replace with 'gzip'

# read back in vaex
df = vx.open(filename)
print("vaex dataframe:")
print(df)

There does seem to be a slight performance overhead in reading back a hdf5 file written down with h5py over export_hdf5, but for our purposes this approach will suffice - e.g. write down with h5py, readback with vaex.

@erwanp
Copy link

erwanp commented Jun 2, 2022

Hello ! @pmariotto and I are currently working with Vaex & >2TB .hdf5 files, which unfortunately do not fit allocated disk-space on our clusters. A compression may be helpful even at the cost of performance.

@maartenbreddels Is this on the Vaex agenda ?
In particular, would I be able to open a compressed file using selection filters ?

@mkst did you keep on using the solution above eventually ?

Best

@JovanVeljanoski
Copy link
Member

@erwanp you can use parquet. I suspect it will be substantially smaller and is vaex compatible. At expense of some performance of course.

@erwanp
Copy link

erwanp commented Jun 2, 2022

Thank you @JovanVeljanoski!
Just to confirm, because I haven't used Parquet before (but https://vaex.readthedocs.io/en/docs/example_io.html had all the answers!)

  • Vaex.open() will work the same with Hdf5 or parquet files, no change here
  • we'd replace Vaex.export_hdf5 with Vaex.export_parquet
  • and the best thing : all the row-selection process happens within the Vaex Dataframe, so our current code will work with no changes whether it's an Hdf5 or parquet file on disk.

Starting to feel the power of Vaex here, not being bound to a particular filetype ;)

@JovanVeljanoski
Copy link
Member

Indeed, correct on all accounts. For exporting, if you ever need the flexibility you can just use df.export('my_file.parquet') and the export method will the the right thing depending on the extension used. So if .hdf5 it will use hdf5 and so on.

Indeed, especially if you use a binary file format (arrow, hdf5, parquet), as long as the schema is the same, the bahaviour should be the same!

@erwanp
Copy link

erwanp commented Jun 2, 2022

Great!
I also noticed Parquet supports many compression methods (brotly, gzip, snappy...)
If I'm creating the parquet file from Vaex, can I specify which algorithm to use ?

@JovanVeljanoski
Copy link
Member

Vaex depends on arrow for the parquet i/o.
So I would check their docs. If arrow can do it, vaex can do it. The defaults are usually good enough tho

@Tejalgnrml
Copy link

Hi
I would like to the know the compression format of vaex.export to parquet file. I am using wrangler to export pandas dataframe into parquet file and have observed that the praquet file generated from vaex.export compressing have performance benefits while running aggregations. I would like to know the compression format so that i can use wrangler to export in the format similar to vaex.export.
Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants