[BUG-REPORT] Slow HDF5 conversion of file with large number of columns #2154

grafail · 2022-08-05T17:16:18Z

Description
Files with a significant amount of columns seem to freeze, while trying to convert to HDF5. On a test file with 5000 columns and 10 rows, conversion to arrow takes 0.19s, convertion to parquet 0.25s, while hdf5 seems to progress quite slowly.

import vaex

df = vaex.open("test_file.csv")
df.export_arrow("test_file.arrow", progress=True)
df.export_parquet("test_file.parquet", progress=True)
df.export_hdf5("test_file.hdf5", progress=True)

export(arrow) [########################################] 100.00% elapsed time  :     0.20s =  0.0m =  0.0h
export(arrow) [########################################] 100.00% elapsed time  :     0.24s =  0.0m =  0.0h
export(hdf5) [###############-------------------------] 38.01% estimated time:    49.82s =  0.8m =  0.0h

It seems most of the delay is originating from this line:

vaex/packages/vaex-hdf5/vaex/hdf5/writer.py

Line 73 in 6339705

shape = (N, ) + df._shape_of(name)[1:]

Software information

Vaex version (import vaex; vaex.__version__):

{'vaex': '4.11.1', 'vaex-core': '4.11.1', 'vaex-viz': '0.5.2', 'vaex-hdf5': '0.12.3', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.1', 'vaex-jupyter': '0.8.0', 'vaex-ml': '0.18.0'}

Vaex was installed via: pip / conda-forge / from source: pip
OS: Ubuntu 20.04

Additional information
I have uploaded a dataset to help reproduce this issue.
test_file.csv

The text was updated successfully, but these errors were encountered:

JovanVeljanoski · 2022-08-08T22:28:12Z

Thanks - good catch! Let's see if we can improve it.

PRs are welcome of course!

ttk-kstn · 2023-09-14T12:56:19Z

Hi, I also have a use case with a lot of columns. I tried to reproduce this issue in my environment, and observed that after vaex 4.14, even arrow and parquet exports are much slower.

I used python 3.9.18 on Ubuntu 22.04 and Windows 10. I installed vaex with conda using conda-forge channel.

With vaex 4.13, only HDF5 export is slow. (Sorry for pasting 4.12.0 results. I copied from wrong terminal.)

In [1]: import vaex

In [2]: vaex.__version__
Out[2]:
{'vaex-core': '4.12.0',
 'vaex-viz': '0.5.4',
 'vaex-hdf5': '0.12.3',
 'vaex-server': '0.8.1',
 'vaex-astro': '0.9.3',
 'vaex-jupyter': '0.8.1',
 'vaex-ml': '0.18.1'}

In [3]: df = vaex.open("/tmp/test_file.csv")

In [4]: df.export_arrow("test_file.arrow", progress=True)
export(arrow) [########################################] 100.00% elapsed time  :     0.24s =  0.0m =  0.0h
 
In [5]: df.export_parquet("test_file.parquet", progress=True)
export(arrow) [########################################] 100.00% elapsed time  :     0.31s =  0.0m =  0.0h
 
In [6]: df.export_hdf5("test_file.hdf5", progress=True)
export(hdf5) [########################################] 100.00% elapsed time  :   140.48s =  2.3m =  0.0h

But with vaex 4.14, arrow & parquet export show significant slow down.


In [1]: import vaex

In [2]: vaex.__version__
Out[2]:
{'vaex-core': '4.14.0',
 'vaex-viz': '0.5.4',
 'vaex-hdf5': '0.13.0',
 'vaex-server': '0.8.1',
 'vaex-astro': '0.9.3',
 'vaex-jupyter': '0.8.1',
 'vaex-ml': '0.18.1'}

In [3]: df = vaex.open("/tmp/test_file.csv")

In [4]: df.export_arrow("test_file.arrow", progress=True)
export(arrow) [########################################] 100.00% elapsed time  :    76.80s =  1.3m =  0.0h
 
In [5]: df.export_parquet("test_file.parquet", progress=True)
export(arrow) [########################################] 100.00% elapsed time  :    79.64s =  1.3m =  0.0h
 
In [6]: df.export_hdf5("test_file.hdf5", progress=True)
export(hdf5) [########################################] 100.00% elapsed time  :   274.33s =  4.6m =  0.1h

This means that we can't work around the slow HDF5 export of wide dataframes by using arrow or parquet.
I would love to see this resolved because vaex seems like a good option for my use case.

Thanks,

JovanVeljanoski added the performance label Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG-REPORT] Slow HDF5 conversion of file with large number of columns #2154

[BUG-REPORT] Slow HDF5 conversion of file with large number of columns #2154

grafail commented Aug 5, 2022 •

edited

Loading

JovanVeljanoski commented Aug 8, 2022

ttk-kstn commented Sep 14, 2023

[BUG-REPORT] Slow HDF5 conversion of file with large number of columns #2154

[BUG-REPORT] Slow HDF5 conversion of file with large number of columns #2154

Comments

grafail commented Aug 5, 2022 • edited Loading

JovanVeljanoski commented Aug 8, 2022

ttk-kstn commented Sep 14, 2023

grafail commented Aug 5, 2022 •

edited

Loading