You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
Files with a significant amount of columns seem to freeze, while trying to convert to HDF5. On a test file with 5000 columns and 10 rows, conversion to arrow takes 0.19s, convertion to parquet 0.25s, while hdf5 seems to progress quite slowly.
Hi, I also have a use case with a lot of columns. I tried to reproduce this issue in my environment, and observed that after vaex 4.14, even arrow and parquet exports are much slower.
I used python 3.9.18 on Ubuntu 22.04 and Windows 10. I installed vaex with conda using conda-forge channel.
With vaex 4.13, only HDF5 export is slow. (Sorry for pasting 4.12.0 results. I copied from wrong terminal.)
In [1]: import vaex
In [2]: vaex.__version__
Out[2]:
{'vaex-core': '4.12.0',
'vaex-viz': '0.5.4',
'vaex-hdf5': '0.12.3',
'vaex-server': '0.8.1',
'vaex-astro': '0.9.3',
'vaex-jupyter': '0.8.1',
'vaex-ml': '0.18.1'}
In [3]: df = vaex.open("/tmp/test_file.csv")
In [4]: df.export_arrow("test_file.arrow", progress=True)
export(arrow) [########################################] 100.00% elapsed time : 0.24s = 0.0m = 0.0h
In [5]: df.export_parquet("test_file.parquet", progress=True)
export(arrow) [########################################] 100.00% elapsed time : 0.31s = 0.0m = 0.0h
In [6]: df.export_hdf5("test_file.hdf5", progress=True)
export(hdf5) [########################################] 100.00% elapsed time : 140.48s = 2.3m = 0.0h
But with vaex 4.14, arrow & parquet export show significant slow down.
In [1]: import vaex
In [2]: vaex.__version__
Out[2]:
{'vaex-core': '4.14.0',
'vaex-viz': '0.5.4',
'vaex-hdf5': '0.13.0',
'vaex-server': '0.8.1',
'vaex-astro': '0.9.3',
'vaex-jupyter': '0.8.1',
'vaex-ml': '0.18.1'}
In [3]: df = vaex.open("/tmp/test_file.csv")
In [4]: df.export_arrow("test_file.arrow", progress=True)
export(arrow) [########################################] 100.00% elapsed time : 76.80s = 1.3m = 0.0h
In [5]: df.export_parquet("test_file.parquet", progress=True)
export(arrow) [########################################] 100.00% elapsed time : 79.64s = 1.3m = 0.0h
In [6]: df.export_hdf5("test_file.hdf5", progress=True)
export(hdf5) [########################################] 100.00% elapsed time : 274.33s = 4.6m = 0.1h
This means that we can't work around the slow HDF5 export of wide dataframes by using arrow or parquet.
I would love to see this resolved because vaex seems like a good option for my use case.
Description
Files with a significant amount of columns seem to freeze, while trying to convert to HDF5. On a test file with 5000 columns and 10 rows, conversion to arrow takes 0.19s, convertion to parquet 0.25s, while hdf5 seems to progress quite slowly.
It seems most of the delay is originating from this line:
vaex/packages/vaex-hdf5/vaex/hdf5/writer.py
Line 73 in 6339705
Software information
import vaex; vaex.__version__)
:Additional information
I have uploaded a dataset to help reproduce this issue.
test_file.csv
The text was updated successfully, but these errors were encountered: