Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG-REPORT] Slow HDF5 conversion of file with large number of columns #2154

Open
grafail opened this issue Aug 5, 2022 · 2 comments
Open

Comments

@grafail
Copy link

grafail commented Aug 5, 2022

Description
Files with a significant amount of columns seem to freeze, while trying to convert to HDF5. On a test file with 5000 columns and 10 rows, conversion to arrow takes 0.19s, convertion to parquet 0.25s, while hdf5 seems to progress quite slowly.

import vaex

df = vaex.open("test_file.csv")
df.export_arrow("test_file.arrow", progress=True)
df.export_parquet("test_file.parquet", progress=True)
df.export_hdf5("test_file.hdf5", progress=True)
export(arrow) [########################################] 100.00% elapsed time  :     0.20s =  0.0m =  0.0h
export(arrow) [########################################] 100.00% elapsed time  :     0.24s =  0.0m =  0.0h
export(hdf5) [###############-------------------------] 38.01% estimated time:    49.82s =  0.8m =  0.0h

It seems most of the delay is originating from this line:

shape = (N, ) + df._shape_of(name)[1:]

Software information

  • Vaex version (import vaex; vaex.__version__):
{'vaex': '4.11.1', 'vaex-core': '4.11.1', 'vaex-viz': '0.5.2', 'vaex-hdf5': '0.12.3', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.1', 'vaex-jupyter': '0.8.0', 'vaex-ml': '0.18.0'}
  • Vaex was installed via: pip / conda-forge / from source: pip
  • OS: Ubuntu 20.04

Additional information
I have uploaded a dataset to help reproduce this issue.
test_file.csv

@JovanVeljanoski
Copy link
Member

Thanks - good catch! Let's see if we can improve it.

PRs are welcome of course!

@ttk-kstn
Copy link

Hi, I also have a use case with a lot of columns. I tried to reproduce this issue in my environment, and observed that after vaex 4.14, even arrow and parquet exports are much slower.

I used python 3.9.18 on Ubuntu 22.04 and Windows 10. I installed vaex with conda using conda-forge channel.

With vaex 4.13, only HDF5 export is slow. (Sorry for pasting 4.12.0 results. I copied from wrong terminal.)

In [1]: import vaex

In [2]: vaex.__version__
Out[2]:
{'vaex-core': '4.12.0',
 'vaex-viz': '0.5.4',
 'vaex-hdf5': '0.12.3',
 'vaex-server': '0.8.1',
 'vaex-astro': '0.9.3',
 'vaex-jupyter': '0.8.1',
 'vaex-ml': '0.18.1'}

In [3]: df = vaex.open("/tmp/test_file.csv")

In [4]: df.export_arrow("test_file.arrow", progress=True)
export(arrow) [########################################] 100.00% elapsed time  :     0.24s =  0.0m =  0.0h
 
In [5]: df.export_parquet("test_file.parquet", progress=True)
export(arrow) [########################################] 100.00% elapsed time  :     0.31s =  0.0m =  0.0h
 
In [6]: df.export_hdf5("test_file.hdf5", progress=True)
export(hdf5) [########################################] 100.00% elapsed time  :   140.48s =  2.3m =  0.0h

But with vaex 4.14, arrow & parquet export show significant slow down.


In [1]: import vaex

In [2]: vaex.__version__
Out[2]:
{'vaex-core': '4.14.0',
 'vaex-viz': '0.5.4',
 'vaex-hdf5': '0.13.0',
 'vaex-server': '0.8.1',
 'vaex-astro': '0.9.3',
 'vaex-jupyter': '0.8.1',
 'vaex-ml': '0.18.1'}

In [3]: df = vaex.open("/tmp/test_file.csv")

In [4]: df.export_arrow("test_file.arrow", progress=True)
export(arrow) [########################################] 100.00% elapsed time  :    76.80s =  1.3m =  0.0h
 
In [5]: df.export_parquet("test_file.parquet", progress=True)
export(arrow) [########################################] 100.00% elapsed time  :    79.64s =  1.3m =  0.0h
 
In [6]: df.export_hdf5("test_file.hdf5", progress=True)
export(hdf5) [########################################] 100.00% elapsed time  :   274.33s =  4.6m =  0.1h

This means that we can't work around the slow HDF5 export of wide dataframes by using arrow or parquet.
I would love to see this resolved because vaex seems like a good option for my use case.

Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants