Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster hdf reading #512

Merged
merged 4 commits into from Nov 11, 2021
Merged

Faster hdf reading #512

merged 4 commits into from Nov 11, 2021

Conversation

pmrv
Copy link
Contributor

@pmrv pmrv commented Nov 11, 2021

I'm waiting for pyiron to submit a large number of jobs right now, so I thought I have a look at how to make it faster.

One of the major bottlenecks is calling list_all()/list_groups()/list_nodes() to check what datasets/groups are inside the HDF5 files. In fact roughly 75%(!) of loading a small lammps job is spent in FileHDFio.list_all().

This makes two changes:

  1. Eagerly try load read data directly with h5io.read_hdf5 instead of checking first whether it is there and then reading it. This makes a simple read like job['output/generic/energy_pot'] faster by about a factor 5, so that it takes roughly the same amount of time as calling h5io.read_hdf5 directly on a file.
  2. Use list_all() instead of list_nodes() and list_groups together. This saves opening the HDF5 file once.

Both together make loading a lammps jobs about 10% faster.

I want to mention that this

h5io.read_hdf5(j.project_hdf5.file_name, title=j.name + '/output/generic/energy_pot')

is still about twice as slow as directly doing

with h5py.File(j.project_hdf5.file_name) as f:
    a = f[j.name + '/output/generic/energy_pot']

so there's still room for improvement.

Most read acesses on FileHDFio will want to grab certain specific
datasets.  Optimize this by trying to just read_hdf5 the given path
straight-away before trying other options.
@pmrv pmrv added the enhancement New feature or request label Nov 11, 2021
The new fast path already takes care of loading leaf datasets
Copy link
Member

@jan-janssen jan-janssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@pmrv pmrv merged commit 14b6ea2 into master Nov 11, 2021
@delete-merged-branch delete-merged-branch bot deleted the fast-hdf-access branch November 11, 2021 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants