Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with dictdump.dicttoh5 #3586

Closed
pierrepaleo opened this issue Jan 7, 2022 · 2 comments
Closed

Performance issues with dictdump.dicttoh5 #3586

pierrepaleo opened this issue Jan 7, 2022 · 2 comments
Labels

Comments

@pierrepaleo
Copy link
Contributor

pierrepaleo commented Jan 7, 2022

I have to export some metadata in a HDF5 file. Within this metadata is a quite large dict of str:

mydict = {}
keys = np.arange(7500)
for key in keys:
    mydict[str(key)] = "file_%05d" % key
%timeit -r2 -n2 dicttoh5(mydict, "/tmp/test.h5")

It takes 9 seconds to export this 7500 keys dict.

Now if I export arrays instead:

keys = np.array(list(map(int, mydict.keys())))
vals = np.array(list(mydict.values()))
mydict2 = {"indices": keys, "files": vals}
%timeit -r2 -n2 dicttoh5(mydict2, "/tmp/test.h5")

it takes 38 ms.

Not sure why silx.io.dictdump.dicttoh5 takes so long to export a (simple, non-nested) dict.
From what I understand (and as suggested by the following profiling), a new dataset is recursively created for each dict key, to handle nested dicts. If so, we could circumvent the problem by avoiding creating dataset for "tree leaves", i.e in the last recursion level.

prof2
prof1

In this case I need to store an "associative array" to keep the mapping between keys and values. Maybe I should switch to something different like array of tuples before dumping ?

@vallsv
Copy link
Contributor

vallsv commented Jan 7, 2022

I am pretty sure there is no choices. Leaves are still datasets. Or you think about using something else like attr?

There is maybe better data structure for your data? For example do you really need a key-value struct? Or instead few datasets with attrs, or associative arrays with columns?

For example if your %05d is well known all this data could be inside a single dataset with this key as an index instead.

@pierrepaleo
Copy link
Contributor Author

Yes I think the simplest would be that I change my data structure before dumping to h5. Although there is no predictable pattern in the values (the %05d was just for example), another structure can be easy to implement without dict.

The approach of dicttoh5 is quite conservative and it should be kept as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants