Performance issues with dictdump.dicttoh5 #3586

pierrepaleo · 2022-01-07T08:54:20Z

I have to export some metadata in a HDF5 file. Within this metadata is a quite large dict of str:

mydict = {}
keys = np.arange(7500)
for key in keys:
    mydict[str(key)] = "file_%05d" % key
%timeit -r2 -n2 dicttoh5(mydict, "/tmp/test.h5")

It takes 9 seconds to export this 7500 keys dict.

Now if I export arrays instead:

keys = np.array(list(map(int, mydict.keys())))
vals = np.array(list(mydict.values()))
mydict2 = {"indices": keys, "files": vals}
%timeit -r2 -n2 dicttoh5(mydict2, "/tmp/test.h5")

it takes 38 ms.

Not sure why silx.io.dictdump.dicttoh5 takes so long to export a (simple, non-nested) dict.
From what I understand (and as suggested by the following profiling), a new dataset is recursively created for each dict key, to handle nested dicts. If so, we could circumvent the problem by avoiding creating dataset for "tree leaves", i.e in the last recursion level.

In this case I need to store an "associative array" to keep the mapping between keys and values. Maybe I should switch to something different like array of tuples before dumping ?

The text was updated successfully, but these errors were encountered:

vallsv · 2022-01-07T09:32:02Z

I am pretty sure there is no choices. Leaves are still datasets. Or you think about using something else like attr?

There is maybe better data structure for your data? For example do you really need a key-value struct? Or instead few datasets with attrs, or associative arrays with columns?

For example if your %05d is well known all this data could be inside a single dataset with this key as an index instead.

pierrepaleo · 2022-01-07T09:47:28Z

Yes I think the simplest would be that I change my data structure before dumping to h5. Although there is no predictable pattern in the values (the %05d was just for example), another structure can be easy to implement without dict.

The approach of dicttoh5 is quite conservative and it should be kept as is.

pierrepaleo added the question label Jan 7, 2022

pierrepaleo closed this as completed Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues with dictdump.dicttoh5 #3586

Performance issues with dictdump.dicttoh5 #3586

pierrepaleo commented Jan 7, 2022 •

edited

vallsv commented Jan 7, 2022

pierrepaleo commented Jan 7, 2022

Performance issues with dictdump.dicttoh5 #3586

Performance issues with dictdump.dicttoh5 #3586

Comments

pierrepaleo commented Jan 7, 2022 • edited

vallsv commented Jan 7, 2022

pierrepaleo commented Jan 7, 2022

pierrepaleo commented Jan 7, 2022 •

edited