Store pandas dataframe as single object in data frame #560

jan-janssen · 2021-11-25T03:47:00Z

Converting to dictionary is terribly slow:

from pyiron_base import DataContainer
from pyiron_base import Project
import pandas
with open("test.csv", "w") as f:
    f.write(",a,b\n")
    for i in range(100000): 
        f.write(str(i) + ",1,2\n")
df = pandas.read_csv("test.csv", index_col=0)
pr = Project(".")
hdf = pr.create_hdf(".", "test")
dc = DataContainer(df.to_dict())
dc.to_hdf(hdf)

This takes forever, while with a small change it is in the order of ms:

dc = DataContainer(df)

jan-janssen · 2021-11-25T03:49:56Z

This fixes my issues with #463

pmrv · 2021-11-25T08:11:00Z

Have you checked that this actually stores the data? In my testing conversion from pandas is broken.

df = DataFrame({'a': range(4), 'b': range(4,8), 'c': range(8,12)})
dc = DataContainer(df)
print(dc)
> DataContainer([])
dc = DataContainer(df.to_dict())
print(dc)
> DataContainer({'a': DataContainer([0, 1, 2, 3]), 'b': DataContainer([4, 5, 6, 7]), 'c': DataContainer([8, 9, 10, 11])})

The reason I didn't save it via pandas is that I expect people to put data into a pyiron_table that is HDF-serializable to us, but not to pandas (e.g. structures). I'll have a look today at #463 and #559.

pmrv · 2021-11-25T08:11:26Z

pyiron_base/table/datamining.py

@@ -692,8 +690,7 @@ def from_hdf(self, hdf=None, group_name=None):
        if hdf_version=="0.3.0":
            with self.project_hdf5.open("output") as hdf5_output:
                if "table" in hdf5_output.list_groups():
-                    data = hdf5_output["table"].to_object().to_builtin()
-                    self._pyiron_table._df = pandas.DataFrame(data)
+                    self._pyiron_table._df = pandas.read_hdf(hdf5_output.file_name, hdf5_output.h5_path + "/table")


This will break without pytables or not?

Yes, I rely on pytables - but as long as that is the recommended solution by pandas I guess that is the only option. Alternatively we could also pickle the object and store the pickled string, but I do not think that is a better alternative.

Then there's no need to put the DataFrame into the data container at all and we can just DataFrame.to_hdf/read_hdf, I suppose. Then we also need to put tables back into our environment.yml.

That I exactly what I do in the latest commit.

Ah, I missed that. Looks good to me then, we can always come back to this if someone really wants to store "complex" data in their pyiron tables.

EDIT: The dependency still should be added, though.

jan-janssen · 2021-11-25T14:48:44Z

The reason I didn't save it via pandas is that I expect people to put data into a pyiron_table that is HDF-serializable to us, but not to pandas (e.g. structures). I'll have a look today at #463 and #559.

That is definitely a limitation of the current solution, currently I sure the structures as JSON objects before inserting them into the pyiron table, but that requires extra work on the user side. The advantage is the size difference. Already if I only load the first 10000 entries of the file in the example above the old code resulted in a 124MB HDF5 file. So the conversion to dictionary and storing in the Datacontainer is definitely not scalable for large pyiron tables.

pmrv · 2021-11-25T15:34:38Z

I'm looking into the file size right now and it seems like an issue with how we write to HDF5 in general. When I do a h5repack on the file in the example, I directly get a factor ~20 decrease in file size.

jan-janssen · 2021-11-25T17:58:03Z

The dependency is still included https://github.com/pyiron/pyiron_base/blob/pyiron_table_store_using_pandas/setup.py#L45

Update datamining.py

ae9a3d7

jan-janssen added 2 commits November 24, 2021 20:58

Update datamining.py

6e93cdb

Update datamining.py

0b661cc

pmrv reviewed Nov 25, 2021

View reviewed changes

pmrv self-requested a review November 25, 2021 16:32

pmrv approved these changes Nov 25, 2021

View reviewed changes

jan-janssen merged commit 08dee54 into master Nov 25, 2021

delete-merged-branch bot deleted the pyiron_table_store_using_pandas branch November 25, 2021 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store pandas dataframe as single object in data frame #560

Store pandas dataframe as single object in data frame #560

jan-janssen commented Nov 25, 2021

jan-janssen commented Nov 25, 2021

pmrv commented Nov 25, 2021

pmrv Nov 25, 2021

jan-janssen Nov 25, 2021

pmrv Nov 25, 2021

jan-janssen Nov 25, 2021

pmrv Nov 25, 2021 •

edited

jan-janssen commented Nov 25, 2021

pmrv commented Nov 25, 2021

jan-janssen commented Nov 25, 2021

Store pandas dataframe as single object in data frame #560

Store pandas dataframe as single object in data frame #560

Conversation

jan-janssen commented Nov 25, 2021

jan-janssen commented Nov 25, 2021

pmrv commented Nov 25, 2021

pmrv Nov 25, 2021

Choose a reason for hiding this comment

jan-janssen Nov 25, 2021

Choose a reason for hiding this comment

pmrv Nov 25, 2021

Choose a reason for hiding this comment

jan-janssen Nov 25, 2021

Choose a reason for hiding this comment

pmrv Nov 25, 2021 • edited

Choose a reason for hiding this comment

jan-janssen commented Nov 25, 2021

pmrv commented Nov 25, 2021

jan-janssen commented Nov 25, 2021

pmrv Nov 25, 2021 •

edited