New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store pandas dataframe as single object in data frame #560
Conversation
This fixes my issues with #463 |
Have you checked that this actually stores the data? In my testing conversion from pandas is broken. df = DataFrame({'a': range(4), 'b': range(4,8), 'c': range(8,12)})
dc = DataContainer(df)
print(dc)
> DataContainer([])
dc = DataContainer(df.to_dict())
print(dc)
> DataContainer({'a': DataContainer([0, 1, 2, 3]), 'b': DataContainer([4, 5, 6, 7]), 'c': DataContainer([8, 9, 10, 11])}) The reason I didn't save it via pandas is that I expect people to put data into a pyiron_table that is HDF-serializable to us, but not to pandas (e.g. structures). I'll have a look today at #463 and #559. |
@@ -692,8 +690,7 @@ def from_hdf(self, hdf=None, group_name=None): | |||
if hdf_version=="0.3.0": | |||
with self.project_hdf5.open("output") as hdf5_output: | |||
if "table" in hdf5_output.list_groups(): | |||
data = hdf5_output["table"].to_object().to_builtin() | |||
self._pyiron_table._df = pandas.DataFrame(data) | |||
self._pyiron_table._df = pandas.read_hdf(hdf5_output.file_name, hdf5_output.h5_path + "/table") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break without pytables
or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I rely on pytables - but as long as that is the recommended solution by pandas I guess that is the only option. Alternatively we could also pickle the object and store the pickled string, but I do not think that is a better alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then there's no need to put the DataFrame
into the data container at all and we can just DataFrame.to_hdf
/read_hdf
, I suppose. Then we also need to put tables
back into our environment.yml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That I exactly what I do in the latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed that. Looks good to me then, we can always come back to this if someone really wants to store "complex" data in their pyiron tables.
EDIT: The dependency still should be added, though.
That is definitely a limitation of the current solution, currently I sure the structures as JSON objects before inserting them into the pyiron table, but that requires extra work on the user side. The advantage is the size difference. Already if I only load the first 10000 entries of the file in the example above the old code resulted in a 124MB HDF5 file. So the conversion to dictionary and storing in the Datacontainer is definitely not scalable for large pyiron tables. |
I'm looking into the file size right now and it seems like an issue with how we write to HDF5 in general. When I do a |
The dependency is still included https://github.com/pyiron/pyiron_base/blob/pyiron_table_store_using_pandas/setup.py#L45 |
Converting to dictionary is terribly slow:
This takes forever, while with a small change it is in the order of ms: