Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store pandas dataframe as single object in data frame #560

Merged
merged 3 commits into from Nov 25, 2021

Conversation

jan-janssen
Copy link
Member

Converting to dictionary is terribly slow:

from pyiron_base import DataContainer
from pyiron_base import Project
import pandas
with open("test.csv", "w") as f:
    f.write(",a,b\n")
    for i in range(100000): 
        f.write(str(i) + ",1,2\n")
df = pandas.read_csv("test.csv", index_col=0)
pr = Project(".")
hdf = pr.create_hdf(".", "test")
dc = DataContainer(df.to_dict())
dc.to_hdf(hdf)

This takes forever, while with a small change it is in the order of ms:

dc = DataContainer(df)

@jan-janssen
Copy link
Member Author

This fixes my issues with #463

@pmrv
Copy link
Contributor

pmrv commented Nov 25, 2021

Have you checked that this actually stores the data? In my testing conversion from pandas is broken.

df = DataFrame({'a': range(4), 'b': range(4,8), 'c': range(8,12)})
dc = DataContainer(df)
print(dc)
> DataContainer([])
dc = DataContainer(df.to_dict())
print(dc)
> DataContainer({'a': DataContainer([0, 1, 2, 3]), 'b': DataContainer([4, 5, 6, 7]), 'c': DataContainer([8, 9, 10, 11])})

The reason I didn't save it via pandas is that I expect people to put data into a pyiron_table that is HDF-serializable to us, but not to pandas (e.g. structures). I'll have a look today at #463 and #559.

@@ -692,8 +690,7 @@ def from_hdf(self, hdf=None, group_name=None):
if hdf_version=="0.3.0":
with self.project_hdf5.open("output") as hdf5_output:
if "table" in hdf5_output.list_groups():
data = hdf5_output["table"].to_object().to_builtin()
self._pyiron_table._df = pandas.DataFrame(data)
self._pyiron_table._df = pandas.read_hdf(hdf5_output.file_name, hdf5_output.h5_path + "/table")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break without pytables or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I rely on pytables - but as long as that is the recommended solution by pandas I guess that is the only option. Alternatively we could also pickle the object and store the pickled string, but I do not think that is a better alternative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then there's no need to put the DataFrame into the data container at all and we can just DataFrame.to_hdf/read_hdf, I suppose. Then we also need to put tables back into our environment.yml.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I exactly what I do in the latest commit.

Copy link
Contributor

@pmrv pmrv Nov 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed that. Looks good to me then, we can always come back to this if someone really wants to store "complex" data in their pyiron tables.

EDIT: The dependency still should be added, though.

@jan-janssen
Copy link
Member Author

The reason I didn't save it via pandas is that I expect people to put data into a pyiron_table that is HDF-serializable to us, but not to pandas (e.g. structures). I'll have a look today at #463 and #559.

That is definitely a limitation of the current solution, currently I sure the structures as JSON objects before inserting them into the pyiron table, but that requires extra work on the user side. The advantage is the size difference. Already if I only load the first 10000 entries of the file in the example above the old code resulted in a 124MB HDF5 file. So the conversion to dictionary and storing in the Datacontainer is definitely not scalable for large pyiron tables.

@pmrv
Copy link
Contributor

pmrv commented Nov 25, 2021

I'm looking into the file size right now and it seems like an issue with how we write to HDF5 in general. When I do a h5repack on the file in the example, I directly get a factor ~20 decrease in file size.

@pmrv pmrv self-requested a review November 25, 2021 16:32
@jan-janssen
Copy link
Member Author

@jan-janssen jan-janssen merged commit 08dee54 into master Nov 25, 2021
@delete-merged-branch delete-merged-branch bot deleted the pyiron_table_store_using_pandas branch November 25, 2021 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants