Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better dataframe hashing #7331

Merged
merged 4 commits into from Sep 20, 2023
Merged

Better dataframe hashing #7331

merged 4 commits into from Sep 20, 2023

Conversation

kajarenc
Copy link
Collaborator

@kajarenc kajarenc commented Sep 14, 2023

Describe your changes

Add column names to hash for DataFrame
Remove memoization heuristic for dataframes and numpy array, because they could be modified in place

GitHub Issue Link (if applicable)

#7086

Testing Plan

  • Explanation of why no additional tests are needed
  • Unit Tests (JS and/or Python) DONE
  • E2E Tests
  • Any manual testing needed?

Contribution License Agreement

By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.

@kajarenc kajarenc changed the title remove memoization optimization for datafarme and numpy array, becaus… Better dataframe hashing Sep 15, 2023
@kajarenc kajarenc marked this pull request as ready for review September 15, 2023 18:07
if len(obj) >= _PANDAS_ROWS_LARGE:
obj = obj.sample(n=_PANDAS_SAMPLE_SIZE, random_state=0)
try:
return b"%s" % pd.util.hash_pandas_object(obj).sum()
column_hash_bytes = self.to_bytes(
pd.util.hash_pandas_object(obj.columns)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: An alternative to df.columns would be to hash df.dtypes. This also includes the types of the columns in addition to the column titles, and also info on. But I'm not sure if this is required since different types are probably also reflected in the actual hashed data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import pandas as pd

h = hashlib.new("md5")
self.update(h, obj.size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why do we need this for series and not for dataframe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, df.shape also added to dataframe hashing

pd.util.hash_pandas_object(obj.columns)
)
self.update(h, column_hash_bytes)
values_hash_bytes = self.to_bytes(pd.util.hash_pandas_object(obj))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Series above uses: pd.util.hash_pandas_object(obj).values.tobytes() instead of self.to_bytes(pd.util.hash_pandas_object(obj)). This probably can be unified or?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea here is that pd.util.hash_pandas_object(obj) returns Series, so when we hash dataframe, return Series, and then recursively call the hashing mechanism for Series, which is separate from dataframe.

So if one day we change/optimize the way of how we hash Series, it will also automatically improve dataframe hashing.

pd.DataFrame(data={"C": [2, 3, 4], "A": [1, 2, 3]}),
False,
),
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you add one case with two only slightly different dtypes:

(
    pd.DataFrame(data={"A": [1, 2, 3], "C": pd.array([1, 2, 3], dtype="UInt64")}),
    pd.DataFrame(data={"A": [1, 2, 3], "C": pd.array([1, 2, 3], dtype="Int64")}),
    False,
),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Collaborator

@LukasMasuch LukasMasuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM 👍 My suggestion would be to wait after the 1.27 release to get that merged in. And maybe it makes sense to actually combine the logic for Series & Dataframe into one part with the only difference that if it is a dataframe, it also adds the column hash.

@kajarenc kajarenc merged commit 3b47351 into develop Sep 20, 2023
50 checks passed
@kajarenc kajarenc deleted the fix-7086 branch September 20, 2023 13:43
eric-skydio pushed a commit to eric-skydio/streamlit that referenced this pull request Dec 20, 2023
Add column names to hash for DataFrame
Remove memoization heuristic for dataframes and numpy array, because they could be modified in place
zyxue pushed a commit to zyxue/streamlit that referenced this pull request Mar 22, 2024
Add column names to hash for DataFrame
Remove memoization heuristic for dataframes and numpy array, because they could be modified in place
zyxue pushed a commit to zyxue/streamlit that referenced this pull request Apr 16, 2024
Add column names to hash for DataFrame
Remove memoization heuristic for dataframes and numpy array, because they could be modified in place
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants