New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better dataframe hashing #7331
Better dataframe hashing #7331
Conversation
…e they could be modified in place also add column names to hash for dataframe
if len(obj) >= _PANDAS_ROWS_LARGE: | ||
obj = obj.sample(n=_PANDAS_SAMPLE_SIZE, random_state=0) | ||
try: | ||
return b"%s" % pd.util.hash_pandas_object(obj).sum() | ||
column_hash_bytes = self.to_bytes( | ||
pd.util.hash_pandas_object(obj.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: An alternative to df.columns
would be to hash df.dtypes
. This also includes the types of the columns in addition to the column titles, and also info on. But I'm not sure if this is required since different types are probably also reflected in the actual hashed data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
import pandas as pd | ||
|
||
h = hashlib.new("md5") | ||
self.update(h, obj.size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why do we need this for series and not for dataframe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, df.shape
also added to dataframe hashing
pd.util.hash_pandas_object(obj.columns) | ||
) | ||
self.update(h, column_hash_bytes) | ||
values_hash_bytes = self.to_bytes(pd.util.hash_pandas_object(obj)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Series above uses: pd.util.hash_pandas_object(obj).values.tobytes()
instead of self.to_bytes(pd.util.hash_pandas_object(obj))
. This probably can be unified or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea here is that pd.util.hash_pandas_object(obj)
returns Series
, so when we hash dataframe, return Series, and then recursively call the hashing mechanism for Series
, which is separate from dataframe.
So if one day we change/optimize the way of how we hash Series, it will also automatically improve dataframe hashing.
pd.DataFrame(data={"C": [2, 3, 4], "A": [1, 2, 3]}), | ||
False, | ||
), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you add one case with two only slightly different dtypes:
(
pd.DataFrame(data={"A": [1, 2, 3], "C": pd.array([1, 2, 3], dtype="UInt64")}),
pd.DataFrame(data={"A": [1, 2, 3], "C": pd.array([1, 2, 3], dtype="Int64")}),
False,
),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM 👍 My suggestion would be to wait after the 1.27
release to get that merged in. And maybe it makes sense to actually combine the logic for Series & Dataframe into one part with the only difference that if it is a dataframe, it also adds the column hash.
Add column names to hash for DataFrame Remove memoization heuristic for dataframes and numpy array, because they could be modified in place
Add column names to hash for DataFrame Remove memoization heuristic for dataframes and numpy array, because they could be modified in place
Add column names to hash for DataFrame Remove memoization heuristic for dataframes and numpy array, because they could be modified in place
Describe your changes
Add column names to hash for DataFrame
Remove memoization heuristic for dataframes and numpy array, because they could be modified in place
GitHub Issue Link (if applicable)
#7086
Testing Plan
Contribution License Agreement
By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.