Better dataframe hashing #7331

kajarenc · 2023-09-14T15:09:51Z

Describe your changes

Add column names to hash for DataFrame
Remove memoization heuristic for dataframes and numpy array, because they could be modified in place

GitHub Issue Link (if applicable)

#7086

Testing Plan

Explanation of why no additional tests are needed
Unit Tests (JS and/or Python) DONE
E2E Tests
Any manual testing needed?

Contribution License Agreement

By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.

…e they could be modified in place also add column names to hash for dataframe

LukasMasuch · 2023-09-15T18:53:01Z

lib/streamlit/runtime/caching/hashing.py

            if len(obj) >= _PANDAS_ROWS_LARGE:
                obj = obj.sample(n=_PANDAS_SAMPLE_SIZE, random_state=0)
            try:
-                return b"%s" % pd.util.hash_pandas_object(obj).sum()
+                column_hash_bytes = self.to_bytes(
+                    pd.util.hash_pandas_object(obj.columns)


nit: An alternative to df.columns would be to hash df.dtypes. This also includes the types of the columns in addition to the column titles, and also info on. But I'm not sure if this is required since different types are probably also reflected in the actual hashed data.

LukasMasuch · 2023-09-15T18:53:07Z

lib/streamlit/runtime/caching/hashing.py

            import pandas as pd

+            h = hashlib.new("md5")
+            self.update(h, obj.size)


nit: why do we need this for series and not for dataframe?

good catch, df.shape also added to dataframe hashing

LukasMasuch · 2023-09-15T18:55:15Z

lib/streamlit/runtime/caching/hashing.py

+                    pd.util.hash_pandas_object(obj.columns)
+                )
+                self.update(h, column_hash_bytes)
+                values_hash_bytes = self.to_bytes(pd.util.hash_pandas_object(obj))


Series above uses: pd.util.hash_pandas_object(obj).values.tobytes() instead of self.to_bytes(pd.util.hash_pandas_object(obj)). This probably can be unified or?

My idea here is that pd.util.hash_pandas_object(obj) returns Series, so when we hash dataframe, return Series, and then recursively call the hashing mechanism for Series, which is separate from dataframe.

So if one day we change/optimize the way of how we hash Series, it will also automatically improve dataframe hashing.

LukasMasuch · 2023-09-15T18:58:47Z

lib/tests/streamlit/runtime/caching/hashing_test.py

+                pd.DataFrame(data={"C": [2, 3, 4], "A": [1, 2, 3]}),
+                False,
+            ),
+        ]


nit: can you add one case with two only slightly different dtypes:

( pd.DataFrame(data={"A": [1, 2, 3], "C": pd.array([1, 2, 3], dtype="UInt64")}), pd.DataFrame(data={"A": [1, 2, 3], "C": pd.array([1, 2, 3], dtype="Int64")}), False, ),

LukasMasuch

Overall LGTM 👍 My suggestion would be to wait after the 1.27 release to get that merged in. And maybe it makes sense to actually combine the logic for Series & Dataframe into one part with the only difference that if it is a dataframe, it also adds the column hash.

Add column names to hash for DataFrame Remove memoization heuristic for dataframes and numpy array, because they could be modified in place

kajarenc added 2 commits September 14, 2023 19:09

remove memoization optimization for datafarme and numpy array, becaus…

7fd7633

…e they could be modified in place also add column names to hash for dataframe

use better hashing mechanism for pandas dataframes and series

ca21926

kajarenc changed the title ~~remove memoization optimization for datafarme and numpy array, becaus…~~ Better dataframe hashing Sep 15, 2023

kajarenc marked this pull request as ready for review September 15, 2023 18:07

kajarenc added security-assessment-completed change:bugfix impact:internal labels Sep 15, 2023

LukasMasuch reviewed Sep 15, 2023

View reviewed changes

LukasMasuch approved these changes Sep 15, 2023

View reviewed changes

kajarenc added 2 commits September 16, 2023 23:14

add suggested test

11d7cf7

minor improvements and tests

c8f6d46

kajarenc merged commit 3b47351 into develop Sep 20, 2023
50 checks passed

kajarenc deleted the fix-7086 branch September 20, 2023 13:43

This was referenced Oct 4, 2023

st.cache_resource does not detect different column headers in pandas dataframes #7086

Closed

New caching methods cannot detect the change of df column name #6236

Closed

sfc-gh-dmatthews added impact:users and removed impact:internal labels Oct 16, 2023

eric-skydio pushed a commit to eric-skydio/streamlit that referenced this pull request Dec 20, 2023

Better dataframe hashing (streamlit#7331)

679e6cd

Add column names to hash for DataFrame Remove memoization heuristic for dataframes and numpy array, because they could be modified in place

zyxue pushed a commit to zyxue/streamlit that referenced this pull request Mar 22, 2024

Better dataframe hashing (streamlit#7331)

c83f966

Add column names to hash for DataFrame Remove memoization heuristic for dataframes and numpy array, because they could be modified in place

zyxue pushed a commit to zyxue/streamlit that referenced this pull request Apr 16, 2024

Better dataframe hashing (streamlit#7331)

8f84357

Add column names to hash for DataFrame Remove memoization heuristic for dataframes and numpy array, because they could be modified in place

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better dataframe hashing #7331

Better dataframe hashing #7331

kajarenc commented Sep 14, 2023 •

edited

LukasMasuch Sep 15, 2023

kajarenc Sep 20, 2023

LukasMasuch Sep 15, 2023

kajarenc Sep 20, 2023

LukasMasuch Sep 15, 2023

kajarenc Sep 20, 2023

LukasMasuch Sep 15, 2023 •

edited

kajarenc Sep 20, 2023

LukasMasuch left a comment

Better dataframe hashing #7331

Better dataframe hashing #7331

Conversation

kajarenc commented Sep 14, 2023 • edited

Describe your changes

GitHub Issue Link (if applicable)

Testing Plan

LukasMasuch Sep 15, 2023

Choose a reason for hiding this comment

kajarenc Sep 20, 2023

Choose a reason for hiding this comment

LukasMasuch Sep 15, 2023

Choose a reason for hiding this comment

kajarenc Sep 20, 2023

Choose a reason for hiding this comment

LukasMasuch Sep 15, 2023

Choose a reason for hiding this comment

kajarenc Sep 20, 2023

Choose a reason for hiding this comment

LukasMasuch Sep 15, 2023 • edited

Choose a reason for hiding this comment

kajarenc Sep 20, 2023

Choose a reason for hiding this comment

LukasMasuch left a comment

Choose a reason for hiding this comment

kajarenc commented Sep 14, 2023 •

edited

LukasMasuch Sep 15, 2023 •

edited