`st.cache_resource` does not detect different column headers in pandas dataframes #7086

naterush · 2023-07-27T18:41:06Z

Checklist

I have searched the existing issues for similar issues.
I added a very descriptive title to this issue.
I have provided sufficient information below to help reproduce this issue.

Summary

If all of the values in two dataframe are the same, st.cache_resource considers them the same dataframe -- even if they have differing column headers.

Clearly, dataframes with different headers are different dataframes!

Reproducible Code Example

import streamlit as st
import pandas as pd

@st.cache_resource
def get_value(df):
    import random
    return random.random()


df1 = pd.DataFrame({'A': [1], 'B': [2]})
df2 = pd.DataFrame({'A': [1], 'C': [2]})

# See that these two are the same value, despite the dataframes clearly being different
# in a way that is easily checkable -- just hash the headers too!
st.write(get_value(df1))
st.write(get_value(df2))

Steps To Reproduce

Run the streamlit app
See the values written are the same

Expected Behavior

The get_value function should return different values for dataframes with different headers.

Current Behavior

Dataframes are cached without considering column headers. I think the offending line is line 413 in /Users/nathanrush/temps/streamlit/venv/lib/python3.9/site-packages/streamlit/runtime/caching/hashing.py -- specifically, you need to include the headers after the hash of the pandas object is taken.

Is this a regression?

Yes, this used to work in a previous version.

Debug info

Streamlit version: 1.25.0
Python version: 3.9.9
Operating System: Mac OSX Monterey

Additional Information

No response

The text was updated successfully, but these errors were encountered:

naterush · 2023-07-27T18:50:45Z

I'm using this code-snippet to work around this issue for now:


import pickle
import streamlit as st
import pandas as pd

def hash_pandas_dataframe(df: pd.DataFrame) -> bytes:
    _PANDAS_ROWS_LARGE = 100000
    _PANDAS_SAMPLE_SIZE = 10000
    
    if len(df) >= _PANDAS_ROWS_LARGE:
        df = df.sample(n=_PANDAS_SAMPLE_SIZE, random_state=0)
    try:
        # Make sure to include the column names in the hash, as well as the
        # values of the dataframe.
        header_bytes = b"%s" % pd.util.hash_pandas_object(df.columns).sum()
        value_bytes = b"%s" % pd.util.hash_pandas_object(df).sum()
        return header_bytes + value_bytes
    except TypeError:
        # Use pickle if pandas cannot hash the object for example if
        # it contains unhashable objects.
        return b"%s" % pickle.dumps(df, pickle.HIGHEST_PROTOCOL)

@st.cache_resource(hash_funcs={pd.DataFrame: hash_pandas_dataframe})
def get_value(df):
    import random
    return random.random()

df1 = pd.DataFrame({'A': [1], 'B': [2]})
df2 = pd.DataFrame({'A': [1], 'C': [2]})

st.write(get_value(df1))
st.write(get_value(df2))

Streamlit maintainers -- feel free to take this an drop it in that file above, if it's at all helpful. Or let me know if I missed anything!

naterush · 2023-07-27T19:14:17Z

Perhaps there's a better choice than the .sum() function for combining hashes of each column?

It's the reason that the column headers aren't considered by default, and I need the workaround above.
It's commutative anyways, which makes it a bit wonky for a .sum perhaps.

Looking into better combining methods now -- hopefully this is helpful -- let me know if it's not! :)

naterush · 2023-07-27T19:40:15Z

Here are the tests I'm running, FWIW:

import pytest
import pandas as pd

def hash_pandas_dataframe(df):
    ....



TEST_HASHES = [
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}),
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}),
        True
    ),
    # Extra column
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}),
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4], 'C': [1, 2, 3]}),
        False
    ),
    # Different values
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}),
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 5]}),
        False
    ),
    # Different order
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}),
        pd.DataFrame(data={'B': [1, 2, 3], 'A': [2, 3, 4]}),
        False
    ),
    # Different index
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}, index=[1, 2, 3]),
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}, index=[1, 2, 4]),
        False
    ),
    # Missing column
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}),
        pd.DataFrame(data={'A': [1, 2, 3]}),
        False
    ),
    # Different sort
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}).sort_values(by=['A']),
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}).sort_values(by=['B'], ascending=False),
        False
    ),
    # Different headers
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'C': [2, 3, 4]}),
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}),
        False
    ),
    # Reordered columns
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'C': [2, 3, 4]}),
        pd.DataFrame(data={'C': [2, 3, 4], 'A': [1, 2, 3]}),
        True
    ),
]

@pytest.mark.parametrize("df1, df2, expected", TEST_HASHES)
def test_hash_pandas_dataframe(df1, df2, expected):
    assert (hash_pandas_dataframe(df1) == hash_pandas_dataframe(df2)) == expected

naterush · 2023-07-27T20:16:55Z

Update, my final function that passes all of these tests:

def _get_dataframe_hash(df: pd.DataFrame) -> bytes:
    """
    Returns a hash for a pandas dataframe that is consistent across runs, notably including:
    1. The column names
    2. The values of the dataframe
    3. The index of the dataframe
    4. The order of all of these
    """
    try:
        return hashlib.md5(
            bytes(str(pd.util.hash_pandas_object(df.columns)), 'utf-8') +
            bytes(str(pd.util.hash_pandas_object(df)), 'utf-8')
        ).digest()
    except TypeError as e:        
        # Use pickle if pandas cannot hash the object for example if
        # it contains unhashable objects.
        return b"%s" % pickle.dumps(df, pickle.HIGHEST_PROTOCOL)

From basic profiling, it does not seem this has a performance disadvantage on large dataframes, and in practice I've seen int performing better than the currently implementation. Hopefully this is helpful :)

kajarenc · 2023-08-02T15:06:02Z

Thanks @naterush for opening this issue, and for your work!

Yes, this is a known issue, that we compute dataframes hashes based on pd.util.hash_pandas_object that not includes information about columns.

Hopefully now with hash_func you can use a custom function to override hashing behaviour that will work in your case.

For the streamlit library itself, there is an open discussion about how exactly we should hash data frames to cover most cases, but at the same time don't have performance issues with large dataframes.

CC: @LukasMasuch since this is something we discussed back at the time.

naterush · 2023-08-11T20:09:52Z

@kajarenc check out the hash function I specified above -- I think it's actually faster than the current function you have in Streamlit (while capturing column headers as well).

import pandas as pd
import time
import hashlib
import numpy as np

small_df = pd.DataFrame({i: [j for j in range(100000)] for i in range(100)}) 
large_df = pd.DataFrame(np.random.rand(1024 ** 3 // 100, 10))

# Currently in streamlit  
og_hash = lambda df: b"%s" % pd.util.hash_pandas_object(df).sum()
# The workaround I propose above
new_hash = lambda df: hashlib.md5(bytes(str(pd.util.hash_pandas_object(df.columns)), 'utf-8') + bytes(str(pd.util.hash_pandas_object(df)), 'utf-8')).digest()

start = time.time(); og_hash(small_df); elapsed = time.time() - start; print(elapsed) # 0.06832695007324219
start = time.time(); new_hash(small_df); elapsed = time.time() - start; print(elapsed) # 0.07105803489685059

start = time.time(); og_hash(large_df); elapsed = time.time() - start; print(elapsed) # 1.1258411407470703
start = time.time(); new_hash(large_df); elapsed = time.time() - start; print(elapsed) # 1.0529148578643799

This clearly needs better profiling, but my feeling is that hash lib optimized code (for a common hash function) is going to outperform most other things you can dream up.

sfc-gh-jcarroll · 2023-09-28T19:52:02Z

I think this is a dup / same underlying issue as #6236

kajarenc · 2023-10-04T13:54:02Z

Fixed in #7331

naterush added status:needs-triage Has not been triaged by the Streamlit team type:bug Something isn't working labels Jul 27, 2023

naterush mentioned this issue Jul 27, 2023

Streamlit refactors mito-ds/mito#848

Merged

carolinedlu added the feature:cache Related to st.cache_data and st.cache_resource label Jul 31, 2023

kajarenc added priority:P2 feature:st.dataframe and removed status:needs-triage Has not been triaged by the Streamlit team labels Aug 2, 2023

kajarenc mentioned this issue Sep 14, 2023

Better dataframe hashing #7331

Merged

kajarenc closed this as completed Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`st.cache_resource` does not detect different column headers in pandas dataframes #7086

`st.cache_resource` does not detect different column headers in pandas dataframes #7086

naterush commented Jul 27, 2023 •

edited by sfc-gh-lmasuch

naterush commented Jul 27, 2023 •

edited

naterush commented Jul 27, 2023

naterush commented Jul 27, 2023

naterush commented Jul 27, 2023

kajarenc commented Aug 2, 2023

naterush commented Aug 11, 2023 •

edited

sfc-gh-jcarroll commented Sep 28, 2023

kajarenc commented Oct 4, 2023

st.cache_resource does not detect different column headers in pandas dataframes #7086

st.cache_resource does not detect different column headers in pandas dataframes #7086

Comments

naterush commented Jul 27, 2023 • edited by sfc-gh-lmasuch

Checklist

Summary

Reproducible Code Example

Steps To Reproduce

Expected Behavior

Current Behavior

Is this a regression?

Debug info

Additional Information

naterush commented Jul 27, 2023 • edited

naterush commented Jul 27, 2023

naterush commented Jul 27, 2023

naterush commented Jul 27, 2023

kajarenc commented Aug 2, 2023

naterush commented Aug 11, 2023 • edited

sfc-gh-jcarroll commented Sep 28, 2023

kajarenc commented Oct 4, 2023

`st.cache_resource` does not detect different column headers in pandas dataframes #7086

`st.cache_resource` does not detect different column headers in pandas dataframes #7086

naterush commented Jul 27, 2023 •

edited by sfc-gh-lmasuch

naterush commented Jul 27, 2023 •

edited

naterush commented Aug 11, 2023 •

edited