Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

st.cache_resource does not detect different column headers in pandas dataframes #7086

Closed
3 of 4 tasks
naterush opened this issue Jul 27, 2023 · 8 comments
Closed
3 of 4 tasks
Labels
feature:cache Related to st.cache_data and st.cache_resource feature:st.dataframe priority:P2 type:bug Something isn't working

Comments

@naterush
Copy link

naterush commented Jul 27, 2023

Checklist

  • I have searched the existing issues for similar issues.
  • I added a very descriptive title to this issue.
  • I have provided sufficient information below to help reproduce this issue.

Summary

If all of the values in two dataframe are the same, st.cache_resource considers them the same dataframe -- even if they have differing column headers.

Clearly, dataframes with different headers are different dataframes!

Reproducible Code Example

Open in Streamlit Cloud

import streamlit as st
import pandas as pd

@st.cache_resource
def get_value(df):
    import random
    return random.random()


df1 = pd.DataFrame({'A': [1], 'B': [2]})
df2 = pd.DataFrame({'A': [1], 'C': [2]})

# See that these two are the same value, despite the dataframes clearly being different
# in a way that is easily checkable -- just hash the headers too!
st.write(get_value(df1))
st.write(get_value(df2))

Steps To Reproduce

  1. Run the streamlit app
  2. See the values written are the same

Expected Behavior

The get_value function should return different values for dataframes with different headers.

Current Behavior

Dataframes are cached without considering column headers. I think the offending line is line 413 in /Users/nathanrush/temps/streamlit/venv/lib/python3.9/site-packages/streamlit/runtime/caching/hashing.py -- specifically, you need to include the headers after the hash of the pandas object is taken.

Is this a regression?

  • Yes, this used to work in a previous version.

Debug info

  • Streamlit version: 1.25.0
  • Python version: 3.9.9
  • Operating System: Mac OSX Monterey

Additional Information

No response

@naterush naterush added status:needs-triage Has not been triaged by the Streamlit team type:bug Something isn't working labels Jul 27, 2023
@naterush
Copy link
Author

naterush commented Jul 27, 2023

I'm using this code-snippet to work around this issue for now:


import pickle
import streamlit as st
import pandas as pd

def hash_pandas_dataframe(df: pd.DataFrame) -> bytes:
    _PANDAS_ROWS_LARGE = 100000
    _PANDAS_SAMPLE_SIZE = 10000
    
    if len(df) >= _PANDAS_ROWS_LARGE:
        df = df.sample(n=_PANDAS_SAMPLE_SIZE, random_state=0)
    try:
        # Make sure to include the column names in the hash, as well as the
        # values of the dataframe.
        header_bytes = b"%s" % pd.util.hash_pandas_object(df.columns).sum()
        value_bytes = b"%s" % pd.util.hash_pandas_object(df).sum()
        return header_bytes + value_bytes
    except TypeError:
        # Use pickle if pandas cannot hash the object for example if
        # it contains unhashable objects.
        return b"%s" % pickle.dumps(df, pickle.HIGHEST_PROTOCOL)

@st.cache_resource(hash_funcs={pd.DataFrame: hash_pandas_dataframe})
def get_value(df):
    import random
    return random.random()

df1 = pd.DataFrame({'A': [1], 'B': [2]})
df2 = pd.DataFrame({'A': [1], 'C': [2]})

st.write(get_value(df1))
st.write(get_value(df2))

Streamlit maintainers -- feel free to take this an drop it in that file above, if it's at all helpful. Or let me know if I missed anything!

@naterush
Copy link
Author

Perhaps there's a better choice than the .sum() function for combining hashes of each column?

  1. It's the reason that the column headers aren't considered by default, and I need the workaround above.
  2. It's commutative anyways, which makes it a bit wonky for a .sum perhaps.

Looking into better combining methods now -- hopefully this is helpful -- let me know if it's not! :)

@naterush
Copy link
Author

Here are the tests I'm running, FWIW:

import pytest
import pandas as pd

def hash_pandas_dataframe(df):
    ....



TEST_HASHES = [
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}),
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}),
        True
    ),
    # Extra column
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}),
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4], 'C': [1, 2, 3]}),
        False
    ),
    # Different values
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}),
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 5]}),
        False
    ),
    # Different order
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}),
        pd.DataFrame(data={'B': [1, 2, 3], 'A': [2, 3, 4]}),
        False
    ),
    # Different index
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}, index=[1, 2, 3]),
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}, index=[1, 2, 4]),
        False
    ),
    # Missing column
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}),
        pd.DataFrame(data={'A': [1, 2, 3]}),
        False
    ),
    # Different sort
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}).sort_values(by=['A']),
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}).sort_values(by=['B'], ascending=False),
        False
    ),
    # Different headers
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'C': [2, 3, 4]}),
        pd.DataFrame(data={'A': [1, 2, 3], 'B': [2, 3, 4]}),
        False
    ),
    # Reordered columns
    (
        pd.DataFrame(data={'A': [1, 2, 3], 'C': [2, 3, 4]}),
        pd.DataFrame(data={'C': [2, 3, 4], 'A': [1, 2, 3]}),
        True
    ),
]

@pytest.mark.parametrize("df1, df2, expected", TEST_HASHES)
def test_hash_pandas_dataframe(df1, df2, expected):
    assert (hash_pandas_dataframe(df1) == hash_pandas_dataframe(df2)) == expected

@naterush
Copy link
Author

Update, my final function that passes all of these tests:

def _get_dataframe_hash(df: pd.DataFrame) -> bytes:
    """
    Returns a hash for a pandas dataframe that is consistent across runs, notably including:
    1. The column names
    2. The values of the dataframe
    3. The index of the dataframe
    4. The order of all of these
    """
    try:
        return hashlib.md5(
            bytes(str(pd.util.hash_pandas_object(df.columns)), 'utf-8') +
            bytes(str(pd.util.hash_pandas_object(df)), 'utf-8')
        ).digest()
    except TypeError as e:        
        # Use pickle if pandas cannot hash the object for example if
        # it contains unhashable objects.
        return b"%s" % pickle.dumps(df, pickle.HIGHEST_PROTOCOL)

From basic profiling, it does not seem this has a performance disadvantage on large dataframes, and in practice I've seen int performing better than the currently implementation. Hopefully this is helpful :)

@carolinedlu carolinedlu added the feature:cache Related to st.cache_data and st.cache_resource label Jul 31, 2023
@kajarenc
Copy link
Collaborator

kajarenc commented Aug 2, 2023

Thanks @naterush for opening this issue, and for your work!

Yes, this is a known issue, that we compute dataframes hashes based on pd.util.hash_pandas_object that not includes information about columns.

Hopefully now with hash_func you can use a custom function to override hashing behaviour that will work in your case.

For the streamlit library itself, there is an open discussion about how exactly we should hash data frames to cover most cases, but at the same time don't have performance issues with large dataframes.

CC: @LukasMasuch since this is something we discussed back at the time.

@kajarenc kajarenc added priority:P2 feature:st.dataframe and removed status:needs-triage Has not been triaged by the Streamlit team labels Aug 2, 2023
@naterush
Copy link
Author

naterush commented Aug 11, 2023

@kajarenc check out the hash function I specified above -- I think it's actually faster than the current function you have in Streamlit (while capturing column headers as well).

import pandas as pd
import time
import hashlib
import numpy as np

small_df = pd.DataFrame({i: [j for j in range(100000)] for i in range(100)}) 
large_df = pd.DataFrame(np.random.rand(1024 ** 3 // 100, 10))

# Currently in streamlit  
og_hash = lambda df: b"%s" % pd.util.hash_pandas_object(df).sum()
# The workaround I propose above
new_hash = lambda df: hashlib.md5(bytes(str(pd.util.hash_pandas_object(df.columns)), 'utf-8') + bytes(str(pd.util.hash_pandas_object(df)), 'utf-8')).digest()

start = time.time(); og_hash(small_df); elapsed = time.time() - start; print(elapsed) # 0.06832695007324219
start = time.time(); new_hash(small_df); elapsed = time.time() - start; print(elapsed) # 0.07105803489685059

start = time.time(); og_hash(large_df); elapsed = time.time() - start; print(elapsed) # 1.1258411407470703
start = time.time(); new_hash(large_df); elapsed = time.time() - start; print(elapsed) # 1.0529148578643799

This clearly needs better profiling, but my feeling is that hash lib optimized code (for a common hash function) is going to outperform most other things you can dream up.

@sfc-gh-jcarroll
Copy link
Collaborator

I think this is a dup / same underlying issue as #6236

@kajarenc
Copy link
Collaborator

kajarenc commented Oct 4, 2023

Fixed in #7331

@kajarenc kajarenc closed this as completed Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:cache Related to st.cache_data and st.cache_resource feature:st.dataframe priority:P2 type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants