Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New caching methods cannot detect the change of df column name #6236

Closed
3 of 5 tasks
PierXuY opened this issue Mar 6, 2023 · 4 comments
Closed
3 of 5 tasks

New caching methods cannot detect the change of df column name #6236

PierXuY opened this issue Mar 6, 2023 · 4 comments
Assignees
Labels
feature:cache Related to st.cache_data and st.cache_resource feature:cache-hash-func priority:P2 status:confirmed Bug has been confirmed by the Streamlit team type:bug Something isn't working

Comments

@PierXuY
Copy link

PierXuY commented Mar 6, 2023

Checklist

  • I have searched the existing issues for similar issues.
  • I added a very descriptive title to this issue.
  • I have provided sufficient information below to help reproduce this issue.

Summary

Use @st.cache_data, when the input parameter of the function is pd.DataFrame, the change of column name cannot be detected. When the column name of df is changed, the wrong df will be returned!

Reproducible Code Example

Open in Streamlit Cloud

import streamlit as st
import pandas as pd
import numpy as np
'''
streamlit 1.19.0
pandas 1.5.3
'''
df = pd.DataFrame({'a':[1,2,3],'b':[1,2,3],'c':[1,2,3]})

@st.cache_data
def show(df):
    return df

st.code("""
@st.cache_data
def show(df):
    return df
""")

columns_name = st.text_input("New column name")

if columns_name:
    try:
        df.columns = columns_name.split(",")
    except:
        df.columns = ['A','B','C']
        st.error('Invalid, please enter three column names separated by commas, such as "q, w, e".')

if st.button('add a column'):
    df['new'] = 4

st.write("st.dataframe(df)")
st.dataframe(df)

st.write("st.dataframe(show(df))")
st.dataframe(show(df))

Steps To Reproduce

1.Run the above code using streamlit.
2.Enter three column names in the text box, separated by commas.
3.You will see that the df processed by the show function has not changed, that is, the df with the new column name has not been cached again.
4.Click add a column,you will see that st.dataframe(show(df)) has changed, but it is still not the correct data.

Expected Behavior

The change of df column name can also be detected.

Current Behavior

No response

Is this a regression?

  • Yes, this used to work in a previous version.

Debug info

  • Streamlit version:1.19.0
  • Python version: 3.9.13
  • Operating System:
  • Browser:
  • Virtual environment:

Additional Information

No response

Are you willing to submit a PR?

  • Yes, I am willing to submit a PR!
@PierXuY PierXuY added status:needs-triage Has not been triaged by the Streamlit team type:bug Something isn't working labels Mar 6, 2023
@PierXuY PierXuY changed the title Use @st.cache_data, when the input parameter of the function is pd.DataFrame, the change of the column name cannot be detected, @st.cache_resource is similar. st.cache_data cannot detect the change of df column name,st.cache_resource is similar. Mar 6, 2023
@carolinedlu carolinedlu added the feature:cache Related to st.cache_data and st.cache_resource label Mar 7, 2023
@carolinedlu carolinedlu changed the title st.cache_data cannot detect the change of df column name,st.cache_resource is similar. New caching methods cannot detect the change of df column name Mar 10, 2023
@carolinedlu
Copy link
Collaborator

Hey @PierXuY, thank you so much for flagging this behavior! Our team is investigating further.

@LukasMasuch LukasMasuch added the status:confirmed Bug has been confirmed by the Streamlit team label Mar 17, 2023
@LukasMasuch
Copy link
Collaborator

LukasMasuch commented Mar 17, 2023

Thanks for reporting this. I was able to reproduce it here. The reason is that we are hashing the dataframe based on its ID. This probably means that all changes that are applied in place will not lead to a new ID and, subsequently, not lead to recompute the cached function.

A potential workaround would be to create a new copy of the dataframe and applying the renaming on the copy.

@snehankekre
Copy link
Collaborator

snehankekre commented Jun 15, 2023

Hey @PierXuY 👋

Good news! In the next Streamlit release (1.24.0), we're bringing back hash_funcs but to the new caching primitives 😄 (added by #6502).

That allows you to override Streamlit's hashing of dataframes based on its id. Here's an example you can use to verify it works.. available in streamlit-nightly==1.23.2.dev20230614:

import pandas as pd
import streamlit as st

df = pd.DataFrame({'a':[1,2,3],'b':[1,2,3],'c':[1,2,3]})

@st.cache_data(hash_funcs={pd.core.frame.DataFrame: lambda x: str(x)})
def show(df):
    return df

columns_name = st.text_input("New column name")

if columns_name:
    try:
        df.columns = columns_name.split(",")
    except:
        df.columns = ['A','B','C']
        st.error('Invalid, please enter three column names separated by commas, such as "q, w, e".')

if st.button('add a column'):
    df['new'] = 4

st.write("st.dataframe(df)")
st.dataframe(df)

st.write("st.dataframe(show(df))")
st.dataframe(show(df))

Note: the hash func used above, str, may not be the best choice generally. It's up to the developer to choose one based on their own criteria for what qualifies as a "good" hash function. For instance, you could replace lambda x: str(x) with

def hash_dataframe_custom(df):
    h1 = pd.util.hash_pandas_object(df)
    column_names = list(df.columns)
    column_names.append(h1)
    return column_names

@st.cache_data(hash_funcs={pd.core.frame.DataFrame: hash_dataframe_custom})
def show(df):
    return df

and it still works as expected.

hash_pandas

@kajarenc
Copy link
Collaborator

kajarenc commented Oct 4, 2023

Fixed in #7331

@kajarenc kajarenc closed this as completed Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:cache Related to st.cache_data and st.cache_resource feature:cache-hash-func priority:P2 status:confirmed Bug has been confirmed by the Streamlit team type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants