New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
st.cache_resource
does not detect different column headers in pandas dataframes
#7086
Comments
I'm using this code-snippet to work around this issue for now:
Streamlit maintainers -- feel free to take this an drop it in that file above, if it's at all helpful. Or let me know if I missed anything! |
Perhaps there's a better choice than the
Looking into better combining methods now -- hopefully this is helpful -- let me know if it's not! :) |
Here are the tests I'm running, FWIW:
|
Update, my final function that passes all of these tests:
From basic profiling, it does not seem this has a performance disadvantage on large dataframes, and in practice I've seen int performing better than the currently implementation. Hopefully this is helpful :) |
Thanks @naterush for opening this issue, and for your work! Yes, this is a known issue, that we compute dataframes hashes based on Hopefully now with For the streamlit library itself, there is an open discussion about how exactly we should hash data frames to cover most cases, but at the same time don't have performance issues with large dataframes. CC: @LukasMasuch since this is something we discussed back at the time. |
@kajarenc check out the hash function I specified above -- I think it's actually faster than the current function you have in Streamlit (while capturing column headers as well).
This clearly needs better profiling, but my feeling is that hash lib optimized code (for a common hash function) is going to outperform most other things you can dream up. |
I think this is a dup / same underlying issue as #6236 |
Fixed in #7331 |
Checklist
Summary
If all of the values in two dataframe are the same,
st.cache_resource
considers them the same dataframe -- even if they have differing column headers.Clearly, dataframes with different headers are different dataframes!
Reproducible Code Example
Steps To Reproduce
Expected Behavior
The
get_value
function should return different values for dataframes with different headers.Current Behavior
Dataframes are cached without considering column headers. I think the offending line is line 413 in
/Users/nathanrush/temps/streamlit/venv/lib/python3.9/site-packages/streamlit/runtime/caching/hashing.py
-- specifically, you need to include the headers after the hash of the pandas object is taken.Is this a regression?
Debug info
Additional Information
No response
The text was updated successfully, but these errors were encountered: