New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add improved type parsing capabilities for st.data_editor
#6551
Conversation
|
||
|
||
SHARED_DATA_KIND_TEST_CASES = [ | ||
(pd.Series(["a", "b", "c"], dtype=pd.StringDtype()), ColumnDataKind.STRING), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see ["a","b","c"]
and other such things duplicated. Should we create a variable for that and reuse so that if these tests need to be changed for x reason, u just have to change one spot instead of a lot of places? Same for the [1,2,3]
and [1.1,2.2.,3.3]
and [1,2.2,3]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, all the cases in the SHARED_DATA_KIND_TEST_CASES
are already the ones that work across all methods. The other cases are more specific to each specific determined method. So, it gets a lot harder to share even more cases. E.g. some are only supported by one method and others by multiple :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, the string case ["a", "b", "c"]
only works for all methods if the series is explicitly set to dtype=pd.StringDtype()
. But arrow and inferre type can also handle this without the dtype being set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* develop: Add improved type parsing capabilities for `st.data_editor` (streamlit#6551)
…t#6551) * Add functionality to check underlying types * Remove not-implemented types * Add comment * Some cleanup * Add unit test * Fix unit tests * Finish unit test * Add tests for index columns * Remove type compatibility checks * Remove refactoring * Remove changes to column config object * Remove final import * Fix test issue * Add dtype object to empty series for compatibility * Add negative int and float to test * Add a couple of comments about column data kind
…t#6551) * Add functionality to check underlying types * Remove not-implemented types * Add comment * Some cleanup * Add unit test * Fix unit tests * Finish unit test * Add tests for index columns * Remove type compatibility checks * Remove refactoring * Remove changes to column config object * Remove final import * Fix test issue * Add dtype object to empty series for compatibility * Add negative int and float to test * Add a couple of comments about column data kind
…t#6551) * Add functionality to check underlying types * Remove not-implemented types * Add comment * Some cleanup * Add unit test * Fix unit tests * Finish unit test * Add tests for index columns * Remove type compatibility checks * Remove refactoring * Remove changes to column config object * Remove final import * Fix test issue * Add dtype object to empty series for compatibility * Add negative int and float to test * Add a couple of comments about column data kind
📚 Context
This PR introduces some logic to determine the data type of the values in DataFrame column or index. I originally tried to prevent doing this, but some upcoming features of the column configuration project will require this, and it is also necessary for keeping the editing logic performant. Unfortunately, to actually get the correct column data kind in every possible situation, we need to combine info from the DataFrame column
dtype
, the inferred dtype viapd.api.types.infer_dtype
as well as the field type from the Arrow schema :(These are all the changes done in this PR:
column_config_utils.py
module and move some parts ofdata_editor
into this module without applying any code changes:_INDEX_IDENTIFIER
,ColumnConfig
,ColumnConfigMapping
,_marshall_column_config
_apply_cell_edits
,_apply_row_additions
,_apply_dataframe_edits
,_parse_value
The
st.dataframe
andst.data_editor
components will have three different notions of data types, so here is an overview to make this a bit less confusing:integer
,float
,string
,bool
): This is the data type of the values in the column.text
,number
,selectbox
): The column type is used in the frontend to provide certain display & editing capabilities. A column type can be compatible with multiple data kinds. And a data kind can be edited by different column types.pd.DataFrame
,List of values
,Snowpark Table
): This is the datastructure type of the input data and - in most cases - also the structure that is returned by thedata_editor
.🧪 Testing Done
Contribution License Agreement
By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.