Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add improved type parsing capabilities for st.data_editor #6551

Merged
merged 17 commits into from Apr 25, 2023

Conversation

LukasMasuch
Copy link
Collaborator

@LukasMasuch LukasMasuch commented Apr 23, 2023

📚 Context

This PR introduces some logic to determine the data type of the values in DataFrame column or index. I originally tried to prevent doing this, but some upcoming features of the column configuration project will require this, and it is also necessary for keeping the editing logic performant. Unfortunately, to actually get the correct column data kind in every possible situation, we need to combine info from the DataFrame column dtype, the inferred dtype via pd.api.types.infer_dtype as well as the field type from the Arrow schema :(

These are all the changes done in this PR:

  1. Create a new column_config_utils.py module and move some parts of data_editor into this module without applying any code changes: _INDEX_IDENTIFIER, ColumnConfig, ColumnConfigMapping, _marshall_column_config
  2. Implement a way to determine the correct underlying data type (-> column data kind) for any DataFrame column.
  3. Use column schema (data kind) in all methods that apply edits: _apply_cell_edits, _apply_row_additions, _apply_dataframe_edits, _parse_value

The st.dataframe and st.data_editor components will have three different notions of data types, so here is an overview to make this a bit less confusing:

  • Column data kind (e.g. integer, float, string, bool): This is the data type of the values in the column.
  • Column type (e.g. text, number, selectbox): The column type is used in the frontend to provide certain display & editing capabilities. A column type can be compatible with multiple data kinds. And a data kind can be edited by different column types.
  • Data format (e.g. pd.DataFrame, List of values, Snowpark Table): This is the datastructure type of the input data and - in most cases - also the structure that is returned by the data_editor.

🧪 Testing Done

  • Screenshots included
  • Added/Updated unit tests
  • Added/Updated e2e tests

Contribution License Agreement

By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.

@LukasMasuch LukasMasuch marked this pull request as ready for review April 24, 2023 15:55


SHARED_DATA_KIND_TEST_CASES = [
(pd.Series(["a", "b", "c"], dtype=pd.StringDtype()), ColumnDataKind.STRING),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see ["a","b","c"] and other such things duplicated. Should we create a variable for that and reuse so that if these tests need to be changed for x reason, u just have to change one spot instead of a lot of places? Same for the [1,2,3] and [1.1,2.2.,3.3] and [1,2.2,3]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, all the cases in the SHARED_DATA_KIND_TEST_CASES are already the ones that work across all methods. The other cases are more specific to each specific determined method. So, it gets a lot harder to share even more cases. E.g. some are only supported by one method and others by multiple :(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the string case ["a", "b", "c"] only works for all methods if the series is explicitly set to dtype=pd.StringDtype(). But arrow and inferre type can also handle this without the dtype being set

Copy link
Collaborator

@willhuang1997 willhuang1997 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LukasMasuch LukasMasuch merged commit c84f17b into develop Apr 25, 2023
76 checks passed
tconkling added a commit to tconkling/streamlit that referenced this pull request Apr 25, 2023
* develop:
  Add improved type parsing capabilities for `st.data_editor` (streamlit#6551)
@sfc-gh-kmcgrady sfc-gh-kmcgrady deleted the feature/better-type-parsing branch October 5, 2023 19:30
eric-skydio pushed a commit to eric-skydio/streamlit that referenced this pull request Dec 20, 2023
…t#6551)

* Add functionality to check underlying types

* Remove not-implemented types

* Add comment

* Some cleanup

* Add unit test

* Fix unit tests

* Finish unit test

* Add tests for index columns

* Remove type compatibility checks

* Remove refactoring

* Remove changes to column config object

* Remove final import

* Fix test issue

* Add dtype object to empty series for compatibility

* Add negative int and float to test

* Add a couple of comments about column data kind
zyxue pushed a commit to zyxue/streamlit that referenced this pull request Mar 22, 2024
…t#6551)

* Add functionality to check underlying types

* Remove not-implemented types

* Add comment

* Some cleanup

* Add unit test

* Fix unit tests

* Finish unit test

* Add tests for index columns

* Remove type compatibility checks

* Remove refactoring

* Remove changes to column config object

* Remove final import

* Fix test issue

* Add dtype object to empty series for compatibility

* Add negative int and float to test

* Add a couple of comments about column data kind
zyxue pushed a commit to zyxue/streamlit that referenced this pull request Apr 16, 2024
…t#6551)

* Add functionality to check underlying types

* Remove not-implemented types

* Add comment

* Some cleanup

* Add unit test

* Fix unit tests

* Finish unit test

* Add tests for index columns

* Remove type compatibility checks

* Remove refactoring

* Remove changes to column config object

* Remove final import

* Fix test issue

* Add dtype object to empty series for compatibility

* Add negative int and float to test

* Add a couple of comments about column data kind
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants