TabularData: Add ignore_empty flag to drop_duplicates. #7064

daavoo · 2021-11-29T21:42:55Z

TabularData: Add ignore_empty flag to drop_duplicates
TabularData: Fix drop_duplicates for rich_text NaN

tests/unit/test_tabular_data.py

pmrowla · 2021-12-17T05:43:05Z

If the problem here is that TabularData is being filled with a mix of None and "-" strings (i.e. table._fill_value), this seems like a bug that needs to be fixed in whatever is generating the TabularData, and not in specific UI functions that filter/render TabularData. (It seems like this will probably also show up in places other than drop_duplicates at some point)

TabularData should either

always store None values as None (and replace them with _fill_value at the time they are rendered to UI)
always replace None values with _fill_value at the time the value is added to the table/row/cell.

Looking at the code it looks like the 2nd option is what is supposed to be happening (@skshetry?), so my understanding is that drop_duplicates should never be encountering None values in the first place. (In which case this needs to be fixed somewhere else, presumably somewhere in exp show?)

daavoo · 2021-12-17T10:42:52Z

If the problem here is that TabularData is being filled with a mix of None and "-" strings (i.e. table._fill_value)

That's not the problem.

The table is being filled with a mix of _fill_value of type string and _fill_value of type rich.Text (from show_experiments).

This is addressed in the first commit b09d353

and not in specific UI functions that filter/render TabularData

The nan_is_value flag for drop_duplicates is not fixing any problem but allowing to configure 2 valid behaviors that depend on the use case.

drop_duplicates should remove columns where all rows have the same value. There is an ambiguity on how to treat NaNs, i.e the following column:

foo
foo
foo
None

For some use cases, could be considered all duplicates because None it's not considered a relevant value. For others, None is considered a relevant value and thus this column should not be dropped.

always store None values as None (and replace them with _fill_value at the time they are rendered to UI)

I prefer this option. Was trying to not introduce many changes here but it's probably the way to go.

pmrowla · 2021-12-17T10:58:37Z

For some use cases, could be considered all duplicates because None it's not considered a relevant value. For others, None is considered a relevant value and thus this column should not be dropped.

Ah ok, that makes sense then.

always store None values as None (and replace them with _fill_value at the time they are rendered to UI)

I prefer this option. Was trying to not introduce many changes here but it's probably the way to go.

After reading your explanation, I think this PR is probably fine for now, maybe just open a separate issue regarding making TabularData store None types instead of the fill strings.

As a side note, I would say that NaN is not the same thing as a None/null (even if python doesnt have native types for nan) and would maybe name this something else like ignore_none/ignore_empty/skip_empty (to skip/ignore empty cells when determining if the value changed).

When filling missing values with `ui.rich_text` (i.e. in experiments show CMD), those values were not being correctly matched against `self._fill_value`)

dvc/compare.py

tests/unit/test_tabular_data.py

Configures whether to consider missing values as relevant or not.

daavoo requested a review from a team as a code owner November 29, 2021 21:42

daavoo requested a review from karajan1001 November 29, 2021 21:42

daavoo mentioned this pull request Nov 29, 2021

exp show: Add parallel coordinates plot. #6933

Merged

2 tasks

karajan1001 approved these changes Dec 2, 2021

View reviewed changes

tests/unit/test_tabular_data.py Show resolved Hide resolved

daavoo force-pushed the drop-duplicates-nan branch from 70769af to b2c7623 Compare December 7, 2021 19:52

daavoo changed the title ~~TabularData: Fix drop_duplicates for rich_text NaN.~~ TabularData: Add nan_is_value flag to drop_duplicates. Dec 7, 2021

daavoo marked this pull request as draft December 17, 2021 10:47

daavoo force-pushed the drop-duplicates-nan branch from b2c7623 to 7c2c194 Compare December 17, 2021 16:16

TabularData: Fix drop_duplicates for rich_text NaN.

0f22370

When filling missing values with `ui.rich_text` (i.e. in experiments show CMD), those values were not being correctly matched against `self._fill_value`)

daavoo force-pushed the drop-duplicates-nan branch from 7c2c194 to c37db07 Compare December 17, 2021 16:47

daavoo changed the title ~~TabularData: Add nan_is_value flag to drop_duplicates.~~ TabularData: Add ignore_empty flag to drop_duplicates. Dec 17, 2021

daavoo marked this pull request as ready for review December 17, 2021 16:48

daavoo mentioned this pull request Dec 17, 2021

TabularData: Use None internally #7167

Closed

daavoo requested a review from karajan1001 December 20, 2021 17:43

karajan1001 reviewed Dec 22, 2021

View reviewed changes

dvc/compare.py Show resolved Hide resolved

tests/unit/test_tabular_data.py Show resolved Hide resolved

TabularData: Add ignore_empty flag to drop_duplicates.

251b931

Configures whether to consider missing values as relevant or not.

daavoo force-pushed the drop-duplicates-nan branch from c37db07 to 251b931 Compare December 27, 2021 10:40

daavoo merged commit 9a5614d into main Dec 29, 2021

daavoo deleted the drop-duplicates-nan branch December 29, 2021 20:33

efiop added the ui user interface / interaction label Jan 14, 2022

daavoo added the skip-changelog Skips changelog label Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TabularData: Add ignore_empty flag to drop_duplicates. #7064

TabularData: Add ignore_empty flag to drop_duplicates. #7064

Uh oh!

daavoo commented Nov 29, 2021 •

edited

Loading

Uh oh!

Uh oh!

pmrowla commented Dec 17, 2021 •

edited

Loading

Uh oh!

daavoo commented Dec 17, 2021 •

edited

Loading

Uh oh!

pmrowla commented Dec 17, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TabularData: Add ignore_empty flag to drop_duplicates. #7064

TabularData: Add ignore_empty flag to drop_duplicates. #7064

Uh oh!

Conversation

daavoo commented Nov 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pmrowla commented Dec 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daavoo commented Dec 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmrowla commented Dec 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

daavoo commented Nov 29, 2021 •

edited

Loading

pmrowla commented Dec 17, 2021 •

edited

Loading

daavoo commented Dec 17, 2021 •

edited

Loading

pmrowla commented Dec 17, 2021 •

edited

Loading