TableVectorize cleanup! #579

Vincent-Maladiere · 2023-06-07T09:13:16Z

This PR addresses what seems to be an inaccuracy in the way we handle numerical values by default in TableVectorizer, along with various readability issues.

Main issue

We currently have:

all_transformers = [
          ("numeric", self.numerical_transformer, numeric_columns),
          ...
]

but this defeat the purpose of cloning the transformers to avoid altering them.

What we want instead is:

all_transformers = [
          ("numeric", self.numerical_transformer_, numeric_columns),
          ...
]

(notice the _)

The difference in behavior is that now to use remainder="drop" on numerical columns, users must do:

tv = TableVectorizer(remainder="drop", numerical_transformer="remainder")

which is a bit more explicit and less error-prone (most users don't want their numerical values to be dropped).

What do you think?

Other smallish readability issues

replace_false_missing() needs to be run before _auto_cast(), so that we don't have to call it twice
low_card_cat_columns and high_card_cat_columns can be created with a single for-loop instead of two
We can use numeric_columns = X.select_dtypes("number") instead of listing all int, uint and float types (uint8 and int8 are missing, is it on purpose?)

for col in X.columns:
    if col in categorical_columns:

can be replaced by

for col in categorical_columns:

since categorical_columns is a subset of X.columns

It's preferable to have check_is_fitted() on "transformers_" rather than "columns_", since the latter is created at the beginning of the function whereas the former is created during the fit_transform of ColumnTransformer. In case of error, columns_ would be defined and hence the estimator would be mistakingly considered fitted.
Fix the docstring of remainder, which currently asserts that the default value is drop, whereas it's actually passthrough.

GaelVaroquaux · 2023-06-07T09:27:27Z

The difference in behavior is that now to use remainder="drop" on numerical columns, users must do: tv = TableVectorizer(remainder="drop", numerical_transformer="remainder") which is a bit more explicit and less error-prone (most users don't want their numerical values to be dropped).

Seems relevant from a cursory look.

LilianBoulard · 2023-06-08T16:27:17Z

What we want instead is

Nice, thanks for spotting that, this is absolutely a typo!

replace_false_missing() needs to be run before _auto_cast(), so that we don't have to call it twice

I'm not so sure, I think the logic is that if there are "false missing", _has_missing_values on line 671 won't recognize them as actual missing values, thus not imputing the columns. Which in turn will not trigger the auto-cast with NANs (this assumes pandas has different data types depending on whether the column contains missing values or not, and to my knowledge this is true).

Edit: after reviewing the PR, that makes sense if we remove the function call from _auto_cast :)

However, I can see a way the logic could be improved (unrelated to the previous paragraph): instead of checking for missing values on X as a whole, we could treat each column independently.

We can use numeric_columns = X.select_dtypes("number") instead of listing all int, uint and float types (uint8 and int8 are missing, is it on purpose?)

Absolutely! And for 8bits, there was most likely a technical reason, but I guess this is not a problem today.

And for 2 and 4, 5 and 6, you have my 👍 😄

LilianBoulard

Looks great, thanks

GaelVaroquaux · 2023-06-11T19:54:19Z

It would be good to have a changelog entry, so that people transiting from dirty-cat see the difference.

GaelVaroquaux

LGTM, but please add a small changelog entry

GaelVaroquaux · 2023-06-11T19:56:13Z

skrub/_table_vectorizer.py

-                np.uint32,
-                np.uint16,
-            ]
+            include="number"


Nice simplification <3

GaelVaroquaux · 2023-06-11T19:57:14Z

skrub/_table_vectorizer.py

+            if X[col].nunique() < self.cardinality_threshold:
+                low_card_cat_columns.append(col)    
+            else:
+                high_card_cat_columns.append(col)


Much cleaner code. Thanks!

CHANGES.rst

GaelVaroquaux · 2023-06-13T10:25:12Z

I'm committing to this branch to add the PR in the changelog (great for advanced users, and helps involving a community)

GaelVaroquaux · 2023-06-13T10:58:11Z

Merged!!

LilianBoulard approved these changes Jun 9, 2023

View reviewed changes

GaelVaroquaux reviewed Jun 11, 2023

View reviewed changes

Vincent-Maladiere added 4 commits June 12, 2023 16:51

first cleaning

443d078

update tests

97c9e90

fix remainder docstring

849e71d

update the changelog

b9cb1bc

Vincent-Maladiere force-pushed the small_enhencement_table_vectorizer branch 2 times, most recently from d3790bc to b9cb1bc Compare June 12, 2023 15:04

LilianBoulard added enhancement New feature or request no changelog needed labels Jun 12, 2023

LilianBoulard assigned Vincent-Maladiere Jun 12, 2023

LilianBoulard removed the no changelog needed label Jun 12, 2023

GaelVaroquaux reviewed Jun 13, 2023

View reviewed changes

CHANGES.rst Outdated Show resolved Hide resolved

GaelVaroquaux added 2 commits June 13, 2023 12:25

Add the link to the PR

93762db

Merge branch 'main' into small_enhencement_table_vectorizer

a545d59

GaelVaroquaux merged commit 72846d4 into skrub-data:main Jun 13, 2023
21 of 22 checks passed

Vincent-Maladiere deleted the small_enhencement_table_vectorizer branch November 9, 2023 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TableVectorize cleanup! #579

TableVectorize cleanup! #579

Vincent-Maladiere commented Jun 7, 2023 •

edited

GaelVaroquaux commented Jun 7, 2023 via email

LilianBoulard commented Jun 8, 2023 •

edited

LilianBoulard left a comment

GaelVaroquaux commented Jun 11, 2023

GaelVaroquaux left a comment

GaelVaroquaux Jun 11, 2023

GaelVaroquaux Jun 11, 2023

GaelVaroquaux commented Jun 13, 2023

GaelVaroquaux commented Jun 13, 2023

TableVectorize cleanup! #579

TableVectorize cleanup! #579

Conversation

Vincent-Maladiere commented Jun 7, 2023 • edited

Main issue

Other smallish readability issues

GaelVaroquaux commented Jun 7, 2023 via email

LilianBoulard commented Jun 8, 2023 • edited

LilianBoulard left a comment

Choose a reason for hiding this comment

GaelVaroquaux commented Jun 11, 2023

GaelVaroquaux left a comment

Choose a reason for hiding this comment

GaelVaroquaux Jun 11, 2023

Choose a reason for hiding this comment

GaelVaroquaux Jun 11, 2023

Choose a reason for hiding this comment

GaelVaroquaux commented Jun 13, 2023

GaelVaroquaux commented Jun 13, 2023

Vincent-Maladiere commented Jun 7, 2023 •

edited

LilianBoulard commented Jun 8, 2023 •

edited