Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TableVectorize cleanup! #579

Conversation

Vincent-Maladiere
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere commented Jun 7, 2023

This PR addresses what seems to be an inaccuracy in the way we handle numerical values by default in TableVectorizer, along with various readability issues.

Main issue

We currently have:

all_transformers = [
          ("numeric", self.numerical_transformer, numeric_columns),
          ...
]

but this defeat the purpose of cloning the transformers to avoid altering them.

What we want instead is:

all_transformers = [
          ("numeric", self.numerical_transformer_, numeric_columns),
          ...
]

(notice the _)

The difference in behavior is that now to use remainder="drop" on numerical columns, users must do:

tv = TableVectorizer(remainder="drop", numerical_transformer="remainder")

which is a bit more explicit and less error-prone (most users don't want their numerical values to be dropped).

What do you think?

Other smallish readability issues

  1. replace_false_missing() needs to be run before _auto_cast(), so that we don't have to call it twice
  2. low_card_cat_columns and high_card_cat_columns can be created with a single for-loop instead of two
  3. We can use numeric_columns = X.select_dtypes("number") instead of listing all int, uint and float types (uint8 and int8 are missing, is it on purpose?)
  4. for col in X.columns:
        if col in categorical_columns:
    can be replaced by
    for col in categorical_columns:
    since categorical_columns is a subset of X.columns
  5. It's preferable to have check_is_fitted() on "transformers_" rather than "columns_", since the latter is created at the beginning of the function whereas the former is created during the fit_transform of ColumnTransformer. In case of error, columns_ would be defined and hence the estimator would be mistakingly considered fitted.
  6. Fix the docstring of remainder, which currently asserts that the default value is drop, whereas it's actually passthrough.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 7, 2023 via email

@LilianBoulard
Copy link
Member

LilianBoulard commented Jun 8, 2023

What we want instead is

Nice, thanks for spotting that, this is absolutely a typo!

replace_false_missing() needs to be run before _auto_cast(), so that we don't have to call it twice

I'm not so sure, I think the logic is that if there are "false missing", _has_missing_values on line 671 won't recognize them as actual missing values, thus not imputing the columns. Which in turn will not trigger the auto-cast with NANs (this assumes pandas has different data types depending on whether the column contains missing values or not, and to my knowledge this is true).

Edit: after reviewing the PR, that makes sense if we remove the function call from _auto_cast :)

However, I can see a way the logic could be improved (unrelated to the previous paragraph): instead of checking for missing values on X as a whole, we could treat each column independently.

We can use numeric_columns = X.select_dtypes("number") instead of listing all int, uint and float types (uint8 and int8 are missing, is it on purpose?)

Absolutely! And for 8bits, there was most likely a technical reason, but I guess this is not a problem today.

And for 2 and 4, 5 and 6, you have my 👍 😄

Copy link
Member

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks

@GaelVaroquaux
Copy link
Member

It would be good to have a changelog entry, so that people transiting from dirty-cat see the difference.

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please add a small changelog entry

np.uint32,
np.uint16,
]
include="number"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simplification <3

if X[col].nunique() < self.cardinality_threshold:
low_card_cat_columns.append(col)
else:
high_card_cat_columns.append(col)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner code. Thanks!

CHANGES.rst Outdated Show resolved Hide resolved
@GaelVaroquaux
Copy link
Member

I'm committing to this branch to add the PR in the changelog (great for advanced users, and helps involving a community)

@GaelVaroquaux GaelVaroquaux merged commit 72846d4 into skrub-data:main Jun 13, 2023
21 of 22 checks passed
@GaelVaroquaux
Copy link
Member

Merged!!

@Vincent-Maladiere Vincent-Maladiere deleted the small_enhencement_table_vectorizer branch November 9, 2023 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants