Skip to content

Rework SV logic#303

Merged
LilianBoulard merged 29 commits intoskrub-data:masterfrom
LilianBoulard:rework_sv_logic
Sep 8, 2022
Merged

Rework SV logic#303
LilianBoulard merged 29 commits intoskrub-data:masterfrom
LilianBoulard:rework_sv_logic

Conversation

@LilianBoulard
Copy link
Member

@LilianBoulard LilianBoulard commented Aug 22, 2022

I noticed some inconsistencies in the behavior of the SuperVectorizer, which this PRs aims to fix:

  • The _auto_cast method was called on the data passed to transform, which meant that types of the fit_transform output data could be different from the types of the transform output data.
  • The types_ variable was overwritten during transform, which shouldn't happen as this is a mapping of the types learnt during fit.
  • "False missing" values (for a lack of a better term) were only replaced in the _auto_cast function, which made sense as it was used in the transform as well. Now, it has been moved to its own function
  • Imputation is not consistent. We might want to rework this logic.

@LilianBoulard LilianBoulard added the bug Something isn't working label Aug 22, 2022
@LilianBoulard LilianBoulard self-assigned this Aug 22, 2022
@GaelVaroquaux
Copy link
Member

This PR has conflicts and failing tests

@LilianBoulard
Copy link
Member Author

So, the failing test was because of an issue that affects pandas version 1.1.5 (our min dependency until now), and possibly earlier versions (I haven't checked).
Pandas 1.2.0 was released on December 26th, 2020, only a few weeks after 1.1.5 (December 7th), and fixes this bug, so I guess bumping is not a big deal.

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks overall good, but we do have the problem of terminology of "cat" vs "str". It seems to me that it should be "str" quite often.

@LilianBoulard
Copy link
Member Author

So, the PR is overall stable, but there's a last thing I'm concerned about: the imputation strategy.

Currently, the logic is to impute missing values categorical / string with a custom value "missing", depending on the impute_missing parameter.

There is currently no imputation for missing values in datetime and numerical columns.
I'm kind of curious as to why this has not been a problem so far: do most models support missing values out of the box ?
What do you think @GaelVaroquaux ?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 6, 2022 via email

@LilianBoulard
Copy link
Member Author

LilianBoulard commented Sep 7, 2022

Okay so these last few commits add some doc to clarify to the user what we impute and what we don't.

Edit: one last thing we need to address is the way the remainder is used, as currently it cannot be applied to the *_transformer parameters because before, it was used when None was passed. Now, None is transformed to a default transformer as pointed out in the doc.

@LilianBoulard LilianBoulard added this to the 0.3.0 release milestone Sep 7, 2022
Copy link
Member Author

@LilianBoulard LilianBoulard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me!
I'm waiting for your feedback, and I'll merge it right away :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I am following the logic fully. If I understand things right, if I put numerical_transformer="pasthrough" (a typo on "passthrough", the numerical_transformer ends up being the high_card_cat_transformer. This sounds like a dangerous behavior, as it will lead to people having bugs in their pipelines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no, sorry that was en error, it should be self.numerical_transformer_ = self.numerical_transformer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And so if there was a typo, it will crash down the line when the ColumnTransformer's fit_transform will call _validate_transformers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks. Please add a comment that says that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a small docstring to describe to the developer what this function does.
It does not need to be a properly formatted docstring describing all parameters, as this is an internal function, we just need to describe why this function

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things to modify:

  • The docstring to add
  • The error on the numerical vs categorical transformer

After this, we merge. Don't do more changes, or I'll find other comments :)

@LilianBoulard LilianBoulard merged commit 9e59a1a into skrub-data:master Sep 8, 2022
@LilianBoulard LilianBoulard deleted the rework_sv_logic branch September 8, 2022 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants