Conversation
This reverts commit 94d02c4.
…rework_sv_logic � Conflicts: � dirty_cat/super_vectorizer.py
|
This PR has conflicts and failing tests |
|
So, the failing test was because of an issue that affects pandas version 1.1.5 (our min dependency until now), and possibly earlier versions (I haven't checked). |
GaelVaroquaux
left a comment
There was a problem hiding this comment.
This looks overall good, but we do have the problem of terminology of "cat" vs "str". It seems to me that it should be "str" quite often.
|
So, the PR is overall stable, but there's a last thing I'm concerned about: the imputation strategy. Currently, the logic is to impute missing values categorical / string with a custom value "missing", depending on the There is currently no imputation for missing values in datetime and numerical columns. |
|
There is currently no imputation for missing values in datetime and numerical columns.
This may need to be done by a sophisticated imputation methods (unlike with categorical data), so we leave it to the user to add an Imputer in the pipeline. Another option is to use a model that natively supports missing values.
|
…rework_sv_logic � Conflicts: � dirty_cat/super_vectorizer.py
|
Okay so these last few commits add some doc to clarify to the user what we impute and what we don't. Edit: one last thing we need to address is the way the |
LilianBoulard
left a comment
There was a problem hiding this comment.
Everything looks good to me!
I'm waiting for your feedback, and I'll merge it right away :)
There was a problem hiding this comment.
I'm not sure that I am following the logic fully. If I understand things right, if I put numerical_transformer="pasthrough" (a typo on "passthrough", the numerical_transformer ends up being the high_card_cat_transformer. This sounds like a dangerous behavior, as it will lead to people having bugs in their pipelines.
There was a problem hiding this comment.
Ah, no, sorry that was en error, it should be self.numerical_transformer_ = self.numerical_transformer
There was a problem hiding this comment.
And so if there was a typo, it will crash down the line when the ColumnTransformer's fit_transform will call _validate_transformers.
There was a problem hiding this comment.
OK, thanks. Please add a comment that says that.
dirty_cat/super_vectorizer.py
Outdated
There was a problem hiding this comment.
Please add a small docstring to describe to the developer what this function does.
It does not need to be a properly formatted docstring describing all parameters, as this is an internal function, we just need to describe why this function
GaelVaroquaux
left a comment
There was a problem hiding this comment.
Two things to modify:
- The docstring to add
- The error on the numerical vs categorical transformer
After this, we merge. Don't do more changes, or I'll find other comments :)
I noticed some inconsistencies in the behavior of the SuperVectorizer, which this PRs aims to fix:
_auto_castmethod was called on the data passed totransform, which meant that types of thefit_transformoutput data could be different from the types of thetransformoutput data.types_variable was overwritten duringtransform, which shouldn't happen as this is a mapping of the types learnt during fit._auto_castfunction, which made sense as it was used in thetransformas well. Now, it has been moved to its own function