Add SuperVectorizer #167

LilianBoulard · 2021-06-22T15:10:18Z

Hello

This PR aims at implementing an "automatic column vectorizer" as suggested by #54 (solves #54) : the SuperVectorizer.

It can be used to automatically apply encoders to columns, depending on implicit characteristic (such as the column dtype).
Under the hood, it is an interface for sklearn's ColumnTransformer.

Currently, it selects columns based on their dtype, and divides them into several groups:

Numerical variables
Datetime variables
Low cardinality categorical variables
High cardinality categorical variables
Low cardinality string variables
High cardinality string variables

As you noticed, we differentiate categorical and string.
The threshold used to categorize low or high cardinality can be specified by the user.

An experimental feature currently implemented is the auto_cast, which will try to cast each column to the best possible data type.

Please submit feedback and suggestions, I'd be happy to discuss it !

GaelVaroquaux · 2021-06-22T15:28:10Z

dirty_cat/super_vectorizer.py

+        return array
+    elif isinstance(array, np.ndarray):
+        # Replace missing values for numpy
+        array[np.where(np.isnan(array))] = 0


Here you are replace the values inplace. This is in general dangerous (though certainly not frowned upon) if the programmer is not aware.

The challenge of this function is that it does a bit of both: it modifies the input data, but what it returns can be different. It's a bit bad style, but I have no proposal that is as memory-friendly :).

This should at least be mentioned clearly in a very short docstring, so that later contributors don't make mistakes.

dirty_cat/super_vectorizer.py

GaelVaroquaux · 2021-06-22T15:33:14Z

dirty_cat/super_vectorizer.py

+        # Detect if the array contains missing values.
+        if _has_missing_values(X):
+            if self.handle_missing == '':
+                X = _replace_missing(X)


Actually, ideally I think that we would not handle missing values in the supervectorizer, as often downstream (transformers or even the final learner) can do this better.

GaelVaroquaux · 2021-06-22T16:28:19Z

dirty_cat/super_vectorizer.py

+        numeric_columns = X.select_dtypes(include=['int', 'float']).columns.to_list()
+        string_columns = X.select_dtypes(include=['string', 'object']).columns.to_list()
+        categorical_columns = X.select_dtypes(include='category').columns.to_list()
+        datetime_columns = X.select_dtypes(include='datetime').columns.to_list()


Shouldn't we check that we have selected all the columns with the above? And issue a warning or an error if not.

Great question !
Technically yes, but we currently don't include some data types (i.e. timedelta, and perhaps others I forgot).

dirty_cat/super_vectorizer.py

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

…irty_cat into super-vectorizer

…super-vectorizer

GaelVaroquaux · 2021-07-01T12:32:19Z

❤️

You have a failing CI. It will need attention at some point.

examples/07_automatic_preprocessing_with_the_supervectorizer.py

LilianBoulard · 2021-07-05T12:02:45Z

Then we need to fix it

Addressed in bd90180 :)

CHANGES.rst

examples/07_automatic_preprocessing_with_the_supervectorizer.py

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

…irty_cat into super-vectorizer

doc/index.rst

GaelVaroquaux · 2021-07-06T14:21:38Z

examples/07_automatic_preprocessing_with_the_supervectorizer.py

+# Let's perform the same workflow, but without the `Pipeline`, so we can
+# analyze its mechanisms along the way.
+
+from sklearn.ensemble import RandomForestRegressor
+
+
+sup_vec = SuperVectorizer(auto_cast=True)
+regressor = RandomForestRegressor(n_estimators=25, random_state=42)
+# Fit the SuperVectorizer
+X_train_enc = sup_vec.fit_transform(X_train, y_train)
+X_test_enc = sup_vec.transform(X_test)
+# And the regressor
+regressor.fit(X_train_enc, y_train)


Maybe we should move this part after we have printed the feature names

You mean for the RFR ?
To get the feature names, the SV must be fitted.

examples/07_automatic_preprocessing_with_the_supervectorizer.py

…super-vectorizer

GaelVaroquaux · 2021-07-06T15:59:08Z

CI ran, I'm merging!

GaelVaroquaux · 2021-07-06T15:59:22Z

This is exciting!!

LilianBoulard added 2 commits June 22, 2021 17:07

Add SuperVectorizer v1

172d196

Add tests for the SuperVectorizer

ccec4d2

LilianBoulard marked this pull request as draft June 22, 2021 15:11

GaelVaroquaux reviewed Jun 22, 2021

View reviewed changes

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Jun 22, 2021

View reviewed changes

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Jun 22, 2021

View reviewed changes

dirty_cat/super_vectorizer.py Outdated Show resolved Hide resolved

LilianBoulard and others added 13 commits June 24, 2021 10:49

Fix verbose condition

cca84e2

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

Flatten nested functions

547cf85

Fix verbose argument for sklearn < 0.21

498bc03

Simplify transformers construction

86d7482

Merge branch 'super-vectorizer' of https://github.com/LilianBoulard/d…

63181fe

…irty_cat into super-vectorizer

Simplify init

ad83bac

Merge branch 'master' of https://github.com/dirty-cat/dirty_cat into …

c2eaf65

…super-vectorizer

Don't use black magic anymore

c7cf873

Don't replace values in-place anymore

ed53a1e

Fix type-hinting

db856b3

Add get_feature_names method

370b8b7

Merge branch 'master' of https://github.com/dirty-cat/dirty_cat into …

ecdae5f

…super-vectorizer

New notebook, version 1

ec2a6c4

GaelVaroquaux reviewed Jul 1, 2021

View reviewed changes

examples/07_automatic_preprocessing_with_the_supervectorizer.py Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Jul 1, 2021

View reviewed changes

examples/07_automatic_preprocessing_with_the_supervectorizer.py Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Jul 1, 2021

View reviewed changes

examples/07_automatic_preprocessing_with_the_supervectorizer.py Show resolved Hide resolved

GaelVaroquaux reviewed Jul 1, 2021

View reviewed changes

examples/07_automatic_preprocessing_with_the_supervectorizer.py Show resolved Hide resolved

GaelVaroquaux reviewed Jul 1, 2021

View reviewed changes

examples/07_automatic_preprocessing_with_the_supervectorizer.py Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Jul 1, 2021

View reviewed changes

examples/07_automatic_preprocessing_with_the_supervectorizer.py Outdated Show resolved Hide resolved

Apply Gael's suggestions

86abfe0

Fix TypeError for earlier versions of sklearn

bd90180

LilianBoulard added 2 commits July 5, 2021 14:22

Improve text formatting

bc076f3

Add CV to notebook, and minor fixes & improvements

b46bfc9

GaelVaroquaux reviewed Jul 6, 2021

View reviewed changes

CHANGES.rst Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Jul 6, 2021

View reviewed changes

examples/07_automatic_preprocessing_with_the_supervectorizer.py Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Jul 6, 2021

View reviewed changes

examples/07_automatic_preprocessing_with_the_supervectorizer.py Outdated Show resolved Hide resolved

LilianBoulard and others added 10 commits July 6, 2021 11:13

Reorder and improve notebook

acff116

Describe in changelog

6aa83cf

Add test for get_feature_names

7074257

Remove unused code

d43c90d

Fix docstring

efd99d1

Fix HGBR import

86f8884

Remove StandardScaler from pipeline

cc662ba

Narrower figure

f3f7245

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>

Merge branch 'super-vectorizer' of https://github.com/LilianBoulard/d…

fb52613

…irty_cat into super-vectorizer

Fix get_feature_names error for earlier version of sklearn

545ffae

GaelVaroquaux reviewed Jul 6, 2021

View reviewed changes

doc/index.rst Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Jul 6, 2021

View reviewed changes

examples/07_automatic_preprocessing_with_the_supervectorizer.py Outdated Show resolved Hide resolved

GaelVaroquaux marked this pull request as ready for review July 6, 2021 14:36

LilianBoulard added 2 commits July 6, 2021 16:57

Reorganize notebook and fix CI warnings

4f7585e

Fix CI warning

01cf92c

LilianBoulard force-pushed the super-vectorizer branch from ce34905 to 01cf92c Compare July 6, 2021 15:02

LilianBoulard changed the title ~~[WIP] Add SuperVectorizer~~ Add SuperVectorizer Jul 6, 2021

Merge branch 'master' of https://github.com/dirty-cat/dirty_cat into …

ba51b51

…super-vectorizer

GaelVaroquaux merged commit b2cfb66 into skrub-data:master Jul 6, 2021

LilianBoulard deleted the super-vectorizer branch October 12, 2021 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SuperVectorizer #167

Add SuperVectorizer #167

LilianBoulard commented Jun 22, 2021

GaelVaroquaux Jun 22, 2021

GaelVaroquaux Jun 22, 2021

GaelVaroquaux Jun 22, 2021

LilianBoulard Jun 23, 2021

GaelVaroquaux commented Jul 1, 2021

LilianBoulard commented Jul 5, 2021

GaelVaroquaux Jul 6, 2021

LilianBoulard Jul 6, 2021

GaelVaroquaux commented Jul 6, 2021

GaelVaroquaux commented Jul 6, 2021

Add SuperVectorizer #167

Add SuperVectorizer #167

Conversation

LilianBoulard commented Jun 22, 2021

GaelVaroquaux Jun 22, 2021

Choose a reason for hiding this comment

GaelVaroquaux Jun 22, 2021

Choose a reason for hiding this comment

GaelVaroquaux Jun 22, 2021

Choose a reason for hiding this comment

LilianBoulard Jun 23, 2021

Choose a reason for hiding this comment

GaelVaroquaux commented Jul 1, 2021

LilianBoulard commented Jul 5, 2021

GaelVaroquaux Jul 6, 2021

Choose a reason for hiding this comment

LilianBoulard Jul 6, 2021

Choose a reason for hiding this comment

GaelVaroquaux commented Jul 6, 2021

GaelVaroquaux commented Jul 6, 2021