Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SuperVectorizer #167

Merged
merged 35 commits into from
Jul 6, 2021
Merged

Conversation

LilianBoulard
Copy link
Member

Hello

This PR aims at implementing an "automatic column vectorizer" as suggested by #54 (solves #54) : the SuperVectorizer.

It can be used to automatically apply encoders to columns, depending on implicit characteristic (such as the column dtype).
Under the hood, it is an interface for sklearn's ColumnTransformer.

Currently, it selects columns based on their dtype, and divides them into several groups:

  • Numerical variables
  • Datetime variables
  • Low cardinality categorical variables
  • High cardinality categorical variables
  • Low cardinality string variables
  • High cardinality string variables

As you noticed, we differentiate categorical and string.
The threshold used to categorize low or high cardinality can be specified by the user.

An experimental feature currently implemented is the auto_cast, which will try to cast each column to the best possible data type.

Please submit feedback and suggestions, I'd be happy to discuss it !

@LilianBoulard LilianBoulard marked this pull request as draft June 22, 2021 15:11
return array
elif isinstance(array, np.ndarray):
# Replace missing values for numpy
array[np.where(np.isnan(array))] = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you are replace the values inplace. This is in general dangerous (though certainly not frowned upon) if the programmer is not aware.

The challenge of this function is that it does a bit of both: it modifies the input data, but what it returns can be different. It's a bit bad style, but I have no proposal that is as memory-friendly :).

This should at least be mentioned clearly in a very short docstring, so that later contributors don't make mistakes.

# Detect if the array contains missing values.
if _has_missing_values(X):
if self.handle_missing == '':
X = _replace_missing(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, ideally I think that we would not handle missing values in the supervectorizer, as often downstream (transformers or even the final learner) can do this better.

numeric_columns = X.select_dtypes(include=['int', 'float']).columns.to_list()
string_columns = X.select_dtypes(include=['string', 'object']).columns.to_list()
categorical_columns = X.select_dtypes(include='category').columns.to_list()
datetime_columns = X.select_dtypes(include='datetime').columns.to_list()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check that we have selected all the columns with the above? And issue a warning or an error if not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question !
Technically yes, but we currently don't include some data types (i.e. timedelta, and perhaps others I forgot).

@GaelVaroquaux
Copy link
Member

❤️

You have a failing CI. It will need attention at some point.

@LilianBoulard
Copy link
Member Author

Then we need to fix it

Addressed in bd90180 :)

CHANGES.rst Outdated Show resolved Hide resolved
doc/index.rst Outdated Show resolved Hide resolved
Comment on lines 105 to 117
# Let's perform the same workflow, but without the `Pipeline`, so we can
# analyze its mechanisms along the way.

from sklearn.ensemble import RandomForestRegressor


sup_vec = SuperVectorizer(auto_cast=True)
regressor = RandomForestRegressor(n_estimators=25, random_state=42)
# Fit the SuperVectorizer
X_train_enc = sup_vec.fit_transform(X_train, y_train)
X_test_enc = sup_vec.transform(X_test)
# And the regressor
regressor.fit(X_train_enc, y_train)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move this part after we have printed the feature names

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean for the RFR ?
To get the feature names, the SV must be fitted.

@GaelVaroquaux GaelVaroquaux marked this pull request as ready for review July 6, 2021 14:36
@LilianBoulard LilianBoulard changed the title [WIP] Add SuperVectorizer Add SuperVectorizer Jul 6, 2021
@GaelVaroquaux
Copy link
Member

CI ran, I'm merging!

@GaelVaroquaux GaelVaroquaux merged commit b2cfb66 into skrub-data:master Jul 6, 2021
@GaelVaroquaux
Copy link
Member

This is exciting!!

@LilianBoulard LilianBoulard deleted the super-vectorizer branch October 12, 2021 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants