-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SuperVectorizer #167
Add SuperVectorizer #167
Conversation
dirty_cat/super_vectorizer.py
Outdated
return array | ||
elif isinstance(array, np.ndarray): | ||
# Replace missing values for numpy | ||
array[np.where(np.isnan(array))] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you are replace the values inplace. This is in general dangerous (though certainly not frowned upon) if the programmer is not aware.
The challenge of this function is that it does a bit of both: it modifies the input data, but what it returns can be different. It's a bit bad style, but I have no proposal that is as memory-friendly :).
This should at least be mentioned clearly in a very short docstring, so that later contributors don't make mistakes.
# Detect if the array contains missing values. | ||
if _has_missing_values(X): | ||
if self.handle_missing == '': | ||
X = _replace_missing(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, ideally I think that we would not handle missing values in the supervectorizer, as often downstream (transformers or even the final learner) can do this better.
numeric_columns = X.select_dtypes(include=['int', 'float']).columns.to_list() | ||
string_columns = X.select_dtypes(include=['string', 'object']).columns.to_list() | ||
categorical_columns = X.select_dtypes(include='category').columns.to_list() | ||
datetime_columns = X.select_dtypes(include='datetime').columns.to_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we check that we have selected all the columns with the above? And issue a warning or an error if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question !
Technically yes, but we currently don't include some data types (i.e. timedelta
, and perhaps others I forgot).
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
…irty_cat into super-vectorizer
…super-vectorizer
…super-vectorizer
❤️ You have a failing CI. It will need attention at some point. |
examples/07_automatic_preprocessing_with_the_supervectorizer.py
Outdated
Show resolved
Hide resolved
examples/07_automatic_preprocessing_with_the_supervectorizer.py
Outdated
Show resolved
Hide resolved
examples/07_automatic_preprocessing_with_the_supervectorizer.py
Outdated
Show resolved
Hide resolved
examples/07_automatic_preprocessing_with_the_supervectorizer.py
Outdated
Show resolved
Hide resolved
Addressed in bd90180 :) |
examples/07_automatic_preprocessing_with_the_supervectorizer.py
Outdated
Show resolved
Hide resolved
examples/07_automatic_preprocessing_with_the_supervectorizer.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
…irty_cat into super-vectorizer
# Let's perform the same workflow, but without the `Pipeline`, so we can | ||
# analyze its mechanisms along the way. | ||
|
||
from sklearn.ensemble import RandomForestRegressor | ||
|
||
|
||
sup_vec = SuperVectorizer(auto_cast=True) | ||
regressor = RandomForestRegressor(n_estimators=25, random_state=42) | ||
# Fit the SuperVectorizer | ||
X_train_enc = sup_vec.fit_transform(X_train, y_train) | ||
X_test_enc = sup_vec.transform(X_test) | ||
# And the regressor | ||
regressor.fit(X_train_enc, y_train) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should move this part after we have printed the feature names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean for the RFR ?
To get the feature names, the SV must be fitted.
examples/07_automatic_preprocessing_with_the_supervectorizer.py
Outdated
Show resolved
Hide resolved
ce34905
to
01cf92c
Compare
…super-vectorizer
CI ran, I'm merging! |
This is exciting!! |
Hello
This PR aims at implementing an "automatic column vectorizer" as suggested by #54 (solves #54) : the SuperVectorizer.
It can be used to automatically apply encoders to columns, depending on implicit characteristic (such as the column
dtype
).Under the hood, it is an interface for sklearn's
ColumnTransformer
.Currently, it selects columns based on their
dtype
, and divides them into several groups:As you noticed, we differentiate categorical and string.
The threshold used to categorize low or high cardinality can be specified by the user.
An experimental feature currently implemented is the
auto_cast
, which will try to cast each column to the best possible data type.Please submit feedback and suggestions, I'd be happy to discuss it !