Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] ENH: new CategoricalEncoder class #9151

Merged
merged 37 commits into from Nov 21, 2017
Merged
Changes from 1 commit
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
70d8165
Added CategoricalEncoder class - deprecating OneHotEncoder
vighneshbirodkar Mar 18, 2016
bea23a5
First round of updates
jorisvandenbossche Jun 19, 2017
fda6d27
fix + test specifying of categories
jorisvandenbossche Jun 26, 2017
5f2b403
further clean-up + tests
jorisvandenbossche Jun 27, 2017
e175e4c
fix skipping pandas test
jorisvandenbossche Jun 27, 2017
dfaa9c0
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche Aug 1, 2017
4f64648
feedback andy
jorisvandenbossche Aug 1, 2017
01c3bd4
add encoding keyword to support ordinal encoding
jorisvandenbossche Aug 7, 2017
dcef19c
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche Aug 9, 2017
2ed91e8
remove y from transform signature
jorisvandenbossche Aug 9, 2017
a589dd9
Remove sparse keyword in favor of encoding='onehot-dense'
jorisvandenbossche Aug 9, 2017
17e5e69
Let encoding='ordinal' follow dtype keyword
jorisvandenbossche Aug 9, 2017
47a88dd
add categories_ attribute
jorisvandenbossche Aug 9, 2017
7b5b476
expand docs on ordinal + feedback
jorisvandenbossche Aug 25, 2017
5f26bdc
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche Oct 19, 2017
3dcc07f
feedback Andy
jorisvandenbossche Oct 19, 2017
5f5934f
add whatsnew note
jorisvandenbossche Oct 19, 2017
c6a5d30
for now raise on unsorted passed categories
jorisvandenbossche Oct 20, 2017
ad5fdc7
Implement inverse_transform
jorisvandenbossche Oct 20, 2017
eb2f4b8
fix example to have sorted categories
jorisvandenbossche Oct 20, 2017
ce82c28
backport scipy sparse argmax
jorisvandenbossche Oct 27, 2017
64aeff5
check handle_unknown before computation in fit
jorisvandenbossche Oct 27, 2017
4f8efcf
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche Oct 27, 2017
a1c0982
make scipy backport private
jorisvandenbossche Oct 27, 2017
85cf315
Directly construct CSR matrix
jorisvandenbossche Oct 30, 2017
b40bd8e
try to preserve original dtype if resulting dtype is not string
jorisvandenbossche Oct 30, 2017
2d9b4dd
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche Oct 31, 2017
a31bb2a
Remove copying of data, only copy when needed in transform + add test
jorisvandenbossche Oct 31, 2017
2ef5fb9
add test for input dtypes / categories_ dtypes
jorisvandenbossche Oct 31, 2017
937446e
doc updates based on feedback
jorisvandenbossche Oct 31, 2017
a83102c
fix docstring example for python 2
jorisvandenbossche Oct 31, 2017
fbe9ea7
Merge remote-tracking branch 'upstream/master' into pr/6559
jorisvandenbossche Nov 7, 2017
21d9c0c
add checking of shape of X in inverse_transform
jorisvandenbossche Nov 7, 2017
929362f
loopify dtype tests
jorisvandenbossche Nov 9, 2017
a6d55d1
reword example on unknown categories
jorisvandenbossche Nov 9, 2017
9aeeb6d
clarify docs
jorisvandenbossche Nov 9, 2017
c39aa0c
remove repeated one
jorisvandenbossche Nov 9, 2017
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.

Always

Just for now

@@ -1197,7 +1197,7 @@ See the :ref:`metrics` section of the user guide for further details.
preprocessing.MaxAbsScaler
preprocessing.MinMaxScaler
preprocessing.Normalizer
preprocessing.OneHotEncoder

This comment has been minimized.

Copy link
@amueller

amueller Jun 18, 2017

Member

This indicated deprecation of the old class? I'm not sure we want to do that yet.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 18, 2017

Author Member

yeah, for now just rebased old pr. See my to do list in top post. I would for now just leave the OneHotEncoder as is. We can always decide to deprecate it later if we want

This comment has been minimized.

Copy link
@amueller

amueller Jun 18, 2017

Member

Looked at the code before I looked at your description. I think your description is a good summary.

preprocessing.CategoricalEncoder
preprocessing.PolynomialFeatures
preprocessing.QuantileTransformer
preprocessing.RobustScaler
@@ -461,38 +461,45 @@ not desired (i.e. the set of browsers was ordered arbitrarily).

One possibility to convert categorical features to features that can be used
with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is
implemented in :class:`OneHotEncoder`. This estimator transforms each
implemented in :class:`CategoricalEncoder`. This estimator transforms each

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 22, 2017

Member

Given the availability of ordinal encoding from CategoricalEncoder, I think it should be described differently here.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Aug 25, 2017

Author Member

Yes, I added some explanation about the ordinal encoding. The only thing is that, by re-using the current example and flow, I first introduce ordinal encoding, and then only onehot, while the default is onehot. If you prefer starting with the default behaviour, I can rework this.

categorical feature with ``m`` possible values into ``m`` binary features, with
only one active.

Continuing the example above::

>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # doctest: +ELLIPSIS
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.transform([[0, 1, 3]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])
>>> enc = preprocessing.CategoricalEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS
CategoricalEncoder(categorical_features='all', classes='auto',
dtype=<... 'numpy.float64'>, handle_unknown='error',
sparse=True)
>>> enc.transform([['female', 'from US', 'uses Safari']]).toarray()
array([[ 1., 0., 0., 1., 0., 1.]])


By default, how many values each feature can take is inferred automatically from the dataset.
It is possible to specify this explicitly using the parameter ``n_values``.
It is possible to specify this explicitly using the parameter ``classes``.
There are two genders, three possible continents and four web browsers in our
dataset.
Then we fit the estimator, and transform a data point.
In the result, the first two numbers encode the gender, the next set of three
numbers the continent and the last four the web browser.

Note that, if there is a possibilty that the training data might have missing categorical
features, one has to explicitly set ``n_values``. For example,

>>> enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
>>> # Note that there are missing categorical values for the 2nd and 3rd
>>> # features
>>> enc.fit([[1, 2, 3], [0, 2, 0]]) # doctest: +ELLIPSIS
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values=[2, 3, 4], sparse=True)
>>> enc.transform([[1, 0, 0]]).toarray()
array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]])
features, one has to explicitly set ``classes``. For example,

>>> genders = ['male', 'female']
>>> locations = ['from Europe', 'from US', 'from Africa', 'from Asia']
>>> browsers = ['uses Safari', 'uses Firefox', 'uses IE', 'uses Chrome']
>>> enc = preprocessing.CategoricalEncoder(classes=[genders, locations, browsers])
>>> # Note that for there are missing categorical values for the 2nd and 3rd
>>> # feature
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS
CategoricalEncoder(categorical_features='all',
classes=[...],
dtype=<... 'numpy.float64'>, handle_unknown='error',
sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[ 1., 0., 0., 0., 0., 1., 1., 0., 0., 0.]])
See :ref:`dict_feature_extraction` for categorical features that are represented
as a dict, not as integers.
@@ -34,7 +34,7 @@
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,
GradientBoostingClassifier)
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import CategoricalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.pipeline import make_pipeline
@@ -62,7 +62,7 @@

# Supervised transformation based on random forests
rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator)
rf_enc = OneHotEncoder()
rf_enc = CategoricalEncoder()
rf_lm = LogisticRegression()
rf.fit(X_train, y_train)
rf_enc.fit(rf.apply(X_train))
@@ -72,7 +72,7 @@
fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm)

grd = GradientBoostingClassifier(n_estimators=n_estimator)
grd_enc = OneHotEncoder()
grd_enc = CategoricalEncoder()
grd_lm = LogisticRegression()
grd.fit(X_train, y_train)
grd_enc.fit(grd.apply(X_train)[:, :, 0])
@@ -39,7 +39,8 @@ class DictVectorizer(BaseEstimator, TransformerMixin):
However, note that this transformer will only do a binary one-hot encoding
when feature values are of type string. If categorical features are
represented as numeric values such as int, the DictVectorizer can be
followed by OneHotEncoder to complete binary one-hot encoding.
followed by :class:`sklearn.preprocessing.CategoricalEncoder` to complete
binary one-hot encoding.
Features that do not occur in a sample (mapping) will have a zero value
in the resulting array/matrix.
@@ -88,8 +89,8 @@ class DictVectorizer(BaseEstimator, TransformerMixin):
See also
--------
FeatureHasher : performs vectorization using only a hash function.
sklearn.preprocessing.OneHotEncoder : handles nominal/categorical features
encoded as columns of integers.
sklearn.preprocessing.CategoricalEncoder : handles nominal/categorical
features encoded as columns of arbitraty data types.

This comment has been minimized.

Copy link
@amueller

amueller Jul 12, 2017

Member

is it? or strings or integers? What happens with pandas categorical?

This comment has been minimized.

Copy link
@amueller

amueller Aug 28, 2017

Member

arbitraty -> arbitrary

"""

def __init__(self, dtype=np.float64, separator="=", sparse=True,
@@ -22,6 +22,7 @@
from .data import minmax_scale
from .data import quantile_transform
from .data import OneHotEncoder
from .data import CategoricalEncoder

from .data import PolynomialFeatures

@@ -46,6 +47,7 @@
'QuantileTransformer',
'Normalizer',
'OneHotEncoder',
'CategoricalEncoder',
'RobustScaler',
'StandardScaler',
'add_dummy_feature',
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.