Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] Add Drop Option to OneHotEncoder. #12908

Merged
Merged
Changes from 42 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
a4845f1
Added documentation for drop one OneHotEncoder
Dec 28, 2018
f2f5c8a
Added functionality for drop first and drop manually-specified.
Jan 2, 2019
47c61bc
Added tests for OneHotEncoder drop parameter
Jan 3, 2019
335406d
Fixed flake8
Jan 3, 2019
fb83786
Resolved merge conflict in preprocessing/_encoders.py
Jan 3, 2019
76a8274
Added additional code to detect when a column that ought to be droppe…
Jan 3, 2019
a35b8b9
Fixed docstring
Jan 3, 2019
bddbad9
Fixed docstring2
Jan 3, 2019
e368cfe
Finally will clear pytest
Jan 3, 2019
6ae1019
Removed code that is not compatible with numpy 1.11.0
Jan 3, 2019
89a55e1
Fixed docs to match current OneHotEncoder string
Jan 3, 2019
6c10149
Updated error message with language that addresses new drop functiona…
Jan 25, 2019
da3f989
Updated documentation and testing to resolve errors mentioned by jnot…
Jan 25, 2019
82f8f07
Fixed Flake8 bug
Jan 25, 2019
cf1a425
Added new way to encode features to be dropped. Added support for lea…
Jan 28, 2019
f09ada2
Merge branch 'master' into leave_one_out_ohe
Jan 28, 2019
aa25bdf
Implemented basic text corrections and code optimizations recommended
Jan 30, 2019
d74a0f4
Merge branch 'master' into leave_one_out_ohe
Jan 30, 2019
a5c58f1
fixed Flake8 Literal complaint
Jan 30, 2019
e1942dd
Merge branch 'master' into leave_one_out_ohe
Jan 30, 2019
702c902
Merge branch 'master' into leave_one_out_ohe
Feb 12, 2019
337bb2b
Removed unnecessary loop on transform step
Feb 12, 2019
9173550
Added additional tests for categories. Refactored some code in transf…
Feb 12, 2019
b3468d8
Merge branch 'master' into leave_one_out_ohe
Feb 12, 2019
f7aaa02
Added dtype, type checking for drop_idx_
Feb 13, 2019
903e17a
Fixed small typo. Repush due to random network error
Feb 14, 2019
063e307
incorporated changes proposed by jorisvandenbossche
Feb 16, 2019
576c94c
Fixed flake8 error
Feb 16, 2019
5948d30
Added documentation for new features. Implemented revisions to tests.…
Feb 19, 2019
3ed9d7a
Merge branch 'master' into leave_one_out_ohe
Feb 19, 2019
decad00
Fixed flake8
Feb 19, 2019
f4c8f7c
Fixed error check order to match expected order in pytest
Feb 19, 2019
0e99906
Refactored some code, small changes from @nicolashug
Feb 19, 2019
c1836ce
Added code example
Feb 19, 2019
b1330c5
updated relative reference in rst to OneHotEncoder
Feb 19, 2019
2eb2c16
Changes to documentation for OHE, changes to legacy implementation to…
Feb 20, 2019
c85b5dd
Merge branch 'master' into leave_one_out_ohe
Feb 20, 2019
ac163ee
Removed option to select drop-one only on some columns. This will be …
Feb 25, 2019
da73238
Merge branch 'master' into leave_one_out_ohe
Feb 25, 2019
e351b1f
Reformatting, fixed flake8
Feb 25, 2019
6033765
Small doc fix
Feb 26, 2019
9597543
Fixed n_values/drop compatibility
Feb 26, 2019
7149eac
Removed cruft. Added in test for drop/n_values interaction
Feb 26, 2019
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.
+309 −58
Diff settings

Always

Just for now

Copy path View file
@@ -489,7 +489,7 @@ Continuing the example above::
>>> enc = preprocessing.OneHotEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None, categories=None,
OneHotEncoder(categorical_features=None, categories=None, drop=None,
dtype=<... 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
>>> enc.transform([['female', 'from US', 'uses Safari'],
@@ -516,7 +516,7 @@ dataset::
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None,
categories=[...],
categories=[...], drop=None,
dtype=<... 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
@@ -533,13 +533,31 @@ columns for this feature will be all zeros
>>> enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None, categories=None,
OneHotEncoder(categorical_features=None, categories=None, drop=None,
dtype=<... 'numpy.float64'>, handle_unknown='ignore',
n_values=None, sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 0., 0., 0.]])


It is also possible to encode each column into ``n_categories - 1`` columns
instead of ``n_categories`` columns by using the ``drop`` parameter. This
parameter allows the user to specify a category for each feature to be dropped.
This is useful to avoid co-linearity in the input matrix in some classifiers.
Such functionality is useful, for example, when using non-regularized
regression (:class:`LinearRegression <sklearn.linear_model.LinearRegression>`),
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 20, 2019

Contributor

maybe state this a bit broader? (or explicitly make it clear the LinearRegression class is an example, as now it seems that this is the only use case)

since co-linearity would cause the covariance matrix to be non-invertible.
When this paramenter is not None, ``handle_unknown`` must be set to
``error``::

>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> drop_enc = preprocessing.OneHotEncoder(drop='first').fit(X)
>>> drop_enc.categories_
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
>>> drop_enc.transform(X).toarray()
array([[1., 1., 1.],
[0., 0., 0.]])

This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Feb 19, 2019

Contributor

Code example is missing right?

See :ref:`dict_feature_extraction` for categorical features that are represented
as a dict, not as scalars.

Copy path View file
@@ -278,6 +278,11 @@ Support for Python 3.4 and below has been officially dropped.
:class:`preprocessing.StandardScaler`. :issue:`13007` by
:user:`Raffaello Baluyot <baluyotraf>`

- |Feature| :class:`OneHotEncoder` now supports dropping one feature per category
with a new drop parameter. :issue:`12908` by
:user:`Drew Johnston <drewmjohnston>`.


:mod:`sklearn.tree`
...................
- |Feature| Decision Trees can now be plotted with matplotlib using
@@ -2,7 +2,6 @@
# Joris Van den Bossche <jorisvandenbossche@gmail.com>
# License: BSD 3 clause


import numbers
import warnings

@@ -158,6 +157,18 @@ class OneHotEncoder(_BaseEncoder):
The used categories can be found in the ``categories_`` attribute.
drop : 'first' or a list/array of shape (n_features,), default=None.
Specifies a methodology to use to drop one of the categories per
feature. This is useful in situations where perfectly collinear
features cause problems, such as when feeding the resulting data
into a neural network or an unregularized regression.
- None : retain all features (the default).
- 'first' : drop the first category in each feature. If only one
category is present, the feature will be dropped entirely.
- array : ``drop[i]`` is the category in feature ``X[:, i]`` that
should be dropped.
sparse : boolean, default=True
Will return sparse matrix if set True else will return an array.
@@ -205,7 +216,13 @@ class OneHotEncoder(_BaseEncoder):
categories_ : list of arrays
The categories of each feature determined during fitting
(in order of the features in X and corresponding with the output
of ``transform``).
of ``transform``). This includes the category specified in ``drop``
(if any).
drop_idx_ : array of shape (n_features,)
``drop_idx_[i]`` is the index in ``categories_[i]`` of the category to
be dropped for each feature. None if all the transformed features will
be retained.
active_features_ : array
Indices for active features, meaning values that actually occur
@@ -243,9 +260,9 @@ class OneHotEncoder(_BaseEncoder):
>>> enc.fit(X)
... # doctest: +ELLIPSIS
... # doctest: +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None, categories=None,
dtype=<... 'numpy.float64'>, handle_unknown='ignore',
n_values=None, sparse=True)
OneHotEncoder(categorical_features=None, categories=None, drop=None,
dtype=<... 'numpy.float64'>, handle_unknown='ignore',
n_values=None, sparse=True)
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
@@ -257,6 +274,12 @@ class OneHotEncoder(_BaseEncoder):
[None, 2]], dtype=object)
>>> enc.get_feature_names()
array(['x0_Female', 'x0_Male', 'x1_1', 'x1_2', 'x1_3'], dtype=object)
>>> drop_enc = OneHotEncoder(drop='first').fit(X)
>>> drop_enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray()
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

can you use the same samples here as in the example above (without dropping)? That makes it easier to compare the difference in output between both.

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 15, 2019

Author Contributor

As it stands, the samples used in the first example are not compatible with the drop method. The first example transforms an unknown feature (and has handle_unknown='ignore' set), but handle_unknown='ignore' is incompatible with drop, as it makes inverse transformations impossible.

array([[0., 0., 0.],
[1., 1., 0.]])
See also
--------
@@ -274,14 +297,15 @@ class OneHotEncoder(_BaseEncoder):
"""

def __init__(self, n_values=None, categorical_features=None,
categories=None, sparse=True, dtype=np.float64,
categories=None, drop=None, sparse=True, dtype=np.float64,
handle_unknown='error'):
self.categories = categories
self.sparse = sparse
self.dtype = dtype
self.handle_unknown = handle_unknown
self.n_values = n_values
self.categorical_features = categorical_features
self.drop = drop

# Deprecated attributes

@@ -346,28 +370,46 @@ def _handle_deprecations(self, X):
)
warnings.warn(msg, DeprecationWarning)
else:

# check if we have integer or categorical input
try:
check_array(X, dtype=np.int)
except ValueError:
self._legacy_mode = False
self._categories = 'auto'
else:
msg = (
"The handling of integer data will change in version "
"0.22. Currently, the categories are determined "
"based on the range [0, max(values)], while in the "
"future they will be determined based on the unique "
"values.\nIf you want the future behaviour and "
"silence this warning, you can specify "
"\"categories='auto'\".\n"
"In case you used a LabelEncoder before this "
"OneHotEncoder to convert the categories to integers, "
"then you can now use the OneHotEncoder directly."
)
warnings.warn(msg, FutureWarning)
self._legacy_mode = True
if self.drop is None:
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

I'm okay with, rather, drop use also triggering the new mode. Why not?

msg = (
"The handling of integer data will change in "
"version 0.22. Currently, the categories are "
"determined based on the range "
"[0, max(values)], while in the future they "
"will be determined based on the unique "
"values.\nIf you want the future behaviour "
"and silence this warning, you can specify "
"\"categories='auto'\".\n"
"In case you used a LabelEncoder before this "
"OneHotEncoder to convert the categories to "
"integers, then you can now use the "
"OneHotEncoder directly."
)
warnings.warn(msg, FutureWarning)
self._legacy_mode = True
self._n_values = 'auto'
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

is there a reason you added this line?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 18, 2019

Contributor

Can you answer this one?

This comment has been minimized.

Copy link
@NicolasHug

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 20, 2019

Contributor

Can you remove this line? (it is already set above)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 26, 2019

Contributor

@drewmjohnston Can you remove this line? (or, explain why it was needed to add it?)

else:
msg = (
"The handling of integer data will change in "
"version 0.22. Currently, the categories are "
"determined based on the range "
"[0, max(values)], while in the future they "
"will be determined based on the unique "
"values.\n The old behavior is not compatible "
"with the `drop` paramenter. Instead, you "
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 26, 2019

Contributor
Suggested change
"with the `drop` paramenter. Instead, you "
"with the `drop` parameter. Instead, you "
"must manually specify \"categories='auto'\" "
"if you wish to use the `drop` parameter on "
"an array of entirely integer data. This will "
"enable the future behavior."
)
raise ValueError(msg)

# if user specified categorical_features -> always use legacy mode
if self.categorical_features is not None:
@@ -399,6 +441,13 @@ def _handle_deprecations(self, X):
else:
self._categorical_features = 'all'

# Prevents new drop functionality from being used in legacy mode
if self._legacy_mode and self.drop is not None:
raise ValueError(
"The `categorical_features` and `n_values` keywords "
"are deprecated, and cannot be used together "
"with 'drop'.")

def fit(self, X, y=None):
"""Fit OneHotEncoder to X.
@@ -411,10 +460,8 @@ def fit(self, X, y=None):
-------
self
"""
if self.handle_unknown not in ('error', 'ignore'):
msg = ("handle_unknown should be either 'error' or 'ignore', "
"got {0}.".format(self.handle_unknown))
raise ValueError(msg)

self._validate_keywords()

self._handle_deprecations(X)

@@ -425,8 +472,59 @@ def fit(self, X, y=None):
return self
else:
self._fit(X, handle_unknown=self.handle_unknown)
self.drop_idx_ = self._compute_drop_idx()
return self

def _compute_drop_idx(self):
if self.drop is None:
return None
elif (isinstance(self.drop, str) and self.drop == 'first'):
return np.zeros(len(self.categories_), dtype=np.int_)
elif not isinstance(self.drop, str):
try:
self.drop = np.asarray(self.drop, dtype=object)
droplen = len(self.drop)
except (ValueError, TypeError):
msg = ("Wrong input for parameter `drop`. Expected "
"'first', None or array of objects, got {}")
raise ValueError(msg.format(type(self.drop)))
if droplen != len(self.categories_):
msg = ("`drop` should have length equal to the number "
"of features ({}), got {}")
raise ValueError(msg.format(len(self.categories_),
len(self.drop)))
missing_drops = [(i, val) for i, val in enumerate(self.drop)
if val not in self.categories_[i]]
if any(missing_drops):
msg = ("The following categories were supposed to be "
"dropped, but were not found in the training "
"data.\n{}".format(
"\n".join(
["Category: {}, Feature: {}".format(c, v)
for c, v in missing_drops])))
raise ValueError(msg)
return np.array([np.where(cat_list == val)[0][0]
for (val, cat_list) in
zip(self.drop, self.categories_)], dtype=np.int_)
else:
msg = ("Wrong input for parameter `drop`. Expected "
"'first', None or array of objects, got {}")
raise ValueError(msg.format(type(self.drop)))

def _validate_keywords(self):
if self.handle_unknown not in ('error', 'ignore'):
msg = ("handle_unknown should be either 'error' or 'ignore', "
"got {0}.".format(self.handle_unknown))
raise ValueError(msg)
# If we have both dropped columns and ignored unknown
# values, there will be ambiguous cells. This creates difficulties
# in interpreting the model.
if self.drop is not None and self.handle_unknown != 'error':
raise ValueError(
"`handle_unknown` must be 'error' when the drop parameter is "
"specified, as both would create categories that are all "
"zero.")

def _legacy_fit_transform(self, X):
"""Assumes X contains only categorical features."""
dtype = getattr(X, 'dtype', None)
@@ -501,10 +599,8 @@ def fit_transform(self, X, y=None):
X_out : sparse matrix if sparse=True else a 2-d array
Transformed input.
"""
if self.handle_unknown not in ('error', 'ignore'):
msg = ("handle_unknown should be either 'error' or 'ignore', "
"got {0}.".format(self.handle_unknown))
raise ValueError(msg)

self._validate_keywords()

self._handle_deprecations(X)

@@ -571,11 +667,22 @@ def _transform_new(self, X):

X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)

if self.drop is not None:
to_drop = self.drop_idx_.reshape(1, -1)

# We remove all the dropped categories from mask, and decrement all
# categories that occur after them to avoid an empty column.

keep_cells = X_int != to_drop
X_mask &= keep_cells
X_int[X_int > to_drop] -= 1
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

put a comment here

n_values = [len(cats) - 1 for cats in self.categories_]
else:
n_values = [len(cats) for cats in self.categories_]

mask = X_mask.ravel()
n_values = [cats.shape[0] for cats in self.categories_]
n_values = np.array([0] + n_values)
feature_indices = np.cumsum(n_values)

indices = (X_int + feature_indices[:-1]).ravel()[mask]
indptr = X_mask.sum(axis=1).cumsum()
indptr = np.insert(indptr, 0, 0)
@@ -613,7 +720,7 @@ def transform(self, X):
def inverse_transform(self, X):
"""Convert the back data to the original representation.
In case unknown categories are encountered (all zero's in the
In case unknown categories are encountered (all zeros in the
one-hot encoding), ``None`` is used to represent this category.
Parameters
@@ -635,7 +742,12 @@ def inverse_transform(self, X):

n_samples, _ = X.shape
n_features = len(self.categories_)
n_transformed_features = sum([len(cats) for cats in self.categories_])
if self.drop is None:
n_transformed_features = sum(len(cats)
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

this is repeated from above with n_values. Refactor it.

for cats in self.categories_)
else:
n_transformed_features = sum(len(cats) - 1
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 28, 2019

Member

can just do sum(len(cats) - (drop_idx is not None) for cats, drop_idx in zip(self.categories_, self.drop_idx_)

for cats in self.categories_)

# validate shape of passed X
msg = ("Shape of the passed X data is not correct. Expected {0} "
@@ -651,18 +763,35 @@ def inverse_transform(self, X):
found_unknown = {}

for i in range(n_features):
n_categories = len(self.categories_[i])
if self.drop is None:
cats = self.categories_[i]
else:
cats = np.delete(self.categories_[i], self.drop_idx_[i])
n_categories = len(cats)

# Only happens if there was a column with a unique
# category. In this case we just fill the column with this
# unique category value.
if n_categories == 0:
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Feb 19, 2019

Contributor

Please don't remove this comment, unless it's out of date:


                    # Only happens if there was a column with a unique
                    # category. In this case we just fill the column with this
                    # unique category value.
X_tr[:, i] = self.categories_[i][self.drop_idx_[i]]
j += n_categories
continue
sub = X[:, j:j + n_categories]

# for sparse X argmax returns 2D matrix, ensure 1D array
labels = np.asarray(_argmax(sub, axis=1)).flatten()
X_tr[:, i] = self.categories_[i][labels]

X_tr[:, i] = cats[labels]
if self.handle_unknown == 'ignore':
# ignored unknown categories: we have a row of all zero's
unknown = np.asarray(sub.sum(axis=1) == 0).flatten()
# ignored unknown categories: we have a row of all zero
if unknown.any():
found_unknown[i] = unknown
# drop will either be None or handle_unknown will be error. If
# self.drop is not None, then we can safely assume that all of
# the nulls in each column are the dropped value
elif self.drop is not None:
dropped = np.asarray(sub.sum(axis=1) == 0).flatten()
if dropped.any():
X_tr[dropped, i] = self.categories_[i][self.drop_idx_[i]]

j += n_categories

Oops, something went wrong.
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.