Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] Add Drop Option to OneHotEncoder. #12908

Merged
Merged
Changes from 11 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
a4845f1
Added documentation for drop one OneHotEncoder
Dec 28, 2018
f2f5c8a
Added functionality for drop first and drop manually-specified.
Jan 2, 2019
47c61bc
Added tests for OneHotEncoder drop parameter
Jan 3, 2019
335406d
Fixed flake8
Jan 3, 2019
fb83786
Resolved merge conflict in preprocessing/_encoders.py
Jan 3, 2019
76a8274
Added additional code to detect when a column that ought to be droppe…
Jan 3, 2019
a35b8b9
Fixed docstring
Jan 3, 2019
bddbad9
Fixed docstring2
Jan 3, 2019
e368cfe
Finally will clear pytest
Jan 3, 2019
6ae1019
Removed code that is not compatible with numpy 1.11.0
Jan 3, 2019
89a55e1
Fixed docs to match current OneHotEncoder string
Jan 3, 2019
6c10149
Updated error message with language that addresses new drop functiona…
Jan 25, 2019
da3f989
Updated documentation and testing to resolve errors mentioned by jnot…
Jan 25, 2019
82f8f07
Fixed Flake8 bug
Jan 25, 2019
cf1a425
Added new way to encode features to be dropped. Added support for lea…
Jan 28, 2019
f09ada2
Merge branch 'master' into leave_one_out_ohe
Jan 28, 2019
aa25bdf
Implemented basic text corrections and code optimizations recommended
Jan 30, 2019
d74a0f4
Merge branch 'master' into leave_one_out_ohe
Jan 30, 2019
a5c58f1
fixed Flake8 Literal complaint
Jan 30, 2019
e1942dd
Merge branch 'master' into leave_one_out_ohe
Jan 30, 2019
702c902
Merge branch 'master' into leave_one_out_ohe
Feb 12, 2019
337bb2b
Removed unnecessary loop on transform step
Feb 12, 2019
9173550
Added additional tests for categories. Refactored some code in transf…
Feb 12, 2019
b3468d8
Merge branch 'master' into leave_one_out_ohe
Feb 12, 2019
f7aaa02
Added dtype, type checking for drop_idx_
Feb 13, 2019
903e17a
Fixed small typo. Repush due to random network error
Feb 14, 2019
063e307
incorporated changes proposed by jorisvandenbossche
Feb 16, 2019
576c94c
Fixed flake8 error
Feb 16, 2019
5948d30
Added documentation for new features. Implemented revisions to tests.…
Feb 19, 2019
3ed9d7a
Merge branch 'master' into leave_one_out_ohe
Feb 19, 2019
decad00
Fixed flake8
Feb 19, 2019
f4c8f7c
Fixed error check order to match expected order in pytest
Feb 19, 2019
0e99906
Refactored some code, small changes from @nicolashug
Feb 19, 2019
c1836ce
Added code example
Feb 19, 2019
b1330c5
updated relative reference in rst to OneHotEncoder
Feb 19, 2019
2eb2c16
Changes to documentation for OHE, changes to legacy implementation to…
Feb 20, 2019
c85b5dd
Merge branch 'master' into leave_one_out_ohe
Feb 20, 2019
ac163ee
Removed option to select drop-one only on some columns. This will be …
Feb 25, 2019
da73238
Merge branch 'master' into leave_one_out_ohe
Feb 25, 2019
e351b1f
Reformatting, fixed flake8
Feb 25, 2019
6033765
Small doc fix
Feb 26, 2019
9597543
Fixed n_values/drop compatibility
Feb 26, 2019
7149eac
Removed cruft. Added in test for drop/n_values interaction
Feb 26, 2019
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.
+254 −45
Diff settings

Always

Just for now

Copy path View file
@@ -489,7 +489,7 @@ Continuing the example above::
>>> enc = preprocessing.OneHotEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None, categories=None,
OneHotEncoder(categorical_features=None, categories=None, drop=None,
dtype=<... 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
>>> enc.transform([['female', 'from US', 'uses Safari'],
@@ -516,7 +516,7 @@ dataset::
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None,
categories=[...],
categories=[...], drop=None,
dtype=<... 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
@@ -533,7 +533,7 @@ columns for this feature will be all zeros
>>> enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None, categories=None,
OneHotEncoder(categorical_features=None, categories=None, drop=None,
dtype=<... 'numpy.float64'>, handle_unknown='ignore',
n_values=None, sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
@@ -161,6 +161,19 @@ class OneHotEncoder(_BaseEncoder):
The used categories can be found in the ``categories_`` attribute.
drop : 'first' or a list/array of shape (n_features,), default=None.
Specifies a methodology to use to drop one of the categories per
feature. This is useful in situations where perfectly collinear
features cause problems, such as when feeding the resulting data
into a neural network or an unregularized regression. If only one
feature appears in a column, that column will be retained as a
column of ones.
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

Why should it be retained?

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Jan 28, 2019

Author Contributor

No real preference here. I've changed it so it will be dropped instead.

- None : retain all features.
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 18, 2019

Contributor
Suggested change
- None : retain all features.
- None : retain all features (the default).
- 'first' : drop the first feature in each category.
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

feature in each category -> category in each feature

- array : ``drop[i]`` is the value of the feature in ``X[:,i]``
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

value of the feature -> category to drop

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

Should it be possible for the user to specify that some features do not drop? Perhaps that's unnecessary complexity, and the current design is good.

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Jan 28, 2019

Author Contributor

I added this functionality as well, it was simple to implement.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

"value of the feature" sounds a bit confusing to me. The line above is also using 'category'.

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 15, 2019

Author Contributor

sure--changed to category.

that should be dropped.
sparse : boolean, default=True
Will return sparse matrix if set True else will return an array.
@@ -210,6 +223,10 @@ class OneHotEncoder(_BaseEncoder):
(in order of the features in X and corresponding with the output
of ``transform``).
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

Note that drop categories are included here (although this does make it hard to identify the correspondence between columns and categories so I'm not entirely sure that's the right design)

This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

Note that categories_ includes the dropped category.

drop_ : array
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

shape?

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

I think the implementation can be simplified with little harm to the API, if we make it drop_idx_ such that categories_[i][drop_idx_[i]] will be dropped for all i. Then fitting drop='first' just involves drop_idx_ = np.zeros(len(self.categories_)). At transform time the masked column indices are drop_idx_ + np.cumsum([0] + [len(c) for c in categories_[:-1]]).

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Jan 25, 2019

Author Contributor

This is a good idea, I think I'll update the implementation to reflect this.

The category for each feature to be dropped. None if all features
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

reorder: "to be dropped for each feature"

will be retained.
active_features_ : array
Indices for active features, meaning values that actually occur
in the training set. Only available when n_values is ``'auto'``.
@@ -246,9 +263,9 @@ class OneHotEncoder(_BaseEncoder):
>>> enc.fit(X)
... # doctest: +ELLIPSIS
... # doctest: +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None, categories=None,
dtype=<... 'numpy.float64'>, handle_unknown='ignore',
n_values=None, sparse=True)
OneHotEncoder(categorical_features=None, categories=None, drop=None,
dtype=<... 'numpy.float64'>, handle_unknown='ignore',
n_values=None, sparse=True)
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
@@ -260,6 +277,17 @@ class OneHotEncoder(_BaseEncoder):
[None, 2]], dtype=object)
>>> enc.get_feature_names()
array(['x0_Female', 'x0_Male', 'x1_1', 'x1_2', 'x1_3'], dtype=object)
>>> drop_enc = OneHotEncoder(drop='first')
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

Don't need to show the repr again, so do .fit on this line

>>> drop_enc.fit(X)
... # doctest: +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None, categories=None, drop='first',
dtype=<class 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
>>> drop_enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray()
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

can you use the same samples here as in the example above (without dropping)? That makes it easier to compare the difference in output between both.

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 15, 2019

Author Contributor

As it stands, the samples used in the first example are not compatible with the drop method. The first example transforms an unknown feature (and has handle_unknown='ignore' set), but handle_unknown='ignore' is incompatible with drop, as it makes inverse transformations impossible.

array([[0., 0., 0.],
[1., 1., 0.]])
See also
--------
@@ -276,7 +304,7 @@ class OneHotEncoder(_BaseEncoder):
matrix indicating the presence of a class label.
"""

def __init__(self, n_values=None, categorical_features=None,
def __init__(self, n_values=None, drop=None, categorical_features=None,
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

can you add this keyword after categories? (as it is in the docstring)

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 15, 2019

Author Contributor

sure thing

categories=None, sparse=True, dtype=np.float64,
handle_unknown='error'):
self.categories = categories
@@ -285,6 +313,7 @@ def __init__(self, n_values=None, categorical_features=None,
self.handle_unknown = handle_unknown
self.n_values = n_values
self.categorical_features = categorical_features
self.drop = drop

# Deprecated attributes

@@ -354,21 +383,39 @@ def _handle_deprecations(self, X):
self._legacy_mode = False
self._categories = 'auto'
else:
msg = (
"The handling of integer data will change in version "
"0.22. Currently, the categories are determined "
"based on the range [0, max(values)], while in the "
"future they will be determined based on the unique "
"values.\nIf you want the future behaviour and "
"silence this warning, you can specify "
"\"categories='auto'\".\n"
"In case you used a LabelEncoder before this "
"OneHotEncoder to convert the categories to integers, "
"then you can now use the OneHotEncoder directly."
)
warnings.warn(msg, FutureWarning)
self._legacy_mode = True
self._n_values = 'auto'
if self.drop is None:
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

I'm okay with, rather, drop use also triggering the new mode. Why not?

msg = (
"The handling of integer data will change in "
"version 0.22. Currently, the categories are "
"determined based on the range "
"[0, max(values)], while in the future they "
"will be determined based on the unique "
"values.\nIf you want the future behaviour "
"and silence this warning, you can specify "
"\"categories='auto'\".\n"
"In case you used a LabelEncoder before this "
"OneHotEncoder to convert the categories to "
"integers, then you can now use the "
"OneHotEncoder directly."
)
warnings.warn(msg, FutureWarning)
self._legacy_mode = True
self._n_values = 'auto'
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

is there a reason you added this line?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 18, 2019

Contributor

Can you answer this one?

This comment has been minimized.

Copy link
@NicolasHug

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 20, 2019

Contributor

Can you remove this line? (it is already set above)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 26, 2019

Contributor

@drewmjohnston Can you remove this line? (or, explain why it was needed to add it?)

else:
msg = (
"The handling of integer data will change in "
"version 0.22. Currently, the categories are "
"determined based on the range "
"[0, max(values)], while in the future they "
"will be determined based on the unique "
"values.\n The old behavior is not compatible "
"with the `drop` paramenter. Instead, you "
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 26, 2019

Contributor
Suggested change
"with the `drop` paramenter. Instead, you "
"with the `drop` parameter. Instead, you "
"must manually specify \"categories='auto'\" "
"if you wish to use the `drop` parameter on "
"an array of entirely integer data. This will "
"enable the future behavior."
)
raise ValueError(msg)

# if user specified categorical_features -> always use legacy mode
if self.categorical_features is not None:
@@ -400,6 +447,21 @@ def _handle_deprecations(self, X):
else:
self._categorical_features = 'all'

# Prevents new drop functionality from being used in legacy mode
if self._legacy_mode and self.drop is not None:
raise ValueError(
"The `categorical_features` and `n_values` keywords "
"are deprecated, and cannot be used together "
"with 'drop'.")

# If we have both dropped columns and ignored unknown
# values, there will be ambiguous cells. This creates difficulties
# in interpreting the model.
if self.drop is not None and self.handle_unknown != 'error':
raise ValueError(
"`handle_unknown` cannot be 'ignore' when the drop parameter "
"is specified, as this will create ambiguous cells")
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

Can you move this check to fit? (where the value of handle_unknown is also checked) Because this check is unrelated to the deprecation handling, correct?


def fit(self, X, y=None):
"""Fit OneHotEncoder to X.
@@ -426,6 +488,42 @@ def fit(self, X, y=None):
return self
else:
self._fit(X, handle_unknown=self.handle_unknown)
if self.drop is None:
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

let's pull this out into a method. self.drop_idx_ = self._compute_drop_idx()

self.drop_ = None
elif (isinstance(self.drop, six.string_types) and
self.drop == 'first'):
self.drop_ = np.array([cats[0] if len(cats) > 1 else None
for cats in self.categories_],
dtype=object)
elif not isinstance(self.drop, six.string_types):
try:
self.drop = np.asarray(self.drop, dtype=object)
droplen = len(self.drop)
except (ValueError, TypeError):
msg = ("Wrong input for parameter `drop`. Expected "
"'first', None or array of objects, got {}")
raise ValueError(msg.format(type(self.drop)))
if droplen != len(self.categories_):
msg = ("`drop` should have length equal to the number "
"of features ({}), got {}")
raise ValueError(msg.format(len(self.categories_),
len(self.drop)))
missing_drops = [(val, i) for i, val in enumerate(self.drop)
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

I think it's confusing that you reverse the tuple

if val not in self.categories_[i]]
if any(missing_drops):
msg = ("The following features were supposed to be "
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

features -> categories

"dropped, but were not found in the training "
"data.\n{}".format(
"\n".join(["Val: {}, Col: {}".format(v, c)
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

Val -> category
Col -> Feature

for v, c in missing_drops])))
raise ValueError(msg)
self.drop_ = np.array([cat for i, cat in enumerate(self.drop)
if len(self.categories_[i]) > 1],
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

It's a bit weird to just have entries missing (and hence unaligned to categories_) if there's only one category in that column. Especially so if the user has specified drop rather than 'first' as they have here. Looking at transform below, I find it hard to believe that this behaviour is properly tested.

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Jan 28, 2019

Author Contributor

OK. I've resolved this by incorporating your suggestion from elsewhere in the thread. Now, columns with only a single category are dropped entirely from the transformed version (unless it is manually specified that no category should be dropped from the feature).

dtype=object)
else:
msg = ("Wrong input for parameter `drop`. Expected "
"'first', None or array of objects, got {}")
raise ValueError(msg.format(type(self.drop)))
return self

def _legacy_fit_transform(self, X):
@@ -572,11 +670,32 @@ def _transform_new(self, X):

X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)

if self.drop is not None:
for i in range(n_features):
Xii = X_int[:, i]
Xmi = X_mask[:, i]
uniques = self.categories_[i]
drop_loc = np.where(uniques == self.drop_[i])
drop_value = drop_loc[0][0] if len(drop_loc[0]) == 1 else None
"""
Add the cells where the drop value is present to the list
of those to be masked-out. Decrement the indices of the values

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 28, 2019

Member

I see three options for implementing transform:

  1. ignore the added category when encoding, a bit like handle_unknown='ignore'
  2. remove the value once encoded as an int
  3. remove the columns once one-hot encoded

I think you've implemented #2. Why is this one chosen? What are its pros and cons in terms of code simplicity and computational complexity?

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Jan 30, 2019

Author Contributor

To do the first option, it seems like I'd have to modify _transform(), which is in the BaseEncoder class. We're not adding drop options to the other classes that inherit from BaseEncoder, so I thought that changing this in BaseEncoder might not be the best approach.
The approach I took with 2) had some advantages, since I was able to mix it in to reuse some of the logic that was already used for masking unknown values. It also avoids having to do column slicing on an output that might be a CSR matrix (as would be the case in 3)).

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 4, 2019

Author Contributor

@jnothman do you think that one of the other approaches is superior?

above so as not to leave a blank column.
"""
if drop_value is not None:
keep_cells = ~np.in1d(Xii, [drop_value])
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

why are we not just doing np.not_equal(Xii, drop_value)

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

More so, is it possible to do that for all features?

keep_cells = X_int != self.drop_idx_.reshape(1, -1)
X_mask &= keep_cells
X_int[X_int > self.drop_idx_.reshape(1, -1)] -= 1

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 12, 2019

Author Contributor

Good point with the first bit. I think this code can be simplified at least to:

Xmi &= np.not_equal(Xii, drop_value)
Xii[Xii > drop_value] -= 1

Regarding the second bit, I don't think that will handle cases in which one of the features has None for the category to be dropped. Is there a better way to do this than looping over columns? Having to mask the columns makes the code longer and inplace modification tricky.

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

If at this point we are working with integer encoding of X surely it's easy to replace None with an appropriately impossible int?

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 12, 2019

Author Contributor

Surely it is, how's this:

if self.drop is not None:
    to_drop = np.copy(self.drop_idx_).reshape(1, -1)
    to_drop[to_drop == None] = max(len(cats)
                                   for cats in self.categories_)
    keep_cells = X_int != to_drop
    X_mask &= keep_cells
    X_int[X_int > to_drop] -= 1
    n_values = [len(cats) - 1 if drop_val is not None
                else len(cats)
                for cats, drop_val in
                zip(self.categories_, self.drop_idx_)]
else:
    n_values = [cats.shape[0] for cats in self.categories_]
np.logical_and(Xmi, keep_cells, out=Xmi)
Xii[Xii > drop_value] -= 1
n_values = [len(cats) - 1
if self.drop_[i] is not None
else len(cats)
for i, cats in enumerate(self.categories_)]
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 28, 2019

Member

use zip rather than enumerate?

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Jan 29, 2019

Author Contributor

Sure, easy enough.

else:
n_values = [cats.shape[0] for cats in self.categories_]

mask = X_mask.ravel()
n_values = [cats.shape[0] for cats in self.categories_]
n_values = np.array([0] + n_values)
feature_indices = np.cumsum(n_values)

indices = (X_int + feature_indices[:-1]).ravel()[mask]
indptr = X_mask.sum(axis=1).cumsum()
indptr = np.insert(indptr, 0, 0)
@@ -636,7 +755,16 @@ def inverse_transform(self, X):

n_samples, _ = X.shape
n_features = len(self.categories_)
n_transformed_features = sum([len(cats) for cats in self.categories_])
if self.drop is None:
n_transformed_features = sum(len(cats)
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

this is repeated from above with n_values. Refactor it.

for cats in self.categories_)
else:
n_transformed_features = sum(len(cats) - 1
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 28, 2019

Member

can just do sum(len(cats) - (drop_idx is not None) for cats, drop_idx in zip(self.categories_, self.drop_idx_)

if self.drop_[i] is not None
else len(cats)
for i, cats in
enumerate(self.categories_)
)

# validate shape of passed X
msg = ("Shape of the passed X data is not correct. Expected {0} "
@@ -652,14 +780,26 @@ def inverse_transform(self, X):
found_unknown = {}

for i in range(n_features):
n_categories = len(self.categories_[i])
if self.drop is None:
cats = self.categories_[i]
else:
cats = np.array([cat for cat in self.categories_[i]
if cat != self.drop_[i]])
n_categories = len(cats)
sub = X[:, j:j + n_categories]

# for sparse X argmax returns 2D matrix, ensure 1D array
labels = np.asarray(_argmax(sub, axis=1)).flatten()
X_tr[:, i] = self.categories_[i][labels]
X_tr[:, i] = cats[labels]

if self.handle_unknown == 'ignore':
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

instead of removing this check, you could also add a or if self.drop is not None ? (although I don't know how expensive the calculation of unknown is, so how important it is to avoid)

'''
In this case, we can safely assume that all of the nulls in
each column are the dropped value
'''
if self.drop is not None:
unknown = np.asarray(sub.sum(axis=1) == 0).flatten()
if unknown.any():
X_tr[unknown, i] = self.drop_[i]
elif self.handle_unknown == 'ignore':
# ignored unknown categories: we have a row of all zero's
unknown = np.asarray(sub.sum(axis=1) == 0).flatten()
if unknown.any():
Oops, something went wrong.
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.