Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] Add Drop Option to OneHotEncoder. #12908

Merged
Merged
Changes from 16 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
a4845f1
Added documentation for drop one OneHotEncoder
Dec 28, 2018
f2f5c8a
Added functionality for drop first and drop manually-specified.
Jan 2, 2019
47c61bc
Added tests for OneHotEncoder drop parameter
Jan 3, 2019
335406d
Fixed flake8
Jan 3, 2019
fb83786
Resolved merge conflict in preprocessing/_encoders.py
Jan 3, 2019
76a8274
Added additional code to detect when a column that ought to be droppe…
Jan 3, 2019
a35b8b9
Fixed docstring
Jan 3, 2019
bddbad9
Fixed docstring2
Jan 3, 2019
e368cfe
Finally will clear pytest
Jan 3, 2019
6ae1019
Removed code that is not compatible with numpy 1.11.0
Jan 3, 2019
89a55e1
Fixed docs to match current OneHotEncoder string
Jan 3, 2019
6c10149
Updated error message with language that addresses new drop functiona…
Jan 25, 2019
da3f989
Updated documentation and testing to resolve errors mentioned by jnot…
Jan 25, 2019
82f8f07
Fixed Flake8 bug
Jan 25, 2019
cf1a425
Added new way to encode features to be dropped. Added support for lea…
Jan 28, 2019
f09ada2
Merge branch 'master' into leave_one_out_ohe
Jan 28, 2019
aa25bdf
Implemented basic text corrections and code optimizations recommended
Jan 30, 2019
d74a0f4
Merge branch 'master' into leave_one_out_ohe
Jan 30, 2019
a5c58f1
fixed Flake8 Literal complaint
Jan 30, 2019
e1942dd
Merge branch 'master' into leave_one_out_ohe
Jan 30, 2019
702c902
Merge branch 'master' into leave_one_out_ohe
Feb 12, 2019
337bb2b
Removed unnecessary loop on transform step
Feb 12, 2019
9173550
Added additional tests for categories. Refactored some code in transf…
Feb 12, 2019
b3468d8
Merge branch 'master' into leave_one_out_ohe
Feb 12, 2019
f7aaa02
Added dtype, type checking for drop_idx_
Feb 13, 2019
903e17a
Fixed small typo. Repush due to random network error
Feb 14, 2019
063e307
incorporated changes proposed by jorisvandenbossche
Feb 16, 2019
576c94c
Fixed flake8 error
Feb 16, 2019
5948d30
Added documentation for new features. Implemented revisions to tests.…
Feb 19, 2019
3ed9d7a
Merge branch 'master' into leave_one_out_ohe
Feb 19, 2019
decad00
Fixed flake8
Feb 19, 2019
f4c8f7c
Fixed error check order to match expected order in pytest
Feb 19, 2019
0e99906
Refactored some code, small changes from @nicolashug
Feb 19, 2019
c1836ce
Added code example
Feb 19, 2019
b1330c5
updated relative reference in rst to OneHotEncoder
Feb 19, 2019
2eb2c16
Changes to documentation for OHE, changes to legacy implementation to…
Feb 20, 2019
c85b5dd
Merge branch 'master' into leave_one_out_ohe
Feb 20, 2019
ac163ee
Removed option to select drop-one only on some columns. This will be …
Feb 25, 2019
da73238
Merge branch 'master' into leave_one_out_ohe
Feb 25, 2019
e351b1f
Reformatting, fixed flake8
Feb 25, 2019
6033765
Small doc fix
Feb 26, 2019
9597543
Fixed n_values/drop compatibility
Feb 26, 2019
7149eac
Removed cruft. Added in test for drop/n_values interaction
Feb 26, 2019
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.
+267 −48
Diff settings

Always

Just for now

@@ -489,7 +489,7 @@ Continuing the example above::
>>> enc = preprocessing.OneHotEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None, categories=None,
OneHotEncoder(categorical_features=None, categories=None, drop=None,
dtype=<... 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
>>> enc.transform([['female', 'from US', 'uses Safari'],
@@ -516,7 +516,7 @@ dataset::
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None,
categories=[...],
categories=[...], drop=None,
dtype=<... 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
@@ -533,7 +533,7 @@ columns for this feature will be all zeros
>>> enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None, categories=None,
OneHotEncoder(categorical_features=None, categories=None, drop=None,
dtype=<... 'numpy.float64'>, handle_unknown='ignore',
n_values=None, sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
@@ -8,6 +8,7 @@
import warnings

import numpy as np
import six
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Feb 18, 2019

Contributor

nor this

from scipy import sparse

from .. import get_config as _get_config
@@ -159,6 +160,21 @@ class OneHotEncoder(_BaseEncoder):
The used categories can be found in the ``categories_`` attribute.
drop : 'first' or a list/array of shape (n_features,), default=None.
Specifies a methodology to use to drop one of the categories per
feature. This is useful in situations where perfectly collinear
features cause problems, such as when feeding the resulting data
into a neural network or an unregularized regression. If only one
feature appears in a column, that column will be dropped from the
transformed version (unless this is overriden by manually
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

I found this two lines very confusing / unclear. From a second read, I suppose this is about columns/features that contain only 1 category, and thus would result in only 1 feature in the transformed result ?

If so, I suppose this also only happens for the drop='first' case? (if that is correct, maybe then move it to that bullet point?)
Or is this always happening? (which I seem to understand from reading previous reviews, but not from the code)

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 15, 2019

Author Contributor

I see what you're saying--rewrote this so this only appears in the bullet point.

specifying that no category ought to be dropped from that column).
- None : retain all features.
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 18, 2019

Contributor
Suggested change
- None : retain all features.
- None : retain all features (the default).
- 'first' : drop the first category in each feature.
- array : ``drop[i]`` is the value of the feature in ``X[:,i]``
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

value of the feature -> category to drop

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

Should it be possible for the user to specify that some features do not drop? Perhaps that's unnecessary complexity, and the current design is good.

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Jan 28, 2019

Author Contributor

I added this functionality as well, it was simple to implement.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

"value of the feature" sounds a bit confusing to me. The line above is also using 'category'.

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 15, 2019

Author Contributor

sure--changed to category.

that should be dropped. None means all categories should
be retained.
sparse : boolean, default=True
Will return sparse matrix if set True else will return an array.
@@ -208,6 +224,10 @@ class OneHotEncoder(_BaseEncoder):
(in order of the features in X and corresponding with the output
of ``transform``).
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

Note that drop categories are included here (although this does make it hard to identify the correspondence between columns and categories so I'm not entirely sure that's the right design)

This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

Note that categories_ includes the dropped category.

drop_idx_ : array of shape (n_features,)
The index in ``categories_ of`` the category to be dropped for
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 28, 2019

Member

"of" should be outside the markup

each feature. None if all features will be retained.
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

Nitpicking a bit, but: it's drop_idx_[i] that is the index into categories_[i], right? It might be good to be explicit about this, to the extent it would still be clear.

Second nitpick: in "None if all features will be retained", I would replace the "features" with either "transformed features" or either "categories" (as the "feature" here has a different meaning as the "feature" in the previous sentence on the same line)

active_features_ : array
Indices for active features, meaning values that actually occur
in the training set. Only available when n_values is ``'auto'``.
@@ -244,9 +264,9 @@ class OneHotEncoder(_BaseEncoder):
>>> enc.fit(X)
... # doctest: +ELLIPSIS
... # doctest: +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None, categories=None,
dtype=<... 'numpy.float64'>, handle_unknown='ignore',
n_values=None, sparse=True)
OneHotEncoder(categorical_features=None, categories=None, drop=None,
dtype=<... 'numpy.float64'>, handle_unknown='ignore',
n_values=None, sparse=True)
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
@@ -258,6 +278,12 @@ class OneHotEncoder(_BaseEncoder):
[None, 2]], dtype=object)
>>> enc.get_feature_names()
array(['x0_Female', 'x0_Male', 'x1_1', 'x1_2', 'x1_3'], dtype=object)
>>> drop_enc = OneHotEncoder(drop='first').fit(X)
>>> drop_enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray()
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

can you use the same samples here as in the example above (without dropping)? That makes it easier to compare the difference in output between both.

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 15, 2019

Author Contributor

As it stands, the samples used in the first example are not compatible with the drop method. The first example transforms an unknown feature (and has handle_unknown='ignore' set), but handle_unknown='ignore' is incompatible with drop, as it makes inverse transformations impossible.

array([[0., 0., 0.],
[1., 1., 0.]])
See also
--------
@@ -274,7 +300,7 @@ class OneHotEncoder(_BaseEncoder):
matrix indicating the presence of a class label.
"""

def __init__(self, n_values=None, categorical_features=None,
def __init__(self, n_values=None, drop=None, categorical_features=None,
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

can you add this keyword after categories? (as it is in the docstring)

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 15, 2019

Author Contributor

sure thing

categories=None, sparse=True, dtype=np.float64,
handle_unknown='error'):
self.categories = categories
@@ -283,6 +309,7 @@ def __init__(self, n_values=None, categorical_features=None,
self.handle_unknown = handle_unknown
self.n_values = n_values
self.categorical_features = categorical_features
self.drop = drop

# Deprecated attributes

@@ -355,20 +382,39 @@ def _handle_deprecations(self, X):
self._legacy_mode = False
self._categories = 'auto'
else:
msg = (
"The handling of integer data will change in version "
"0.22. Currently, the categories are determined "
"based on the range [0, max(values)], while in the "
"future they will be determined based on the unique "
"values.\nIf you want the future behaviour and "
"silence this warning, you can specify "
"\"categories='auto'\".\n"
"In case you used a LabelEncoder before this "
"OneHotEncoder to convert the categories to integers, "
"then you can now use the OneHotEncoder directly."
)
warnings.warn(msg, FutureWarning)
self._legacy_mode = True
if self.drop is None:
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

I'm okay with, rather, drop use also triggering the new mode. Why not?

msg = (
"The handling of integer data will change in "
"version 0.22. Currently, the categories are "
"determined based on the range "
"[0, max(values)], while in the future they "
"will be determined based on the unique "
"values.\nIf you want the future behaviour "
"and silence this warning, you can specify "
"\"categories='auto'\".\n"
"In case you used a LabelEncoder before this "
"OneHotEncoder to convert the categories to "
"integers, then you can now use the "
"OneHotEncoder directly."
)
warnings.warn(msg, FutureWarning)
self._legacy_mode = True
self._n_values = 'auto'
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

is there a reason you added this line?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 18, 2019

Contributor

Can you answer this one?

This comment has been minimized.

Copy link
@NicolasHug

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 20, 2019

Contributor

Can you remove this line? (it is already set above)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 26, 2019

Contributor

@drewmjohnston Can you remove this line? (or, explain why it was needed to add it?)

else:
msg = (
"The handling of integer data will change in "
"version 0.22. Currently, the categories are "
"determined based on the range "
"[0, max(values)], while in the future they "
"will be determined based on the unique "
"values.\n The old behavior is not compatible "
"with the `drop` paramenter. Instead, you "
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 26, 2019

Contributor
Suggested change
"with the `drop` paramenter. Instead, you "
"with the `drop` parameter. Instead, you "
"must manually specify \"categories='auto'\" "
"if you wish to use the `drop` parameter on "
"an array of entirely integer data. This will "
"enable the future behavior."
)
raise ValueError(msg)

# if user specified categorical_features -> always use legacy mode
if self.categorical_features is not None:
@@ -400,6 +446,21 @@ def _handle_deprecations(self, X):
else:
self._categorical_features = 'all'

# Prevents new drop functionality from being used in legacy mode
if self._legacy_mode and self.drop is not None:
raise ValueError(
"The `categorical_features` and `n_values` keywords "
"are deprecated, and cannot be used together "
"with 'drop'.")

# If we have both dropped columns and ignored unknown
# values, there will be ambiguous cells. This creates difficulties
# in interpreting the model.
if self.drop is not None and self.handle_unknown != 'error':
raise ValueError(
"`handle_unknown` cannot be 'ignore' when the drop parameter "
"is specified, as this will create ambiguous cells")
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

Can you move this check to fit? (where the value of handle_unknown is also checked) Because this check is unrelated to the deprecation handling, correct?


def fit(self, X, y=None):
"""Fit OneHotEncoder to X.
@@ -426,6 +487,44 @@ def fit(self, X, y=None):
return self
else:
self._fit(X, handle_unknown=self.handle_unknown)
if self.drop is None:
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

let's pull this out into a method. self.drop_idx_ = self._compute_drop_idx()

self.drop_idx_ = None
elif (isinstance(self.drop, six.string_types) and
self.drop == 'first'):
self.drop_idx_ = np.zeros(len(self.categories_),
dtype=np.int8)
elif not isinstance(self.drop, six.string_types):
try:
self.drop = np.asarray(self.drop, dtype=object)
droplen = len(self.drop)
except (ValueError, TypeError):
msg = ("Wrong input for parameter `drop`. Expected "
"'first', None or array of objects, got {}")
raise ValueError(msg.format(type(self.drop)))
if droplen != len(self.categories_):
msg = ("`drop` should have length equal to the number "
"of features ({}), got {}")
raise ValueError(msg.format(len(self.categories_),
len(self.drop)))
missing_drops = [(i, val) for i, val in enumerate(self.drop)
if val not in self.categories_[i] and
val is not None]
if any(missing_drops):
msg = ("The following categories were supposed to be "
"dropped, but were not found in the training "
"data.\n{}".format(
"\n".join(
["Category: {}, Feature: {}".format(c, v)
for c, v in missing_drops])))
raise ValueError(msg)
self.drop_idx_ = np.array([np.where(cat_list == val)[0][0]
if val in cat_list else None
for (val, cat_list) in
zip(self.drop, self.categories_)])
else:
msg = ("Wrong input for parameter `drop`. Expected "
"'first', None or array of objects, got {}")
raise ValueError(msg.format(type(self.drop)))
return self

def _legacy_fit_transform(self, X):
@@ -572,11 +671,30 @@ def _transform_new(self, X):

X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)

if self.drop is not None:
for i in range(n_features):
Xii = X_int[:, i]
Xmi = X_mask[:, i]
drop_value = self.drop_idx_[i]
"""
Add the cells where the drop value is present to the list
of those to be masked-out. Decrement the indices of the values

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 28, 2019

Member

I see three options for implementing transform:

  1. ignore the added category when encoding, a bit like handle_unknown='ignore'
  2. remove the value once encoded as an int
  3. remove the columns once one-hot encoded

I think you've implemented #2. Why is this one chosen? What are its pros and cons in terms of code simplicity and computational complexity?

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Jan 30, 2019

Author Contributor

To do the first option, it seems like I'd have to modify _transform(), which is in the BaseEncoder class. We're not adding drop options to the other classes that inherit from BaseEncoder, so I thought that changing this in BaseEncoder might not be the best approach.
The approach I took with 2) had some advantages, since I was able to mix it in to reuse some of the logic that was already used for masking unknown values. It also avoids having to do column slicing on an output that might be a CSR matrix (as would be the case in 3)).

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 4, 2019

Author Contributor

@jnothman do you think that one of the other approaches is superior?

above so as not to leave a blank column.
"""
if drop_value is not None:
keep_cells = ~np.in1d(Xii, [drop_value])
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

why are we not just doing np.not_equal(Xii, drop_value)

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

More so, is it possible to do that for all features?

keep_cells = X_int != self.drop_idx_.reshape(1, -1)
X_mask &= keep_cells
X_int[X_int > self.drop_idx_.reshape(1, -1)] -= 1

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 12, 2019

Author Contributor

Good point with the first bit. I think this code can be simplified at least to:

Xmi &= np.not_equal(Xii, drop_value)
Xii[Xii > drop_value] -= 1

Regarding the second bit, I don't think that will handle cases in which one of the features has None for the category to be dropped. Is there a better way to do this than looping over columns? Having to mask the columns makes the code longer and inplace modification tricky.

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

If at this point we are working with integer encoding of X surely it's easy to replace None with an appropriately impossible int?

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 12, 2019

Author Contributor

Surely it is, how's this:

if self.drop is not None:
    to_drop = np.copy(self.drop_idx_).reshape(1, -1)
    to_drop[to_drop == None] = max(len(cats)
                                   for cats in self.categories_)
    keep_cells = X_int != to_drop
    X_mask &= keep_cells
    X_int[X_int > to_drop] -= 1
    n_values = [len(cats) - 1 if drop_val is not None
                else len(cats)
                for cats, drop_val in
                zip(self.categories_, self.drop_idx_)]
else:
    n_values = [cats.shape[0] for cats in self.categories_]
np.logical_and(Xmi, keep_cells, out=Xmi)
Xii[Xii > drop_value] -= 1
n_values = [len(cats) - 1
if self.drop_idx_[i] is not None
else len(cats)
for i, cats in enumerate(self.categories_)]
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 28, 2019

Member

use zip rather than enumerate?

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Jan 29, 2019

Author Contributor

Sure, easy enough.

else:
n_values = [cats.shape[0] for cats in self.categories_]

mask = X_mask.ravel()
n_values = [cats.shape[0] for cats in self.categories_]
n_values = np.array([0] + n_values)
feature_indices = np.cumsum(n_values)

indices = (X_int + feature_indices[:-1]).ravel()[mask]
indptr = X_mask.sum(axis=1).cumsum()
indptr = np.insert(indptr, 0, 0)
@@ -636,7 +754,16 @@ def inverse_transform(self, X):

n_samples, _ = X.shape
n_features = len(self.categories_)
n_transformed_features = sum([len(cats) for cats in self.categories_])
if self.drop is None:
n_transformed_features = sum(len(cats)
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

this is repeated from above with n_values. Refactor it.

for cats in self.categories_)
else:
n_transformed_features = sum(len(cats) - 1
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 28, 2019

Member

can just do sum(len(cats) - (drop_idx is not None) for cats, drop_idx in zip(self.categories_, self.drop_idx_)

if self.drop_idx_[i] is not None
else len(cats)
for i, cats in
enumerate(self.categories_)
)

# validate shape of passed X
msg = ("Shape of the passed X data is not correct. Expected {0} "
@@ -652,17 +779,31 @@ def inverse_transform(self, X):
found_unknown = {}

for i in range(n_features):
n_categories = len(self.categories_[i])
if self.drop is None:
cats = self.categories_[i]
else:
cats = np.array([cat for idx, cat in
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 28, 2019

Member

Can't you just do np.delete(self.categories_[i], self.drop_idx_[i])?

Do you need to handle the drop_idx_[i] is None case here?

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Jan 30, 2019

Author Contributor

Sure, that makes more sense.

enumerate(self.categories_[i])
if idx != self.drop_idx_[i]])
n_categories = len(cats)
if n_categories == 0:
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Feb 19, 2019

Contributor

Please don't remove this comment, unless it's out of date:


                    # Only happens if there was a column with a unique
                    # category. In this case we just fill the column with this
                    # unique category value.
X_tr[:, i] = self.categories_[i][self.drop_idx_[i]]
j += n_categories
continue
sub = X[:, j:j + n_categories]

# for sparse X argmax returns 2D matrix, ensure 1D array
labels = np.asarray(_argmax(sub, axis=1)).flatten()
X_tr[:, i] = self.categories_[i][labels]

if self.handle_unknown == 'ignore':
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

instead of removing this check, you could also add a or if self.drop is not None ? (although I don't know how expensive the calculation of unknown is, so how important it is to avoid)

# ignored unknown categories: we have a row of all zero's
unknown = np.asarray(sub.sum(axis=1) == 0).flatten()
if unknown.any():
X_tr[:, i] = cats[labels]
unknown = np.asarray(sub.sum(axis=1) == 0).flatten()

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

I think you want ~sub.any(axis=1)

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 12, 2019

Author Contributor

Hrm, this doesn't play nicely with CSR matrices. Do you think it's worth converting a sparse representation to a dense one to use this?

if unknown.any():
'''
In this case, we can safely assume that all of the nulls in
each column are the dropped value
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

Can you use standard # comment here?

'''
if self.drop is not None:
X_tr[unknown, i] = self.categories_[i][self.drop_idx_[i]]
elif self.handle_unknown == 'ignore':
# ignored unknown categories: we have a row of all zero's
found_unknown[i] = unknown

j += n_categories
Oops, something went wrong.
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.