Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 2] Add Drop Option to OneHotEncoder. #12908

Merged
Merged
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
a4845f1
Added documentation for drop one OneHotEncoder
Dec 28, 2018
f2f5c8a
Added functionality for drop first and drop manually-specified.
Jan 2, 2019
47c61bc
Added tests for OneHotEncoder drop parameter
Jan 3, 2019
335406d
Fixed flake8
Jan 3, 2019
fb83786
Resolved merge conflict in preprocessing/_encoders.py
Jan 3, 2019
76a8274
Added additional code to detect when a column that ought to be droppe…
Jan 3, 2019
a35b8b9
Fixed docstring
Jan 3, 2019
bddbad9
Fixed docstring2
Jan 3, 2019
e368cfe
Finally will clear pytest
Jan 3, 2019
6ae1019
Removed code that is not compatible with numpy 1.11.0
Jan 3, 2019
89a55e1
Fixed docs to match current OneHotEncoder string
Jan 3, 2019
6c10149
Updated error message with language that addresses new drop functiona…
Jan 25, 2019
da3f989
Updated documentation and testing to resolve errors mentioned by jnot…
Jan 25, 2019
82f8f07
Fixed Flake8 bug
Jan 25, 2019
cf1a425
Added new way to encode features to be dropped. Added support for lea…
Jan 28, 2019
f09ada2
Merge branch 'master' into leave_one_out_ohe
Jan 28, 2019
aa25bdf
Implemented basic text corrections and code optimizations recommended
Jan 30, 2019
d74a0f4
Merge branch 'master' into leave_one_out_ohe
Jan 30, 2019
a5c58f1
fixed Flake8 Literal complaint
Jan 30, 2019
e1942dd
Merge branch 'master' into leave_one_out_ohe
Jan 30, 2019
702c902
Merge branch 'master' into leave_one_out_ohe
Feb 12, 2019
337bb2b
Removed unnecessary loop on transform step
Feb 12, 2019
9173550
Added additional tests for categories. Refactored some code in transf…
Feb 12, 2019
b3468d8
Merge branch 'master' into leave_one_out_ohe
Feb 12, 2019
f7aaa02
Added dtype, type checking for drop_idx_
Feb 13, 2019
903e17a
Fixed small typo. Repush due to random network error
Feb 14, 2019
063e307
incorporated changes proposed by jorisvandenbossche
Feb 16, 2019
576c94c
Fixed flake8 error
Feb 16, 2019
5948d30
Added documentation for new features. Implemented revisions to tests.…
Feb 19, 2019
3ed9d7a
Merge branch 'master' into leave_one_out_ohe
Feb 19, 2019
decad00
Fixed flake8
Feb 19, 2019
f4c8f7c
Fixed error check order to match expected order in pytest
Feb 19, 2019
0e99906
Refactored some code, small changes from @nicolashug
Feb 19, 2019
c1836ce
Added code example
Feb 19, 2019
b1330c5
updated relative reference in rst to OneHotEncoder
Feb 19, 2019
2eb2c16
Changes to documentation for OHE, changes to legacy implementation to…
Feb 20, 2019
c85b5dd
Merge branch 'master' into leave_one_out_ohe
Feb 20, 2019
ac163ee
Removed option to select drop-one only on some columns. This will be …
Feb 25, 2019
da73238
Merge branch 'master' into leave_one_out_ohe
Feb 25, 2019
e351b1f
Reformatting, fixed flake8
Feb 25, 2019
6033765
Small doc fix
Feb 26, 2019
9597543
Fixed n_values/drop compatibility
Feb 26, 2019
7149eac
Removed cruft. Added in test for drop/n_values interaction
Feb 26, 2019
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.
+9 −14
Diff settings

Always

Just for now

Implemented basic text corrections and code optimizations recommended

  • Loading branch information...
Drew Johnston
Drew Johnston committed Jan 30, 2019
commit aa25bdf06b5ab53314b31d16f540da0abffa5a14
@@ -225,7 +225,7 @@ class OneHotEncoder(_BaseEncoder):
of ``transform``).
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Jan 10, 2019

Member

Note that drop categories are included here (although this does make it hard to identify the correspondence between columns and categories so I'm not entirely sure that's the right design)

This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

Note that categories_ includes the dropped category.

drop_idx_ : array of shape (n_features,)
The index in ``categories_ of`` the category to be dropped for
The index in ``categories_`` of the category to be dropped for
each feature. None if all features will be retained.
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Feb 15, 2019

Contributor

Nitpicking a bit, but: it's drop_idx_[i] that is the index into categories_[i], right? It might be good to be explicit about this, to the extent it would still be clear.

Second nitpick: in "None if all features will be retained", I would replace the "features" with either "transformed features" or either "categories" (as the "feature" here has a different meaning as the "feature" in the previous sentence on the same line)

active_features_ : array
@@ -685,10 +685,10 @@ def _transform_new(self, X):
keep_cells = ~np.in1d(Xii, [drop_value])
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

why are we not just doing np.not_equal(Xii, drop_value)

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

More so, is it possible to do that for all features?

keep_cells = X_int != self.drop_idx_.reshape(1, -1)
X_mask &= keep_cells
X_int[X_int > self.drop_idx_.reshape(1, -1)] -= 1

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 12, 2019

Author Contributor

Good point with the first bit. I think this code can be simplified at least to:

Xmi &= np.not_equal(Xii, drop_value)
Xii[Xii > drop_value] -= 1

Regarding the second bit, I don't think that will handle cases in which one of the features has None for the category to be dropped. Is there a better way to do this than looping over columns? Having to mask the columns makes the code longer and inplace modification tricky.

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

If at this point we are working with integer encoding of X surely it's easy to replace None with an appropriately impossible int?

This comment has been minimized.

Copy link
@drewmjohnston

drewmjohnston Feb 12, 2019

Author Contributor

Surely it is, how's this:

if self.drop is not None:
    to_drop = np.copy(self.drop_idx_).reshape(1, -1)
    to_drop[to_drop == None] = max(len(cats)
                                   for cats in self.categories_)
    keep_cells = X_int != to_drop
    X_mask &= keep_cells
    X_int[X_int > to_drop] -= 1
    n_values = [len(cats) - 1 if drop_val is not None
                else len(cats)
                for cats, drop_val in
                zip(self.categories_, self.drop_idx_)]
else:
    n_values = [cats.shape[0] for cats in self.categories_]
np.logical_and(Xmi, keep_cells, out=Xmi)
Xii[Xii > drop_value] -= 1
n_values = [len(cats) - 1
if self.drop_idx_[i] is not None
n_values = [len(cats) - 1 if to_drop is not None
else len(cats)
for i, cats in enumerate(self.categories_)]
for cats, to_drop in
zip(self.categories_, self.drop_idx_)]
else:
n_values = [cats.shape[0] for cats in self.categories_]

@@ -758,12 +758,9 @@ def inverse_transform(self, X):
n_transformed_features = sum(len(cats)
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

this is repeated from above with n_values. Refactor it.

for cats in self.categories_)
else:
n_transformed_features = sum(len(cats) - 1
if self.drop_idx_[i] is not None
else len(cats)
for i, cats in
enumerate(self.categories_)
)
n_transformed_features = sum(len(cats) - (drop_idx is not None)
for cats, drop_idx in
zip(self.categories_, self.drop_idx_))

# validate shape of passed X
msg = ("Shape of the passed X data is not correct. Expected {0} "
@@ -779,12 +776,10 @@ def inverse_transform(self, X):
found_unknown = {}

for i in range(n_features):
if self.drop is None:
if (self.drop is None) or self.drop_idx_[i] is None:
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@jnothman

jnothman Feb 12, 2019

Member

rm unnecessary parentheses

cats = self.categories_[i]
else:
cats = np.array([cat for idx, cat in
enumerate(self.categories_[i])
if idx != self.drop_idx_[i]])
cats = np.delete(self.categories_[i], self.drop_idx_[i])
n_categories = len(cats)
if n_categories == 0:
This conversation was marked as resolved by drewmjohnston

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Feb 19, 2019

Contributor

Please don't remove this comment, unless it's out of date:


                    # Only happens if there was a column with a unique
                    # category. In this case we just fill the column with this
                    # unique category value.
X_tr[:, i] = self.categories_[i][self.drop_idx_[i]]
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.