Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Add 'if_binary' option to drop argument of OneHotEncoder #16245

Merged
merged 27 commits into from
Feb 7, 2020

Conversation

rushabh-v
Copy link
Contributor

Fixes #12502

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It looks pretty good,

@glemaitre glemaitre changed the title Add 'if_binary' option to drop argument of OneHotEncoder ENH Add 'if_binary' option to drop argument of OneHotEncoder Jan 28, 2020
@rth
Copy link
Member

rth commented Jan 28, 2020

A bit related to categories="dtype" in #15396 I could imagine if_binary=True to be the only useful option long term. To keep backward compatibility constraints, one possibility could be to implement if_binary='warn' and raise a deprecation warning indicating that the behaviour will change from if_binary=False to True by default in the future. Then some versions later remove if_binary altogether.

The alternative could be to make this breaking change in 1.0. WDYT? cc @thomasjpfan

@rushabh-v
Copy link
Contributor Author

if_binary=True to be the only useful option long term.

Do you mean drop='if_binary'? @rth

@rth
Copy link
Member

rth commented Jan 28, 2020

Do you mean drop='if_binary'

Yes, never mind my comment for this PR, particularly since this doesn't add a new parameter. The comment was more about how to make this the default in the long term.

@amueller
Copy link
Member

@thomasjpfan would be great if you could have a look ;)

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @rushabh-v !

So this currently drops the first collumn for any input with 2 classes, not just binary,

>>> OneHotEncoder(drop=None).fit_transform([['a'], ['b'], ['a']]).A
array([[1., 0.],
       [0., 1.],
       [1., 0.]])
>>> OneHotEncoder(drop='if_binary').fit_transform([['a'], ['b'], ['a']]).A
array([[0.],
       [1.],
       [0.]])

I would expect to this only take effect for actually binary column, not any input with 2 classes. You could add it as a counter example in unit tests.

Enabling it if the drop='if_binary' the column is of dtype bool, might be easier. For int/float I suppose one could apply it if np.unique(X[col]) is [0, 1], but it's a bit more expensive. Let's wait for input from other reviewers, before making changes.

@glemaitre
Copy link
Member

I would expect to this only take effect for actually binary column, not any input with 2 classes. You could add it as a counter example in unit tests.

@rth how would you do to tag a feature as "binary"? I would somehow assume that 2 categories features should be close to a binary feature, isn't it?

@jnothman
Copy link
Member

jnothman commented Jan 29, 2020 via email

@rth
Copy link
Member

rth commented Jan 29, 2020

how would you do to tag a feature as "binary"? I would somehow assume that 2 categories features should be close to a binary feature, isn't it?

I think here we are assuming that two categories means binary.

OK that also works. Though, I think binary features has a very specific meaning, as either a column with binary dtype or a column with [0, 1]. A column with ["yes", "no"] or ["true", "false"], can be considered binary, but IMO it's up to the initial preprocessing to convert it to binary (also I have rarely seen this in practice). It might be best to avoid that vocabulary in this PR. Say if you have a column with countries ['JP', 'DE'] that's definitely not a binary feature, it can suggest that other categories are possible just not present in the data.

It might be OK to keep the name drop="if_binary" but document it without using "binary feature" term,

  • 'if_binary' : drop the first category for features that have 1 or 2 categories.

(since this PR also drops columns with 1 features I think). WDYT?

@rushabh-v
Copy link
Contributor Author

rushabh-v commented Jan 29, 2020

since this PR also drops columns with 1 features I think

No, it doesn't drop columns with 1 category.

@rth
Copy link
Member

rth commented Jan 29, 2020

No, it doesn't drop columns with 1 category.

Thanks for the confirmation. It is currently behaving strangely in that case though which probably needs fixing,

>>> OneHotEncoder().fit_transform([['a'], ['a'], ['a']]).A
array([[1.],
       [1.],
       [1.]])
>>> OneHotEncoder(drop='if_binary').fit_transform([['a'], ['a'], ['a']]).A
array([[0.],
       [0.],
       [0.]])

In general just to clarify, implementation wise is this expected to behave identically to drop='first' but only apply with 2 classes (in which case we might as well just reuse the drop='first' implementation conditionally), or is there additional logic added in?

@rushabh-v
Copy link
Contributor Author

In every case when there is a binary feature, drop='if_binary' will work the same as drop='first'. But for drop=None the implementation is different. And that's why in the above example outputs for drop='if_binary' and drop=None are different.

But IMO, for features with one category drop=None should also return all 0s.

@rth
Copy link
Member

rth commented Jan 29, 2020

In every case when there is a binary feature, drop='if_binary' will work the same as drop='first'. But for drop=None the implementation is different. And that's why in the above example outputs for drop='if_binary' and drop=None are different.

The above example with drop='first' returns an empty array which more what I would expect,

>>> OneHotEncoder(drop='first').fit_transform([['a'], ['a'], ['a']]).A
array([], shape=(3, 0), dtype=float64)

In terms of implementation I think we might want to change the signature of _compute_drop_idx to take X as input and make the computation there (r otherwise rethink where we compute self.drop_idx_, as currently it is not correct in some cases,

>>> est = OneHotEncoder(drop='if_binary')
>>> est.fit(np.atleast_2d(['a', 'b', 'c']))
OneHotEncoder(categories='auto', drop='if_binary',
              dtype=<class 'numpy.float64'>, handle_unknown='error',
              sparse=True)
>>> est.drop_idx_
array([0, 0, 0])

In general we should compute drop_idx_ then apply them, instead of adding custom logic after it is computed. I have not looked at the OHE code enough to say what would be the fastest to get there.

@rushabh-v
Copy link
Contributor Author

drop_idx_ contains index of the category to be dropped from each feature. What should we put there in case of non-binary features then?

@rushabh-v
Copy link
Contributor Author

rushabh-v commented Jan 29, 2020

And even if we change the logic of _compute_drop_idx(Like if we put None or something in case of non-binary features), then I think the logic of dropping would become more expensive in case of drop='if_binary'. Can't we correct drop_idx_ by putting some extra conditions or something?

@rth
Copy link
Member

rth commented Jan 29, 2020

And even if we change the logic of compute_drop_idx, then I think the logic to drop would become more expensive in case of drop='if_binary'. Can't we correct drop_idx by putting some extra conditions or something?

Yes, it may have some additional cost but basically the logic for drop should be,

  • in fit compute the columns to drop and store them in drop_idx_
  • in transform drop according to drop_idx_

so we cannot fix drop_idx_ in transform, as the determination of "binary" features need to happen based on train data in fit.

@rushabh-v
Copy link
Contributor Author

rushabh-v commented Jan 29, 2020

What I meant was that we can edit the _compute_drop_idx so that drop_idx_ would be correct every time.

And in case of drop=if_binary instead of using drop_idx_ we can take a temporary variable drop_bin = np.zeros(len(self.categories_), dtype=np.int_)(Because that's what it expects every time) inside transform. And use that for all the computations.

correct me if I am wrong!

@rth
Copy link
Member

rth commented Jan 29, 2020

What I meant was that we can edit the compute_drop_idx so that drop_idx would be correct every time.

Yes, that is what I mean.

Something like: in fit,

self.drop_idx_ = self._compute_drop_idx(self.categories)

with

def _compute_drop_idx(self, categories)
    ...
    elif (isinstance(self.drop, str) and self.drop == 'first'):
        return np.array([0 if len(col_cats) > 2 else None
                         for col_cats in categories], dtype=object)

and then in transform mostly what we have now with a check for None in drop_idx_[i]. Though that would mean that drop_idx_ can be an array of objects, while previously we always had an array of ints which might need some adjustment and documentation. Or we can use an int -1 (or some other negative value) to indicate that a column should not be dropped but that's also suboptimal.

@rushabh-v
Copy link
Contributor Author

rushabh-v commented Jan 30, 2020

So right now I am considering implementing it with object array just to check if it works fine or not. then we will change it if ints are required.

@rushabh-v rushabh-v requested a review from NicolasHug February 4, 2020 15:24
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @rushabh-v !

@NicolasHug
Copy link
Member

I'm still gettig a 500 from codecov, not sure what's going on. Not merging just yet in case @thomasjpfan wants to have a look?

@rushabh-v rushabh-v requested a review from thomasjpfan February 4, 2020 16:35
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inverse_transform needs to be updated and tested with the new 'if_binary' option.

@jnothman
Copy link
Member

jnothman commented Feb 4, 2020

Another small decision: if we look across the codebase, do we tend to use underscores, spaces, hyphens or other separators for words in param values? Should it be 'if_binary' or 'if binary' or 'ifbinary'??

@rth
Copy link
Member

rth commented Feb 4, 2020

Should it be 'if_binary' or 'if binary' or 'ifbinary'??

+1 for 'if_binary'

@rushabh-v
Copy link
Contributor Author

rushabh-v commented Feb 5, 2020

@thomasjpfan I have updated the inverse_transform and added the test, can you review, please?

@rushabh-v rushabh-v requested a review from thomasjpfan February 5, 2020 19:00
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with if_binary as the keyword option.

@@ -265,6 +265,16 @@ def test_one_hot_encoder_inverse(sparse_, drop):
enc.inverse_transform(X_tr)


def test_one_hot_encoder_inverse_if_binary():
X = [['Male', 1],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make the input a numpy array:

    X = np.array([['Male', 1],
                  ['Female', 3],
                  ['Female', 2]])

and then

assert_array_equal(ohe.inverse_transform(X_tr), X)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -448,8 +474,13 @@ def inverse_transform(self, X):
n_transformed_features = sum(len(cats)
for cats in self.categories_)
else:
n_transformed_features = sum(len(cats) - 1
for cats in self.categories_)
if isinstance(self.drop, str) and self.drop == 'if_binary':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use elif if it is appropriate here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@rushabh-v
Copy link
Contributor Author

why it says Cmd.exe exited with code '1'.?

Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>
@rushabh-v
Copy link
Contributor Author

All greens now!

@rth
Copy link
Member

rth commented Feb 7, 2020

Thanks @rushabh-v and all reviewers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC Behavior of OneHotEncoder on binary features
7 participants