Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] ENH: new CategoricalEncoder class #9151

Merged
merged 37 commits into from Nov 21, 2017

Conversation

Projects
None yet
9 participants
@jorisvandenbossche
Copy link
Contributor

commented Jun 18, 2017

This is currently simply a rebase of PR #6559 by @vighneshbirodkar .

Some context: this PR #6559 was the first of a series of PRs related to this. This PR added a CategoricalEncoder. Then it was decided, instead of adding a new class, to add this functionality to the existing OneHotEncoder: #8793 and #7327, and recently taken up by @stephen-hoover in #8793.

At the sprint we discussed this, and @amueller put a summary of that in #8793 (comment).
The main reason not to add this in OneHotEncoder is that this is fundamentally different behaviour (OneHotEncoder determines the categories based on the range of the positive integer values passed, the new CategoricalEncoder would determine it based on unique values), and that almost all keyword, attributes and behaviour of the current OneHotEncoder would be deprecated, which makes the implementation to do this in one class (deprecated + new behaviour) overly complex.

The basics already work nicely with the rebased PR:

In [30]: cat = CategoricalEncoder()

In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T

In [32]: cat.fit_transform(X).toarray()
Out[32]: 
array([[ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  1.]])

Some changes I would like to make to the current PR:

  • add some more tests
  • rename 'classes' keyword to 'categories' (IMO this is more in line with the name of the class 'CategoricalEncoder')
  • possibly remove the categorical_features keyword (the ability to select certain columns) for now to keep it more simple, as this can always be achieved in combination with the ColumnTransformer
  • I would like to add a categories_ attribute that is a dict of {column number/name : [categories]}. And maybe the underlying LabelEncoders can be hidden from the users (currently stored in the label_encoders_ attribute)
  • add support for pandas DataFrames (they already work, but would be nice to keep column -> categories information, see previous)
  • don't deprecate OneHotEncoder for now (we can leave this for a separate discussion)
  • move to sklearn.experimental (if we keep to this for the ColumnTransformer)
  • add get_feature_names() method ?

But before doing that, I wanted to check if we agree on this way forward (separate CategoricalEncoder class). @jnothman are you OK with this or still in favor of doing the changes in the OneHotEncoder?

If we want to keep the name OneHotEncoder, another option would be to implement the 'new' OneHotEncoder in eg a 'sklearn.future' module, so people can still move to it gradually and the current one can be deprecated, but keeping the implementations separate.

Closes #7375, closes #7327, closes #4920, closes #3956

Related issue that can possibly be closed as well: #3599, #8136

@@ -1197,7 +1197,7 @@ See the :ref:`metrics` section of the user guide for further details.
preprocessing.MaxAbsScaler
preprocessing.MinMaxScaler
preprocessing.Normalizer
preprocessing.OneHotEncoder

This comment has been minimized.

Copy link
@amueller

amueller Jun 18, 2017

Member

This indicated deprecation of the old class? I'm not sure we want to do that yet.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jun 18, 2017

Author Contributor

yeah, for now just rebased old pr. See my to do list in top post. I would for now just leave the OneHotEncoder as is. We can always decide to deprecate it later if we want

This comment has been minimized.

Copy link
@amueller

amueller Jun 18, 2017

Member

Looked at the code before I looked at your description. I think your description is a good summary.

@amueller

This comment has been minimized.

Copy link
Member

commented Jun 18, 2017

I'm generally good with the to-dos, though the categories_ attribute is lower priority to me than the rest.

@amueller

This comment has been minimized.

Copy link
Member

commented Jun 18, 2017

And obviously I would prefer not to move this to a different module, I'd be fine with adding a note to the docs that this is experimental and might change, but I'll not press that point so we can move forward.

@amueller amueller modified the milestone: 0.19 Jun 19, 2017

@jnothman
Copy link
Member

left a comment

At least for the record, could you please remind me why this is superior to DictVectorizer().fit(frame.to_json(orient='records') and to ColumnTransformer([(f, LabelEncoder(), f) for f in fields])?

I appreciate the difference of this from OHE, and that it provides a more definitive interface for this kind of operation. We should at the same time clarify what OHE is for (ordinals; #8628 should get a similar interface) and what LabelEncoder is not for.

Parameters
----------
classes : 'auto', 2D array of ints or strings or both.

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 19, 2017

Member

not 2d. A list of lists of values.

- 'auto' : Determine classes automatically from the training data.
- array: ``classes[i]`` holds the classes expected in the ith column.
categorical_features : 'all' or array of indices or mask

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 19, 2017

Member

I'd support getting rid of this as unnecesarry complexity.

Values per feature.
- 'auto' : Determine classes automatically from the training data.
- array: ``classes[i]`` holds the classes expected in the ith column.

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 19, 2017

Member

Does this specify the order of these values in the output, or just the set? Must be clear.

dictionary items (also handles string-valued features).
sklearn.feature_extraction.FeatureHasher : performs an approximate one-hot
encoding of dictionary items or strings.
"""

This comment has been minimized.

Copy link
@jnothman

jnothman Jun 19, 2017

Member

Surely we need See Also to also describe the relationship to / distinction from OHE

@amueller

This comment has been minimized.

Copy link
Member

commented Jun 19, 2017

ColumnTransformer([(f, LabelEncoder(), f) for f in fields])

Followed by some version of the one-hot-encoder, right?
Or I guess with LabelBinarizer() it would be fine? If that's a correct implementation that allows for an inverse_transform and get_feature_names. I'm all for it.

DictVectorizer().fit(frame.to_json(orient='records')

Tbh I'm not familiar enough with how DictVectorizer treats integers and floats for that. Maybe a good argument would be the possibility of an exact inverse_transform which we decided is out-of-scope?

There is also somewhere a hack that uses CountVectorizer(analyzer=lambda x: x) or something like that, and that also works for a single column.

If we actually decide (which was the consensus at the sprint between @GaelVaroquaux, @jorisvandenbossche an me) that we always want to transform all columns, then maybe one of these implementations could actually work.

I would like something discoverable and with good feature names and the possibility to have some feature provenance in the future.

Maybe someone can write a blogpost about all the subtle differences between these lol.
I think that DictVectorizer().fit(frame.to_json(orient='records') is a bit obscure, and it throws away the dtype of the columns, right?

@jnothman

This comment has been minimized.

Copy link
Member

commented Jun 19, 2017

@jorisvandenbossche

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2017

ColumnTransformer([(f, LabelBinarizer(), f) for f in fields])

If that's a correct implementation that allows for an inverse_transform and get_feature_names. I'm all for it.

Problem with the LabelBinarizer is that it currently only works on strings, not numerical values (which could maybe be fixed), and that it doesn't play nice with the ColumnTransformer due to both X and y being passed in the transformers (but there is a PR to fix this?)
It also doesn't give us a get_feature_names out of the box.

DictVectorizer().fit(frame.to_dict(orient='records')

Tbh I'm not familiar enough with how DictVectorizer treats integers and floats for that. Maybe a good argument would be the possibility of an exact inverse_transform which we decided is out-of-scope?

DictVectorizer seems to work and gives us get_feature_names, but, it treats string values and numerical values differently. Strings are dummy encoded, but integer are just passed through. So not fully the behaviour we want.
I also think the conversion to a dict (instead of working on the column as arrays) can become quite costly for larger datasets.

There is also somewhere a hack that uses CountVectorizer(analyzer=lambda x: x) or something like that, and that also works for a single column.

It is indeed using CountVectorizer(analyzer=lambda x: [x]) that gives us more or less exactly what we want. It also gives us get_feature_names (we only have to fix it to be able to deal with mixed strings and numerical values).
So this could be used under the hood instead of LabelEncoder/OneHotEncoder. But given the quite different original use case, I am not sure this is a good way to go.

Full experimentation of the different possibilities: http://nbviewer.ipython.org/d6a79e96b490872905e74202d0818ab2

@jnothman

This comment has been minimized.

Copy link
Member

commented Jun 19, 2017

@jorisvandenbossche

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2017

LabelEncoder just does a np.unique(y) for determining the different classes in fit, and a np.unique(y, return_inverse=True) for the conversion to integer codes in fit_transform or np.searchsorted in transform.

@jnothman

This comment has been minimized.

Copy link
Member

commented Jun 19, 2017

@amueller amueller modified the milestones: 0.20, 0.19 Jun 19, 2017

First round of updates
- remove compat code for numpy < 1.8
- remove categorical_features keyword
- make label_encoders_ private
- rename classes to categories

@stephen-hoover stephen-hoover referenced this pull request Jun 22, 2017

Closed

[MRG] Support for strings in OneHotEncoder #8793

4 of 4 tasks complete
@Trion129

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2017

Hy Sci-kittens! :D I recently suggested in mailing list about having a drop_one parameter in the OneHotEncoder so that one of the columns in the end or beginning of encoded array is dropped as it is like doubling same column twice, it will benefit some models like LinearRegression. I got guided to this PR, would like to know if it can be added to new Categorical encoder?

further clean-up + tests
- check that it works on pandas frames
- fix doctests
- un-deprecate OneHotEncoder
- undo changes in _transform_selected (as we no longer need those changes for CategoricalEncoder)
- add see also to OneHotEncoder and vice versa
- for now remove the self.feature_indices_ attribute

@jorisvandenbossche jorisvandenbossche force-pushed the jorisvandenbossche:pr/6559 branch from 6c764e0 to 5f2b403 Jun 27, 2017

@raghavrv raghavrv self-requested a review Jun 27, 2017

@jorisvandenbossche

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2017

OK, I cleaned the code a bit further and added some more tests, I think this is ready for more detailed review.
The current PR is basically the most simplest version of a CategoricalEncoder which just does what you need in most cases and should be rather uncontroversial, but without much additional features/attributes (so eg no attributes yet to inspect the categories, no get_feature_names, no inverse_transform, ..).

@Trion129 That's indeed a possible extension of the current PR. If this is desired, we could add a keyword the determines this behaviour, with the default to not drop any column (current behaviour). As a reference, the pandas get_dummies uses a drop_first keyword for this.

@jorisvandenbossche

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2017

(the docs should probably be further updated, as now it just changes the example using OneHotEncoder to CategoricalEncoder. But we might want to keep an example with OneHotEncoder as well, if there is a good example)

@amueller

This comment has been minimized.

Copy link
Member

commented Jul 12, 2017

Btw, in our tutorial @agramfort mentioned using integer encodings for categorical as that works reasonably for trees. Should we implement that? If so, in this estimator? but probably not for now.

@amueller
Copy link
Member

left a comment

Mostly looks good. Needs attribute documentation and then we can talk about what a good way to expose the per-feature categories is. Maybe also get_feature_names?

sklearn.preprocessing.OneHotEncoder : handles nominal/categorical features
encoded as columns of integers.
sklearn.preprocessing.CategoricalEncoder : handles nominal/categorical
features encoded as columns of arbitraty data types.

This comment has been minimized.

Copy link
@amueller

amueller Jul 12, 2017

Member

is it? or strings or integers? What happens with pandas categorical?

This comment has been minimized.

Copy link
@amueller

amueller Aug 28, 2017

Member

arbitraty -> arbitrary

@@ -1730,13 +1728,14 @@ def _transform_selected(X, transform, selected="all", copy=True):


class OneHotEncoder(BaseEstimator, TransformerMixin):
"""Encode categorical integer features using a one-hot aka one-of-K scheme.
"""Encode ordinal integer features using a one-hot aka one-of-K scheme.

This comment has been minimized.

Copy link
@amueller

amueller Jul 12, 2017

Member

I don't think ordinal is right as they are not ordered.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jul 25, 2017

Author Contributor

I used "ordinal" to indicate that the actual set of values do mean something (eg the values [1,4,5] means that there are also categories 2 and 3 which are not present in the data), but I agree that this is not the same as being ordered.
That said, I also find 'categorical' misleading given the above behaviour.

This comment has been minimized.

Copy link
@amueller

amueller Jul 25, 2017

Member

I would say categorical and in the next line clarify they need to be 0 to (n_categories - 1).

handle_unknown : str, 'error' or 'ignore'
Whether to raise an error or ignore if a unknown categorical feature is
present during transform.

This comment has been minimized.

Copy link
@amueller

amueller Jul 12, 2017

Member

we always raise an error when categories is not auto and an unknown category is encountered during training, right? Maybe we should make that explicit?

This comment has been minimized.

Copy link
@amueller

amueller Jul 12, 2017

Member

We should probably say what happens with the unknown value: all columns will be zero.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Aug 1, 2017

Author Contributor

we always raise an error when categories is not auto and an unknown category is encountered during training, right?

Shouldn't we also let that depend on this keyword? So also when you specify the categories yourself, you can ignore such errors of unknown values with this keyword (the default it to raise anyhow)

This comment has been minimized.

Copy link
@amueller

amueller Aug 1, 2017

Member

Ok, I'm fine with that.

Examples
--------
Given a dataset with three features and two samples, we let the encoder
find the maximum value per feature and transform the data to a binary

This comment has been minimized.

Copy link
@amueller

amueller Jul 12, 2017

Member

maximum value? Unique values, right? Maybe add in a 999? or just a 5?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Aug 1, 2017

Author Contributor

indeed, remaining from OneHotEncoder

Will add other number. Ideally string as well, that's not possible with just using simple lists for the example ..

This comment has been minimized.

Copy link
@amueller

amueller Aug 28, 2017

Member

still says maximum ;)

See also
--------
sklearn.preprocessing.OneHotEncoder : performs a one-hot encoding of
integer ordinal features. This transformer assumes that input features

This comment has been minimized.

Copy link
@amueller

amueller Jul 12, 2017

Member

I don't think ordinal is right. "This" -> OneHotEncoder (otherwise I feel it's ambiguous)

enc = CategoricalEncoder(sparse=False)
Xtr3 = enc.fit_transform(X)

assert_allclose(Xtr1.toarray(), Xtr2.toarray())

This comment has been minimized.

Copy link
@amueller

amueller Jul 12, 2017

Member

I don't understand this test. Determinism?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Aug 1, 2017

Author Contributor

good catch, before they did something different (different selection of the columns, but i removed that feature), so not needed anymore

assert_allclose(Xtr, [[1, 0, 1, 0], [0, 1, 0, 1]])

Xtr = CategoricalEncoder().fit_transform(X)
assert_allclose(Xtr.toarray(), [[1, 0, 1, 0, 1], [0, 1, 0, 1, 1]])

This comment has been minimized.

Copy link
@amueller

amueller Jul 12, 2017

Member

We should probably explain the handling of constant features. Does this make the most sense? Not sure what else....

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Aug 1, 2017

Author Contributor

So you mean the case when there is just one category, and now this gets one column with ones?
Also not sure what else we could do for that.

This comment has been minimized.

Copy link
@amueller

amueller Aug 1, 2017

Member

yes. We could also remove it, or make it zeros. But ones is maybe fine? But we should be explicit on what we do.

enc.fit(X)

X[0][0] = -1
msg = re.escape('Unknown feature(s) [-1] in column 0')

This comment has been minimized.

Copy link
@amueller

amueller Jul 12, 2017

Member

Shouldn't it be feature value? Unknown category?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Aug 1, 2017

Author Contributor

indeed. Took 'category'

assert_array_equal(enc.fit_transform(X).toarray(), exp)

# don't follow order of passed categories, but sort them
enc = CategoricalEncoder(categories=[['c', 'b', 'a']])

This comment has been minimized.

Copy link
@amueller

amueller Jul 12, 2017

Member

That's not documented, is it?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Aug 1, 2017

Author Contributor

See last bullet point in #9151 (but indeed, this is not yet documented)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Aug 1, 2017

Author Contributor

for now, documented it how it is now

enc = CategoricalEncoder(categories=[['a', 'b']])
assert_raises(ValueError, enc.fit, X)
enc = CategoricalEncoder(categories=[['a', 'b']], handle_unknown='ignore')
enc.fit(X)

This comment has been minimized.

Copy link
@amueller

amueller Jul 12, 2017

Member

We should check the outcome for that. We only get two columns, right?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Aug 1, 2017

Author Contributor

Currently this gives a row of all zeros (with indeed only two columns). Is this the desired outcome? (not sure what to do otherwise)

@jnothman

This comment has been minimized.

Copy link
Member

commented Jul 13, 2017

@amueller

This comment has been minimized.

Copy link
Member

commented Jul 21, 2017

Any news on this? ;)

@jorisvandenbossche

This comment has been minimized.

Copy link
Contributor Author

commented Nov 20, 2017

We're gonna gonna do get_feature_names in a follow-up PR, right?

Yes, that is on my list of follow-ups (#9151 (comment)), although there are some questions about what it should exactly do (#9151 (comment))

The question is whether we want dataframe in -> dataframe out?

I didn't consider the 'dataframe out' (although that can also be a consideration, but a much bigger one I think). Here it was more about having some special code inside the transformer to prevent converting a dataframe with different dtypes to 'object' dtyped array(check_array). This conversion is not really needed, as the transformer encodes the input column by column anyway, so it would be rather easy to preserve the datatypes per column. The transformed output will always have a uniform dtype anyway, so that is ok to be an array.

@amueller

This comment has been minimized.

Copy link
Member

commented Nov 20, 2017

@jorisvandenbossche ah, that makes more sense. I would leave it as-is. Merge?

@amueller

This comment has been minimized.

Copy link
Member

commented Nov 20, 2017

I think the proposal with something like ['0__female', '0__male', '1__0', '1__1'] is good. I would do it as in PolynomialFeatures with uses x0, x1 etc (maybe), with an option to pass in input feature names to transform them. That would allow preserving the semantics more easily.

@jorisvandenbossche

This comment has been minimized.

Copy link
Contributor Author

commented Nov 21, 2017

Merge?

Yes, I think somebody can merge it!

@jnothman

This comment has been minimized.

Copy link
Member

commented Nov 21, 2017

Sure. Let's see what this thing does in the wild!

@jnothman jnothman merged commit a2ebb8c into scikit-learn:master Nov 21, 2017

5 of 6 checks passed

codecov/patch 90.09% of diff hit (target 96.19%)
Details
ci/circleci Your tests passed on CircleCI!
Details
codecov/project 96.16% (-0.04%) compared to abb43c1
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@jnothman

This comment has been minimized.

Copy link
Member

commented Nov 21, 2017

Congratulations! And thanks. Please feel free to make follow-up issues.

@jorisvandenbossche

This comment has been minimized.

Copy link
Contributor Author

commented Nov 21, 2017

And thanks a lot for all the review!

@jorisvandenbossche jorisvandenbossche deleted the jorisvandenbossche:pr/6559 branch Nov 21, 2017

@amueller

This comment has been minimized.

Copy link
Member

commented Nov 21, 2017

My excitement about this is pretty much through the roof lol ;)

@amueller

This comment has been minimized.

Copy link
Member

commented Nov 21, 2017

I would appreciate it if you could focus on #9012, maybe we can leave the get_feature_names for someone else, it shouldn't be too tricky.

@vighneshbirodkar

This comment has been minimized.

Copy link
Contributor

commented Nov 21, 2017

Congratulations and thanks @jorisvandenbossche
It is nice to see this finally merged.

@austinmw

This comment has been minimized.

Copy link

commented Jan 11, 2018

Hi, any chance you could add a drop_first parameter like in pandas.get_dummies()? It'd make it easier to put this in a pipeline without requiring something additional to drop a column.

@jnothman

This comment has been minimized.

Copy link
Member

commented Jan 11, 2018

@jorisvandenbossche

This comment has been minimized.

Copy link
Contributor Author

commented Jan 23, 2018

For people who are subscribed here: I opened a new issue with some questions on the API and naming: #10521

@rragundez

This comment has been minimized.

Copy link
Contributor

commented Mar 11, 2019

CategoricalEncoderis now refactored into OneHotEncoder and OrdinalEncoder. #10523

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.