New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] ENH: new CategoricalEncoder class #9151

Merged
merged 37 commits into from Nov 21, 2017

Conversation

Projects
None yet
8 participants
@jorisvandenbossche
Contributor

jorisvandenbossche commented Jun 18, 2017

This is currently simply a rebase of PR #6559 by @vighneshbirodkar .

Some context: this PR #6559 was the first of a series of PRs related to this. This PR added a CategoricalEncoder. Then it was decided, instead of adding a new class, to add this functionality to the existing OneHotEncoder: #8793 and #7327, and recently taken up by @stephen-hoover in #8793.

At the sprint we discussed this, and @amueller put a summary of that in #8793 (comment).
The main reason not to add this in OneHotEncoder is that this is fundamentally different behaviour (OneHotEncoder determines the categories based on the range of the positive integer values passed, the new CategoricalEncoder would determine it based on unique values), and that almost all keyword, attributes and behaviour of the current OneHotEncoder would be deprecated, which makes the implementation to do this in one class (deprecated + new behaviour) overly complex.

The basics already work nicely with the rebased PR:

In [30]: cat = CategoricalEncoder()

In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T

In [32]: cat.fit_transform(X).toarray()
Out[32]: 
array([[ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  1.]])

Some changes I would like to make to the current PR:

  • add some more tests
  • rename 'classes' keyword to 'categories' (IMO this is more in line with the name of the class 'CategoricalEncoder')
  • possibly remove the categorical_features keyword (the ability to select certain columns) for now to keep it more simple, as this can always be achieved in combination with the ColumnTransformer
  • I would like to add a categories_ attribute that is a dict of {column number/name : [categories]}. And maybe the underlying LabelEncoders can be hidden from the users (currently stored in the label_encoders_ attribute)
  • add support for pandas DataFrames (they already work, but would be nice to keep column -> categories information, see previous)
  • don't deprecate OneHotEncoder for now (we can leave this for a separate discussion)
  • move to sklearn.experimental (if we keep to this for the ColumnTransformer)
  • add get_feature_names() method ?

But before doing that, I wanted to check if we agree on this way forward (separate CategoricalEncoder class). @jnothman are you OK with this or still in favor of doing the changes in the OneHotEncoder?

If we want to keep the name OneHotEncoder, another option would be to implement the 'new' OneHotEncoder in eg a 'sklearn.future' module, so people can still move to it gradually and the current one can be deprecated, but keeping the implementations separate.

Closes #7375, closes #7327, closes #4920, closes #3956

Related issue that can possibly be closed as well: #3599, #8136

Show outdated Hide outdated doc/modules/classes.rst Outdated
@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 18, 2017

Member

I'm generally good with the to-dos, though the categories_ attribute is lower priority to me than the rest.

Member

amueller commented Jun 18, 2017

I'm generally good with the to-dos, though the categories_ attribute is lower priority to me than the rest.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 18, 2017

Member

And obviously I would prefer not to move this to a different module, I'd be fine with adding a note to the docs that this is experimental and might change, but I'll not press that point so we can move forward.

Member

amueller commented Jun 18, 2017

And obviously I would prefer not to move this to a different module, I'd be fine with adding a note to the docs that this is experimental and might change, but I'll not press that point so we can move forward.

@amueller amueller modified the milestone: 0.19 Jun 19, 2017

@jnothman

At least for the record, could you please remind me why this is superior to DictVectorizer().fit(frame.to_json(orient='records') and to ColumnTransformer([(f, LabelEncoder(), f) for f in fields])?

I appreciate the difference of this from OHE, and that it provides a more definitive interface for this kind of operation. We should at the same time clarify what OHE is for (ordinals; #8628 should get a similar interface) and what LabelEncoder is not for.

Show outdated Hide outdated sklearn/preprocessing/data.py Outdated
Show outdated Hide outdated sklearn/preprocessing/data.py Outdated
Show outdated Hide outdated sklearn/preprocessing/data.py Outdated
dictionary items (also handles string-valued features).
sklearn.feature_extraction.FeatureHasher : performs an approximate one-hot
encoding of dictionary items or strings.
"""

This comment has been minimized.

@jnothman

jnothman Jun 19, 2017

Member

Surely we need See Also to also describe the relationship to / distinction from OHE

@jnothman

jnothman Jun 19, 2017

Member

Surely we need See Also to also describe the relationship to / distinction from OHE

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 19, 2017

Member

ColumnTransformer([(f, LabelEncoder(), f) for f in fields])

Followed by some version of the one-hot-encoder, right?
Or I guess with LabelBinarizer() it would be fine? If that's a correct implementation that allows for an inverse_transform and get_feature_names. I'm all for it.

DictVectorizer().fit(frame.to_json(orient='records')

Tbh I'm not familiar enough with how DictVectorizer treats integers and floats for that. Maybe a good argument would be the possibility of an exact inverse_transform which we decided is out-of-scope?

There is also somewhere a hack that uses CountVectorizer(analyzer=lambda x: x) or something like that, and that also works for a single column.

If we actually decide (which was the consensus at the sprint between @GaelVaroquaux, @jorisvandenbossche an me) that we always want to transform all columns, then maybe one of these implementations could actually work.

I would like something discoverable and with good feature names and the possibility to have some feature provenance in the future.

Maybe someone can write a blogpost about all the subtle differences between these lol.
I think that DictVectorizer().fit(frame.to_json(orient='records') is a bit obscure, and it throws away the dtype of the columns, right?

Member

amueller commented Jun 19, 2017

ColumnTransformer([(f, LabelEncoder(), f) for f in fields])

Followed by some version of the one-hot-encoder, right?
Or I guess with LabelBinarizer() it would be fine? If that's a correct implementation that allows for an inverse_transform and get_feature_names. I'm all for it.

DictVectorizer().fit(frame.to_json(orient='records')

Tbh I'm not familiar enough with how DictVectorizer treats integers and floats for that. Maybe a good argument would be the possibility of an exact inverse_transform which we decided is out-of-scope?

There is also somewhere a hack that uses CountVectorizer(analyzer=lambda x: x) or something like that, and that also works for a single column.

If we actually decide (which was the consensus at the sprint between @GaelVaroquaux, @jorisvandenbossche an me) that we always want to transform all columns, then maybe one of these implementations could actually work.

I would like something discoverable and with good feature names and the possibility to have some feature provenance in the future.

Maybe someone can write a blogpost about all the subtle differences between these lol.
I think that DictVectorizer().fit(frame.to_json(orient='records') is a bit obscure, and it throws away the dtype of the columns, right?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 19, 2017

Member
Member

jnothman commented Jun 19, 2017

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 19, 2017

Contributor

ColumnTransformer([(f, LabelBinarizer(), f) for f in fields])

If that's a correct implementation that allows for an inverse_transform and get_feature_names. I'm all for it.

Problem with the LabelBinarizer is that it currently only works on strings, not numerical values (which could maybe be fixed), and that it doesn't play nice with the ColumnTransformer due to both X and y being passed in the transformers (but there is a PR to fix this?)
It also doesn't give us a get_feature_names out of the box.

DictVectorizer().fit(frame.to_dict(orient='records')

Tbh I'm not familiar enough with how DictVectorizer treats integers and floats for that. Maybe a good argument would be the possibility of an exact inverse_transform which we decided is out-of-scope?

DictVectorizer seems to work and gives us get_feature_names, but, it treats string values and numerical values differently. Strings are dummy encoded, but integer are just passed through. So not fully the behaviour we want.
I also think the conversion to a dict (instead of working on the column as arrays) can become quite costly for larger datasets.

There is also somewhere a hack that uses CountVectorizer(analyzer=lambda x: x) or something like that, and that also works for a single column.

It is indeed using CountVectorizer(analyzer=lambda x: [x]) that gives us more or less exactly what we want. It also gives us get_feature_names (we only have to fix it to be able to deal with mixed strings and numerical values).
So this could be used under the hood instead of LabelEncoder/OneHotEncoder. But given the quite different original use case, I am not sure this is a good way to go.

Full experimentation of the different possibilities: http://nbviewer.ipython.org/d6a79e96b490872905e74202d0818ab2

Contributor

jorisvandenbossche commented Jun 19, 2017

ColumnTransformer([(f, LabelBinarizer(), f) for f in fields])

If that's a correct implementation that allows for an inverse_transform and get_feature_names. I'm all for it.

Problem with the LabelBinarizer is that it currently only works on strings, not numerical values (which could maybe be fixed), and that it doesn't play nice with the ColumnTransformer due to both X and y being passed in the transformers (but there is a PR to fix this?)
It also doesn't give us a get_feature_names out of the box.

DictVectorizer().fit(frame.to_dict(orient='records')

Tbh I'm not familiar enough with how DictVectorizer treats integers and floats for that. Maybe a good argument would be the possibility of an exact inverse_transform which we decided is out-of-scope?

DictVectorizer seems to work and gives us get_feature_names, but, it treats string values and numerical values differently. Strings are dummy encoded, but integer are just passed through. So not fully the behaviour we want.
I also think the conversion to a dict (instead of working on the column as arrays) can become quite costly for larger datasets.

There is also somewhere a hack that uses CountVectorizer(analyzer=lambda x: x) or something like that, and that also works for a single column.

It is indeed using CountVectorizer(analyzer=lambda x: [x]) that gives us more or less exactly what we want. It also gives us get_feature_names (we only have to fix it to be able to deal with mixed strings and numerical values).
So this could be used under the hood instead of LabelEncoder/OneHotEncoder. But given the quite different original use case, I am not sure this is a good way to go.

Full experimentation of the different possibilities: http://nbviewer.ipython.org/d6a79e96b490872905e74202d0818ab2

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 19, 2017

Member
Member

jnothman commented Jun 19, 2017

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 19, 2017

Contributor

LabelEncoder just does a np.unique(y) for determining the different classes in fit, and a np.unique(y, return_inverse=True) for the conversion to integer codes in fit_transform or np.searchsorted in transform.

Contributor

jorisvandenbossche commented Jun 19, 2017

LabelEncoder just does a np.unique(y) for determining the different classes in fit, and a np.unique(y, return_inverse=True) for the conversion to integer codes in fit_transform or np.searchsorted in transform.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 19, 2017

Member
Member

jnothman commented Jun 19, 2017

@amueller amueller modified the milestones: 0.20, 0.19 Jun 19, 2017

First round of updates
- remove compat code for numpy < 1.8
- remove categorical_features keyword
- make label_encoders_ private
- rename classes to categories

@stephen-hoover stephen-hoover referenced this pull request Jun 22, 2017

Closed

[MRG] Support for strings in OneHotEncoder #8793

4 of 4 tasks complete
@Trion129

This comment has been minimized.

Show comment
Hide comment
@Trion129

Trion129 Jun 26, 2017

Contributor

Hy Sci-kittens! :D I recently suggested in mailing list about having a drop_one parameter in the OneHotEncoder so that one of the columns in the end or beginning of encoded array is dropped as it is like doubling same column twice, it will benefit some models like LinearRegression. I got guided to this PR, would like to know if it can be added to new Categorical encoder?

Contributor

Trion129 commented Jun 26, 2017

Hy Sci-kittens! :D I recently suggested in mailing list about having a drop_one parameter in the OneHotEncoder so that one of the columns in the end or beginning of encoded array is dropped as it is like doubling same column twice, it will benefit some models like LinearRegression. I got guided to this PR, would like to know if it can be added to new Categorical encoder?

jorisvandenbossche added some commits Jun 27, 2017

further clean-up + tests
- check that it works on pandas frames
- fix doctests
- un-deprecate OneHotEncoder
- undo changes in _transform_selected (as we no longer need those changes for CategoricalEncoder)
- add see also to OneHotEncoder and vice versa
- for now remove the self.feature_indices_ attribute

@raghavrv raghavrv self-requested a review Jun 27, 2017

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 27, 2017

Contributor

OK, I cleaned the code a bit further and added some more tests, I think this is ready for more detailed review.
The current PR is basically the most simplest version of a CategoricalEncoder which just does what you need in most cases and should be rather uncontroversial, but without much additional features/attributes (so eg no attributes yet to inspect the categories, no get_feature_names, no inverse_transform, ..).

@Trion129 That's indeed a possible extension of the current PR. If this is desired, we could add a keyword the determines this behaviour, with the default to not drop any column (current behaviour). As a reference, the pandas get_dummies uses a drop_first keyword for this.

Contributor

jorisvandenbossche commented Jun 27, 2017

OK, I cleaned the code a bit further and added some more tests, I think this is ready for more detailed review.
The current PR is basically the most simplest version of a CategoricalEncoder which just does what you need in most cases and should be rather uncontroversial, but without much additional features/attributes (so eg no attributes yet to inspect the categories, no get_feature_names, no inverse_transform, ..).

@Trion129 That's indeed a possible extension of the current PR. If this is desired, we could add a keyword the determines this behaviour, with the default to not drop any column (current behaviour). As a reference, the pandas get_dummies uses a drop_first keyword for this.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 27, 2017

Contributor

(the docs should probably be further updated, as now it just changes the example using OneHotEncoder to CategoricalEncoder. But we might want to keep an example with OneHotEncoder as well, if there is a good example)

Contributor

jorisvandenbossche commented Jun 27, 2017

(the docs should probably be further updated, as now it just changes the example using OneHotEncoder to CategoricalEncoder. But we might want to keep an example with OneHotEncoder as well, if there is a good example)

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 12, 2017

Member

Btw, in our tutorial @agramfort mentioned using integer encodings for categorical as that works reasonably for trees. Should we implement that? If so, in this estimator? but probably not for now.

Member

amueller commented Jul 12, 2017

Btw, in our tutorial @agramfort mentioned using integer encodings for categorical as that works reasonably for trees. Should we implement that? If so, in this estimator? but probably not for now.

@amueller

Mostly looks good. Needs attribute documentation and then we can talk about what a good way to expose the per-feature categories is. Maybe also get_feature_names?

Show outdated Hide outdated sklearn/feature_extraction/dict_vectorizer.py Outdated
Show outdated Hide outdated sklearn/preprocessing/data.py Outdated
Show outdated Hide outdated sklearn/preprocessing/data.py Outdated
Show outdated Hide outdated sklearn/preprocessing/data.py Outdated
Show outdated Hide outdated sklearn/preprocessing/data.py Outdated
Show outdated Hide outdated sklearn/preprocessing/tests/test_data.py Outdated
Show outdated Hide outdated sklearn/preprocessing/tests/test_data.py Outdated
Show outdated Hide outdated sklearn/preprocessing/tests/test_data.py Outdated
Show outdated Hide outdated sklearn/preprocessing/tests/test_data.py Outdated
Show outdated Hide outdated sklearn/preprocessing/tests/test_data.py Outdated
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 13, 2017

Member
Member

jnothman commented Jul 13, 2017

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jul 21, 2017

Member

Any news on this? ;)

Member

amueller commented Jul 21, 2017

Any news on this? ;)

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jul 25, 2017

Contributor

Back from holidays :)

using integer encodings for categorical

To be explicit, do you mean you want to convert something like ['a', 'b', 'c', 'a'] to [0, 1, 2, 0] (but 2D) instead of [[1,0,0], [0,1,0], [0,0,1], [1,0,0]]. Which is what the current LabelEncoder does (but only for 1D y).

In [11]: LabelEncoder().fit_transform(np.array(['a', 'b', 'c', 'a']))
Out[11]: array([0, 1, 2, 0])

In [12]: CategoricalEncoder().fit_transform(np.array(['a', 'b', 'c', 'a']).reshape(-1, 1)).toarray()
Out[12]: 
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.]])

If this is what is meant, I am not sure I agree that CategoricalEncoder should behave like that. It seems putting two different (but I agree related) ways to convert categorical data to numerical features in one transformer. How would we toggle between both? Using a keyword?

On the other hand, I also agree being able to just convert your categoricals to numerical codes is useful, and it would be nice to provide a nice way to do this.
Ideally, I would do this in two separate transformers, as they give different output shapes. But it would indeed get a bit messy with the already existing classes LabelEncoder and LabelBinarizer

Contributor

jorisvandenbossche commented Jul 25, 2017

Back from holidays :)

using integer encodings for categorical

To be explicit, do you mean you want to convert something like ['a', 'b', 'c', 'a'] to [0, 1, 2, 0] (but 2D) instead of [[1,0,0], [0,1,0], [0,0,1], [1,0,0]]. Which is what the current LabelEncoder does (but only for 1D y).

In [11]: LabelEncoder().fit_transform(np.array(['a', 'b', 'c', 'a']))
Out[11]: array([0, 1, 2, 0])

In [12]: CategoricalEncoder().fit_transform(np.array(['a', 'b', 'c', 'a']).reshape(-1, 1)).toarray()
Out[12]: 
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.]])

If this is what is meant, I am not sure I agree that CategoricalEncoder should behave like that. It seems putting two different (but I agree related) ways to convert categorical data to numerical features in one transformer. How would we toggle between both? Using a keyword?

On the other hand, I also agree being able to just convert your categoricals to numerical codes is useful, and it would be nice to provide a nice way to do this.
Ideally, I would do this in two separate transformers, as they give different output shapes. But it would indeed get a bit messy with the already existing classes LabelEncoder and LabelBinarizer

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 20, 2017

Member

Sorry for the delay. The question is whether we want dataframe in -> dataframe out? That might be nice, but I'd rather merge without that and possibly add that later.

Member

amueller commented Nov 20, 2017

Sorry for the delay. The question is whether we want dataframe in -> dataframe out? That might be nice, but I'd rather merge without that and possibly add that later.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 20, 2017

Member

We're gonna gonna do get_feature_names in a follow-up PR, right?

Member

amueller commented Nov 20, 2017

We're gonna gonna do get_feature_names in a follow-up PR, right?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 20, 2017

Contributor

We're gonna gonna do get_feature_names in a follow-up PR, right?

Yes, that is on my list of follow-ups (#9151 (comment)), although there are some questions about what it should exactly do (#9151 (comment))

The question is whether we want dataframe in -> dataframe out?

I didn't consider the 'dataframe out' (although that can also be a consideration, but a much bigger one I think). Here it was more about having some special code inside the transformer to prevent converting a dataframe with different dtypes to 'object' dtyped array(check_array). This conversion is not really needed, as the transformer encodes the input column by column anyway, so it would be rather easy to preserve the datatypes per column. The transformed output will always have a uniform dtype anyway, so that is ok to be an array.

Contributor

jorisvandenbossche commented Nov 20, 2017

We're gonna gonna do get_feature_names in a follow-up PR, right?

Yes, that is on my list of follow-ups (#9151 (comment)), although there are some questions about what it should exactly do (#9151 (comment))

The question is whether we want dataframe in -> dataframe out?

I didn't consider the 'dataframe out' (although that can also be a consideration, but a much bigger one I think). Here it was more about having some special code inside the transformer to prevent converting a dataframe with different dtypes to 'object' dtyped array(check_array). This conversion is not really needed, as the transformer encodes the input column by column anyway, so it would be rather easy to preserve the datatypes per column. The transformed output will always have a uniform dtype anyway, so that is ok to be an array.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 20, 2017

Member

@jorisvandenbossche ah, that makes more sense. I would leave it as-is. Merge?

Member

amueller commented Nov 20, 2017

@jorisvandenbossche ah, that makes more sense. I would leave it as-is. Merge?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 20, 2017

Member

I think the proposal with something like ['0__female', '0__male', '1__0', '1__1'] is good. I would do it as in PolynomialFeatures with uses x0, x1 etc (maybe), with an option to pass in input feature names to transform them. That would allow preserving the semantics more easily.

Member

amueller commented Nov 20, 2017

I think the proposal with something like ['0__female', '0__male', '1__0', '1__1'] is good. I would do it as in PolynomialFeatures with uses x0, x1 etc (maybe), with an option to pass in input feature names to transform them. That would allow preserving the semantics more easily.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 21, 2017

Contributor

Merge?

Yes, I think somebody can merge it!

Contributor

jorisvandenbossche commented Nov 21, 2017

Merge?

Yes, I think somebody can merge it!

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 21, 2017

Member

Sure. Let's see what this thing does in the wild!

Member

jnothman commented Nov 21, 2017

Sure. Let's see what this thing does in the wild!

@jnothman jnothman merged commit a2ebb8c into scikit-learn:master Nov 21, 2017

5 of 6 checks passed

codecov/patch 90.09% of diff hit (target 96.19%)
Details
ci/circleci Your tests passed on CircleCI!
Details
codecov/project 96.16% (-0.04%) compared to abb43c1
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 21, 2017

Member

Congratulations! And thanks. Please feel free to make follow-up issues.

Member

jnothman commented Nov 21, 2017

Congratulations! And thanks. Please feel free to make follow-up issues.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 21, 2017

Contributor

And thanks a lot for all the review!

Contributor

jorisvandenbossche commented Nov 21, 2017

And thanks a lot for all the review!

@jorisvandenbossche jorisvandenbossche deleted the jorisvandenbossche:pr/6559 branch Nov 21, 2017

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 21, 2017

Member

My excitement about this is pretty much through the roof lol ;)

Member

amueller commented Nov 21, 2017

My excitement about this is pretty much through the roof lol ;)

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 21, 2017

Member

I would appreciate it if you could focus on #9012, maybe we can leave the get_feature_names for someone else, it shouldn't be too tricky.

Member

amueller commented Nov 21, 2017

I would appreciate it if you could focus on #9012, maybe we can leave the get_feature_names for someone else, it shouldn't be too tricky.

@vighneshbirodkar

This comment has been minimized.

Show comment
Hide comment
@vighneshbirodkar

vighneshbirodkar Nov 21, 2017

Contributor

Congratulations and thanks @jorisvandenbossche
It is nice to see this finally merged.

Contributor

vighneshbirodkar commented Nov 21, 2017

Congratulations and thanks @jorisvandenbossche
It is nice to see this finally merged.

@austinmw

This comment has been minimized.

Show comment
Hide comment
@austinmw

austinmw Jan 11, 2018

Hi, any chance you could add a drop_first parameter like in pandas.get_dummies()? It'd make it easier to put this in a pipeline without requiring something additional to drop a column.

austinmw commented Jan 11, 2018

Hi, any chance you could add a drop_first parameter like in pandas.get_dummies()? It'd make it easier to put this in a pipeline without requiring something additional to drop a column.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jan 11, 2018

Member
Member

jnothman commented Jan 11, 2018

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jan 23, 2018

Contributor

For people who are subscribed here: I opened a new issue with some questions on the API and naming: #10521

Contributor

jorisvandenbossche commented Jan 23, 2018

For people who are subscribed here: I opened a new issue with some questions on the API and naming: #10521

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment