Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Unary encoder -- continued #12893

Closed

Conversation

NicolasHug
Copy link
Member

Reference Issues/PRs

Closes #8652

What does this implement/fix? Explain your changes.

This PR finishes the work in #8652 that implements a unary encoder.

Changes from #8652:

I haven't changed any of the existing tests, nor added new ones (except for inverse_transform), considering that the original PR was already accepted by @jnothman.

Any other comments?

I have added some review/comments where I need additional input from reviewers.

arjunjauhari and others added 22 commits November 28, 2017 02:27
…rameter. Also added a new test case test_unary_encoder_n_values_array
  1. Fixed test failures
  2. Updated docs
  3. UnaryEncoder: Changed handle_unknown to handle_greater

Updated docs

UnaryEncoder: Changed handle_unknown to handle_greater
transform.
Parameters checking being done in fit method. Plus, new test cases.

Updated docs

Updating implementation of UnaryEncoder

checking handle_greater parameter value in fit

New test cases
Removing fit_transform for UnaryEncoder and relying on one defined in TransformerMixin
Update handle_greater=error error message

Removing fit_transform for UnaryEncoder and relying on one defined in TransformerMixin

Update handle_greater=error error message
Updaing warn mode

Making warn as default mode
Explaining handle_greater options

Adding test cases to test warn option of handle_greater parameter

Updated warning message

updated feature_matrix generation function and made test deterministic

making test cases clearer by using n_features

n_values and n_features cleanup
Unary encoding
--------------

For some ordinal features, it does not necessarily make sense to use
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about how to motivate this encoder properly. Please feel free to suggest anything better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like "if the difference between ordered categories is uneven".

How about

Ordinal variables can also be represented in unary, where one feature is created for each possible value, as in one-hot encoding, but several features are active at a time (1, 3 and 5 would be encoded as row vectors [1, 0, 0, 0, 0], [1, 1, 1, 0, 0] and [1, 1, 1, 1, 1] respectively, for instance).

For generalized linear models, a one-hot encoding requires learning a weight for each value of an ordinal variable; an ordinal encoding learns a single weight for the variable, assuming a smooth (e.g. linear) relationship between the variable and target. Unary encoding is not strictly more expressive than one-hot, but allows the model to learn the non-linear effect of increasing an ordinal variable with fewer large or non-zero parameters. For similar reasons, unary may be a worthwhile encoding for linear modelling of discretized variables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an example of an ordinal variable would then be helpful


Parameters
----------
n_values : 'auto', int or array of ints, optional (default='auto')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be changed to categories to be consistent with other encoders?

Note that in the other encoders categories is a list of list, while here it would only be a list of int (the list of the maximum) since the input matrix X is supposed to contain non-negative integers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should turn to categories. Maybe we should support string here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand the variables are supposed to be ordinal. How would we represent ordinal features with strings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should support things like

enc = preprocessing.UnaryEncoder(categories=[["a", "b", "c"]])
enc.fit_transform([["a"], ["b"], ["c"]])

Although users can combine UnaryEncoder with OrdinalEncoder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think OrdinalEncoder should be used for this since its output is precisely the input that UnaryEncoder requires. I'll add an example in the user guide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you force users to use OrdinalEncoder first? I'd argue that it's not user-friendly :)
I think OrdinalEncoder should be similar to OneHotEncoder, but the categories parameter here is actually similar to the n_values parameter we previously had in OneHotEncoder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we make categories being exactly the same as in OrdinalEncoder, and rework the implementation of UnaryEncoder to first use a OrdinalEncoder under the hood, in order to have integer input (as currently required)?

Basically: make UnaryEncoder work exactly like make_pipeline(OrdinalEncoder(categories), CurrentUnaryEncoder())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically: make UnaryEncoder work exactly like make_pipeline(OrdinalEncoder(categories), CurrentUnaryEncoder())

This is exactly what I thought, I'll vote +1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think this is a good idea too, just want input from @jnothman before reworking this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm happy with making this have input the same as OrdinalEncoder. But I also don't require it for this to be merged.

sparse : boolean, optional (default=False)
Will return sparse matrix if set True else will return an array.

handle_greater : str, 'warn', 'error' or 'clip', optional (default='warn')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default here is warn while for other encoders it's error. See https://github.com/scikit-learn/scikit-learn/pull/8652/files#r161743736.

@NicolasHug NicolasHug changed the title [WIP] Unary encoder -- continued [MRG] Unary encoder -- continued Dec 30, 2018
@NicolasHug
Copy link
Member Author

So you want to determine the category values with np.unique instead of max?

@qinhanmin2014
Copy link
Member

I mean that any value unseen in training between 1 and max will have a feature generated for it, but it will be constant in training hence obsolescent.

This seems unavoidable if we keep current API design (we only have parameter max_value and attribute max_value_)?

@jnothman
Copy link
Member

jnothman commented Jan 28, 2019 via email

@qinhanmin2014
Copy link
Member

It's a bit of a tricky decision, huh? Using unique values would certainly make it a more well-behaved transformer in a pipeline context.

I think the problem is how do design the API, what's the solution in your mind? @jnothman
I can't come up with an elegant solution (i.e., use np.unique when we only have parameter max_value and attribute max_value_), unless we change current API (e.g., the solution proposed by me previously #12893 (comment))
Regarding redundant features, we also have similar problem when we combine OneHotEncoder with KBinsDiscretize, so maybe we need a separate issue to discuss it.

@jnothman
Copy link
Member

I think the constant features are more likely when dealing with count-valued data than categorical. Sorry I don't have a clear design in mind yet and in general am not being very constructive here and have pushed the conversation around in circles.

We are interested in the following use cases:

  1. Count-valued inputs
  2. Ordered categoricals
  3. Discretised numerics
  4. ?performing arbitrary discretisation directly with specified bin edges.

I'm guessing that 3 will be the most common use case with 1 and 2 following.

In all cases, it's probably best to only output features for values either seen in the training or explicitly requested by the user. This adds some code complexity but at least avoids generating useless features.

For 1 and 4 we need to consider the input as numeric and output the value corresponding to the nearest bin edge below the input.

3 can be achieved directly by supporting 1.

So maybe what we really want is to support arbitrary numeric input with bin_edges, and require pipelining for case 2. Current behaviour could be more-or-less accomplished with bin_edges=[arrange(max_value)]*n_features.

@qinhanmin2014
Copy link
Member

So maybe what we really want is to support arbitrary numeric input with bin_edges, and require pipelining for case 2. Current behaviour could be more-or-less accomplished with bin_edges=[arrange(max_value)]*n_features.

I think the encoder should focus on encoding data. It's not the duty of an encoder to preprocess data. We can have a discretizer to bin continuous data according to bin_edges specified by users (or somehow add this feature to KBinsDiscretizer), but I guess we don't need to consider it here.

I think the main issue here is how to determine the categories.
(1) If we determine the categories according to max values, our implementation will be similar to current PR (parameter max_value + attribute max_value_, unable to support strings directly).
(2) If we determine the categories according to unique values, out implementation will be similar to current OneHotEncoder (parameter categories + attribute categories_, easy to support strings directly).

@NicolasHug
Copy link
Member Author

NicolasHug commented Jan 29, 2019

Agreed with @qinhanmin2014 that we should restrict the scope of this encoder. Anything bin related should be KBinsDiscretizer's job to me. And now that we need to support so many different scenarios, I would also add that anything string related should be handled by OrdinalEncoder.

Supporting so many different scenarios means that the encoder must accept a restricted type of data and behave in a simple, predicted way. What happens before and after should be delegated to other tools. (Again, like in unix). We can use pipelines for that, it's literally what it's designed for.


It seems that so far every single option we have proposed was dismissed. I guess that means it is impossible to design a perfect UnaryEncoder that would satisfy all the constraints. I think it's time though that we settle on something. It's better to have something usable than to have nothing.

My proposition is to keep it exactly as it is:

  • no string support
  • only integer-valued features from 0 to max
  • at predict a feature > max will be clipped
  • anything < 0 raises an error
  • users can use pipeline
  • drawback: this creates constant features if a a value is not observed during fit. I personally don't see this being such a big problem

EDIT: again, not familiar with the encoder and less acquainted to the whole sklearn tooling so I'll yield to anything you'd deem more relevant.

@jnothman
Copy link
Member

jnothman commented Jan 29, 2019 via email

@qinhanmin2014
Copy link
Member

Well a lot of estimators (incl things like feature_importances_ and L2 regularisation) are designed in a way that assumes or prefers no perfectly collinear or redundant features. So when features are counts it's not great to encode them like that.

For redundant features, at least users can get rid of them with sklearn.feature_selection.VarianceThreshold

I'm OK with the API design of current PR, though I still feel a little bit strange to clip feature > max_value and reject feature < 0.

@jorisvandenbossche any opinion about current API design? Personally I'd like to see more consensus before reviewing it.

@jnothman
Copy link
Member

@jorisvandenbossche, @NicolasHug: shall we set a meeting time to find consensus on this?

@qinhanmin2014
Copy link
Member

The main issues here IMO is:
(1) Whether we should support string directly or rely on OrdinalEncoder + UnaryEncoder pipeline. If we support string, seems that we need to reject unknown features.
(2) If we do not support string, whether it's reasonable to reject feature < 0 and clip feature > max.
Other issues:
(1) How to treat redundant features? (Maybe we can simply tell users UnaryEncoder might produce constant features and these features can be removed with feature selection algorithms. We did so in KBinsDiscretizer recently, see #13165 )
(2) Do we have a better name? Are there any references/similar implementations? Share some application scenario if someone is familiar with this encoder?
(3) How to construct an good example (e.g., use which dataset)?

After the long discussion here, I still prefer to support strings and always reject unknown categories. This will introduce an encoder which is similar to OneHotEncoder (the new version).

@jnothman
Copy link
Member

@jorisvandenbossche, we could call it Base1Encoder to distinguish from other meanings of "unary"??

@jnothman
Copy link
Member

jnothman commented May 1, 2019

What's new will need moving to 0.22... if we're sure we'll ever resolve the disagreements / find adequate need for this feature.

@qinhanmin2014
Copy link
Member

I added it to the meeting note. If we have time, maybe we can briefly discuss whether we want this feature.

@qinhanmin2014
Copy link
Member

I think this is not useful for tree-based models, right? and is it useful for other models?

@jnothman
Copy link
Member

jnothman commented Aug 5, 2019 via email

@qinhanmin2014
Copy link
Member

Encoders are not really useful for tree based models.

We still need OneHotEncoder to handle strings? By saying not useful, I mean when training tree-based models, we can use OrdinalEncoder instead of UnaryEncoder.

@jnothman
Copy link
Member

jnothman commented Aug 5, 2019 via email

@jnothman
Copy link
Member

It might be a good idea to make this encoder available in the discretizer, but not available as a separate public API until we're happy with its use-cases and API.

@qinhanmin2014
Copy link
Member

It might be a good idea to make this encoder available in the discretizer, but not available as a separate public API until we're happy with its use-cases and API.

this seems acceptable, but I don't think it's a good solution (so +0 from my side)
If we do so, we actually provide users with a way to encode continuous features, but do not provide users with a way to encode categorical features.

@jnothman
Copy link
Member

jnothman commented Aug 18, 2019 via email

@qinhanmin2014
Copy link
Member

I don't see how this encoding is appropriate for unordered categorical
features.

I mean, if we think this encoder is reasonable, maybe we should support both (ordered) categorical features and continuous features.

@NicolasHug
Copy link
Member Author

I'm not a huge fan of the discretizer also doing encoding. It's convenient until the point where you want a more flexible encoding, e.g. drop a column in the OneHotEncoder.

Why don't we encourage users to use pipelines more?

@qinhanmin2014
Copy link
Member

I'm not a huge fan of the discretizer also doing encoding. It's convenient until the point where you want a more flexible encoding, e.g. drop a column in the OneHotEncoder.

I'm also +1 to rely more on pipeline (e.g., do not let estimators do encoding, do not let encoder do feature selection.), especially when we can improve the performance of our impoementation by making some assumptions.

@NicolasHug
Copy link
Member Author

NicolasHug commented Feb 23, 2020

I don't (fore)see this moving forward, so closing.

@NicolasHug NicolasHug closed this Feb 23, 2020
@cmarmo cmarmo mentioned this pull request Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants