MRG added ``classes`` parameter to LabelBinarizer #1643

Closed
wants to merge 17 commits into
from

Conversation

Projects
None yet
6 participants
Owner

amueller commented Jan 31, 2013

Adds a classes parameter to LabelBinarizer. I wanted that to add partial_fit to the naive Bayes models.
If classes is specified, there is no need to call fit.
This PR still needs tests.

Owner

ogrisel commented Jan 31, 2013

I think you need a couple of tests :) Please don't forget to test the exceptions, including the actual message value for the one that displays the class difference.

Owner

amueller commented Jan 31, 2013

Definitely will do ;)

Owner

amueller commented Feb 2, 2013

Done :)

Owner

amueller commented Feb 2, 2013

Renamed to MRG

@ogrisel ogrisel commented on the diff Feb 3, 2013

sklearn/tests/test_preprocessing.py
+ y = [[1, 2], [1], []]
+ Y = np.array([[1, 1],
+ [1, 0],
+ [0, 0]])
+ assert_array_equal(lb.transform(y), Y)
+ assert_array_equal(lb.fit_transform(y), Y)
+
+ lb = LabelBinarizer(classes=np.arange(1, 3))
+ assert_raise_message(ValueError, "not fitted with multilabel",
+ lb.transform, y)
+
+ # check that labels present at fit time that are not in 'classes'
+ # will be ignored but a warning will be shown
+ lb = LabelBinarizer(classes=[1, 2])
+ with warnings.catch_warnings(record=True) as w:
+ transformed = lb.fit_transform([0, 1, 2])
@ogrisel

ogrisel Feb 3, 2013

Owner

Thanks for testing this :)

Owner

ogrisel commented Feb 3, 2013

Looks good to me. Could you please rebase on top of master to fix travis. +1 for merging when travis is green.

Owner

amueller commented Feb 3, 2013

seems like I broke multilabel stuff.. no idea how. will fix now.

Owner

amueller commented Feb 3, 2013

Ok, fixed. cc @mblondel ;)

Owner

amueller commented Feb 3, 2013

This is not fully backward compatible now :-/ there was an attribute multilabel that should have been called multilabel_, so now the semantics of LabelBinarizer.multilabel have changed :-(

Owner

arjoly commented Feb 4, 2013

You should specify what type of representation it will lead if you specify the classes argument.

What is the use case of the multilabel argument?

Owner

arjoly commented Feb 4, 2013

You should specify what type of representation it will lead if you specify the classes argument.

Misunderstood the doc. Don't take this comment into account.

Owner

amueller commented Feb 4, 2013

I'm not sure I understand the multilabel question. It is for multi-label input with prespecified classes.

Owner

arjoly commented Feb 5, 2013

I think that you need to add a constructor argument for the self.indicator_matrix_ attribute.

Owner

amueller commented Feb 5, 2013

@arjoly there is no such attribute in master, right? That is only in your branch afaik.
Never mind. I'll add it and also a test.

Owner

amueller commented Feb 8, 2013

Should be good now :)

Owner

amueller commented Feb 8, 2013

Ok, travis is happy now....

Owner

arjoly commented Feb 9, 2013

I don't think you treat the case multiclass=False and indicator_matrix=True.

Instead of booleans, it may be better to use strings? For instance: "multiclass", "indicator_matrix" and "list_tuple_labels"?

Otherwise, it looks good !

Owner

amueller commented Feb 9, 2013

There is a test here. Do you think it is sufficient?
We could do a string argument. Would you then also make the fitted attributes a string?

Owner

arjoly commented Feb 9, 2013

I don't think you treat the case multiclass=False and indicator_matrix=True.

Sorry, but I messed when writing the preceeding sentence. I mean multilabel=False and indicator_matrix=True. (You say it is not a multilabel format, but impose a multilabel format)

Would you then also make the fitted attributes a string?

From my point of view, the string format in the attribute should reduce the number of nested if. However with 3 supported formats, this might not be necessary.

Note that it would be cool to support sparse binary indicator matrix which would bring many more sparse format (csc, csr, ...).

Honestly, do as you think it's the best.

@mblondel mblondel and 2 others commented on an outdated diff Feb 9, 2013

sklearn/preprocessing.py
Value with which positive labels must be encoded.
+ classes : ndarray if int or None (default)
+ Array of possible classes.
+
+ multilabel : bool or None (default)
+ Whether or not data will be multilabel.
+ If None, it will be inferred during fitting.
@mblondel

mblondel Feb 9, 2013

Owner

Why do you need this?

@amueller

amueller Feb 9, 2013

Owner

Would you just infer it on transform? I wanted to match the current behavior when fit is called. Currently it is enforced that the data in transform is of the same kind (multilable or not) then during fit.

@mblondel

mblondel Feb 10, 2013

Owner

Yes, I think inferring it in transform would be enough. But I don't really see why users would want to fit multiclass data and transform multilabel data.

@amueller

amueller Feb 10, 2013

Owner

Ok, then I would also change the behavior when using fit to have it unified. I'm not sure who wrote this code in the first place. I guess either you or @larsmans? I also don't see much of a reason for the current behavior actually...

@amueller

amueller Feb 10, 2013

Owner

Hum... so what do I do if someone calls inverse_transform before calling either fit or transform?

@mblondel

mblondel Feb 11, 2013

Owner

I think the output_type option of inverse_transform could take on the following values:

  • auto: used format inferred from fit or multiclass if unknown
  • multiclass: vectors of ints (argmax of the decision_function)
  • multilabel-list: list of lists
  • multilabel-indicator: indicator matrix

(threshold makes really more sense as parameter to inverse_transform since it can be used to compute a precision-recall curve. )

@mblondel

mblondel Feb 11, 2013

Owner

Hum... so what do I do if someone calls inverse_transform before calling either fit or transform?

Thinking about it, this problems occurs irrespective of whether the parameter is a constructor one or a method one...

@amueller

amueller Feb 11, 2013

Owner

Sorry, I was a bit daft. I thought the multi_label attribute was needed somewhere else. Ok, I got it now.

@amueller

amueller Feb 11, 2013

Owner

Ok, I think I'm giving up.

Currently we have the current behavior:

  • The user needs to call fit before using the estimator
  • If the user passes y to transform that is doesn't have the same value of is_multilabel than during fit, an error is raised.
  • The result type of inverse_transform is taken form y during fit.

I wanted to add the possibility to leave out the call to fit, or rather to call transform without having all the data first. For that I added a classes parameter. To cope with the other points above, there are the following possibilities as far as I can see:

  • Pass the multi-label format to inverse_transform and either
    • use the current boolean flags to do weak input validation (is_multilabel has the same value), but only if fit was called.
    • get rid of all input validation in transform.
    • change the boolean attributes to a string with the same possible values as the parameters to inverse_transform and
      • do weak input validation on them in transform if fit was called
      • validate that the format in transform is exactly the same (which changes behavior) in case fit was called
  • Pass the boolean attributes that are inferred in fit optionally as __init__ parameters.

I did the second because I thought it was the closest to the current behavior of the code and I just wanted to do a small addition to the functionality.
I am not sure which of the other possibilities above you are in favor of. It seems you didn't want input validation when fit is not called. I'm not sure why you don't want that. I'm not sure what input validation you want if fit was called and which attributes you would store during fit?

I feel it is weird if calling fit is a no-op apart from the fact that it enforces input validation. That is really weird behavior imho.

@arjoly

arjoly Feb 11, 2013

Owner

Ok, I think I'm giving up.

Don't give up !! this pr is very useful.

@mblondel

mblondel Feb 12, 2013

Owner

Ok, I think I'm giving up.

I gave my opinion. Now as always in scikit-learn, the majority votes. If
more people agree with you, I will respect the majority's opinion.

As far as I see it, there are 2 different issues. Whether to use
constructor or method parameter, and whether to use one string parameter or
two boolean parameters.
I think at least we agree on the latter.

@amueller

amueller Feb 12, 2013

Owner

My issue is more that I don't know what to do with the input validation in transform if I do a method parameter.

@mblondel

mblondel Feb 12, 2013

Owner

On Tue, Feb 12, 2013 at 5:56 PM, Andreas Mueller
notifications@github.comwrote:

My issue is more that I don't know what to do with the input validation in
transform if I do a method parameter.

If the string takes the "auto" value by default, doesn't the same problem
exist for the constructor parameter too?

@amueller

amueller Feb 12, 2013

Owner

The way it is implemented currently is that if fit is not called, there is a fall-back to multiclass. I.e. if you don't specify multi-label but pass a multi-label y to transform, an error is raised.

@mblondel mblondel and 2 others commented on an outdated diff Feb 9, 2013

sklearn/preprocessing.py
Value with which positive labels must be encoded.
+ classes : ndarray if int or None (default)
+ Array of possible classes.
+
+ multilabel : bool or None (default)
+ Whether or not data will be multilabel.
+ If None, it will be inferred during fitting.
+
+ indicator_matrix : bool or None (default)
+ Whether ``inverse_transform`` will produce an indicator
+ matrix encoding (if False it will return list of lists).
+ If None, it will be inferred during fitting.
+
@mblondel

mblondel Feb 9, 2013

Owner

I'd rather make it an argument of inverse_transform.

@mblondel

mblondel Feb 10, 2013

Owner

Because it's an option that concerns inverse_transform only. Having it a constructor option implies that you need to recreate a LabelBinarizer object to change the behavior of inverse_transform.

@amueller

amueller Feb 10, 2013

Owner

I see your point but I feel that is a bit unlike the normal api.
By 'implies' do you mean 'looks like that to the user'? You could just set indicator_matrix to a different value before calling inverse_transform if you wanted.

The reason why I think it doesn't fit well in the API is that we usually have all parameters that don't depend on the data as __init__ parameters. There is a good reason for that: having parameters to the functions breaks grid-searching and pipelining.

So if someone wants to use Pipeline with inverse_transform you get into trouble.
For all other transform based methods, the parameters are always specified for __init__ so as a user I wouldn't expect a parameter in inverse_transform now.

@mblondel

mblondel Feb 10, 2013

Owner

There'd be no point grid searching on the inverse_transform option as it should not affect accuracy. As for pipeline, LabelBinarizer works on y, not X, so I don't think it makes sense either.

This inverse_transform option is only useful if you want to override the behavior inferred in fit (use indicator matrix or list of lists).

@amueller

amueller Feb 10, 2013

Owner

Ok, you are right, pipelining doesn't really makes sense.
I still feel that we should keep the "everything in init" pattern, as it is easy to remember and options are easy to discover.
If we make it a function argument here users will have to look it up in the docs every time they want to use it.

The option I introduced was not to overwrite what was done in fit but for the case that fit is not called btw.

@mblondel

mblondel Feb 10, 2013

Owner

The option I introduced was not to overwrite what was done in fit but for the case that fit is not called btw.

I see. In that case, there are three cases: multiclass, list of lists and indicator matrix. So @arjoly's earlier suggestion to use a string makes sense to me (and will avoid mutually inconsistent combination).

I personally still prefer to have it as parameter of inverse_transform if there is no major benefit in having it a constructor parameter. Also note that inverse_transform already has a threshold option.

@amueller

amueller Feb 10, 2013

Owner

hm, ok, in this case... I'll do it ;)
The string option means a lot of deprecations :-/ I'll do it later.

@amueller

amueller Feb 10, 2013

Owner

Ok I don't think I understand the solution you are proposing. If I make inidicator_matrix an argument of inverse_transform, what are the possible string values? transform does not differentiate between the two multi-label formats.

@mblondel

mblondel Feb 11, 2013

Owner

The input to inverse_transform is an indicator matrix (possibly containing real values, since it can accept what decision_function returns). If I'm not mistaken, no matter what data type fit was called with, you can inverse-transform this indicator matrix to any of the 3 possible formats: multiclass, multilabel-list, multilabel-indicator.

@GaelVaroquaux

GaelVaroquaux Feb 11, 2013

Owner

Having it a constructor option implies that you need to recreate a
LabelBinarizer object to change the behavior of inverse_transform.

No it does not. It just means that it is a parameter that can be set and
controled in the parameter-selection framework. Nothing specifies whether
these parameters can be changed or not between the fit and the transform.

@GaelVaroquaux

GaelVaroquaux Feb 11, 2013

Owner

I personally still prefer to have it as parameter of inverse_transform
if there is no major benefit in having it a constructor parameter.

I feel like @amueller. It's a question of consistency of the API.

Also note that inverse_transform already has a threshold option.

Arh! Crap.

@mblondel

mblondel Feb 11, 2013

Owner

Arh! Crap.

transform and inverse_transform are not supposed to change the state of the object so I feel that it's really ugly to change the state of the object just to call these methods:

for threshold in thresholds:
    label_binarizer.threshold = threshold
    predictions = label_binarizer.inverse_transform(dec_func)
    # compute precision-recall

For the option specifying the output format of inverse_transform, the user is less likely to want to call it several times with different values so I feel it more acceptable to have the option as a constructor parameter, though I don't really see the benefit.

Nothing specifies whether these parameters can be changed or not between the fit and the transform.

From a software engineering point of view, directly changing the attribute of an object requires knowledge of how the object works internally. For fit, it's less of a problem because if you change attributes and recall fit, it's like you reinitialized the object. For transform or inverse_transform, how do you know that the attribute you just changed doesn't also affect the behavior of fit, unless you know the code? Code like this is also more likely to break in case of API change.

I do agree that constructor parameters that concern transform can be useful for grid search or pipelines, though.

@amueller

amueller Feb 11, 2013

Owner

transform and inverse_transform are not supposed to change the state of the object so I feel that it's really ugly to change the state of the object just to call these methods:

I do agree that constructor parameters that concern transform can be useful for grid search or pipelines, though.

I agree with both of them, I just don't see what conclusion you want to draw. The pipelining argument is equally valid for inverse_transform, isn't it?
I think pipelining is an integral part of sklearn so all estimator should work with pipelining. Ok, from a general point of view it is ugly to do that by passing everything as constructor arguments. But that is the way we decided to go for scikit-learn. If you have a better idea, we can discuss changing this approach.
As this is the way we need to do things (maybe not here because we are working on y, not X), I think we should stick to it. Everything else is just confusing.

@amueller

amueller Feb 11, 2013

Owner

I agree that you need knowledge of the code for doing this, but if you have that, you can write

for threshold in thresholds:
    predictions = label_binarizer.set_params(threshold=threshold).inverse_transform(dec_func)
@GaelVaroquaux

GaelVaroquaux Feb 12, 2013

Owner

Ok, from a general point of view it is ugly to do that by passing
everything as constructor arguments.

The way I like to think about it is that the constructor arguments are
everything that I need to know about an estimator and define it
completely. If we stick to this principle, we have a uniform feeling in
scikit-learn and minimize surprises.

Owner

amueller commented Feb 16, 2013

I replaced the two boolean arguments by a single string argument. I didn't adjust the tests yet. I did not replace the boolean attributes. Do you think I should do that, too?

Replacing the __init__ parameter did not result in getting rid of any nested ifs and I don't know why it should.
The code definitely got more convoluted now :-/

I just also tried replacing the properties with strings and that led to one more level of ifs and a lot of label_type in ["..", ".."].

@arjoly @mblondel which part did you expect to get simpler with string arguments?

Owner

amueller commented Feb 16, 2013

hm maybe I can fix it somehow...

Owner

amueller commented Feb 16, 2013

Ok, I think it is actually nicer now with strings :)

Owner

amueller commented Feb 16, 2013

@arjoly I introduced a new function _get_label_type that should probably also move to the new utils module. I guess it depends on which PR gets merged first.

Owner

amueller commented Feb 16, 2013

Ok so I didn't deprecate the property multilabel_ as it is used in the multiclass module.

Owner

amueller commented Feb 16, 2013

This adds another property to the interface...

Owner

amueller commented Feb 20, 2013

Travis should be happy now. Are the other devs, too? pint @arjoly @ogrisel (@mblondel said he is busy).

@arjoly arjoly commented on an outdated diff Feb 21, 2013

sklearn/preprocessing.py
Value with which positive labels must be encoded.
+ classes : ndarray if int or None (default)
@arjoly

arjoly Feb 21, 2013

Owner

I don't understand this line.

@arjoly arjoly commented on an outdated diff Feb 21, 2013

sklearn/preprocessing.py
Value with which positive labels must be encoded.
+ classes : ndarray if int or None (default)
+ Array of possible classes.
+
+ label_type : string, default="auto"
+ Expected type of y.
+ Possible values are:
+ - "multiclass", y is array of ints
+ - "multilabel-indicator", y is indicator matrix of classes
+ - "multiclass-list", y is list of lists of labels
+ - "auto", form of y is determined during 'fit'. If 'fit' is not
+ called, multiclass is assumed.
@arjoly

arjoly Feb 21, 2013

Owner

When you wrote y is ..., I think that an article is missing.
It may be better to say something like y is in the indicator matrix of classes format, y is in the list of lists of labels format.

@arjoly arjoly commented on an outdated diff Feb 21, 2013

sklearn/preprocessing.py
Value with which positive labels must be encoded.
+ classes : ndarray if int or None (default)
+ Array of possible classes.
+
+ label_type : string, default="auto"
+ Expected type of y.
+ Possible values are:
+ - "multiclass", y is array of ints
@arjoly

arjoly Feb 21, 2013

Owner

I think that we accept any one dimensional array like y, e.g. np.array([1,2,3]), [1, 2, 3], (1, 2, 3), np.reshape(np.arange(3), (-1, 1)), `np.reshape(np.arange(3), (1, -1))``.

@arjoly arjoly commented on an outdated diff Feb 21, 2013

sklearn/preprocessing.py
Holds the label for each class.
+ `label_type_` : bool
@arjoly

arjoly Feb 21, 2013

Owner

you mean str?

@arjoly arjoly and 1 other commented on an outdated diff Feb 21, 2013

sklearn/preprocessing.py
@@ -950,19 +979,31 @@ class LabelBinarizer(BaseEstimator, TransformerMixin):
array([1, 2, 3])
"""
- def __init__(self, neg_label=0, pos_label=1):
+ def __init__(self, neg_label=0, pos_label=1, classes=None,
+ label_type='auto'):
if neg_label >= pos_label:
raise ValueError("neg_label must be strictly less than pos_label.")
@arjoly

arjoly Feb 21, 2013

Owner

This shouldn't be in the __init__.

@arjoly

arjoly Feb 21, 2013

Owner

By the way, why do we enforce this constraint?

@amueller

amueller Mar 2, 2013

Owner

Because GridSearchCV uses set_params not __init__ to set the parameters.

Owner

arjoly commented Feb 21, 2013

Can you add a test to check that:

  • the order of labels in the argument classes is properly taken into account (LabelBinarizer(classes=[1, 2, 3]) is not equal to LabelBinarizer(classes=[3, 1, 2])) ?
  • redundant labels with the classes argument is treated in some way (LabelBinarizer(classes=[1, 2, 3, 3])) ?
  • a call to the transform method with redundant labels and label in a shuffled order works properly for the list of list of labels format?
  • _get_label_type works? Can you also add some documentation? What should happen when y is not properly formated?

Can you also

  • update the narrative doc and add one or more examples to illustrate the new features?
  • update whats_new.rst?
Owner

amueller commented Mar 2, 2013

Is there any other class except SGDClassifier and related that have a classes parameter somewhere?
I just checked an SGDClassifiers partial fit will break if the entries in classes are not sorted or not unique if I understand the code correctly.

It is probably best if we do respect the ordering of the labels, as long as that is not incompatible with code in other places.

Owner

amueller commented Mar 2, 2013

The current code does a unique on the labels, so redundant and shuffled classes will all lead to the same result.

@arjoly arjoly commented on the diff Mar 7, 2013

sklearn/preprocessing.py
"""
self._check_fitted()
-
- if self.multilabel or len(self.classes_) > 2:
- if _is_label_indicator_matrix(y):
- # nothing to do as y is already a label indicator matrix
- return y
-
+ label_type = _get_label_type(y)
+ if label_type != self.label_type_:
+ raise ValueError("label_type was set to %s, but got y of type %s."
+ % (self.label_type_, label_type))
+ if label_type == "multilabel-indicator":
+ # nothing to do as y is already a label indicator matrix
+ return y
+ elif label_type == "multilabel-list" or len(self.classes_) > 2:
@arjoly

arjoly Mar 7, 2013

Owner

Why is the case len(self.classes_) > 2 used here?

@amueller

amueller Mar 7, 2013

Owner

because for two classes, the output is 1d.

@arjoly

arjoly Mar 7, 2013

Owner

Thanks !

@arjoly arjoly commented on the diff Mar 7, 2013

sklearn/preprocessing.py
@@ -622,6 +623,18 @@ def _is_multilabel(y):
_is_label_indicator_matrix(y))
+def _get_label_type(y):
@arjoly

arjoly Mar 7, 2013

Owner

Now, you can put this in utils.multiclass. :-)

@amueller

amueller Mar 7, 2013

Owner

will do :)

@jnothman

jnothman May 22, 2013

Owner

I propose we replace is_multilabel with something like this, that moreover may handle multiple ys: #1985

@arjoly arjoly commented on the diff Mar 7, 2013

sklearn/preprocessing.py
def _check_fitted(self):
if not hasattr(self, "classes_"):
- raise ValueError("LabelBinarizer was not fitted yet.")
+ if self.classes is not None:
+ self.classes_ = np.unique(self.classes)
@arjoly

arjoly Mar 7, 2013

Owner

The order of the classes won't be preserve.
I don't know if it matters.

@amueller

amueller Mar 7, 2013

Owner

I know. I was wondering about that. I think I'd like to keep it like this for now.

@arjoly

arjoly Mar 7, 2013

Owner

The only place, where I have seen such things is in the test_precision_recall_f1_score_multiclass in test_metrics.py.

@amueller

amueller Mar 7, 2013

Owner

"such things" meaning preserving the class order?

@arjoly

arjoly Mar 7, 2013

Owner

Yes, the class order must be known to output the precision, recall, fscore in the right order (see averaging type is equal to None).

Owner

arjoly commented May 2, 2013

Any news on this pr?

Owner

amueller commented May 6, 2013

sorry I'm swamped. I actually forgot the state of this. I'm trying to catch up currently.

Owner

jnothman commented May 22, 2013

If I understand correctly, out-of-classes labels will be ignored with a warning at fit time, but will throw an error at transform time. Does this make sense? (Why?) If so, should this be documented?

Owner

jnothman commented May 23, 2013

Also, I think parallel functionality should (as a rule) be available in LabelEncoder and LabelBinarizer. Indeed, LabelBinarizer should use a LabelEncoder, after which its job is simple.

@jnothman jnothman commented on the diff May 28, 2013

sklearn/preprocessing.py
Value with which positive labels must be encoded.
+ classes : ndarray of int or None (default)
+ Array of possible classes.
+
+ label_type : string, default="auto"
@jnothman

jnothman May 28, 2013

Owner

can we call this target_type?

@jnothman jnothman commented on the diff May 28, 2013

sklearn/preprocessing.py
Value with which positive labels must be encoded.
+ classes : ndarray of int or None (default)
+ Array of possible classes.
@jnothman

jnothman May 28, 2013

Owner

The purpose of the classes parameter is not clear from the comment. Is it merely so that you needn't fit if you already know the class labels? Or is it to define an ordering of the labels? What happens if an excess label is given? What happens if a label is absent but found in the y passed to fit, or in transform?

Are there cases where finding an absent class means we should raise an error, but others where it makes sense to remove the entry, or substitute a different class? Should we have a handle_unknown parameter selecting among these, or is that excessively flexible? Certainly, the behaviour is implicit at the moment, and that's a bad thing.

There is really no reason the same functionality shouldn't apply to LabelEncoder, and I think they should inherit from a common class (as in jnothman/scikit-learn@8669605).

amueller closed this Apr 28, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment