[WIP] Make LabelEncoder more friendly to new labels #3243

Closed
wants to merge 59 commits into
from

Projects

None yet
@mjbommar

This PR intends to make preprocessing.LabelEncoder more friendly for production/pipeline usage by adding a new_labels constructor argument.

Instead of always raising ValueError for unseen/new labels in transform, LabelEncoder may be initialized with new_labels as:

  • "raise": current behavior, i.e., raise ValueError; to remain default behavior
  • "nan": return np.nan for unseen/new labels
  • "update": update classes_ with new IDs [N, ..., N+m-1] for m new labels and assign
  • "label": set newly seen labels to have fixed class new_label_class=-1

Tests and documentation updates included.

(edit: adding "label" to list for quick summary)

@jnothman
scikit-learn member
@mjbommar

@jnothman, good idea. I would normally hit it with pd.fillna after, but that would be even friendlier.

@mjbommar

@jnothman, should I add a new example to preprocessing.rst that shows how to handle this? I think this issue of handling unseen categorical labels is a very common pitfall for people and I seem to run into it very often when teaching.

@jnothman
scikit-learn member

Yes, I think an example would be helpful.

@mjbommar

@jnothman, another subtle point about np.nan here - if we allow for np.nan encodings, we need to make our returned array as float*, not int*.

Which do we want?

  • All encodings returned as np.float64 (handles np.nan)
  • new_labels="nan" returned as np.float64, else np.int64
@coveralls

Coverage Status

Coverage decreased (-0.0%) when pulling ab788f7 on mjbommar:label-encoder-unseen into d298a37 on scikit-learn:master.

@jnothman
scikit-learn member

As a categorical label, NaN seems a bit strange altogether, given that it is a float. Is the option needed?

But if it's there, yes, I'd say upcast to a float type (could use find_common_type).

@mjbommar

Well, this is the same issues that pandas addresses more generally, and Wes seemed to find no better solution than upcasting for np.nan:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#na-type-promotions

I think we know deterministically that the indices will be integer unless upcast, so do we need to use find_common_type?

@jnothman
scikit-learn member
@jnothman
scikit-learn member
@mjbommar

OK, the version I have currently pushed has the proposed float/int logic.

@mjbommar

@jnothman, just wanted to see if you were waiting on anything from me on this. I think I've addressed your comments thus far but wanted to make sure.

@jnothman
scikit-learn member

I've not got further than looking at the PR description! It's a busy week, and I'm overseas for the next two, so I'm avoiding promises to review atm.

@mjbommar

@jnothman, no worries, no rush on my end. Just wanted to make sure you weren't waiting on me. Safe travels!

@mjbommar

Reminder to myself that this will need to be rebased given other work in preprocessing.label.

@jnothman jnothman and 1 other commented on an outdated diff Jul 15, 2014
sklearn/preprocessing/label.py
@@ -134,7 +156,48 @@ def transform(self, y):
_check_numpy_unicode_bug(classes)
if len(np.intersect1d(classes, self.classes_)) < len(classes):
diff = np.setdiff1d(classes, self.classes_)
- raise ValueError("y contains new labels: %s" % str(diff))
+
+ # If we are mapping new labels, get "new" ID and change in copy.
+ if self.new_labels == "update":
+ # Update the class list with new labels
+ self.classes_ = np.append(self.classes_, np.sort(diff))
@jnothman
jnothman Jul 15, 2014

I don't see how searchsorted will continue to work unless min(diff) > max(self.classes_) here.

@mjbommar
mjbommar Jul 15, 2014

You're correct.

I had some performant options that worked if transform was only called once with unseen labels, but these masking+searchsorted approaches fall apart once transform is called twice with unseen labels.

My opinion is that degrading performance to a list index lookup for "update" would be worth the practical benefit of the option.

@mjbommar
mjbommar Jul 15, 2014

@jnothman, committed updates addressing other issues, and enhanced the unit test for this case to generate an error as you highlighted.

Let me know if you have an issue with a dictionary lookup. My thought was to keep the missing mask and:

  • use searchsorted for ~missing entries
  • use dictionary lookup for missing entries
  • extend self.classes_ but keep a private index where the post-fit labels begin
@jnothman jnothman and 1 other commented on an outdated diff Jul 15, 2014
sklearn/preprocessing/label.py
@@ -134,7 +156,48 @@ def transform(self, y):
_check_numpy_unicode_bug(classes)
if len(np.intersect1d(classes, self.classes_)) < len(classes):
diff = np.setdiff1d(classes, self.classes_)
- raise ValueError("y contains new labels: %s" % str(diff))
+
+ # If we are mapping new labels, get "new" ID and change in copy.
+ if self.new_labels == "update":
+ # Update the class list with new labels
+ self.classes_ = np.append(self.classes_, np.sort(diff))
+
+ # Return mapped encoding
+ return np.searchsorted(self.classes_, y)
+ elif self.new_labels == "nan":
+ # Create copy of array and return
+ y_array = np.array(y)
@jnothman
jnothman Jul 15, 2014

You can just let y = np.asarray(y) at the start of transform. Yet you seem to want a copy here, and I don't understand why.

@mjbommar
mjbommar Jul 15, 2014

Was simply trying to be memory-efficient, as no np.ndarray is required in the cases where new_labels \in {"raise", "update"} or no new classes seen.

Do we still want to apply this always?

@jnothman
jnothman Jul 15, 2014

searchsorted will convert to an array anyway.

@jnothman jnothman and 1 other commented on an outdated diff Jul 15, 2014
sklearn/preprocessing/label.py
+ y_array = np.array(y)
+ z = np.zeros(y_array.shape, dtype=int)
+
+ # Find entries with new labels
+ missing_mask = np.in1d(y, diff)
+
+ # Populate return array properly and return
+ z[-missing_mask] = np.searchsorted(self.classes_,
+ y_array[-missing_mask])
+ z[missing_mask] = self.new_label_class
+ return z
+ elif self.new_labels == "raise":
+ # Return ValueError, original behavior.
+ raise ValueError("y contains new labels: %s" % str(diff))
+ else:
+ # Raise on invalid argument.
@jnothman
jnothman Jul 15, 2014

Failing on fit is probably kinder to the user.

@jnothman jnothman and 1 other commented on an outdated diff Jul 15, 2014
sklearn/preprocessing/label.py
@@ -134,7 +156,48 @@ def transform(self, y):
_check_numpy_unicode_bug(classes)
if len(np.intersect1d(classes, self.classes_)) < len(classes):
diff = np.setdiff1d(classes, self.classes_)
- raise ValueError("y contains new labels: %s" % str(diff))
+
+ # If we are mapping new labels, get "new" ID and change in copy.
+ if self.new_labels == "update":
+ # Update the class list with new labels
+ self.classes_ = np.append(self.classes_, np.sort(diff))
+
+ # Return mapped encoding
+ return np.searchsorted(self.classes_, y)
+ elif self.new_labels == "nan":
+ # Create copy of array and return
+ y_array = np.array(y)
+ z = np.zeros(y_array.shape, dtype=float)
@jnothman
jnothman Jul 15, 2014

Can we call z out to fit numpy convention?

@mjbommar
mjbommar Jul 15, 2014

Sorry, bit slow tonight; do you mean to literally rename z to be called out?

@jnothman
jnothman Jul 15, 2014

I was ambiguous, but that's what I mean.

@mjbommar mjbommar and 1 other commented on an outdated diff Jul 15, 2014
sklearn/preprocessing/label.py
+
+ # Populate return array properly and return
+ z[-missing_mask] = np.searchsorted(self.classes_,
+ y_array[-missing_mask])
+ z[missing_mask] = np.nan
+ return z
+ elif self.new_labels == "label":
+ # Create copy of array and return
+ y_array = np.array(y)
+ z = np.zeros(y_array.shape, dtype=int)
+
+ # Find entries with new labels
+ missing_mask = np.in1d(y, diff)
+
+ # Populate return array properly and return
+ z[-missing_mask] = np.searchsorted(self.classes_,
@mjbommar
mjbommar Jul 15, 2014

Seeing these in nose under bleeding numpy:

......./home/mjbommar/workspace/scikit-learn/sklearn/preprocessing/label.py:184: DeprecationWarning: numpy boolean negative (the unary `-` operator) is deprecated, use the bitwise_xor (the `^` operator) or the logical_xor function instead.
  y_array[-missing_mask])
/home/mjbommar/workspace/scikit-learn/sklearn/preprocessing/label.py:184: DeprecationWarning: numpy boolean negative (the unary `-` operator) is deprecated, use the bitwise_xor (the `^` operator) or the logical_xor function instead.
  y_array[-missing_mask])
./home/mjbommar/workspace/scikit-learn/sklearn/preprocessing/label.py:197: DeprecationWarning: numpy boolean negative (the unary `-` operator) is deprecated, use the bitwise_xor (the `^` operator) or the logical_xor function instead.
  y_array[-missing_mask])
/home/mjbommar/workspace/scikit-learn/sklearn/preprocessing/label.py:197: DeprecationWarning: numpy boolean negative (the unary `-` operator) is deprecated, use the bitwise_xor (the `^` operator) or the logical_xor function instead.
  y_array[-missing_mask])
........
@jnothman
jnothman Jul 15, 2014

Yes, I thought that looked unfamiliar. Use ~ or np.logical_not.

@cjauvin

Very nice idea, I'm glad to know that I'm not the only one thinking this is an issue!

Do you agree that the same logic would also apply to the LabelBinarizer (and perhaps to some other encoder types that I'm not thinking of as well)?

@jnothman
scikit-learn member

See also #1643 (abandoned).

@amueller amueller added this to the 0.16 milestone Jul 17, 2014
@amueller amueller and 1 other commented on an outdated diff Jul 17, 2014
sklearn/preprocessing/label.py
+ # Setup out
+ out = np.zeros(y.shape, dtype=int)
+
+ # Find entries with new labels
+ missing_mask = np.in1d(y, diff)
+ new_class_values = np.sort(diff)
+
+ # Populate return array properly and return
+ out[~missing_mask] = np.searchsorted(self.classes_,
+ y[~missing_mask])
+ out[missing_mask] = np.searchsorted(new_class_values,
+ y[missing_mask]) + \
+ len(self.classes_)
+
+ # Update the class list with new labels
+ self.classes_ = np.append(self.classes_, new_class_values)
@amueller
amueller Jul 17, 2014

wouldn't it be easier to update self.classes_ and then just do the normal searchsorted in the end?

@mjbommar
mjbommar Jul 17, 2014

What if, as @jnothman pointed out, min(new_class_values) < max(self.classes_)? np.searchsorted wouldn't work in that case.

To summarize my thoughts with this:

  • We need to support repeated calls to transform that have new values, which rules out a new/old searchsorted.
  • To make sure we can support repeated calls, I added a unit test (the failing one).

My opinion is that we ought to keep a dictionary self.new_class_mapping_ that maps from values that weren't in fit to new class labels during transform, which will ensure that repeated calls to transform will map new values consistently.

If no one feels strongly against the performance impact of falling back to dictionary access for handling the unseen values, I'll implement this later today.

@arjoly arjoly commented on an outdated diff Jul 17, 2014
sklearn/preprocessing/label.py
+ y = np.array(y)
+
+ # If we are mapping new labels, get "new" ID and change in copy.
+ if self.new_labels == "update":
+ # Setup out
+ out = np.zeros(y.shape, dtype=int)
+
+ # Find entries with new labels
+ missing_mask = np.in1d(y, diff)
+ new_class_values = np.sort(diff)
+
+ # Populate return array properly and return
+ out[~missing_mask] = np.searchsorted(self.classes_,
+ y[~missing_mask])
+ out[missing_mask] = np.searchsorted(new_class_values,
+ y[missing_mask]) + \
@arjoly
arjoly Jul 17, 2014

Could you avoid the \?

@arjoly arjoly commented on an outdated diff Jul 17, 2014
sklearn/preprocessing/label.py
+
+ # Update the class list with new labels
+ self.classes_ = np.append(self.classes_, new_class_values)
+
+ # Return mapped encoding
+ return out
+ elif self.new_labels == "nan":
+ # Setup out
+ out = np.zeros(y.shape, dtype=float)
+
+ # Find entries with new labels
+ missing_mask = np.in1d(y, diff)
+
+ # Populate return array properly and return
+ out[~missing_mask] = np.searchsorted(self.classes_,
+ y[~missing_mask])
@arjoly
arjoly Jul 17, 2014

The above line of codes are reapeated several times. It would be great to refactor those.

@arjoly
scikit-learn member

LGTM, whenever previous comments are addressed and Travis is green.

@amueller amueller commented on an outdated diff Jul 17, 2014
sklearn/preprocessing/label.py
+ missing_mask = np.in1d(y, diff)
+
+ # Populate return array properly and return
+ out[~missing_mask] = np.searchsorted(self.classes_,
+ y[~missing_mask])
+ out[missing_mask] = np.nan
+ return out
+ elif self.new_labels == "label":
+ # Setup out
+ out = np.zeros(y.shape, dtype=int)
+
+ # Find entries with new labels
+ missing_mask = np.in1d(y, diff)
+
+ # Populate return array properly and return
+ out[~missing_mask] = np.searchsorted(self.classes_,
@amueller
amueller Jul 17, 2014

My gut tells me it would be cheaper to update all of out and then use the mask to overwrite as in the second line. That would also get rid of allocating out first.

@amueller
scikit-learn member

@cjauvin this should also be in LabelBinarizer and probably OneHotEncoder.

@arjoly
scikit-learn member

@cjauvin this should also be in LabelBinarizer and probably OneHotEncoder.

+1 for separate pull requests :-)

@mjbommar mjbommar changed the title from Make LabelEncoder more friendly to new labels to [WIP] Make LabelEncoder more friendly to new labels Jul 17, 2014
@GaelVaroquaux
scikit-learn member

I am very much against using nans for anything other than numerical errors. Experience shows that it make debugging very hard, and tends to enforce verification code in a lot of the codebase. On top of it, I have a bad feeling about having floats creep in our labels.

I suggest using a specific label for unseen classes. I would suggest -9999. However, if we want more flexibility, we can also accept an integer value as a valid value to this argument and use it.

@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Jul 17, 2014
sklearn/preprocessing/label.py
@@ -43,6 +43,24 @@ def _check_numpy_unicode_bug(labels):
class LabelEncoder(BaseEstimator, TransformerMixin):
"""Encode labels with value between 0 and n_classes-1.
+ Parameters
+ ----------
+
+ new_labels : string, optional (default: "raise")
+ Determines how to handle new labels, i.e., data
+ not seen in the training domain.
+
+ - If ``"raise"``, then raise ValueError.
+ - If ``"update"``, then re-map the new labels to
+ classes ``[N, ..., N+m-1]``, where ``m`` is the number of new labels.
+ - If ``"nan"``, then re-map the new labels to ``numpy.nan``.
+ - If ``"label"``, then use the value of ``new_label_class``.
@GaelVaroquaux
GaelVaroquaux Jul 17, 2014

I would here suppress the nan option, and accept:

  • raise
  • update
  • an integer value, which is then the value used for new labels

And I would suppress the argument 'new_label_class'.

What do people think?

@mjbommar

Summary:

  • Took @GaelVaroquaux and @arjoly's suggestions by (1) removing "nan" option, (2) reworking "label" and new_label_value into polymporphic argument, and (3) changing line-break style.
  • Implemented split between fit-time classes in classes_ and post-fit classes in new_label_mapping_, which is robust to repeated calls to transform regardless of ordering of classes.
  • Implemented get_classes method to return an np.ndarray of class values whose order matches the integer encoding in the case of new_label="update".
  • Some minor cleanup/docstring/typos to fix when I rebase to incorporate the other work done in label lately, but all unit tests passing and structure there for review.

@GaelVaroquaux , regarding your recommendation of -9999, a few thoughts:

  • Do we want to make this default behavior instead of raise/ValueError?
  • Is there appetite for a more holistic discussion of NaN vs. NA vs. NaT vs. .... Wes faced this same set of decisions in pandas (built on top of numpy/scipy, handling noisy/realistic data of various types, Cython, performance-oriented). Would it be worth starting with a table of support, by pre-processor/estimator, for type and NA/NAN?
@GaelVaroquaux
scikit-learn member
@jnothman
scikit-learn member

I think raise is a kinder default.

@mjbommar

@GaelVaroquaux , agreed, but in the absence of any consistent(ly communicated) option for NA, many substitute NaN. My question was mostly whether we thought it a good use of collective development time to present a consistent message on how to encode "missing" observations throughout sklearn, similar to the way that pandas has a clearly communicated and (mostly) consistent strategy.

@GaelVaroquaux
scikit-learn member
@mjbommar

OK, I get the frustration. That said, as a practitioner relying on pandas daily, I am grateful for the fact that Wes has not taken a similarly pessimistic view. I think there are a number of places like imputation, scaling, encoding, or categorical-friendly estimators that would benefit from a similar approach.

@mjbommar

@jnothman, re: kinder default, my opinion on this is that would be new_labels="update" instead of -9999. Any opinion?

@jnothman
scikit-learn member
pvnguyen and others added some commits Jul 21, 2014
@pvnguyen pvnguyen Update outlier_detection.rst
Add reference for One-class SVM.
0b8e63c
@kastnerkyle kastnerkyle Added directory checking for documentation builds, and corrected for …
…Windows pathing
62f1f57
@kastnerkyle kastnerkyle Merge pull request #3477 from kastnerkyle/docbuild_fix
[MRG] Docbuild fixes for #3475
8cd5d85
@larsmans larsmans Merge pull request #3479 from MechCoder/improve_logcv_docs
Improve docstring of Logistic Regression CV model
965b109
@ogrisel ogrisel MAINT More robust windows installation script d814353
@ogrisel ogrisel FIX unstable test on 32 bit windows f3afd4e
@ogrisel ogrisel MAINT move skip for unstable 32bit to _check_transformer 0af2d8f
@agramfort agramfort Merge pull request #3465 from pvnguyen/patch-1
Update outlier_detection.rst
376ac51
@mjbommar mjbommar Adding new_labels argument to LabelEncoder 4b6978e
@mjbommar mjbommar Adding tests for new_labels argument. d990207
@mjbommar mjbommar Changing classes_ update strategy a69840b
@mjbommar mjbommar Adding nan behavior, renaming to fce9fb5
@mjbommar mjbommar Updating tests to include nan case and update name 76921e5
@mjbommar mjbommar Fixing docstring for test-doc pass 0e39a2a
@mjbommar mjbommar Fixing docstring for test-doc pass (for real) 1da2880
@mjbommar mjbommar Updating doctests 926b166
@mjbommar mjbommar Updating constructor documentation 5ef9b85
@mjbommar mjbommar Adding specific "label" option to new_labels 4dfb4cb
@mjbommar mjbommar Adding test for "label" option to ``new_labels`` 392e54b
@mjbommar mjbommar Updating docstring for ``new_labels="label"`` e053635
@mjbommar mjbommar pep8 122a98f
@mjbommar mjbommar Autodoc fix de18372
@mjbommar mjbommar Fixing rst docs d735ca2
@mjbommar mjbommar Changing dtypes for new_labels d276565
@mjbommar mjbommar Adding example for new_labels argument a01f8b0
@mjbommar mjbommar Adding new_labels handling to fit/fit_transform 495347c
@mjbommar mjbommar Improving difficulty of test cases with non-increasing unseen labels dee4ae0
@mjbommar mjbommar Moving ValueError check to fit c297017
@mjbommar mjbommar Improving difficult for new_labels='update' test to include multiple …
…transform with new labels
f29800b
@mjbommar mjbommar Fixing negative indexing, renamed z->out, failing approach for new_la…
…bels=update w/ searchsorted
74b7589
@mjbommar mjbommar PEP8 3e1be5d
@mjbommar mjbommar Removing nan option and corresponding test abf01cc
@mjbommar mjbommar Handling repeated transform calls with new_class_mapping_, refactorin…
…g, cleaning after removing np.nan.
f26a902
@mjbommar mjbommar Rebase
0725d4c
@mjbommar

Cleanly rebased final PR pending.

@mjbommar

Closing for PR #3483.

@mjbommar mjbommar closed this Jul 24, 2014
@mjbommar mjbommar referenced this pull request Aug 27, 2014
Open

[WIP] Sparse and Multioutput LabelEncoder #3592

2 of 7 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment