# [WIP] Make LabelEncoder more friendly to new labels #3243

Closed
wants to merge 59 commits into
from

None yet

### 12 participants

This PR intends to make preprocessing.LabelEncoder more friendly for production/pipeline usage by adding a new_labels constructor argument.

Instead of always raising ValueError for unseen/new labels in transform, LabelEncoder may be initialized with new_labels as:

• "raise": current behavior, i.e., raise ValueError; to remain default behavior
• "nan": return np.nan for unseen/new labels
• "update": update classes_ with new IDs [N, ..., N+m-1] for m new labels and assign
• "label": set newly seen labels to have fixed class new_label_class=-1

(edit: adding "label" to list for quick summary)

added some commits Jun 4, 2014
 mjbommar Adding new_labels argument to LabelEncoder 4d79789 mjbommar Adding tests for new_labels argument. 2bc5686 mjbommar Changing classes_ update strategy 8c1fafe mjbommar Adding nan behavior, renaming to 1ffb24a mjbommar Updating tests to include nan case and update name 99f65a9 mjbommar Fixing docstring for test-doc pass af8c6a9 mjbommar Fixing docstring for test-doc pass (for real) 8ffc839 mjbommar Updating doctests e6fbc47 mjbommar Updating constructor documentation 46118d9
scikit-learn member

@jnothman, good idea. I would normally hit it with pd.fillna after, but that would be even friendlier.

added some commits Jun 5, 2014
 mjbommar Adding specific "label" option to new_labels 8d21ec1 mjbommar Adding test for "label" option to new_labels 343c726 mjbommar Updating docstring for new_labels="label" be97c14 mjbommar pep8 cdd7147 mjbommar Autodoc fix 170d00c mjbommar Fixing rst docs 2d87e88

@jnothman, should I add a new example to preprocessing.rst that shows how to handle this? I think this issue of handling unseen categorical labels is a very common pitfall for people and I seem to run into it very often when teaching.

scikit-learn member

Yes, I think an example would be helpful.

@jnothman, another subtle point about np.nan here - if we allow for np.nan encodings, we need to make our returned array as float*, not int*.

Which do we want?

• All encodings returned as np.float64 (handles np.nan)
• new_labels="nan" returned as np.float64, else np.int64
added some commits Jun 8, 2014
 mjbommar Changing dtypes for new_labels bb8d9a6 mjbommar Adding example for new_labels argument ab788f7

Coverage decreased (-0.0%) when pulling ab788f7 on mjbommar:label-encoder-unseen into d298a37 on scikit-learn:master.

scikit-learn member

As a categorical label, NaN seems a bit strange altogether, given that it is a float. Is the option needed?

But if it's there, yes, I'd say upcast to a float type (could use find_common_type).

Well, this is the same issues that pandas addresses more generally, and Wes seemed to find no better solution than upcasting for np.nan:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#na-type-promotions

I think we know deterministically that the indices will be integer unless upcast, so do we need to use find_common_type?

scikit-learn member
scikit-learn member

OK, the version I have currently pushed has the proposed float/int logic.

@jnothman, just wanted to see if you were waiting on anything from me on this. I think I've addressed your comments thus far but wanted to make sure.

scikit-learn member

I've not got further than looking at the PR description! It's a busy week, and I'm overseas for the next two, so I'm avoiding promises to review atm.

@jnothman, no worries, no rush on my end. Just wanted to make sure you weren't waiting on me. Safe travels!

Reminder to myself that this will need to be rebased given other work in preprocessing.label.

and 1 other commented on an outdated diff Jul 15, 2014
sklearn/preprocessing/label.py
 @@ -134,7 +156,48 @@ def transform(self, y): _check_numpy_unicode_bug(classes) if len(np.intersect1d(classes, self.classes_)) < len(classes): diff = np.setdiff1d(classes, self.classes_) - raise ValueError("y contains new labels: %s" % str(diff)) + + # If we are mapping new labels, get "new" ID and change in copy. + if self.new_labels == "update": + # Update the class list with new labels + self.classes_ = np.append(self.classes_, np.sort(diff))
 jnothman Jul 15, 2014 I don't see how searchsorted will continue to work unless min(diff) > max(self.classes_) here. mjbommar Jul 15, 2014 You're correct. I had some performant options that worked if transform was only called once with unseen labels, but these masking+searchsorted approaches fall apart once transform is called twice with unseen labels. My opinion is that degrading performance to a list index lookup for "update" would be worth the practical benefit of the option. mjbommar Jul 15, 2014 @jnothman, committed updates addressing other issues, and enhanced the unit test for this case to generate an error as you highlighted. Let me know if you have an issue with a dictionary lookup. My thought was to keep the missing mask and: use searchsorted for ~missing entries use dictionary lookup for missing entries extend self.classes_ but keep a private index where the post-fit labels begin
and 1 other commented on an outdated diff Jul 15, 2014
sklearn/preprocessing/label.py
 @@ -134,7 +156,48 @@ def transform(self, y): _check_numpy_unicode_bug(classes) if len(np.intersect1d(classes, self.classes_)) < len(classes): diff = np.setdiff1d(classes, self.classes_) - raise ValueError("y contains new labels: %s" % str(diff)) + + # If we are mapping new labels, get "new" ID and change in copy. + if self.new_labels == "update": + # Update the class list with new labels + self.classes_ = np.append(self.classes_, np.sort(diff)) + + # Return mapped encoding + return np.searchsorted(self.classes_, y) + elif self.new_labels == "nan": + # Create copy of array and return + y_array = np.array(y)
 jnothman Jul 15, 2014 You can just let y = np.asarray(y) at the start of transform. Yet you seem to want a copy here, and I don't understand why. mjbommar Jul 15, 2014 Was simply trying to be memory-efficient, as no np.ndarray is required in the cases where new_labels \in {"raise", "update"} or no new classes seen. Do we still want to apply this always? jnothman Jul 15, 2014 searchsorted will convert to an array anyway.
and 1 other commented on an outdated diff Jul 15, 2014
sklearn/preprocessing/label.py
 + y_array = np.array(y) + z = np.zeros(y_array.shape, dtype=int) + + # Find entries with new labels + missing_mask = np.in1d(y, diff) + + # Populate return array properly and return + z[-missing_mask] = np.searchsorted(self.classes_, + y_array[-missing_mask]) + z[missing_mask] = self.new_label_class + return z + elif self.new_labels == "raise": + # Return ValueError, original behavior. + raise ValueError("y contains new labels: %s" % str(diff)) + else: + # Raise on invalid argument.
 jnothman Jul 15, 2014 Failing on fit is probably kinder to the user. mjbommar Jul 15, 2014 Addressed.
and 1 other commented on an outdated diff Jul 15, 2014
sklearn/preprocessing/label.py
 @@ -134,7 +156,48 @@ def transform(self, y): _check_numpy_unicode_bug(classes) if len(np.intersect1d(classes, self.classes_)) < len(classes): diff = np.setdiff1d(classes, self.classes_) - raise ValueError("y contains new labels: %s" % str(diff)) + + # If we are mapping new labels, get "new" ID and change in copy. + if self.new_labels == "update": + # Update the class list with new labels + self.classes_ = np.append(self.classes_, np.sort(diff)) + + # Return mapped encoding + return np.searchsorted(self.classes_, y) + elif self.new_labels == "nan": + # Create copy of array and return + y_array = np.array(y) + z = np.zeros(y_array.shape, dtype=float)
 jnothman Jul 15, 2014 Can we call z out to fit numpy convention? mjbommar Jul 15, 2014 Sorry, bit slow tonight; do you mean to literally rename z to be called out? jnothman Jul 15, 2014 I was ambiguous, but that's what I mean.
and 1 other commented on an outdated diff Jul 15, 2014
sklearn/preprocessing/label.py
 + + # Populate return array properly and return + z[-missing_mask] = np.searchsorted(self.classes_, + y_array[-missing_mask]) + z[missing_mask] = np.nan + return z + elif self.new_labels == "label": + # Create copy of array and return + y_array = np.array(y) + z = np.zeros(y_array.shape, dtype=int) + + # Find entries with new labels + missing_mask = np.in1d(y, diff) + + # Populate return array properly and return + z[-missing_mask] = np.searchsorted(self.classes_,
 mjbommar Jul 15, 2014 Seeing these in nose under bleeding numpy: ......./home/mjbommar/workspace/scikit-learn/sklearn/preprocessing/label.py:184: DeprecationWarning: numpy boolean negative (the unary - operator) is deprecated, use the bitwise_xor (the ^ operator) or the logical_xor function instead. y_array[-missing_mask]) /home/mjbommar/workspace/scikit-learn/sklearn/preprocessing/label.py:184: DeprecationWarning: numpy boolean negative (the unary - operator) is deprecated, use the bitwise_xor (the ^ operator) or the logical_xor function instead. y_array[-missing_mask]) ./home/mjbommar/workspace/scikit-learn/sklearn/preprocessing/label.py:197: DeprecationWarning: numpy boolean negative (the unary - operator) is deprecated, use the bitwise_xor (the ^ operator) or the logical_xor function instead. y_array[-missing_mask]) /home/mjbommar/workspace/scikit-learn/sklearn/preprocessing/label.py:197: DeprecationWarning: numpy boolean negative (the unary - operator) is deprecated, use the bitwise_xor (the ^ operator) or the logical_xor function instead. y_array[-missing_mask]) ........  jnothman Jul 15, 2014 Yes, I thought that looked unfamiliar. Use ~ or np.logical_not.
added some commits Jul 15, 2014
Adding new_labels handling to fit/fit_transform a597fc3
Improving difficulty of test cases with non-increasing unseen labels 291d752
Moving ValueError check to fit fe0141d
Improving difficult for new_labels='update' test to include multiple …
…transform with new labels
e1b7ed5
Fixing negative indexing, renamed z->out, failing approach for new_la…
…bels=update w/ searchsorted
9fd7736
PEP8
e3c14bb

Very nice idea, I'm glad to know that I'm not the only one thinking this is an issue!

Do you agree that the same logic would also apply to the LabelBinarizer (and perhaps to some other encoder types that I'm not thinking of as well)?

scikit-learn member

added this to the 0.16 milestone Jul 17, 2014
and 1 other commented on an outdated diff Jul 17, 2014
sklearn/preprocessing/label.py
 + # Setup out + out = np.zeros(y.shape, dtype=int) + + # Find entries with new labels + missing_mask = np.in1d(y, diff) + new_class_values = np.sort(diff) + + # Populate return array properly and return + out[~missing_mask] = np.searchsorted(self.classes_, + y[~missing_mask]) + out[missing_mask] = np.searchsorted(new_class_values, + y[missing_mask]) + \ + len(self.classes_) + + # Update the class list with new labels + self.classes_ = np.append(self.classes_, new_class_values)
 amueller Jul 17, 2014 wouldn't it be easier to update self.classes_ and then just do the normal searchsorted in the end? mjbommar Jul 17, 2014 What if, as @jnothman pointed out, min(new_class_values) < max(self.classes_)? np.searchsorted wouldn't work in that case. To summarize my thoughts with this: We need to support repeated calls to transform that have new values, which rules out a new/old searchsorted. To make sure we can support repeated calls, I added a unit test (the failing one). My opinion is that we ought to keep a dictionary self.new_class_mapping_ that maps from values that weren't in fit to new class labels during transform, which will ensure that repeated calls to transform will map new values consistently. If no one feels strongly against the performance impact of falling back to dictionary access for handling the unseen values, I'll implement this later today.
sklearn/preprocessing/label.py
 + y = np.array(y) + + # If we are mapping new labels, get "new" ID and change in copy. + if self.new_labels == "update": + # Setup out + out = np.zeros(y.shape, dtype=int) + + # Find entries with new labels + missing_mask = np.in1d(y, diff) + new_class_values = np.sort(diff) + + # Populate return array properly and return + out[~missing_mask] = np.searchsorted(self.classes_, + y[~missing_mask]) + out[missing_mask] = np.searchsorted(new_class_values, + y[missing_mask]) + \
 arjoly Jul 17, 2014 Could you avoid the \?
sklearn/preprocessing/label.py
 + + # Update the class list with new labels + self.classes_ = np.append(self.classes_, new_class_values) + + # Return mapped encoding + return out + elif self.new_labels == "nan": + # Setup out + out = np.zeros(y.shape, dtype=float) + + # Find entries with new labels + missing_mask = np.in1d(y, diff) + + # Populate return array properly and return + out[~missing_mask] = np.searchsorted(self.classes_, + y[~missing_mask])
 arjoly Jul 17, 2014 The above line of codes are reapeated several times. It would be great to refactor those.
scikit-learn member

sklearn/preprocessing/label.py
 + missing_mask = np.in1d(y, diff) + + # Populate return array properly and return + out[~missing_mask] = np.searchsorted(self.classes_, + y[~missing_mask]) + out[missing_mask] = np.nan + return out + elif self.new_labels == "label": + # Setup out + out = np.zeros(y.shape, dtype=int) + + # Find entries with new labels + missing_mask = np.in1d(y, diff) + + # Populate return array properly and return + out[~missing_mask] = np.searchsorted(self.classes_,
 amueller Jul 17, 2014 My gut tells me it would be cheaper to update all of out and then use the mask to overwrite as in the second line. That would also get rid of allocating out first.
scikit-learn member

@cjauvin this should also be in LabelBinarizer and probably OneHotEncoder.

scikit-learn member

@cjauvin this should also be in LabelBinarizer and probably OneHotEncoder.

+1 for separate pull requests :-)

changed the title from Make LabelEncoder more friendly to new labels to [WIP] Make LabelEncoder more friendly to new labels Jul 17, 2014
scikit-learn member

I am very much against using nans for anything other than numerical errors. Experience shows that it make debugging very hard, and tends to enforce verification code in a lot of the codebase. On top of it, I have a bad feeling about having floats creep in our labels.

I suggest using a specific label for unseen classes. I would suggest -9999. However, if we want more flexibility, we can also accept an integer value as a valid value to this argument and use it.

and 1 other commented on an outdated diff Jul 17, 2014
sklearn/preprocessing/label.py
 @@ -43,6 +43,24 @@ def _check_numpy_unicode_bug(labels): class LabelEncoder(BaseEstimator, TransformerMixin): """Encode labels with value between 0 and n_classes-1. + Parameters + ---------- + + new_labels : string, optional (default: "raise") + Determines how to handle new labels, i.e., data + not seen in the training domain. + + - If "raise", then raise ValueError. + - If "update", then re-map the new labels to + classes [N, ..., N+m-1], where m is the number of new labels. + - If "nan", then re-map the new labels to numpy.nan. + - If "label", then use the value of new_label_class.
 GaelVaroquaux Jul 17, 2014 I would here suppress the nan option, and accept: raise update an integer value, which is then the value used for new labels And I would suppress the argument 'new_label_class'. What do people think?
added some commits Jul 19, 2014
 mjbommar Removing nan option and corresponding test fe79736 mjbommar Handling repeated transform calls with new_class_mapping_, refactorin… …g, cleaning after removing np.nan. b83b37f

Summary:

• Took @GaelVaroquaux and @arjoly's suggestions by (1) removing "nan" option, (2) reworking "label" and new_label_value into polymporphic argument, and (3) changing line-break style.
• Implemented split between fit-time classes in classes_ and post-fit classes in new_label_mapping_, which is robust to repeated calls to transform regardless of ordering of classes.
• Implemented get_classes method to return an np.ndarray of class values whose order matches the integer encoding in the case of new_label="update".
• Some minor cleanup/docstring/typos to fix when I rebase to incorporate the other work done in label lately, but all unit tests passing and structure there for review.

@GaelVaroquaux , regarding your recommendation of -9999, a few thoughts:

• Do we want to make this default behavior instead of raise/ValueError?
• Is there appetite for a more holistic discussion of NaN vs. NA vs. NaT vs. .... Wes faced this same set of decisions in pandas (built on top of numpy/scipy, handling noisy/realistic data of various types, Cython, performance-oriented). Would it be worth starting with a table of support, by pre-processor/estimator, for type and NA/NAN?
scikit-learn member
scikit-learn member

I think raise is a kinder default.

@GaelVaroquaux , agreed, but in the absence of any consistent(ly communicated) option for NA, many substitute NaN. My question was mostly whether we thought it a good use of collective development time to present a consistent message on how to encode "missing" observations throughout sklearn, similar to the way that pandas has a clearly communicated and (mostly) consistent strategy.

scikit-learn member

OK, I get the frustration. That said, as a practitioner relying on pandas daily, I am grateful for the fact that Wes has not taken a similarly pessimistic view. I think there are a number of places like imputation, scaling, encoding, or categorical-friendly estimators that would benefit from a similar approach.

@jnothman, re: kinder default, my opinion on this is that would be new_labels="update" instead of -9999. Any opinion?

scikit-learn member
referenced this pull request Jul 21, 2014
Closed

#### LabelBinarizer regression between 0.14.1 and 0.15.0 #3462

and others added some commits Jul 21, 2014
Update outlier_detection.rst
Add reference for One-class SVM.
0b8e63c
Added directory checking for documentation builds, and corrected for …
…Windows pathing
62f1f57
Merge pull request #3477 from kastnerkyle/docbuild_fix
[MRG] Docbuild fixes for #3475
8cd5d85
Merge pull request #3479 from MechCoder/improve_logcv_docs
Improve docstring of Logistic Regression CV model
965b109
MAINT More robust windows installation script d814353
FIX unstable test on 32 bit windows f3afd4e
MAINT move skip for unstable 32bit to _check_transformer 0af2d8f
Merge pull request #3465 from pvnguyen/patch-1
Update outlier_detection.rst
376ac51
Adding new_labels argument to LabelEncoder 4b6978e
Adding tests for new_labels argument. d990207
Changing classes_ update strategy a69840b
Adding nan behavior, renaming to fce9fb5
Updating tests to include nan case and update name 76921e5
Fixing docstring for test-doc pass 0e39a2a
Fixing docstring for test-doc pass (for real) 1da2880
Updating doctests 926b166
Updating constructor documentation 5ef9b85
Adding specific "label" option to new_labels 4dfb4cb
Adding test for "label" option to new_labels 392e54b
Updating docstring for new_labels="label" e053635
pep8 122a98f
Autodoc fix de18372
Fixing rst docs d735ca2
Changing dtypes for new_labels d276565
Adding example for new_labels argument a01f8b0
Adding new_labels handling to fit/fit_transform 495347c
Improving difficulty of test cases with non-increasing unseen labels dee4ae0
Moving ValueError check to fit c297017
Improving difficult for new_labels='update' test to include multiple …
…transform with new labels
f29800b
Fixing negative indexing, renamed z->out, failing approach for new_la…
…bels=update w/ searchsorted
74b7589
PEP8 3e1be5d
Removing nan option and corresponding test abf01cc
Handling repeated transform calls with new_class_mapping_, refactorin…
…g, cleaning after removing np.nan.
f26a902
Rebase
0725d4c

Cleanly rebased final PR pending.

referenced this pull request Jul 24, 2014
Closed

#### [MRG-0] Make LabelEncoder more friendly to new labels #3483

Closing for PR #3483.

closed this Jul 24, 2014
referenced this pull request Aug 27, 2014
Open