Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA allow unknowns in OrdinalEncoder transform #17406

Merged
merged 20 commits into from Aug 5, 2020

Conversation

FelixWick
Copy link
Contributor

@FelixWick FelixWick commented Jun 1, 2020

Sometimes it can be convenient to allow values (categories) in the transform of OrdinalEncoder that were not present in the fit data set. For example, a machine learning method, which is able of setting such an unknown sample of the corresponding feature to a neutral value (i.e. non-informative), could use the information from all other features of the sample and still output a prediction (instead of no prediction at all).

Closes #13488
Closes #15108
Closes #16959
Closes #14534
Closes #12045
Closes #13897

@FelixWick
Copy link
Contributor Author

Lots of discussion about this in here: #16959

@cmarmo
Copy link
Member

cmarmo commented Jun 17, 2020

Hi @scikit-learn/core-devs, this discussion was addressed during the last core-dev meeting.
I'm commenting here to draw some attention on this PR, as @FelixWick kindly proposes a simplified version that perhaps could help to move forward. Do you mind having a look before the next meeting? Thanks!

@FelixWick
Copy link
Contributor Author

@thomasjpfan
Please see changes as discussed in #16959.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @FelixWick !

@@ -621,6 +622,18 @@ class OrdinalEncoder(_BaseEncoder):
dtype : number type, default np.float64
Desired dtype of output.

handle_unknown : 'error' or 'use_encoded_value', default='error'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
handle_unknown : 'error' or 'use_encoded_value', default='error'
handle_unknown : {'error', 'use_encoded_value}', default='error'

When set to 'error' an error will be raised in case an unknown
categorical feature is present during transform. When set to
'use_encoded_value', the encoded value of unknown categories will be
set to the value given for the parameter unknown_value. In
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
set to the value given for the parameter unknown_value. In
set to the value given for the parameter `unknown_value`. In

raise TypeError("Please set unknown_value to an integer "
"value.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have both checks raise the same error (may need to wrap)

Suggested change
raise TypeError("Please set unknown_value to an integer "
"value.")
raise TypeError(f"Set unknown_value to an integer, got {self.unknown_value}")

if 0 <= self.unknown_value < len(self.categories_[i]):
raise ValueError(f"The used value for unknown_value "
f"{self.unknown_value} is one of the "
f"values already used for encoding the "
f"seen categories.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation check should be in fit after the self._fit call.

# set unknown values to None
if self.handle_unknown == 'use_encoded_value':
X_tr[:, i] = np.where(
labels == self.unknown_value, None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can set unknown_labels = labels == self.unknown_value so the mask does not need to be computed twice.

@@ -553,6 +553,42 @@ def test_ordinal_encoder_raise_missing(X):
ohe.transform(X)


def test_ordinal_encoder_handle_unknowns():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add separate test function for encoding numerical values.

@@ -731,6 +765,14 @@ def inverse_transform(self, X):

for i in range(n_features):
labels = X[:, i].astype('int64', copy=False)
X_tr[:, i] = self.categories_[i][labels]
# set unknown values to None
unknown_labels = labels == self.unknown_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be moved into the if statement, since it is not needed with when self.handle_unknown='error'

self.categories_[i][np.where(
unknown_labels, 0, labels)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm do you think the following is clearer:

unknown_labels = labels == self.unknown_value
X_tr[:, i] = self.categories_[i][np.where(unknown_labels, 0, labels)]
X_tr[unknown_labels, i] = None

It might be slightly better memory wise as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about memory, but yes, your suggestion is clearer.

def test_ordinal_encoder_handle_unknowns_numeric():
enc = OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-999)
X_fit = np.array([[1, 7], [2, 8], [3, 9]], dtype=object)
X_trans = np.array([[3, 12], [23, 8], [1, 7]], dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_ordinal_encoder_handle_unknowns_numeric():
enc = OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-999)
X_fit = np.array([[1, 7], [2, 8], [3, 9]], dtype=object)
X_trans = np.array([[3, 12], [23, 8], [1, 7]], dtype=object)
@pytest.mark.parametrize('dtype', [float, int])
def test_ordinal_encoder_handle_unknowns_numeric(dtype):
enc = OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-999)
X_fit = np.array([[1, 7], [2, 8], [3, 9]], dtype=dtype)
X_trans = np.array([[3, 12], [23, 8], [1, 7]], dtype=dtype)

(I know it is a little strange to encoding floats and ints, but these dtypes are support in OrdinalEnoder)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. In fact, I forgot an astype in the inverse transform.

unknown_labels = labels == self.unknown_value
X_tr[:, i] = self.categories_[i][np.where(
unknown_labels, 0, labels)]
X_tr = X_tr.astype(object, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to have only one call astype(object) similar to

# if ignored are found: potentially need to upcast result to
# insert None values
if found_unknown:
if X_tr.dtype != object:
X_tr = X_tr.astype(object)
for idx, mask in found_unknown.items():
X_tr[mask, idx] = None

In that cases the unknown masks are stored in a dictionary and the None are placed in X_tr after the fact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that might be faster.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM!

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
@@ -553,6 +553,59 @@ def test_ordinal_encoder_raise_missing(X):
ohe.transform(X)


def test_ordinal_encoder_handle_unknowns_string():
enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should test that unknown_value=0 or 1 is also valid (but not for inverse_transform).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean. Using unknown_value=0 or 1 would usually lead to an error during fit, because these are values already used for encoding the seen categories.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would that be an error? What's wrong with the user wanting to conflate unknowns with a specific known category? I had thought we would want to handle that case. If we won't allow that case then we should certainly test that an error is raised if it is attempted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a test already for checking correct raise in case of using an unknown_value that is already used for encoding one of the categories in the fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And nothing wrong with conflating a specific known category with unknowns. I think it is just a matter of taste whether we want to allow conflating or not. However, I would suggest to allow it only in mode use_category (Which we said we would include in a later version to keep it small for now, see #16959.), not in use_encoded_value, because it would be clearer to explicitly state the category to be conflated. But I am fine to allow it here for use_encoded_value as well, if you prefer that (It just means deleting a few lines of code ;).).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't mind whether we allow or disallow it at this point, but we need to explicitly do one or the other, and test it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Please let me know if the test I added for this (last check in test_ordinal_encoder_handle_unknowns_raise) is not sufficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan can you confirm whether it was your expectation to allow/disallow unknown_value=0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally fine with being conservative and not allowing this for now. We can allow it later without breaking backward compatibility.

The test is good

@jnothman
Copy link
Member

jnothman commented Aug 2, 2020

Please add an entry to the change log at doc/whats_new/v0.24.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

@FelixWick FelixWick changed the title [WIP] allow unknowns in OrdinalEncoder transform (simplified) [MRG] allow unknowns in OrdinalEncoder transform (simplified) Aug 3, 2020
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, a few more things...

enc.fit(X)

enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=1)
msg = ("The used value for unknown_value 1 is one of the values already "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
msg = ("The used value for unknown_value 1 is one of the values already "
msg = ("The used value for unknown_value (1) is one of the values already "

I think this is a bit less ambiguous

When the parameter handle_unknown is set to 'use_encoded_value', this
parameter is required and will set the encoded value of unknown
categories.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use .. versionadded::

'use_encoded_value', the encoded value of unknown categories will be
set to the value given for the parameter `unknown_value`. In
:meth:`inverse_transform`, an unknown category will be denoted as None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use .. versionadded

When the parameter handle_unknown is set to 'use_encoded_value', this
parameter is required and will set the encoded value of unknown
categories.

Attributes
----------
categories_ : list of arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says that categories_ will correspond with the output of transform. I suspect this is no longer true. Is there a way we can make them line up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, in case of unknown categories there will be outputs of transform that are not in categories_, namely unknown_value. Are you suggesting a code change or just another description here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to change the code here. But the description, yes.

unknown_value : int, default=None
When the parameter handle_unknown is set to 'use_encoded_value', this
parameter is required and will set the encoded value of unknown
categories.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please specify that unknown_value should be distinct from the value used to encode any categories.

@@ -553,6 +553,59 @@ def test_ordinal_encoder_raise_missing(X):
ohe.transform(X)


def test_ordinal_encoder_handle_unknowns_string():
enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan can you confirm whether it was your expectation to allow/disallow unknown_value=0?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (besides @jnothman's comments above). Thank you very much @FelixWick.

@ogrisel
Copy link
Member

ogrisel commented Aug 4, 2020

@thomasjpfan can you confirm whether it was your expectation to allow/disallow unknown_value=0?

I am not @thomasjfox :) but I wouldn't mind either option. We can always allow later if we want without breaking backward compat while the opposite would require a deprecation cycle.

I agree that the future use_category option (in the input space) is more natural to map unknown values to known categories observed on the training set.

@thomasjfox
Copy link
Contributor

@thomasjpfan can you confirm whether it was your expectation to allow/disallow unknown_value=0?

I am not @thomasjfox :) but I wouldn't mind either option.

I'm thomasjfox, but I'm not @thomasjpfan :) I wish I wouldn't often get mentioned on scikit-learn ^^

@ogrisel
Copy link
Member

ogrisel commented Aug 4, 2020

Sorry @thomasjfox, github autocomplete fail...

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @FelixWick for your consistent work!

Made some minor comments but LGTM overall.

When the parameter handle_unknown is set to 'use_encoded_value', this
parameter is required and will set the encoded value of unknown
categories. It has to be distinct from the values used to encode any of
the categories in the fit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the categories in the fit.
the categories in `fit`.

Attributes
----------
categories_ : list of arrays
The categories of each feature determined during fitting
(in order of the features in X and corresponding with the output
of ``transform``).
of ``transform`` except for categories not seen during ``fit``).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like adding a new sentence instead might be clearer, e.g.:

This does not include categories that weren't seen duringfit.

@@ -678,8 +699,21 @@ def fit(self, X, y=None):
-------
self
"""
if self.handle_unknown == 'use_encoded_value':
if not isinstance(self.unknown_value, numbers.Integral):
raise TypeError(f"unknown_value should be an integer, got "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise TypeError(f"unknown_value should be an integer, got "
raise TypeError(f"unknown_value should be an integer when `handle_unknown is 'use_encoded_value'`, got "

sklearn/preprocessing/_encoders.py Show resolved Hide resolved
for i in range(len(self.categories_)):
if 0 <= self.unknown_value < len(self.categories_[i]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we just use (without using range and len)

for feature_cats in self.categories_:

@@ -553,6 +553,59 @@ def test_ordinal_encoder_raise_missing(X):
ohe.transform(X)


def test_ordinal_encoder_handle_unknowns_string():
enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally fine with being conservative and not allowing this for now. We can allow it later without breaking backward compatibility.

The test is good

for i in range(len(self.categories_)):
X_int[~X_mask[:, i], i] = self.unknown_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to loop over features here? Can't we just do X_int[~X_mask] =self.unknown_value?

if X_tr.dtype != object:
X_tr = X_tr.astype(object, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we directly convert without checking if X_tr.dtype != object:?

X_tr[:, i] = self.categories_[i][np.where(
unknown_labels, 0, labels)]
found_unknown[i] = unknown_labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just do X_tr[:, i] = self.categories_[i][labels] as in else: case?
Since these would be overridden later anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not, as this results in an out of bounds error.

- |Enhancement| Add value ``use_encoded_value`` for ``handle_unknown``
parameter and add ``unknown_value`` parameter to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the whole handle_unknown parameter is new, maybe this could be something like

Add a new handle_unkown parameter with a use_encoded_value option, along with a new unknown_value parameter to...

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @FelixWick , LGTM! The test failure seems unrelated (codecov upload fail)

else:
if self.unknown_value is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could just use elif

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ship it :)

Thanks @FelixWick

@jnothman jnothman changed the title [MRG] allow unknowns in OrdinalEncoder transform (simplified) FEA allow unknowns in OrdinalEncoder transform Aug 5, 2020
@jnothman jnothman merged commit 8a213df into scikit-learn:master Aug 5, 2020
7 checks passed
@@ -621,12 +622,29 @@ class OrdinalEncoder(_BaseEncoder):
dtype : number type, default np.float64
Desired dtype of output.

handle_unknown : {'error', 'use_encoded_value}', default='error'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo:

handle_unknown : {'error', 'use_encoded_value'}, default='error'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, fixed ;)

@NicolasHug NicolasHug mentioned this pull request Aug 7, 2020
@Sandy4321
Copy link

Can you clarify in which scikit version it will be available?
Meant we use. jdraines/cardinal_encoder: Implements a Scikit-Learn CardinalEncoder which differs from OrdinalEncoder in that it handles unknowns.

@jnothman
Copy link
Member

jnothman commented Aug 15, 2020 via email

@Sandy4321
Copy link

great thanks
and in what version will be available unknowns in predictions for categorical nive Bayesian ,
meaning values not met in train data.
we wait it for long time ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle Error Policy in OrdinalEncoder
9 participants