New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA allow unknowns in OrdinalEncoder transform #17406
FEA allow unknowns in OrdinalEncoder transform #17406
Conversation
Lots of discussion about this in here: #16959 |
Hi @scikit-learn/core-devs, this discussion was addressed during the last core-dev meeting. |
@thomasjpfan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @FelixWick !
sklearn/preprocessing/_encoders.py
Outdated
@@ -621,6 +622,18 @@ class OrdinalEncoder(_BaseEncoder): | |||
dtype : number type, default np.float64 | |||
Desired dtype of output. | |||
|
|||
handle_unknown : 'error' or 'use_encoded_value', default='error' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle_unknown : 'error' or 'use_encoded_value', default='error' | |
handle_unknown : {'error', 'use_encoded_value}', default='error' |
sklearn/preprocessing/_encoders.py
Outdated
When set to 'error' an error will be raised in case an unknown | ||
categorical feature is present during transform. When set to | ||
'use_encoded_value', the encoded value of unknown categories will be | ||
set to the value given for the parameter unknown_value. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set to the value given for the parameter unknown_value. In | |
set to the value given for the parameter `unknown_value`. In |
sklearn/preprocessing/_encoders.py
Outdated
raise TypeError("Please set unknown_value to an integer " | ||
"value.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have both checks raise the same error (may need to wrap)
raise TypeError("Please set unknown_value to an integer " | |
"value.") | |
raise TypeError(f"Set unknown_value to an integer, got {self.unknown_value}") |
sklearn/preprocessing/_encoders.py
Outdated
if 0 <= self.unknown_value < len(self.categories_[i]): | ||
raise ValueError(f"The used value for unknown_value " | ||
f"{self.unknown_value} is one of the " | ||
f"values already used for encoding the " | ||
f"seen categories.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This validation check should be in fit
after the self._fit
call.
sklearn/preprocessing/_encoders.py
Outdated
# set unknown values to None | ||
if self.handle_unknown == 'use_encoded_value': | ||
X_tr[:, i] = np.where( | ||
labels == self.unknown_value, None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can set unknown_labels = labels == self.unknown_value
so the mask does not need to be computed twice.
@@ -553,6 +553,42 @@ def test_ordinal_encoder_raise_missing(X): | |||
ohe.transform(X) | |||
|
|||
|
|||
def test_ordinal_encoder_handle_unknowns(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add separate test function for encoding numerical values.
sklearn/preprocessing/_encoders.py
Outdated
@@ -731,6 +765,14 @@ def inverse_transform(self, X): | |||
|
|||
for i in range(n_features): | |||
labels = X[:, i].astype('int64', copy=False) | |||
X_tr[:, i] = self.categories_[i][labels] | |||
# set unknown values to None | |||
unknown_labels = labels == self.unknown_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be moved into the if
statement, since it is not needed with when self.handle_unknown='error'
sklearn/preprocessing/_encoders.py
Outdated
self.categories_[i][np.where( | ||
unknown_labels, 0, labels)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm do you think the following is clearer:
unknown_labels = labels == self.unknown_value
X_tr[:, i] = self.categories_[i][np.where(unknown_labels, 0, labels)]
X_tr[unknown_labels, i] = None
It might be slightly better memory wise as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about memory, but yes, your suggestion is clearer.
def test_ordinal_encoder_handle_unknowns_numeric(): | ||
enc = OrdinalEncoder(handle_unknown='use_encoded_value', | ||
unknown_value=-999) | ||
X_fit = np.array([[1, 7], [2, 8], [3, 9]], dtype=object) | ||
X_trans = np.array([[3, 12], [23, 8], [1, 7]], dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_ordinal_encoder_handle_unknowns_numeric(): | |
enc = OrdinalEncoder(handle_unknown='use_encoded_value', | |
unknown_value=-999) | |
X_fit = np.array([[1, 7], [2, 8], [3, 9]], dtype=object) | |
X_trans = np.array([[3, 12], [23, 8], [1, 7]], dtype=object) | |
@pytest.mark.parametrize('dtype', [float, int]) | |
def test_ordinal_encoder_handle_unknowns_numeric(dtype): | |
enc = OrdinalEncoder(handle_unknown='use_encoded_value', | |
unknown_value=-999) | |
X_fit = np.array([[1, 7], [2, 8], [3, 9]], dtype=dtype) | |
X_trans = np.array([[3, 12], [23, 8], [1, 7]], dtype=dtype) |
(I know it is a little strange to encoding floats and ints, but these dtypes are support in OrdinalEnoder
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. In fact, I forgot an astype in the inverse transform.
sklearn/preprocessing/_encoders.py
Outdated
unknown_labels = labels == self.unknown_value | ||
X_tr[:, i] = self.categories_[i][np.where( | ||
unknown_labels, 0, labels)] | ||
X_tr = X_tr.astype(object, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to have only one call astype(object)
similar to
scikit-learn/sklearn/preprocessing/_encoders.py
Lines 549 to 556 in ae0d535
# if ignored are found: potentially need to upcast result to | |
# insert None values | |
if found_unknown: | |
if X_tr.dtype != object: | |
X_tr = X_tr.astype(object) | |
for idx, mask in found_unknown.items(): | |
X_tr[mask, idx] = None |
In that cases the unknown masks are stored in a dictionary and the None
are placed in X_tr
after the fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that might be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM!
@@ -553,6 +553,59 @@ def test_ordinal_encoder_raise_missing(X): | |||
ohe.transform(X) | |||
|
|||
|
|||
def test_ordinal_encoder_handle_unknowns_string(): | |||
enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should test that unknown_value=0
or 1
is also valid (but not for inverse_transform
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean. Using unknown_value=0
or 1
would usually lead to an error during fit, because these are values already used for encoding the seen categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would that be an error? What's wrong with the user wanting to conflate unknowns with a specific known category? I had thought we would want to handle that case. If we won't allow that case then we should certainly test that an error is raised if it is attempted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a test already for checking correct raise in case of using an unknown_value
that is already used for encoding one of the categories in the fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And nothing wrong with conflating a specific known category with unknowns. I think it is just a matter of taste whether we want to allow conflating or not. However, I would suggest to allow it only in mode use_category
(Which we said we would include in a later version to keep it small for now, see #16959.), not in use_encoded_value
, because it would be clearer to explicitly state the category to be conflated. But I am fine to allow it here for use_encoded_value
as well, if you prefer that (It just means deleting a few lines of code ;).).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I don't mind whether we allow or disallow it at this point, but we need to explicitly do one or the other, and test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Please let me know if the test I added for this (last check in test_ordinal_encoder_handle_unknowns_raise
) is not sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan can you confirm whether it was your expectation to allow/disallow unknown_value=0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally fine with being conservative and not allowing this for now. We can allow it later without breaking backward compatibility.
The test is good
Please add an entry to the change log at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, a few more things...
enc.fit(X) | ||
|
||
enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=1) | ||
msg = ("The used value for unknown_value 1 is one of the values already " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
msg = ("The used value for unknown_value 1 is one of the values already " | |
msg = ("The used value for unknown_value (1) is one of the values already " |
I think this is a bit less ambiguous
When the parameter handle_unknown is set to 'use_encoded_value', this | ||
parameter is required and will set the encoded value of unknown | ||
categories. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use .. versionadded::
'use_encoded_value', the encoded value of unknown categories will be | ||
set to the value given for the parameter `unknown_value`. In | ||
:meth:`inverse_transform`, an unknown category will be denoted as None. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use .. versionadded
When the parameter handle_unknown is set to 'use_encoded_value', this | ||
parameter is required and will set the encoded value of unknown | ||
categories. | ||
|
||
Attributes | ||
---------- | ||
categories_ : list of arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This says that categories_ will correspond with the output of transform. I suspect this is no longer true. Is there a way we can make them line up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, in case of unknown categories there will be outputs of transform
that are not in categories_
, namely unknown_value
. Are you suggesting a code change or just another description here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to change the code here. But the description, yes.
sklearn/preprocessing/_encoders.py
Outdated
unknown_value : int, default=None | ||
When the parameter handle_unknown is set to 'use_encoded_value', this | ||
parameter is required and will set the encoded value of unknown | ||
categories. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please specify that unknown_value should be distinct from the value used to encode any categories.
@@ -553,6 +553,59 @@ def test_ordinal_encoder_raise_missing(X): | |||
ohe.transform(X) | |||
|
|||
|
|||
def test_ordinal_encoder_handle_unknowns_string(): | |||
enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan can you confirm whether it was your expectation to allow/disallow unknown_value=0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (besides @jnothman's comments above). Thank you very much @FelixWick.
I am not @thomasjfox :) but I wouldn't mind either option. We can always allow later if we want without breaking backward compat while the opposite would require a deprecation cycle. I agree that the future |
I'm thomasjfox, but I'm not @thomasjpfan :) I wish I wouldn't often get mentioned on scikit-learn ^^ |
Sorry @thomasjfox, github autocomplete fail... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @FelixWick for your consistent work!
Made some minor comments but LGTM overall.
sklearn/preprocessing/_encoders.py
Outdated
When the parameter handle_unknown is set to 'use_encoded_value', this | ||
parameter is required and will set the encoded value of unknown | ||
categories. It has to be distinct from the values used to encode any of | ||
the categories in the fit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the categories in the fit. | |
the categories in `fit`. |
sklearn/preprocessing/_encoders.py
Outdated
Attributes | ||
---------- | ||
categories_ : list of arrays | ||
The categories of each feature determined during fitting | ||
(in order of the features in X and corresponding with the output | ||
of ``transform``). | ||
of ``transform`` except for categories not seen during ``fit``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like adding a new sentence instead might be clearer, e.g.:
This does not include categories that weren't seen during
fit
.
sklearn/preprocessing/_encoders.py
Outdated
@@ -678,8 +699,21 @@ def fit(self, X, y=None): | |||
------- | |||
self | |||
""" | |||
if self.handle_unknown == 'use_encoded_value': | |||
if not isinstance(self.unknown_value, numbers.Integral): | |||
raise TypeError(f"unknown_value should be an integer, got " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise TypeError(f"unknown_value should be an integer, got " | |
raise TypeError(f"unknown_value should be an integer when `handle_unknown is 'use_encoded_value'`, got " |
sklearn/preprocessing/_encoders.py
Outdated
for i in range(len(self.categories_)): | ||
if 0 <= self.unknown_value < len(self.categories_[i]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we just use (without using range
and len
)
for feature_cats in self.categories_:
@@ -553,6 +553,59 @@ def test_ordinal_encoder_raise_missing(X): | |||
ohe.transform(X) | |||
|
|||
|
|||
def test_ordinal_encoder_handle_unknowns_string(): | |||
enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally fine with being conservative and not allowing this for now. We can allow it later without breaking backward compatibility.
The test is good
sklearn/preprocessing/_encoders.py
Outdated
for i in range(len(self.categories_)): | ||
X_int[~X_mask[:, i], i] = self.unknown_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to loop over features here? Can't we just do X_int[~X_mask] =self.unknown_value
?
sklearn/preprocessing/_encoders.py
Outdated
if X_tr.dtype != object: | ||
X_tr = X_tr.astype(object, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we directly convert without checking if X_tr.dtype != object:
?
X_tr[:, i] = self.categories_[i][np.where( | ||
unknown_labels, 0, labels)] | ||
found_unknown[i] = unknown_labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just do X_tr[:, i] = self.categories_[i][labels]
as in else:
case?
Since these would be overridden later anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not, as this results in an out of bounds error.
doc/whats_new/v0.24.rst
Outdated
- |Enhancement| Add value ``use_encoded_value`` for ``handle_unknown`` | ||
parameter and add ``unknown_value`` parameter to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the whole handle_unknown
parameter is new, maybe this could be something like
Add a new
handle_unkown
parameter with ause_encoded_value
option, along with a newunknown_value
parameter to...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @FelixWick , LGTM! The test failure seems unrelated (codecov upload fail)
sklearn/preprocessing/_encoders.py
Outdated
else: | ||
if self.unknown_value is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we could just use elif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's ship it :)
Thanks @FelixWick
@@ -621,12 +622,29 @@ class OrdinalEncoder(_BaseEncoder): | |||
dtype : number type, default np.float64 | |||
Desired dtype of output. | |||
|
|||
handle_unknown : {'error', 'use_encoded_value}', default='error' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo:
handle_unknown : {'error', 'use_encoded_value'}, default='error'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, fixed ;)
Can you clarify in which scikit version it will be available? |
This will be available in v0.24
|
great thanks |
Sometimes it can be convenient to allow values (categories) in the transform of OrdinalEncoder that were not present in the fit data set. For example, a machine learning method, which is able of setting such an unknown sample of the corresponding feature to a neutral value (i.e. non-informative), could use the information from all other features of the sample and still output a prediction (instead of no prediction at all).
Closes #13488
Closes #15108
Closes #16959
Closes #14534
Closes #12045
Closes #13897