-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG + 1] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034 #11042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add or modify a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There 2 calls to _transform_selected
which used the default dtype. Check if there is any trouble with those tests.
sklearn/preprocessing/data.py
Outdated
@@ -1825,7 +1825,7 @@ def add_dummy_feature(X, value=1.0): | |||
return np.hstack((np.ones((n_samples, 1)) * value, X)) | |||
|
|||
|
|||
def _transform_selected(X, transform, selected="all", copy=True): | |||
def _transform_selected(X, transform, dtype=np.float64, selected="all", copy=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that there is a reason to have a default dtype
sklearn/preprocessing/data.py
Outdated
@@ -1836,6 +1836,9 @@ def _transform_selected(X, transform, selected="all", copy=True): | |||
transform : callable | |||
A callable transform(X) -> X_transformed | |||
|
|||
dtype : number type, default=np.float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
number type -> dtype, ...
Could you also change this parameter in the OneHotEncoder
docstring
sklearn/preprocessing/data.py
Outdated
@@ -1872,9 +1875,9 @@ def _transform_selected(X, transform, selected="all", copy=True): | |||
X_not_sel = X[:, ind[not_sel]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of changing the dtype below, I think that you only need to call astype(dtype)
on X_not_sel
.
The concatenation will be done with array with the same type. You can add a small comment above the line:
The columns of X which are not transformed need to be casted to the desire dtype before concatenation. Otherwise, the stacking will cast to the higher-precision dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to shorten the comment
@@ -132,7 +132,7 @@ def test_polynomial_features(): | |||
assert_array_almost_equal(X_poly, P2[:, [0, 1, 2, 4]]) | |||
|
|||
assert_equal(interact.powers_.shape, (interact.n_output_features_, | |||
interact.n_input_features_)) | |||
interact.n_input_features_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert all the spacing. Even if it solves PEP8, we tend to not modify unrelated code base. It might create some merge conflict in other PR. You can revert the other spaces below.
@@ -1987,6 +1986,48 @@ def test_one_hot_encoder_categorical_features(): | |||
_check_one_hot(X, X2, cat, 5) | |||
|
|||
|
|||
def test_one_hot_encoder_mixed_input_given_type(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use pytest.mark.parametrize
to make a single test with different dtype. Also use bare assert instead of assert_equal
. Basically something like that:
@pytest.mark.parametrize(
"output_dtype",
[np.int32, np.float32, np.float64]
)
@pytest.mark.parametrize(
"input_dtype",
[np.int32, np.float32, np.float64]
)
@pytest.mark.parametrize(
"sparse",
[True, False]
)
def test_one_hot_encoder_preserve_type(input_dtype, output_dtype, sparse):
X = np.array([[0, 1, 0, 0], [1, 2, 0, 0]], dtype=input_dtype)
transformer = OneHotEncoder(categorical_features=[0, 1],
dtype=output_dtype, sparse=sparse)
X_trans = transformer.fit_transform(X)
assert X_trans.dtype == output_dtype
@DanielMorales9 Could you address the comments? |
@glemaitre sure |
I've added the requested changes. Sorry for the delay. I am happy to contribute 😄 |
The CI is failing can you check |
@DanielMorales9 I make the change regarding the PEP8. I am not sure that the error regarding the kmeans was related. This strange. |
@jnothman Could you have a look |
def test_one_hot_encoder_mixed_input_given_type(input_dtype, output_dtype, | ||
sparse): | ||
X = np.array([[0, 2, 1], [1, 0, 3], [1, 0, 2]], dtype=input_dtype) | ||
# Test that one hot encoder raises error for unknown features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears out of place...
@pytest.mark.parametrize("output_dtype", [np.int32, np.float32, np.float64]) | ||
@pytest.mark.parametrize("input_dtype", [np.int32, np.float32, np.float64]) | ||
@pytest.mark.parametrize("sparse", [True, False]) | ||
def test_one_hot_encoder_mixed_input_given_type(input_dtype, output_dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tired, but it's not clear to me how this is distinct from above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, I did not see but but it has an unnecessary test.
The only test required was: #11042 (comment)
lgtm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good, but it might break someone's pipeline. Please add a what's new.
If this might be the case, then it is maybe not worth to do it? As they already have to switch to the new OneHotEncoder behaviour (assuming my PR gets merged, where dtype is already honoured when not using the legacy code), which will change this behaviour anyhow. Anyhow, I don't care too much, and it's fine for me to merge this (it will give merge conflicts with my other PR, but the diff doesn't look that large, so that should be OK) |
i think I'd rather merge than not
|
Fine for me |
Please add an entry to the change log at |
Thanks @DanielMorales9 !!! |
Reference Issues/PRs
Original discussion at #11034
What does this implement/fix? Explain your changes.