Skip to content

Conversation

DanielMorales9
Copy link
Contributor

Reference Issues/PRs

Original discussion at #11034

What does this implement/fix? Explain your changes.

@DanielMorales9 DanielMorales9 changed the title Ensure that the OneHotEncoder outputs sparse matrix with given dtype #11034 Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034 Apr 28, 2018
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add or modify a test

@DanielMorales9 DanielMorales9 changed the title Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034 [MRG] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034 May 2, 2018
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There 2 calls to _transform_selected which used the default dtype. Check if there is any trouble with those tests.

@@ -1825,7 +1825,7 @@ def add_dummy_feature(X, value=1.0):
return np.hstack((np.ones((n_samples, 1)) * value, X))


def _transform_selected(X, transform, selected="all", copy=True):
def _transform_selected(X, transform, dtype=np.float64, selected="all", copy=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that there is a reason to have a default dtype

@@ -1836,6 +1836,9 @@ def _transform_selected(X, transform, selected="all", copy=True):
transform : callable
A callable transform(X) -> X_transformed

dtype : number type, default=np.float
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number type -> dtype, ...

Could you also change this parameter in the OneHotEncoder docstring

@@ -1872,9 +1875,9 @@ def _transform_selected(X, transform, selected="all", copy=True):
X_not_sel = X[:, ind[not_sel]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of changing the dtype below, I think that you only need to call astype(dtype) on X_not_sel.
The concatenation will be done with array with the same type. You can add a small comment above the line:

The columns of X which are not transformed need to be casted to the desire dtype before concatenation. Otherwise, the stacking will cast to the higher-precision dtype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to shorten the comment

@@ -132,7 +132,7 @@ def test_polynomial_features():
assert_array_almost_equal(X_poly, P2[:, [0, 1, 2, 4]])

assert_equal(interact.powers_.shape, (interact.n_output_features_,
interact.n_input_features_))
interact.n_input_features_))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert all the spacing. Even if it solves PEP8, we tend to not modify unrelated code base. It might create some merge conflict in other PR. You can revert the other spaces below.

@@ -1987,6 +1986,48 @@ def test_one_hot_encoder_categorical_features():
_check_one_hot(X, X2, cat, 5)


def test_one_hot_encoder_mixed_input_given_type():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use pytest.mark.parametrize to make a single test with different dtype. Also use bare assert instead of assert_equal. Basically something like that:

@pytest.mark.parametrize(                                                                                                                                                   
    "output_dtype",                                                                                                                                                         
    [np.int32, np.float32, np.float64]                                                                                                                                      
)                                                                                                                                                                           
@pytest.mark.parametrize(                                                                                                                                                   
    "input_dtype",                                                                                                                                                          
    [np.int32, np.float32, np.float64]                                                                                                                                      
)                                                                                                                                                                           
@pytest.mark.parametrize(                                                                                                                                                   
    "sparse",                                                                                                                                                               
    [True, False]                                                                                                                                                           
)                                                                                                                                                                           
def test_one_hot_encoder_preserve_type(input_dtype, output_dtype, sparse):                                                                                                  
    X = np.array([[0, 1, 0, 0], [1, 2, 0, 0]], dtype=input_dtype)                                                                                                           
    transformer = OneHotEncoder(categorical_features=[0, 1],                                                                                                                
                                dtype=output_dtype, sparse=sparse)                                                                                                          
    X_trans = transformer.fit_transform(X)                                                                                                                                  
    assert X_trans.dtype == output_dtype

@glemaitre
Copy link
Member

@DanielMorales9 Could you address the comments?

@DanielMorales9
Copy link
Contributor Author

@glemaitre sure

@DanielMorales9
Copy link
Contributor Author

DanielMorales9 commented May 28, 2018

I've added the requested changes. Sorry for the delay. I am happy to contribute 😄

@glemaitre
Copy link
Member

The CI is failing can you check

@glemaitre
Copy link
Member

@DanielMorales9 I make the change regarding the PEP8. I am not sure that the error regarding the kmeans was related. This strange.

@glemaitre glemaitre changed the title [MRG] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034 [MRG + 1] Ensuring that the OneHotEncoder outputs sparse matrix with given dtype #11034 Jun 4, 2018
@glemaitre
Copy link
Member

@jnothman Could you have a look

def test_one_hot_encoder_mixed_input_given_type(input_dtype, output_dtype,
sparse):
X = np.array([[0, 2, 1], [1, 0, 3], [1, 0, 2]], dtype=input_dtype)
# Test that one hot encoder raises error for unknown features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears out of place...

@pytest.mark.parametrize("output_dtype", [np.int32, np.float32, np.float64])
@pytest.mark.parametrize("input_dtype", [np.int32, np.float32, np.float64])
@pytest.mark.parametrize("sparse", [True, False])
def test_one_hot_encoder_mixed_input_given_type(input_dtype, output_dtype,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tired, but it's not clear to me how this is distinct from above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, I did not see but but it has an unnecessary test.
The only test required was: #11042 (comment)

@amueller
Copy link
Member

amueller commented Jun 4, 2018

lgtm.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good, but it might break someone's pipeline. Please add a what's new.

@jorisvandenbossche
Copy link
Member

but it might break someone's pipeline.

If this might be the case, then it is maybe not worth to do it? As they already have to switch to the new OneHotEncoder behaviour (assuming my PR gets merged, where dtype is already honoured when not using the legacy code), which will change this behaviour anyhow.

Anyhow, I don't care too much, and it's fine for me to merge this (it will give merge conflicts with my other PR, but the diff doesn't look that large, so that should be OK)

@jnothman
Copy link
Member

jnothman commented Jun 5, 2018 via email

@jorisvandenbossche
Copy link
Member

Fine for me

@jnothman
Copy link
Member

jnothman commented Jun 6, 2018

Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@glemaitre glemaitre merged commit a6028fc into scikit-learn:master Jun 6, 2018
@glemaitre
Copy link
Member

Thanks @DanielMorales9 !!!
I added the doc such that @jorisvandenbossche can go on with the OneHotEnconder which is a release target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants