Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] DEP remove legacy mode from OneHotEncoder #13855

Merged
merged 8 commits into from May 29, 2019

Conversation

Projects
None yet
5 participants
@glemaitre
Copy link
Contributor

commented May 10, 2019

Remove the legacy code from the OneHotEncoder. Since that there is some old tests, I remove and merge some of them without changing what they are testing.

@glemaitre glemaitre referenced this pull request May 10, 2019

Closed

Remove deprecated functions and classes 0.22 #13797

28 of 28 tasks complete
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

commented May 10, 2019

@jorisvandenbossche could you have a look when this is MRG. You can already check the change within the OHE source code. I am fixing the tests.

@jorisvandenbossche
Copy link
Member

left a comment

You're faster than me :-)

The changes in the OneHotEncoder itself look good, didn't look at the tests yet.

Show resolved Hide resolved sklearn/preprocessing/_encoders.py Outdated
@glemaitre

This comment has been minimized.

Copy link
Contributor Author

commented May 10, 2019

I have one failing test in the common tests: check_fit_idempotent. The issue is that it provides some continuous randn data as input which does not suit the OneHotEncoder. Any idea what is the best way to skip this test: (i) add a tag or (ii) skip within the check. I would be more in favor of (i)

glemaitre added some commits May 10, 2019

@glemaitre glemaitre changed the title [WIP] DEP remove legacy mode from OneHotEncoder [MRG] DEP remove legacy mode from OneHotEncoder May 10, 2019

@NicolasHug

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

OneHotEncoder is supposed to work with continuous data too (or at least with floats).

I think the right fix is to set handle_unknown='ignore' in set_checking_parameters().

@NicolasHug
Copy link
Contributor

left a comment

Please also remove this part from the docstring:

The OneHotEncoder previously assumed that the input features take on values in the range [0, max(values)). This behaviour is deprecated

LGTM when tests are green

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

commented May 10, 2019

OneHotEncoder is supposed to work with continuous data too (or at least with floats). I think the right fix is to set handle_unknown='ignore' in set_checking_parameters().

I did that but I got some other failures.

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

commented May 10, 2019

=================================================================================== test session starts ====================================================================================
platform linux -- Python 3.7.2, pytest-3.8.1, py-1.6.0, pluggy-0.7.1 -- /home/glemaitre/miniconda3/envs/dev/bin/python
cachedir: .pytest_cache
rootdir: /home/glemaitre/Documents/packages/scikit-learn, inifile: setup.cfg
plugins: cov-2.6.0
collected 5703 items / 5671 deselected                                                                                                                                                     

sklearn/tests/test_common.py::test_parameters_default_constructible[OneHotEncoder-OneHotEncoder] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_dtypes] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit_score_takes_y] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_sample_weights_pandas_series] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_sample_weights_list] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_sample_weights_invariance] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_fit_returns_self] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_fit_returns_self(readonly_memmap=True)] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_complex_data] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_dtype_object] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_empty_data_messages] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_pipeline_consistency] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_nan_inf] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_overwrite_params] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimator_sparse_data] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_estimators_pickle] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_data_not_an_array] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_general] FAILED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_general(readonly_memmap=True)] FAILED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformers_unfitted] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_transformer_n_iter] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit2d_predict1d] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_methods_subset_invariance] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit2d_1sample] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit2d_1feature] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit1d] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_get_params_invariance] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_set_params] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_dict_unchanged] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_dont_overwrite_parameters] PASSED
sklearn/tests/test_common.py::test_estimators[OneHotEncoder-check_fit_idempotent] PASSED
sklearn/tests/test_common.py::test_no_attributes_set_in_init[OneHotEncoder-estimator118] PASSED

========================================================================================= FAILURES =========================================================================================
_________________________________________________________________ test_estimators[OneHotEncoder-check_transformer_general] _________________________________________________________________

estimator = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
check = <function check_transformer_general at 0x7f12cc9b7268>

    @pytest.mark.parametrize(
            "estimator, check",
            _generate_checks_per_estimator(_yield_all_checks,
                                           _tested_estimators()),
            ids=_rename_partial
    )
    def test_estimators(estimator, check):
        # Common tests for estimator instances
        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            set_checking_parameters(estimator)
            name = estimator.__class__.__name__
>           check(name, estimator)

check      = <function check_transformer_general at 0x7f12cc9b7268>
estimator  = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
name       = 'OneHotEncoder'

sklearn/tests/test_common.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sklearn/utils/testing.py:355: in wrapper
    return fn(*args, **kwargs)
sklearn/utils/estimator_checks.py:955: in check_transformer_general
    _check_transformer(name, transformer, X, y)
sklearn/utils/estimator_checks.py:1054: in _check_transformer
    transformer.transform(X.T)
sklearn/preprocessing/_encoders.py:372: in transform
    X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
X = array([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
        0.61183692, 2.36067187, 0.44584725, 2.6891..., 2.75619882, 0.44342693, 2.44345576, 2.34577095,
        2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
handle_unknown = 'ignore'

    def _transform(self, X, handle_unknown='error'):
        X_list, n_samples, n_features = self._check_X(X)
    
        X_int = np.zeros((n_samples, n_features), dtype=np.int)
        X_mask = np.ones((n_samples, n_features), dtype=np.bool)
    
        for i in range(n_features):
            Xi = X_list[i]
>           diff, valid_mask = _encode_check_unknown(Xi, self.categories_[i],
                                                     return_mask=True)
E           IndexError: list index out of range

X          = array([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
        0.61183692, 2.36067187, 0.44584725, 2.6891..., 2.75619882, 0.44342693, 2.44345576, 2.34577095,
        2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
X_int      = array([[24,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0...,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])
X_list     = [array([2.51189522, 2.6430893 , 2.54847718]), array([0.67033055, 0.66835685, 0.46763126]), array([0.4992157 , 0.448980....58788791, 2.34342981]), array([2.58181919, 2.38607905, 2.28934567]), array([0.61183692, 0.38774913, 0.52384727]), ...]
X_mask     = array([[ True, False, False,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  Tru...ue,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True]])
Xi         = array([2.16073928, 2.58788791, 2.34342981])
_          = array([0.03911488, 0.26329427, 0.35089856, 0.36860623, 0.39741998,
       0.44342693, 0.46763126, 0.50367667, 0.523847...53, 2.40841944, 2.44345576, 2.4511805 , 2.53611719,
       2.54847718, 2.60256625, 2.67679634, 2.73159826, 2.75619882])
diff       = [0.44898091995707645, 0.4992156950294815]
encoded    = array([ 0,  0, 13])
handle_unknown = 'ignore'
i          = 3
n_features = 30
n_samples  = 3
self       = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
valid_mask = array([False, False,  True])

sklearn/preprocessing/_encoders.py:115: IndexError
______________________________________________________ test_estimators[OneHotEncoder-check_transformer_general(readonly_memmap=True)] ______________________________________________________

estimator = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
check = functools.partial(<function check_transformer_general at 0x7f12cc9b7268>, readonly_memmap=True)

    @pytest.mark.parametrize(
            "estimator, check",
            _generate_checks_per_estimator(_yield_all_checks,
                                           _tested_estimators()),
            ids=_rename_partial
    )
    def test_estimators(estimator, check):
        # Common tests for estimator instances
        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            set_checking_parameters(estimator)
            name = estimator.__class__.__name__
>           check(name, estimator)

check      = functools.partial(<function check_transformer_general at 0x7f12cc9b7268>, readonly_memmap=True)
estimator  = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
name       = 'OneHotEncoder'

sklearn/tests/test_common.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sklearn/utils/testing.py:355: in wrapper
    return fn(*args, **kwargs)
sklearn/utils/estimator_checks.py:955: in check_transformer_general
    _check_transformer(name, transformer, X, y)
sklearn/utils/estimator_checks.py:1054: in _check_transformer
    transformer.transform(X.T)
sklearn/preprocessing/_encoders.py:372: in transform
    X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
X = memmap([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
         0.61183692, 2.36067187, 0.44584725, 2.68... 2.75619882, 0.44342693, 2.44345576, 2.34577095,
         2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
handle_unknown = 'ignore'

    def _transform(self, X, handle_unknown='error'):
        X_list, n_samples, n_features = self._check_X(X)
    
        X_int = np.zeros((n_samples, n_features), dtype=np.int)
        X_mask = np.ones((n_samples, n_features), dtype=np.bool)
    
        for i in range(n_features):
            Xi = X_list[i]
>           diff, valid_mask = _encode_check_unknown(Xi, self.categories_[i],
                                                     return_mask=True)
E           IndexError: list index out of range

X          = memmap([[2.51189522, 0.67033055, 0.4992157 , 2.16073928, 2.58181919,
         0.61183692, 2.36067187, 0.44584725, 2.68... 2.75619882, 0.44342693, 2.44345576, 2.34577095,
         2.60256625, 0.57520755, 2.73159826, 2.40841944, 0.26329427]])
X_int      = array([[24,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0...,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])
X_list     = [array([2.51189522, 2.6430893 , 2.54847718]), array([0.67033055, 0.66835685, 0.46763126]), array([0.4992157 , 0.448980....58788791, 2.34342981]), array([2.58181919, 2.38607905, 2.28934567]), array([0.61183692, 0.38774913, 0.52384727]), ...]
X_mask     = array([[ True, False, False,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  Tru...ue,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True]])
Xi         = array([2.16073928, 2.58788791, 2.34342981])
_          = array([0.03911488, 0.26329427, 0.35089856, 0.36860623, 0.39741998,
       0.44342693, 0.46763126, 0.50367667, 0.523847...53, 2.40841944, 2.44345576, 2.4511805 , 2.53611719,
       2.54847718, 2.60256625, 2.67679634, 2.73159826, 2.75619882])
diff       = [0.44898091995707645, 0.4992156950294815]
encoded    = array([ 0,  0, 13])
handle_unknown = 'ignore'
i          = 3
n_features = 30
n_samples  = 3
self       = OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)
valid_mask = array([False, False,  True])

sklearn/preprocessing/_encoders.py:115: IndexError
@NicolasHug

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

You need something like

        if n_features != len(self.categories_):
            raise ValueError("OOPS")

in transform() now

@thomasjpfan
Copy link
Member

left a comment

Looks good after the following is removed:

The OneHotEncoder previously assumed that the input features take on
values in the range [0, max(values)). This behaviour is deprecated.

@jorisvandenbossche
Copy link
Member

left a comment

Apart from the aforementioned doc fix and a minor nit (that can be ignored), looks good to me as well

@glemaitre

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

I think this is good to be merged. I addressed the last issues.

@jnothman jnothman merged commit 9ee164b into scikit-learn:master May 29, 2019

15 of 16 checks passed

codecov/patch 75.71% of diff hit (target 96.84%)
Details
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python 4 new alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/project 96.81% (-0.04%) compared to 7ea7284
Details
scikit-learn.scikit-learn Build #20190529.18 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_np_atlas) Linux py35_np_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda) Linux pylatest_conda succeeded
Details
scikit-learn.scikit-learn (Windows py35_32) Windows py35_32 succeeded
Details
scikit-learn.scikit-learn (Windows py37_64) Windows py37_64 succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda) macOS pylatest_conda succeeded
Details
@jnothman

This comment has been minimized.

Copy link
Member

commented May 29, 2019

Thank you, @glemaitre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.