New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] iforest backward compatibility #11553

Merged
merged 8 commits into from Jul 23, 2018

Conversation

@ngoix
Contributor

ngoix commented Jul 16, 2018

Reference Issues/PRs

Fixes #11337

What does this implement/fix? Explain your changes.

Add a parameter which switches old/new behaviour. Default value to be changed in 0.22 and parameter to be removed in 0.24

@ngoix ngoix changed the title from [WIP] iforest backward compati to [WIP] iforest backward compatibility Jul 16, 2018

Assuming the behaviour parameter is set to 'old', we always have
offset_ = -0.5.
behaviour: str, default == 'old'

This comment has been minimized.

@agramfort

agramfort Jul 16, 2018

Member

should be written

behaviour: str, optional (default='old')

@@ -131,6 +141,7 @@ def __init__(self,
n_estimators=100,
max_samples="auto",
contamination="legacy",
behaviour='old',

This comment has been minimized.

@agramfort

agramfort Jul 16, 2018

Member

should be positioned after n_jobs param

@ngoix ngoix changed the title from [WIP] iforest backward compatibility to [MRG] iforest backward compatibility Jul 16, 2018

@albertcthomas

We need to update the whatsnew entry that you added in the PR about the outlier detection API. Common tests are also failing. The common tests are written for the new API. We can maybe do check_estimator(IsolationForest(behavior='new')) in test_iforest.py and skip IsolationForest in the common test until the end of the deprecation cycle? or just do estimator.set_params(behavior='new') if estimator=IsolationForest in the common tests.

Assuming the behaviour parameter is set to 'old', we always have
offset_ = -0.5.
behaviour: str, optional (default='old')

This comment has been minimized.

@albertcthomas

albertcthomas Jul 16, 2018

Contributor

Should be in the Parameters section of the docstring not Attributes section

@GaelVaroquaux

One cosmetic comment, and then this is good.

Default "behaviour" parameter will change to "new" in version 0.22.
Passing behaviour="new" makes the decision_function change to match
other anomaly detection algorithm API, as explained in the offset_
parameter documentation.

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member

I think that you should be slightly more explicit as to what switching this parameter entails.

@GaelVaroquaux GaelVaroquaux changed the title from [MRG] iforest backward compatibility to [MRG+1] iforest backward compatibility Jul 16, 2018

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 16, 2018

LGTM. Thank you!

@albertcthomas

This comment has been minimized.

Contributor

albertcthomas commented Jul 16, 2018

for the common tests you can add

    if estimator.__class__.__name__ == "IsolationForest":
        # XXX to be removed in 0.22.
        # this is used because the old IsolationForest does not
        # respect the outlier detection API and thus and does not
        # pass the outlier detection common tests.
        estimator.set_params(behaviour='new')

in the set_checking_parameters(estimator) function located here

@albertcthomas

The examples should also be modified so that IsolationForest is instantiated with behaviour=new.

@jnothman

I've not checked that the examples render similarly to 0.19, which is what we'd aim for...?

Passing behaviour="new" makes the decision_function change to match
other anomaly detection algorithm API, as explained in details in the
offset_ attribute documentation. Basically, the decision_function
becomes dependent from the contamination parameter, in such a way that

This comment has been minimized.

@jnothman

jnothman Jul 16, 2018

Member

"dependent from" is bad English. Do you mean "dependent on"?

return self
# else, define offset_ wrt contamination parameter, so that the
# threshold_ attribute is implicitely 0 and is not needed anymore:

This comment has been minimized.

@jnothman

jnothman Jul 16, 2018

Member

*implicitly

@jnothman

This comment has been minimized.

Member

jnothman commented Jul 17, 2018

flake8 errors

@jnothman

This comment has been minimized.

Member

jnothman commented Jul 17, 2018

sklearn/ensemble/tests/test_iforest.py:267:27: E127 continuation line over-indented for visual indent
                          'in version 0.22',
                          ^
sklearn/ensemble/tests/test_iforest.py:276:1: E302 expected 2 blank lines, found 1
def test_behaviour_param():
^
@albertcthomas

This comment has been minimized.

Contributor

albertcthomas commented Jul 17, 2018

I've not checked that the examples render similarly to 0.19, which is what we'd aim for...?

Yes we need to check the examples. (Btw the example given in #11337 is not an example from scikit-learn but from a scipy tutorial).

@agramfort

This comment has been minimized.

Member

agramfort commented Jul 17, 2018

@ngoix

This comment has been minimized.

Contributor

ngoix commented Jul 17, 2018

rebased on master, fixing conflict with #11561
I checked the examples, they seem fine.
I run @glemaitre script from #11337, which now returns:

figure_1

(same as on 0.19)

@amueller

This comment has been minimized.

Member

amueller commented Jul 17, 2018

looks like travis will pass in a sec. should we wait for the rest?

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 17, 2018

@agramfort

This comment has been minimized.

Member

agramfort commented Jul 17, 2018

@ngoix you need to rebase.

self.contamination = contamination
if behaviour == 'old':

This comment has been minimized.

@glemaitre

glemaitre Jul 17, 2018

Contributor

This should be done in fit

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jul 17, 2018

You need to catch the warning using the pytest.mark.filterwarings for when behaviour='old'

@glemaitre

Basically, we turn warning into errors so you can refer to the following traceback:

================================ test session starts ================================
platform linux -- Python 3.6.5, pytest-3.6.1, py-1.5.3, pluggy-0.6.0 -- /home/lemaitre/miniconda3/envs/dev/bin/python
cachedir: .pytest_cache
rootdir: /home/lemaitre/Documents/code/toolbox/scikit-learn, inifile: setup.cfg
collected 14 items                                                                  

sklearn/ensemble/tests/test_iforest.py::test_iforest PASSED                   [  7%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_sparse FAILED            [ 14%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_error FAILED             [ 21%]
sklearn/ensemble/tests/test_iforest.py::test_recalculate_max_depth FAILED     [ 28%]
sklearn/ensemble/tests/test_iforest.py::test_max_samples_attribute FAILED     [ 35%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_parallel_regression FAILED [ 42%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_performance FAILED       [ 50%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_works PASSED             [ 57%]
sklearn/ensemble/tests/test_iforest.py::test_max_samples_consistency FAILED   [ 64%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_subsampled_features FAILED [ 71%]
sklearn/ensemble/tests/test_iforest.py::test_iforest_average_path_length PASSED [ 78%]
sklearn/ensemble/tests/test_iforest.py::test_score_samples FAILED             [ 85%]
sklearn/ensemble/tests/test_iforest.py::test_deprecation FAILED               [ 92%]
sklearn/ensemble/tests/test_iforest.py::test_behaviour_param FAILED           [100%]

===================================== FAILURES ======================================
________________________________ test_iforest_sparse ________________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    @pytest.mark.filterwarnings('ignore:threshold_ attribute')
    def test_iforest_sparse():
        """Check IForest for various parameter settings on sparse input."""
        rng = check_random_state(0)
        X_train, X_test, y_train, y_test = train_test_split(boston.data[:50],
                                                            boston.target[:50],
                                                            random_state=rng)
        grid = ParameterGrid({"max_samples": [0.5, 1.0],
                              "bootstrap": [True, False]})
    
        for sparse_format in [csc_matrix, csr_matrix]:
            X_train_sparse = sparse_format(X_train)
            X_test_sparse = sparse_format(X_test)
    
            for params in grid:
                # Trained on sparse format
                sparse_classifier = IsolationForest(
>                   n_estimators=10, random_state=1, **params).fit(X_train_sparse)

sklearn/ensemble/tests/test_iforest.py:85: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=True, contamination='legacy',
        max_features=1.0, max_samples=0.5, n_estimators=10, n_jobs=1,
        random_state=1, verbose=0)
X = <37x13 sparse matrix of type '<class 'numpy.float64'>'
	with 420 stored elements in Compressed Sparse Column format>
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
________________________________ test_iforest_error _________________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    @pytest.mark.filterwarnings('ignore:threshold_ attribute')
    def test_iforest_error():
        """Test that it gives proper exception on deficient input."""
        X = iris.data
    
        # Test max_samples
        assert_raises(ValueError,
>                     IsolationForest(max_samples=-1).fit, X)

sklearn/ensemble/tests/test_iforest.py:104: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/utils/_unittest_backport.py:204: in assertRaises
    return context.handle('assertRaises', args, kwargs)
sklearn/utils/_unittest_backport.py:113: in handle
    callable_obj(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples=-1, n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = array([[5.8, 2.8, 5.1, 2.4],
       [6. , 2.2, 4. , 1. ],
       [5.5, 4.2, 1.4, 0.2],
       [7.3, 2.9, 6.3, 1.8],
  ...],
       [6.3, 2.9, 5.6, 1.8],
       [5.8, 2.7, 4.1, 1. ],
       [7.7, 3.8, 6.7, 2.2],
       [4.6, 3.2, 1.4, 0.2]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
____________________________ test_recalculate_max_depth _____________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    def test_recalculate_max_depth():
        """Check max_depth recalculation when max_samples is reset to n_samples"""
        X = iris.data
>       clf = IsolationForest().fit(X)

sklearn/ensemble/tests/test_iforest.py:145: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = array([[5.8, 2.8, 5.1, 2.4],
       [6. , 2.2, 4. , 1. ],
       [5.5, 4.2, 1.4, 0.2],
       [7.3, 2.9, 6.3, 1.8],
  ...],
       [6.3, 2.9, 5.6, 1.8],
       [5.8, 2.7, 4.1, 1. ],
       [7.7, 3.8, 6.7, 2.2],
       [4.6, 3.2, 1.4, 0.2]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
____________________________ test_max_samples_attribute _____________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    def test_max_samples_attribute():
        X = iris.data
>       clf = IsolationForest().fit(X)

sklearn/ensemble/tests/test_iforest.py:153: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = array([[5.8, 2.8, 5.1, 2.4],
       [6. , 2.2, 4. , 1. ],
       [5.5, 4.2, 1.4, 0.2],
       [7.3, 2.9, 6.3, 1.8],
  ...],
       [6.3, 2.9, 5.6, 1.8],
       [5.8, 2.7, 4.1, 1. ],
       [7.7, 3.8, 6.7, 2.2],
       [4.6, 3.2, 1.4, 0.2]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
_________________________ test_iforest_parallel_regression __________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    @pytest.mark.filterwarnings('ignore:threshold_ attribute')
    def test_iforest_parallel_regression():
        """Check parallel regression."""
        rng = check_random_state(0)
    
        X_train, X_test, y_train, y_test = train_test_split(boston.data,
                                                            boston.target,
                                                            random_state=rng)
    
        ensemble = IsolationForest(n_jobs=3,
>                                  random_state=0).fit(X_train)

sklearn/ensemble/tests/test_iforest.py:177: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=3,
        random_state=0, verbose=0)
X = array([[1.44760e-01, 0.00000e+00, 1.00100e+01, ..., 1.78000e+01,
        3.91500e+02, 1.36100e+01],
       [3.46600e-0...+02, 6.53000e+00],
       [1.40507e+01, 0.00000e+00, 1.81000e+01, ..., 2.02000e+01,
        3.50500e+01, 2.12200e+01]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
_____________________________ test_iforest_performance ______________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    def test_iforest_performance():
        """Test Isolation Forest performs well"""
    
        # Generate train/test data
        rng = check_random_state(2)
        X = 0.3 * rng.randn(120, 2)
        X_train = np.r_[X + 2, X - 2]
        X_train = X[:100]
    
        # Generate some abnormal novel observations
        X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))
        X_test = np.r_[X[100:], X_outliers]
        y_test = np.array([0] * 20 + [1] * 20)
    
        # fit the model
>       clf = IsolationForest(max_samples=100, random_state=rng).fit(X_train)

sklearn/ensemble/tests/test_iforest.py:208: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples=100, n_estimators=100, n_jobs=1,
        random_state=<mtrand.RandomState object at 0x7fe2e0dc3360>,
        verbose=0)
X = array([[-1.25027354e-01, -1.68800482e-02],
       [-6.40858829e-01,  4.92081243e-01],
       [-5.38030676e-01, -2.5252....66428750e-01,  5.41342992e-01],
       [ 5.41229421e-02,  1.65949282e-01],
       [ 3.09908720e-01, -9.87007304e-02]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
___________________________ test_max_samples_consistency ____________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    def test_max_samples_consistency():
        # Make sure validated max_samples in iforest and BaseBagging are identical
        X = iris.data
>       clf = IsolationForest().fit(X)

sklearn/ensemble/tests/test_iforest.py:238: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = array([[5.8, 2.8, 5.1, 2.4],
       [6. , 2.2, 4. , 1. ],
       [5.5, 4.2, 1.4, 0.2],
       [7.3, 2.9, 6.3, 1.8],
  ...],
       [6.3, 2.9, 5.6, 1.8],
       [5.8, 2.7, 4.1, 1. ],
       [7.7, 3.8, 6.7, 2.2],
       [4.6, 3.2, 1.4, 0.2]])
y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
_________________________ test_iforest_subsampled_features __________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    @pytest.mark.filterwarnings('ignore:threshold_ attribute')
    def test_iforest_subsampled_features():
        # It tests non-regression for #5732 which failed at predict.
        rng = check_random_state(0)
        X_train, X_test, y_train, y_test = train_test_split(boston.data[:50],
                                                            boston.target[:50],
                                                            random_state=rng)
        clf = IsolationForest(max_features=0.8)
>       clf.fit(X_train, y_train)

sklearn/ensemble/tests/test_iforest.py:251: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=0.8, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = array([[5.20177e+00, 0.00000e+00, 1.81000e+01, 1.00000e+00, 7.70000e-01,
        6.12700e+00, 8.34000e+01, 2.72270e+00...      6.89700e+00, 5.43000e+01, 6.33610e+00, 6.00000e+00, 3.00000e+02,
        1.66000e+01, 3.91250e+02, 1.13800e+01]])
y = array([22.7, 26.2, 17.8, 10.9, 22.2, 18.2, 24.6, 22. , 13.4,  5. ,  9.5,
       36.4, 21.6, 31.5, 14.9, 33.3, 24. , 16... , 15.1, 14.9,
       28.4, 14.4, 16.4, 19. , 14.5, 24.4, 31.1, 23.7, 19.3, 32.9, 33.1,
        8.8, 19.9, 26.6, 22. ])
sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
________________________________ test_score_samples _________________________________

    @pytest.mark.filterwarnings('ignore:default contamination')
    def test_score_samples():
        X_train = [[1, 1], [1, 2], [2, 1]]
>       clf1 = IsolationForest(contamination=0.1).fit(X_train)

sklearn/ensemble/tests/test_iforest.py:271: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination=0.1,
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = [[1, 1], [1, 2], [2, 1]], y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
                          FutureWarning)
            self._contamination = 0.1
        else:
            self._contamination = self.contamination
    
        if self.behaviour == 'old':
            warnings.warn('Default "behaviour" parameter will change to "new" '
                          'in version 0.22. Passing behaviour="new" makes '
                          'IsolationForest decision_function change to match '
                          'other anomaly detection algorithm API.',
>                         FutureWarning)
E           FutureWarning: Default "behaviour" parameter will change to "new" in version 0.22. Passing behaviour="new" makes IsolationForest decision_function change to match other anomaly detection algorithm API.

sklearn/ensemble/iforest.py:209: FutureWarning
_________________________________ test_deprecation __________________________________

    def test_deprecation():
        X = [[0.0], [1.0]]
        clf = IsolationForest()
    
        assert_warns_message(FutureWarning,
                             'default contamination parameter 0.1 will change '
                             'in version 0.22 to "auto"',
                             clf.fit, X)
    
        assert_warns_message(FutureWarning,
                             'Default "behaviour" parameter will change to "new" '
                             'in version 0.22',
                             clf.fit, X)
    
>       clf = IsolationForest().fit(X)

sklearn/ensemble/tests/test_iforest.py:295: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = [[0.0], [1.0]], y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
>                         FutureWarning)
E           FutureWarning: default contamination parameter 0.1 will change in version 0.22 to "auto". This will change the predict method behavior.

sklearn/ensemble/iforest.py:199: FutureWarning
_______________________________ test_behaviour_param ________________________________

    def test_behaviour_param():
        X_train = [[1, 1], [1, 2], [2, 1]]
>       clf1 = IsolationForest(behaviour='old').fit(X_train)

sklearn/ensemble/tests/test_iforest.py:304: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
        max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
        random_state=None, verbose=0)
X = [[1, 1], [1, 2], [2, 1]], y = None, sample_weight = None

    def fit(self, X, y=None, sample_weight=None):
        """Fit estimator.
    
            Parameters
            ----------
            X : array-like or sparse matrix, shape (n_samples, n_features)
                The input samples. Use ``dtype=np.float32`` for maximum
                efficiency. Sparse matrices are also supported, use sparse
                ``csc_matrix`` for maximum efficiency.
    
            sample_weight : array-like, shape = [n_samples] or None
                Sample weights. If None, then samples are equally weighted.
    
            Returns
            -------
            self : object
            """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
>                         FutureWarning)
E           FutureWarning: default contamination parameter 0.1 will change in version 0.22 to "auto". This will change the predict method behavior.

sklearn/ensemble/iforest.py:199: FutureWarning
======================== 11 failed, 3 passed in 0.96 seconds ========================

@qinhanmin2014 qinhanmin2014 added this to the 0.20 milestone Jul 18, 2018

@ngoix

This comment has been minimized.

Contributor

ngoix commented Jul 18, 2018

PR ready to be merged I think

@agramfort

This comment has been minimized.

Member

agramfort commented Jul 18, 2018

LGTM

@glemaitre merge if you're happy

@albertcthomas

Sorry to be late but I think we should be clearer about the deprecation cycle.

  1. we are going to remove the behaviour parameter at some point (0.24 according to the PR description) right? In that case I think we should mention it somewhere (in whastnew, the docstring or the deprecation warning). I don't see something mentioning this except the description of the PR.
  2. I would also add in whatsnew that the default value of behaviour will be new in 0.22
@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jul 19, 2018

we are going to remove the behaviour parameter at some point (0.24 according to the PR description) right? In that case I think we should mention it somewhere (in whastnew, the docstring or the deprecation warning). I don't see something mentioning this except the description of the PR.

We have to be careful there. In 0.22, we will not be able to make the old behaviour since a bunch of thing will be deprecated or change from default. Therefore, behaviour needs to be introduce and deprecated right now because we will not be able to use behavior=old in 0.22.

Does it make sense (or is it make jet-lag speaking)?

@albertcthomas

This comment has been minimized.

Contributor

albertcthomas commented Jul 19, 2018

We have to be careful there. In 0.22, we will not be able to make the old behaviour since a bunch of thing will be deprecated or change from default. Therefore, behaviour needs to be introduce and deprecated right now because we will not be able to use behavior=old in 0.22.

Does it make sense (or is it make jet-lag speaking)?

Yes it does. In this case the docstring of the behaviour attribute and the warning are a bit misleading because they say Default "behaviour" parameter will change to "new" in version 0.22 and don't say that there will no longer be a behaviour parameter in 0.22. Same thing for the whatsnew entry which promotes the introduction of a new parameter behaviour and does not mention that it will no longer be here in 0.22.

Should we add in the docstring, the warning and the whatsnew entry that behaviour parameter will be removed in 0.22?

@ngoix

This comment has been minimized.

Contributor

ngoix commented Jul 20, 2018

Yes we definitely should, thanks for pointing out!

@ngoix

This comment has been minimized.

Contributor

ngoix commented Jul 20, 2018

Actually I'm not very comfortable in introducing a deprecated parameter. We can't both change the behaviour + remove the parameter in one single version.
I think we should stick to switch default behavior to 'new' in 0.22, and remove it in 0.24.
behaviour='old' will still be usable in 0.22 if the user set contamination to something different than 'auto'.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jul 20, 2018

We are doing this right now there:
#11585
to keep the wrong behaviour for the smallest amount of time.

Moreover, here I think that this is even more problematic since we will not be able to maintain behaviour=old after 0.22 since that we are supposed to remove some deprecated code which allows for such behaviour.

@ngoix

This comment has been minimized.

Contributor

ngoix commented Jul 20, 2018

here this is not a wrong behaviour though, just a different convention...

Moreover, here I think that this is even more problematic since we will not be able to maintain behaviour=old after 0.22 since that we are supposed to remove some deprecated code which allows for such behaviour.

behaviour=old will be ok in 0.22, though not the default argument, and we remove this param in 0.24 so I don't see the problem?

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Jul 20, 2018

I don't think we should deprecate behaviour='new' already now. We need to give users the opportunity to already have the new behaviour with 0.20, but without having a deprecation warning that the keyword they need to specify to have that behaviour is deprecated.
So in my view it is only behaviour='old' that is deprecated, so it is fine to remove that option in 0.22, and from then only behaviour='new' will be accepted (and then deprecated at that point).
But maybe we should be more explicit about it in the deprecation and keyword description that the 'old' option will be removed in 0.22

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jul 20, 2018

behaviour=old will be ok in 0.22, though not the default argument, and we remove this param in 0.24 so I don't see the problem?

We are supposed to not have threshold_ anymore, isn't it.

@ngoix

This comment has been minimized.

Contributor

ngoix commented Jul 20, 2018

We are supposed to not have threshold_ anymore, isn't it.

Right, but we can still allow behaviour='old' in 0.22 (deprecated) without allowing to access this attribute anymore.

I'm also ok with @jorisvandenbossche compromise solution, ie remove behavior='old' in 0.22 and remove the param in 0.24

@albertcthomas

This comment has been minimized.

Contributor

albertcthomas commented Jul 20, 2018

So in my view it is only behaviour='old' that is deprecated, so it is fine to remove that option in 0.22, and from then only behaviour='new' will be accepted (and then deprecated at that point).
But maybe we should be more explicit about it in the deprecation and keyword description that the 'old' option will be removed in 0.22

I'm +1 for @jorisvandenbossche 's solution, even if it takes 2 deprecation cycles. We introduce behaviour now and deprecate the default old value. So in 0.22, only new will be available (and we don't need to maintain the old value anymore). We then deprecate behaviour to remove it in 0.24.

@albertcthomas

This comment has been minimized.

Contributor

albertcthomas commented Jul 20, 2018

But maybe we should be more explicit about it in the deprecation and keyword description that the 'old' option will be removed in 0.22

And +1 for this also if we choose @jorisvandenbossche 's solution

ngoix added some commits Jul 16, 2018

iforest behaviour param
fix + cosmit

cosmit

common test

update examples

cosmit + fix travis

attribute error instead of value + mask warning in test

whatsnew
@glemaitre

I think that we are almost good then. Just a weird thing with the example and I would slightly modify the behaviour docstring.

@@ -0,0 +1,129 @@
"""

This comment has been minimized.

@glemaitre

glemaitre Jul 20, 2018

Contributor

Is it a new file?

This comment has been minimized.

@albertcthomas

albertcthomas Jul 20, 2018

Contributor

This example was removed in PR #10700 (merged yesterday) in favor of plot_anomaly_comparison.py. We don’t need it anymore.

offset_ attribute documentation. Basically, the decision_function
becomes dependent on the contamination parameter, in such a way that
0 becomes its natural threshold to detect outliers.

This comment has been minimized.

@glemaitre

glemaitre Jul 20, 2018

Contributor

Maybe we can use the sphinx syntax to highlight those remarks. I would also remove the optional since that people should probably use 'new' and state that 'new' will be the default in the future.

 behaviour : str, default='old'
    Behaviour of the ``decision_function`` which can be either 'old' or 'new'.
    Passing ``behaviour='new'`` makes the ``decision_function`` change to match
    other anomaly detection algorithm API which will be the default behaviour
    in the future. As explained in details in the ``offset_`` attribute
    documentation, the ``decision_function`` becomes dependent on the
    contamination parameter, in such a way that 0 becomes its natural 
    threshold to detect outliers.

    .. versionadded:: 0.20
       ``behaviour`` is added in 0.20 for back-compatibility purpose.
   
    .. deprecated:: 0.20
       ``behaviour='old'`` is deprecated in 0.20 and will not be possible in 0.22.

    .. deprecated:: 0.22
       ``behaviour`` parameter will be deprecated in 0.22 and removed in 0.24.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Jul 23, 2018

Contributor

You have the explanation duplicated now

ngoix added some commits Jul 23, 2018

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Jul 23, 2018

Thanks @ngoix !!! Merging now.

@glemaitre glemaitre merged commit 53622e8 into scikit-learn:master Jul 23, 2018

5 of 6 checks passed

ci/circleci: pypy3 Your tests failed on CircleCI
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment