Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IsolationForest is breaking backward compatibility #11337

Closed
glemaitre opened this issue Jun 21, 2018 · 14 comments · Fixed by #11553
Closed

IsolationForest is breaking backward compatibility #11337

glemaitre opened this issue Jun 21, 2018 · 14 comments · Fixed by #11553
Labels
Milestone

Comments

@glemaitre
Copy link
Member

By preparing the scipy tutorial, I stumbled into a backward-compatibility breakage. It was introduced when the outlier detection methods have been improved.

The issue is with the following code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest

X, y = make_blobs(n_features=2, centers=3, n_samples=500,
                  random_state=42)

alpha_set = 0.95
n_samples, n_features = X.shape
X_range = np.zeros((n_features, 2))
X_range[:, 0] = np.min(X, axis=0) - 1.
X_range[:, 1] = np.max(X, axis=0) + 1.

h = 0.1  # step size of the mesh
x_min, x_max = X_range[0]
y_min, y_max = X_range[1]
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

grid = np.c_[xx.ravel(), yy.ravel()]

iforest = IsolationForest(n_estimators=300, contamination=0.10)
iforest = iforest.fit(X)
Z_iforest = iforest.decision_function(grid)
Z_iforest = Z_iforest.reshape(xx.shape)

plt.figure()
c_0 = plt.contour(xx, yy, Z_iforest,
                  levels=[iforest.threshold_],
                  colors='red', linewidths=3)
plt.clabel(c_0, inline=1, fontsize=15,
           fmt={iforest.threshold_: str(alpha_set)})
plt.scatter(X[:, 0], X[:, 1], s=1.)
plt.show()

So it seems that now we need to had the threshold_ (or later on the offset_) to the score given by the decision function.

Results in 0.19

019

Results in master

master

Where we are of this breakage? I could not find it in the What's New.
ping @jnothman @ngoix @albertcthomas

@jnothman
Copy link
Member

jnothman commented Jun 21, 2018 via email

@glemaitre
Copy link
Member Author

One solution would be to actually change the value threshold_ in master such that it works with decision_function. However, if users use threshold_ without decision_function, then this is a change of behavior which is important. However, I am not enough familiar with the use cases.

@albertcthomas
Copy link
Contributor

albertcthomas commented Jun 21, 2018

Using levels = [0] on master should give the expected result. And I think this is an intended change as when you use decision_function you should use levels = [0] to be consistent with the other outlier detection estimators.

@albertcthomas
Copy link
Contributor

albertcthomas commented Jun 21, 2018

IsolationForest was not consistent with the other estimators as decision_function was not thresholded by default. See issue #8693.

@albertcthomas
Copy link
Contributor

albertcthomas commented Jun 21, 2018

BTW in your snippet the contamination parameter of IsolationForest should be set to 1 - alpha_set. And if you want to have the same figure, use alpha_set = 0.90.

@glemaitre
Copy link
Member Author

Yep but this is does not solve the backward compatibility breakage which is my main concern here. Also, I have nothing against breaking it, but we should certainly mention it at least in the what's new and give a tip on how to change code then.

WDYT?

@albertcthomas
Copy link
Contributor

Yes I agree, this change should be better explained in the whatsnew. Concerning the tip to change the code, the example of IsolationForest has been updated accordingly and we can add a reference to it.

@albertcthomas
Copy link
Contributor

Does this rather require a deprecation cycle?

@jnothman
Copy link
Member

jnothman commented Jun 21, 2018 via email

@albertcthomas
Copy link
Contributor

I will open a PR. Unless a new contributor wants to take this one :). Thanks @glemaitre!

@albertcthomas
Copy link
Contributor

This requires a bit more work than what I originally thought as we need to revert the examples involving the decision_function of IsolationForest so that they work with the 0.19 version. Some of the current outlier detection common tests will also not pass for the decision_function of version 0.19.

@jnothman
Copy link
Member

jnothman commented Jul 3, 2018

Maybe we should consider something as blunt as a parameter which switches old/new behaviour, to change default value in 0.22 and to be removed in 0.24....

@albertcthomas
Copy link
Contributor

I think that's a good idea.

@agramfort
Copy link
Member

agramfort commented Jul 3, 2018 via email

@jorisvandenbossche jorisvandenbossche added this to Milestones tagged in scikit-learn 0.20 Jul 16, 2018
scikit-learn 0.20 automation moved this from Issues tagged to Done Jul 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants