New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IsolationForest(max_features=0.8).predict(X) fails input validation #5732

Closed
betatim opened this Issue Nov 5, 2015 · 6 comments

Comments

6 participants
@betatim
Contributor

betatim commented Nov 5, 2015

When subsampling features IsolationForest fails the input validation when calling predict().

from sklearn.ensemble import IsolationForest
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

clf = IsolationForest(max_features=0.8)
clf.fit(X, y)
clf.predict(X)

gives the following:

scikit-learn/sklearn/tree/tree.pyc in _validate_X_predict(self, X, check_input)
    392                              " match the input. Model n_features is %s and "
    393                              " input n_features is %s "
--> 394                              % (self.n_features_, n_features))
    395
    396         return X

ValueError: Number of features of the model must  match the input. Model n_features is 3 and  input n_features is 4

In predict one of the individual fitted estimators is used for input validation: self.estimators_[0]._validate_X_predict(X, check_input=True) but it is passed the full X which has all the features. After looking into it a bit, bagging.py sub-samples the features itself, where as forest.py delegates it to the underlying DecisionTree.

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort
Member

agramfort commented Nov 5, 2015

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Nov 6, 2015

Contributor

Yes we can't apply X_test to the base estimators self.estimators_ because X_test has a full number of features, whereas base estimators are built on a smaller number of features.

The original Isolation Forest algorithm does not consider sub-sampling features. We let this argument because it was supposed to be an easy extension. Seems that it is not the case. I vote for removing max_features argument.

Contributor

ngoix commented Nov 6, 2015

Yes we can't apply X_test to the base estimators self.estimators_ because X_test has a full number of features, whereas base estimators are built on a smaller number of features.

The original Isolation Forest algorithm does not consider sub-sampling features. We let this argument because it was supposed to be an easy extension. Seems that it is not the case. I vote for removing max_features argument.

@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Nov 6, 2015

Member

To me the issue comes from the fact that the validity of X is checked through one its sub-estimators
(See X = self.estimators_[0]._validate_X_predict(X, check_input=True) at line 194). This is not the correct way to do it.

Member

glouppe commented Nov 6, 2015

To me the issue comes from the fact that the validity of X is checked through one its sub-estimators
(See X = self.estimators_[0]._validate_X_predict(X, check_input=True) at line 194). This is not the correct way to do it.

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Nov 6, 2015

Contributor

Yes but even if you replace this line by just X = check_array(X), you get error in predict when doing (L202)

for i, tree in enumerate(self.estimators_):
         leaves_index = tree.apply(X)
Contributor

ngoix commented Nov 6, 2015

Yes but even if you replace this line by just X = check_array(X), you get error in predict when doing (L202)

for i, tree in enumerate(self.estimators_):
         leaves_index = tree.apply(X)
@glouppe

This comment has been minimized.

Show comment
Hide comment
@glouppe

glouppe Nov 6, 2015

Member

Yes, because you should not do that either. Since you inherit form BaseBagging, you need to iterate over zip(self.estimators_, self.estimators_features_) and then compute leaves_index = tree.apply(X[:, features]). Sorry for not catching that in the initial review...

Member

glouppe commented Nov 6, 2015

Yes, because you should not do that either. Since you inherit form BaseBagging, you need to iterate over zip(self.estimators_, self.estimators_features_) and then compute leaves_index = tree.apply(X[:, features]). Sorry for not catching that in the initial review...

@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Nov 6, 2015

Contributor

Ok great! I didn't know about self.estimators_features_. So it is indeed an easy extension, and I withdraw my vote for removing max_features ;)

Contributor

ngoix commented Nov 6, 2015

Ok great! I didn't know about self.estimators_features_. So it is indeed an easy extension, and I withdraw my vote for removing max_features ;)

@amueller amueller modified the milestone: 0.19 Sep 29, 2016

@jnothman jnothman closed this in 4f3c60c Dec 27, 2016

sergeyf added a commit to sergeyf/scikit-learn that referenced this issue Feb 28, 2017

Sundrique added a commit to Sundrique/scikit-learn that referenced this issue Jun 14, 2017

NelleV added a commit to NelleV/scikit-learn that referenced this issue Aug 11, 2017

paulha added a commit to paulha/scikit-learn that referenced this issue Aug 19, 2017

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this issue Nov 15, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment