Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

GBT fails with RF init #2691

Open
agramfort opened this Issue · 12 comments

7 participants

@agramfort
Owner

here is a tiny script which reproduces the crash.

from sklearn.datasets import load_iris
from sklearn import ensemble
from sklearn.cross_validation import train_test_split

iris = load_iris()
X, y = iris.data, iris.target
X, y = X[y < 2], y[y < 2]  # make it binary

X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit GBT init with RF
rf = ensemble.RandomForestClassifier()
clf = ensemble.GradientBoostingClassifier(init=rf)

clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)
print("Accuracy: {:.4f}".format(acc))

It also seems that the init param in GradientBoostingClassifier is
not really tested.

@pprett @glouppe @ogrisel

@pprett
Owner

@agramfort thanks - I'm aware of the issue - but I was cautious to get rid of it because handling this properly would incur quite a test-time performance degradation for single instance prediction (checking isinstance or some try-except block).
I'll solve this soon.

@agramfort
Owner

ok thanks.

@amueller
Owner

Hm I guess we should fix that before a release, right?

@pprett
Owner

jep - lets put a milestone

@agramfort
Owner

now that the big refactoring of GBRT is merged, what's needed here?

@pprett
Owner

basically consolidating this check:

if (not hasattr(self.init, 'fit') or not hasattr(self.init, 'predict'))

to check on predict_proba for classification.
Then we need to make sure that if an init estimator has predict_proba we use the log-odds for binary classification and the output of predict_proba for multi-class.
Basically, the following lines have to be changed to accommodate this:

y_pred = self.init_.predict(X)  # in fit

and:

score = self.init_.predict(X).astype(np.float64)  # in _init_decision_function
@ogrisel
Owner

I think we need a better implementation of astype somewhere under sklearn/utils. The current implementation of numpy.astype always makes a copy of the data even when it already has the right type.

@GaelVaroquaux
@kaushik94

Is this the bug:

Traceback (most recent call last):
  File "x.py", line 15, in <module>
    clf.fit(X_train, y_train)
  File "/home/kaushik/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 1124, in fit
    return super(GradientBoostingClassifier, self).fit(X, y, monitor)
  File "/home/kaushik/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 782, in fit
    begin_at_stage, monitor)
  File "/home/kaushik/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 833, in _fit_stages
    criterion, splitter, random_state)
  File "/home/kaushik/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 575, in _fit_stage
    sample_mask, self.learning_rate, k=k)
  File "/home/kaushik/scikit-learn/sklearn/ensemble/gradient_boosting.py", line 194, in update_terminal_regions
    y_pred[:, k])
IndexError: too many indices
@pprett
Owner

@kaushik94 more context please - arguments and dataset characteristics in particular

@agramfort
Owner
@abhishekkrthakur
@ogrisel ogrisel removed this from the 0.15 milestone
@amueller amueller added this to the 0.15.1 milestone
@amueller amueller modified the milestone: 0.17, 0.16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.