Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] Fix for OvR partial_fit bug #7786

Merged
merged 5 commits into from Nov 5, 2016
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
33 changes: 19 additions & 14 deletions sklearn/multiclass.py
Expand Up @@ -244,26 +244,31 @@ def partial_fit(self, X, y, classes=None):
self
"""
if _check_partial_fit_first_call(self, classes):
if (not hasattr(self.estimator, "partial_fit")):
raise ValueError("Base estimator {0}, doesn't have partial_fit"
"method".format(self.estimator))
if not hasattr(self.estimator, "partial_fit"):
raise ValueError("Base estimator {0}, doesn't have "
"partial_fit method".format(self.estimator))
self.estimators_ = [clone(self.estimator) for _ in range
(self.n_classes_)]

# A sparse LabelBinarizer, with sparse_output=True, has been shown to
# outperform or match a dense label binarizer in all cases and has also
# resulted in less or equal memory consumption in the fit_ovr function
# overall.
self.label_binarizer_ = LabelBinarizer(sparse_output=True)
Y = self.label_binarizer_.fit_transform(y)
# A sparse LabelBinarizer, with sparse_output=True, has been
# shown to outperform or match a dense label binarizer in all
# cases and has also resulted in less or equal memory consumption
# in the fit_ovr function overall.
self.label_binarizer_ = LabelBinarizer(sparse_output=True)
self.label_binarizer_.fit(self.classes_)

if not set(self.classes_).issuperset(y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for large y, iteration may be much slower than using np.setdiff1d..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes np.setdiff1d seems to be faster

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

due to the use of np.setdiff1d it is not possible to partial_fit with a sparse y

raise ValueError("Mini-batch contains {0} while classes " +
"must be subset of {1}".format(np.unique(y),
self.classes_))

Y = self.label_binarizer_.transform(y)
Y = Y.tocsc()
columns = (col.toarray().ravel() for col in Y.T)

self.estimators_ = Parallel(n_jobs=self.n_jobs)(delayed(
_partial_fit_binary)(self.estimators_[i],
X, next(columns) if self.classes_[i] in
self.label_binarizer_.classes_ else
np.zeros((1, len(y))))
self.estimators_ = Parallel(n_jobs=self.n_jobs)(
delayed(_partial_fit_binary)(self.estimators_[i], X,
next(columns))
for i in range(self.n_classes_))

return self
Expand Down
43 changes: 33 additions & 10 deletions sklearn/tests/test_multiclass.py
Expand Up @@ -22,7 +22,8 @@
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import (LinearRegression, Lasso, ElasticNet, Ridge,
Perceptron, LogisticRegression)
Perceptron, LogisticRegression,
SGDClassifier)
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
Expand Down Expand Up @@ -89,16 +90,39 @@ def test_ovr_partial_fit():
assert_greater(np.mean(y == pred), 0.65)

# Test when mini batches doesn't have all classes
# with MultinomialNB
X = np.abs(np.random.randn(14, 2))
y = [1, 1, 1, 1, 2, 3, 3, 0, 0, 2, 3, 1, 2, 3]
ovr = OneVsRestClassifier(MultinomialNB())
ovr.partial_fit(iris.data[:60], iris.target[:60], np.unique(iris.target))
ovr.partial_fit(iris.data[60:], iris.target[60:])
pred = ovr.predict(iris.data)
ovr2 = OneVsRestClassifier(MultinomialNB())
pred2 = ovr2.fit(iris.data, iris.target).predict(iris.data)
ovr.partial_fit(X[:7], y[:7], np.unique(y))
ovr.partial_fit(X[7:], y[7:])
pred = ovr.predict(X)
ovr1 = OneVsRestClassifier(MultinomialNB())
pred1 = ovr1.fit(X, y).predict(X)
assert_equal(np.mean(pred == y), np.mean(pred1 == y))

# Test when mini batches doesn't have all classes
# with SGDClassifier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reader of the test, it's not clear why you would want to test with both these base estimators.

Also: can we do it in a loop for clarity that they're the same test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous PR #6239 , the author of the PR was able to create a fix which worked with MultinomialNB but not with SGDClassifier. That's why I added SGDClassifier also. But i think only testing with SGDClassifier is enough, @jnothman ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, happy for it to just be SGDClassifier

ovr = OneVsRestClassifier(SGDClassifier(n_iter=1, shuffle=False,
random_state=0))
ovr.partial_fit(X[:7], y[:7], np.unique(y))
ovr.partial_fit(X[7:], y[7:])
pred = ovr.predict(X)
ovr1 = OneVsRestClassifier(SGDClassifier(n_iter=1, shuffle=False,
random_state=0))
pred1 = ovr1.fit(X, y).predict(X)
assert_equal(np.mean(pred == y), np.mean(pred1 == y))

assert_almost_equal(pred, pred2)
assert_equal(len(ovr.estimators_), len(np.unique(iris.target)))
assert_greater(np.mean(iris.target == pred), 0.65)

def test_ovr_partial_fit_exceptions():
ovr = OneVsRestClassifier(MultinomialNB())
X = np.abs(np.random.randn(14, 2))
y = [1, 1, 1, 1, 2, 3, 3, 0, 0, 2, 3, 1, 2, 3]
ovr.partial_fit(X[:7], y[:7], np.unique(y))
# A new class value which was not in the first call of partial_fit
# It should raise ValueError
y1 = [5] + y[7:-1]
assert_raises(ValueError, ovr.partial_fit, X=X[7:], y=y1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use assert_raises_regex to check the error message as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve Done!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then you can drop this line.



def test_ovr_ovo_regressor():
Expand Down Expand Up @@ -204,7 +228,6 @@ def test_ovr_multiclass():
for base_clf in (MultinomialNB(), LinearSVC(random_state=0),
LinearRegression(), Ridge(),
ElasticNet()):

clf = OneVsRestClassifier(base_clf).fit(X, y)
assert_equal(set(clf.classes_), classes)
y_pred = clf.predict(np.array([[0, 0, 4]]))[0]
Expand Down