Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

MRG multiclass probability estimates for SGDClassifier #1814

Closed
wants to merge 3 commits into from

4 participants

@larsmans
Owner

As discussed on the ML. Not done for loss="modified_huber" since I didn't see how to generalize its way of producing probabilities to the multiclass case.

Cc @pprett.

@mblondel
Owner

I didn't see how to generalize its way of producing probabilities to the multiclass case

Just use a for loop.

@larsmans
Owner

The code isn't the problem, it's the fact that the probabilities may all turn out to be zero. In the (rather ugly, sorry) plot below, there are three classes. I've blacked out the region where

scores = clf.decision_function(X)
prob = (np.clip(scores, -1, 1) + 1) / 2
prob.sum(axis=1)

is exactly zero. Should we just special case for this and return a uniform distribution for this region? (I guess this is unlikely to happen, or even impossible, with properly scaled inputs.)

image

@mblondel
Owner

+1 for uniform probability in the multiclass case.

In the multilabel case, all zero probabilities would be a perfectly valid result: it would mean that the instance is associated with no label (but SGDClassifier doesn't support multi-label).

@larsmans
Owner

I'll implement that.

There's one problem left, though: the hard zero/one probabilities make predict_log_proba do zero divisions. That's not a new problem, and I won't tackle it now.

@larsmans
Owner

Done.

@mblondel
Owner

+1 for merge, thanks a lot!

@arjoly
Owner

If it isn't sgd specific, I think that you should put the huber loss in the metrics module.
So anybody could re-use your work. :-)

@larsmans
Owner

I didn't implement the Huber loss, it was already there. I just implemented multiclass probability normalization on top of the existing binary probability estimates.

If you want modified Huber loss in metrics, feel free to implement it; I never use it myself (and the fact that it gives hard zero/one probabilities is enough reason not to use it in the near future).

@pprett
Owner
@arjoly
Owner

I was thinking that some code could be re-use. There is also a huber loss function in the gbt module.

Looking more to the code, It seems that I mistaken the two.

@arjoly
Owner

I didn't implement the Huber loss, it was already there. I just implemented multiclass probability normalization on top of the existing binary probability estimates.

If you want modified Huber loss in metrics, feel free to implement it; I never use it myself (and the fact that it gives hard zero/one probabilities is enough reason not to use it in the near future).

Fine, thanks for your reply.

@larsmans
Owner

I can't seem to load the Travis report. Will do a full test on my workstation to see if I can find out why the build fails.

@larsmans
Owner

make clean test passes here, no idea what's up with Travis.

@mblondel
Owner

@arjoly The function in question is not a metric but a way to convert the result of decision_function to a probability.

@larsmans IIRC, this way of computing the probability is equal in expectation to the true probability (see referenced paper). The fact that it's sparse could be useful if you use it as feature for some underlying system.

@larsmans
Owner

I'm not doubting the correctness or usefulness of this way of computing probabilities. But the way I use predict_proba output (currently) is in a self-training loop which fits a log loss variant. That's going to crash unless I artificially clip it to [-ε, ε].

Now you know why I was so quick to implement this; I've been using a hacked SGDClassifier in my own work for some time now :)

@larsmans
Owner

@pprett Any comments on the PR?

@pprett pprett commented on the diff
sklearn/linear_model/stochastic_gradient.py
((21 lines not shown))
elif self.loss == "modified_huber":
- proba[:, 1] = (np.clip(scores, -1, 1) + 1) / 2.
+ binary = (len(self.classes_) == 2)
+ scores = self.decision_function(X)
+
+ if binary:
+ prob2 = np.ones((scores.shape[0], 2))
+ prob = prob2[:, 1]
+ else:
+ prob = scores
+
+ np.clip(scores, -1, 1, prob)
+ prob[:] += 1.
+ prob[:] /= 2.
@pprett Owner
pprett added a note

@larsmans I'm curious, what would be the difference if you omit the [:] ? thx

@larsmans Owner

Nothing, I guess. This is a left-over from the old code, will remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@pprett
Owner

looks great to me +1 for merge

btw: I learned a lot (not only about subtle issues in probability estimates but also numpy kungfu) - thanks for the feature!

@larsmans larsmans closed this in 0180188
@larsmans
Owner

The test failure turned out to be mldata. Pushed to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
View
3  doc/whats_new.rst
@@ -51,6 +51,9 @@ Changelog
converts their ``coef_`` into a sparse matrix, meaning stored models
trained using these estimators can be made much more compact.
+ - :class:`linear_model.SGDClassifier` now produces multiclass probability
+ estimates when trained under log loss or modified Huber loss.
+
- Hyperlinks to documentation in example code on the website by
`Martin Luessi`_.
View
19 sklearn/linear_model/base.py
@@ -222,6 +222,25 @@ def predict(self, X):
indices = scores.argmax(axis=1)
return self.classes_[indices]
+ def _predict_proba_lr(self, X):
+ """Probability estimation for OvR logistic regression.
+
+ Positive class probabilities are computed as
+ 1. / (1. + np.exp(-self.decision_function(X)));
+ multiclass is handled by normalizing that over all classes.
+ """
+ prob = self.decision_function(X)
+ prob *= -1
+ np.exp(prob, prob)
+ prob += 1
+ np.reciprocal(prob, prob)
+ if len(prob.shape) == 1:
+ return np.vstack([1 - prob, prob]).T
+ else:
+ # OvR normalization, like LibLinear's predict_probability
+ prob /= prob.sum(axis=1).reshape((prob.shape[0], -1))
+ return prob
+
class SparseCoefMixin(object):
"""Mixin for converting coef_ to and from CSR format.
View
13 sklearn/linear_model/logistic.py
@@ -120,18 +120,7 @@ def predict_proba(self, X):
Returns the probability of the sample for each class in the model,
where classes are ordered as they are in ``self.classes_``.
"""
- # 1. / (1. + np.exp(-scores)), computed in-place
- prob = self.decision_function(X)
- prob *= -1
- np.exp(prob, prob)
- prob += 1
- np.reciprocal(prob, prob)
- if len(prob.shape) == 1:
- return np.vstack([1 - prob, prob]).T
- else:
- # OvR, not softmax, like Liblinear's predict_probability
- prob /= prob.sum(axis=1).reshape((prob.shape[0], -1))
- return prob
+ return self._predict_proba_lr(X)
def predict_log_proba(self, X):
"""Log of probability estimates.
View
56 sklearn/linear_model/stochastic_gradient.py
@@ -654,7 +654,12 @@ class SGDClassifier(BaseSGDClassifier, SelectorMixin):
def predict_proba(self, X):
"""Probability estimates.
- Probability estimates are only supported for binary classification.
+ Multiclass probability estimates are derived from binary (one-vs.-rest)
+ estimates by simple normalization, as recommended by Zadrozny and
+ Elkan.
+
+ Binary probability estimates for loss="modified_huber" are given by
+ (clip(decision_function(X), -1, 1) + 1) / 2.
Parameters
----------
@@ -668,34 +673,59 @@ def predict_proba(self, X):
References
----------
+ Zadrozny and Elkan, "Transforming classifier scores into multiclass
+ probability estimates", SIGKDD'02,
+ http://www.research.ibm.com/people/z/zadrozny/kdd2002-Transf.pdf
The justification for the formula in the loss="modified_huber"
case is in the appendix B in:
http://jmlr.csail.mit.edu/papers/volume2/zhang02c/zhang02c.pdf
"""
- if len(self.classes_) != 2:
- raise NotImplementedError("predict_(log_)proba only supported"
- " for binary classification")
-
- scores = self.decision_function(X)
- proba = np.ones((scores.shape[0], 2), dtype=np.float64)
if self.loss == "log":
- proba[:, 1] = 1. / (1. + np.exp(-scores))
+ return self._predict_proba_lr(X)
elif self.loss == "modified_huber":
- proba[:, 1] = (np.clip(scores, -1, 1) + 1) / 2.
+ binary = (len(self.classes_) == 2)
+ scores = self.decision_function(X)
+
+ if binary:
+ prob2 = np.ones((scores.shape[0], 2))
+ prob = prob2[:, 1]
+ else:
+ prob = scores
+
+ np.clip(scores, -1, 1, prob)
+ prob[:] += 1.
+ prob[:] /= 2.
@pprett Owner
pprett added a note

@larsmans I'm curious, what would be the difference if you omit the [:] ? thx

@larsmans Owner

Nothing, I guess. This is a left-over from the old code, will remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+
+ if binary:
+ prob2[:, 0] -= prob
+ prob = prob2
+ else:
+ # the above might assign zero to all classes, which doesn't
+ # normalize neatly; work around this to produce uniform
+ # probabilities
+ prob_sum = prob.sum(axis=1)
+ all_zero = (prob_sum == 0)
+ if np.any(all_zero):
+ prob[all_zero, :] = 1
+ prob_sum[all_zero] = len(self.classes_)
+
+ # normalize
+ prob /= prob_sum.reshape((prob.shape[0], -1))
+
+ return prob
else:
raise NotImplementedError("predict_(log_)proba only supported when"
" loss='log' or loss='modified_huber' "
- "(%s given)" % self.loss)
- proba[:, 0] -= proba[:, 1]
- return proba
+ "(%r given)" % self.loss)
def predict_log_proba(self, X):
"""Log of probability estimates.
- Log probability estimates are only supported for binary classification.
+ When loss="modified_huber", probability estimates may be hard zeros
+ and ones, so taking the logarithm is not possible.
Parameters
----------
View
51 sklearn/linear_model/tests/test_sgd.py
@@ -319,9 +319,11 @@ def test_sgd_proba(self):
clf = self.factory(loss="hinge", alpha=0.01, n_iter=10).fit(X, Y)
assert_raises(NotImplementedError, clf.predict_proba, [3, 2])
- # the log and modified_huber losses can output "probability" estimates
- for loss in ("log", "modified_huber"):
- clf = self.factory(loss=loss, alpha=0.01, n_iter=10).fit(X, Y)
+ # log and modified_huber losses can output probability estimates
+ # binary case
+ for loss in ["log", "modified_huber"]:
+ clf = self.factory(loss="modified_huber", alpha=0.01, n_iter=10)
+ clf.fit(X, Y)
p = clf.predict_proba([3, 2])
assert_true(p[0, 1] > 0.5)
p = clf.predict_proba([-1, -1])
@@ -332,6 +334,49 @@ def test_sgd_proba(self):
p = clf.predict_log_proba([-1, -1])
assert_true(p[0, 1] < p[0, 0])
+ # log loss multiclass probability estimates
+ clf = self.factory(loss="log", alpha=0.01, n_iter=10).fit(X2, Y2)
+
+ d = clf.decision_function([[.1, -.1], [.3, .2]])
+ p = clf.predict_proba([[.1, -.1], [.3, .2]])
+ assert_array_equal(np.argmax(p, axis=1), np.argmax(d, axis=1))
+ assert_almost_equal(p[0].sum(), 1)
+ assert_true(np.all(p[0] >= 0))
+
+ p = clf.predict_proba([-1, -1])
+ d = clf.decision_function([-1, -1])
+ assert_array_equal(np.argsort(p[0]), np.argsort(d[0]))
+
+ l = clf.predict_log_proba([3, 2])
+ p = clf.predict_proba([3, 2])
+ assert_array_almost_equal(np.log(p), l)
+
+ l = clf.predict_log_proba([-1, -1])
+ p = clf.predict_proba([-1, -1])
+ assert_array_almost_equal(np.log(p), l)
+
+ # Modified Huber multiclass probability estimates; requires a separate
+ # test because the hard zero/one probabilities may destroy the
+ # ordering present in decision_function output.
+ clf = self.factory(loss="modified_huber", alpha=0.01, n_iter=10)
+ clf.fit(X2, Y2)
+ d = clf.decision_function([3, 2])
+ p = clf.predict_proba([3, 2])
+ if not isinstance(self, SparseSGDClassifierTestCase):
+ assert_equal(np.argmax(d, axis=1), np.argmax(p, axis=1))
+ else: # XXX the sparse test gets a different X2 (?)
+ assert_equal(np.argmin(d, axis=1), np.argmin(p, axis=1))
+
+ # the following sample produces decision_function values < -1,
+ # which would cause naive normalization to fail (see comment
+ # in SGDClassifier.predict_proba)
+ x = X.mean(axis=0)
+ d = clf.decision_function(x)
+ if np.all(d < -1): # XXX not true in sparse test case (why?)
+ p = clf.predict_proba(x)
+ assert_array_almost_equal(p[0], [1/3.] * 3)
+
+
def test_sgd_l1(self):
"""Test L1 regularization"""
n = len(X4)
Something went wrong with that request. Please try again.