Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

[WIP] Multinomial logistic regression #2814

Closed
wants to merge 1 commit into from

7 participants

@larsmans
Owner

This adds the true multinomial (multiclass) logistic regression to linear_models. It uses L-BFGS, which means that only L2 penalty is supported, and I didn't bother to implement the dual objective.

The docs need a bit more love; the code could be optimized just a little more; and I'm not really sure how to handle the regularization. Currently, the batch training objective is

-1/m * C * log_loss(w, X, y) + ||w||²

and I'm not sure if the 1/m is appropriate; how does Liblinear do this?

@coveralls

Coverage Status

Coverage remained the same when pulling 04685aa on larsmans:multinomial-logreg into 1b73d8b on scikit-learn:master.

@jnothman jnothman closed this
@jnothman jnothman reopened this
@jnothman
Owner

Sorry about that... The liblinear documentation doesn't mention scaling C by m, FWIW.

@larsmans
Owner

I think we actually had some issues with C scaling in the past, but I don't exactly recall what happened.

sklearn/linear_model/logistic.py
@@ -100,12 +105,74 @@ class LogisticRegression(BaseLibLinear, LinearClassifierMixin,
def __init__(self, penalty='l2', dual=False, tol=1e-4, C=1.0,
fit_intercept=True, intercept_scaling=1, class_weight=None,
- random_state=None):
+ random_state=None, algorithm='liblinear'):
@mblondel Owner
mblondel added a note

lbfgs is a solver and can be applied to any objective. So I don't like the idea of using algorithm="lbfgs" to activate multiclass logistic regression. You could use a multi_class option instead like LinearSVC. But I'd rather create a separate class MulticlassLogisticRegression, since your solver doesn't support neither the dual option nor l1 regularization.

@GaelVaroquaux Owner

I agree with @mblondel that I would favor a new class (and the corresponding see-also)

@larsmans Owner
larsmans added a note

I was trying to avoid creating a new estimator with a name that suggests LogisticRegression cannot handle the multiclass case; but I agree with your points and actually the parametrization is a bit different as well (I don't penalize the intercept like Liblinear and intercept_scaling is ignored). I'll refactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux
@GaelVaroquaux
@larsmans
Owner

I have the idea that not scaling makes it at least easier to use (lower C values needed) and intuitively bigger training sets would not smaller penalty. I'll get rid of the scaling. I'll also change the SVM-style penalty to an SGD-style one, L(w, X, y) + αR(w).

@larsmans
Owner

Ok, made it a separate estimator. Main problem now is performance: somewhere in the ravel/unravel trickery, the w array gets some strange strides and the matrix multiplications become very slow.

@agramfort
Owner
@MechCoder
Owner

@larsmans a brief gist on what work has been done till now in your PR, and what more is left would be really helpful, so that I can work on it?

@larsmans
Owner

@Manoj-Kumar-S Work done: I added this estimator, which implements L2-penalty multiclass LR using L-BFGS. L1 penalty is impossible to implement with that solver.

Todo: I believe this thing works, but it's very slow. Try it on 20newsgroups to see exactly how slow. Two things can help speed it up: 1. fix the unraveling of parameters in the loss/gradient function, which produces strange strides that might confuse matrix mult routines, and 2. it might actually be more efficient to compute loss and gradient separately. I have a hunch some of the functions calls in L-BFGS don't actually need both.

@mblondel
Owner

L1 penalty is impossible to implement with that solver.

It's not impossible. There is a well-known trick to avoid the non-differentiability of the absolute term which consists in breaking down every weight w_j into two non-negative parts: w_j = w+_j - w-_j. This way,|w_j| = w+_j + w-_j. So, this can be solved by L-BFGS-B. Obviously, the disadvantage is that you now have an optimization problem with twice more variables.

@larsmans
Owner

Ok, it is possible, but it's not really feasible :) The code is slow enough as it is.

@larsmans
Owner

Should I close this one now that #2862 has been merged?

@MechCoder
Owner

@larsmans for a multi-class problem, the present logistic regression cv implements a OvA. I suppose we need to implement the true multinomial as well.

@agramfort
Owner
@mblondel
Owner

And also LinearSVCCV for the squared hinge loss if you have time ;-) This should just be a matter of replacing the gradient and hessian expressions.

@GaelVaroquaux
@mblondel
Owner

Good question. The squared hinge is not twice differentiable but there is a notion of "generalized Hessian matrix". If you're interested in the maths, this is discussed in "A Finite Newton Method for Classification Problems", by O. L. Mangasarian. In practice, however, this is just computing the Hessian pretending the loss is twice differentiable. This is what the TRON solver in liblinear uses (as for newton-cg, it only needs matrix-vector computations of the Hessian).

@MechCoder MechCoder referenced this pull request
Closed

[MRG+2] Multinomial Logistic Regression #3490

3 of 3 tasks complete
@MechCoder
Owner

I suppose this can be closed in favor of #3490 ?

@larsmans
Owner

If you're willing to take it up, sure!

@larsmans larsmans closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Feb 1, 2014
  1. @larsmans
This page is out of date. Refresh to see the latest.
View
19 doc/modules/linear_model.rst
@@ -651,13 +651,28 @@ rather than the sum of square residuals (as in ordinary regression).
Logistic regression is also known in the literature as
logit regression, maximum-entropy classification (MaxEnt)
or the log-linear classifier.
+Several estimators are available for logistic regression.
-The :class:`LogisticRegression` class can be used to do L1 or L2 penalized
-logistic regression. L1 penalization yields sparse predicting weights.
+:class:`LogisticRegression` uses a coordinate descent (CD) algorithm based on
+Liblinear and supports L1 or L2-penalized logistic regression.
+L1 penalization yields sparse predicting weights.
For L1 penalization :func:`sklearn.svm.l1_min_c` allows to calculate
the lower bound for C in order to get a non "null" (all feature weights to
zero) model.
+However, the CD algorithm cannot learn a true multinomial (multiclass) model;
+instead, the optimization problem is decomposed in a "one-vs-rest" fashion
+so separate binary classifiers are trained for all classes.
+This happens under the hood, so :class:`LogisticRegression` instances
+behave as multiclass classifiers.
+
+:class:`MultinomialLR` uses the L-BFGS algorithm
+to learn a true multinomial logistic regression model,
+which means that its probability estimates should be better-calibrated
+than those of :class:`LogisticRegression`.
+L-BFGS cannot optimize L1-penalized models, though,
+so :class:`MultinomialLR` does not learn sparse models.
+
.. topic:: Examples:
* :ref:`example_linear_model_plot_logistic_l1_l2_sparsity.py`
View
4 examples/document_classification_20newsgroups.py
@@ -37,6 +37,7 @@
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import LinearSVC
+from sklearn.linear_model import MultinomialLR
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
@@ -235,7 +236,8 @@ def benchmark(clf):
(RidgeClassifier(tol=1e-2, solver="lsqr"), "Ridge Classifier"),
(Perceptron(n_iter=50), "Perceptron"),
(PassiveAggressiveClassifier(n_iter=50), "Passive-Aggressive"),
- (KNeighborsClassifier(n_neighbors=10), "kNN")):
+ (KNeighborsClassifier(n_neighbors=10), "kNN"),
+ (MultinomialLR(), "Multinomial LR")):
print('=' * 80)
print(name)
results.append(benchmark(clf))
View
2  sklearn/linear_model/__init__.py
@@ -22,7 +22,7 @@
from .stochastic_gradient import SGDClassifier, SGDRegressor
from .ridge import (Ridge, RidgeCV, RidgeClassifier, RidgeClassifierCV,
ridge_regression)
-from .logistic import LogisticRegression
+from .logistic import LogisticRegression, MultinomialLR
from .omp import (orthogonal_mp, orthogonal_mp_gram, OrthogonalMatchingPursuit,
OrthogonalMatchingPursuitCV)
from .passive_aggressive import PassiveAggressiveClassifier
View
199 sklearn/linear_model/logistic.py
@@ -1,16 +1,60 @@
# Authors: Fabian Pedregosa
# Alexandre Gramfort
+# Lars Buitinck
# License: 3-clause BSD
import numpy as np
+from scipy.optimize import fmin_l_bfgs_b
from .base import LinearClassifierMixin, SparseCoefMixin
+from ..base import BaseEstimator
from ..feature_selection.from_model import _LearntSelectorMixin
+from ..preprocessing import LabelBinarizer
from ..svm.base import BaseLibLinear
+from ..utils import atleast2d_or_csr
+from ..utils.extmath import logsumexp, safe_sparse_dot
-class LogisticRegression(BaseLibLinear, LinearClassifierMixin,
- _LearntSelectorMixin, SparseCoefMixin):
+class _LogRegMixin(LinearClassifierMixin):
+ def predict_proba(self, X):
+ """Probability estimates.
+
+ The returned estimates for all classes are ordered by the
+ label of classes.
+
+ Parameters
+ ----------
+ X : array-like, shape = [n_samples, n_features]
+
+ Returns
+ -------
+ T : array-like, shape = [n_samples, n_classes]
+ Returns the probability of the sample for each class in the model,
+ where classes are ordered as they are in ``self.classes_``.
+ """
+ return self._predict_proba_lr(X)
+
+ def predict_log_proba(self, X):
+ """Log of probability estimates.
+
+ The returned estimates for all classes are ordered by the
+ label of classes.
+
+ Parameters
+ ----------
+ X : array-like, shape = [n_samples, n_features]
+
+ Returns
+ -------
+ T : array-like, shape = [n_samples, n_classes]
+ Returns the log-probability of the sample for each class in the
+ model, where classes are ordered as they are in ``self.classes_``.
+ """
+ return np.log(self.predict_proba(X))
+
+
+class LogisticRegression(BaseLibLinear, _LogRegMixin, _LearntSelectorMixin,
+ SparseCoefMixin):
"""Logistic Regression (aka logit, MaxEnt) classifier.
In the multiclass case, the training algorithm uses a one-vs.-all (OvA)
@@ -96,6 +140,12 @@ class LogisticRegression(BaseLibLinear, LinearClassifierMixin,
methods for logistic regression and maximum entropy models.
Machine Learning 85(1-2):41-75.
http://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf
+
+
+ See also
+ --------
+ sklearn.linear_model.MultinomialLR
+ sklearn.linear_model.SGDClassifier
"""
def __init__(self, penalty='l2', dual=False, tol=1e-4, C=1.0,
@@ -107,38 +157,133 @@ def __init__(self, penalty='l2', dual=False, tol=1e-4, C=1.0,
fit_intercept=fit_intercept, intercept_scaling=intercept_scaling,
class_weight=class_weight, random_state=random_state)
- def predict_proba(self, X):
- """Probability estimates.
- The returned estimates for all classes are ordered by the
- label of classes.
+class MultinomialLR(BaseEstimator, _LogRegMixin):
+ """Multinomial logistic regression.
- Parameters
- ----------
- X : array-like, shape = [n_samples, n_features]
+ This class implements logistic regression for multiclass problems. While
+ LogisticRegression is able to do multiclass classification out of the box,
+ the probabilities it estimates are not well-calibrated for such problems
+ as it fits one binary model per class. By contrast, multinomial LR
+ estimators solve a single multiclass optimization problem and minimize
+ the cross-entropy (aka. log loss) over the whole probability
+ distribution P(y=k|X).
- Returns
- -------
- T : array-like, shape = [n_samples, n_classes]
- Returns the probability of the sample for each class in the model,
- where classes are ordered as they are in ``self.classes_``.
- """
- return self._predict_proba_lr(X)
+ Parameters
+ ----------
+ alpha : float, optional
+ Strength of L2 penalty (aka. regularization, weight decay).
+ Note that the (optional) intercept term is not regularized.
- def predict_log_proba(self, X):
- """Log of probability estimates.
+ class_weight : {dict, 'auto'}, optional
+ Over-/undersamples the samples of each class according to the given
+ weights. If not given, all classes are supposed to have weight one.
+ The 'auto' mode selects weights inversely proportional to class
+ frequencies in the training set.
- The returned estimates for all classes are ordered by the
- label of classes.
+ fit_intercept : bool, default: True
+ Whether an intercept (bias) term should be learned and added to the
+ decision function.
+
+ Attributes
+ ----------
+ `coef_` : array, shape = [n_classes, n_features]
+ Coefficient of the features in the decision function.
+
+ `intercept_` : array, shape = [n_classes]
+ Intercept (a.k.a. bias) added to the decision function.
+ If `fit_intercept` is set to False, the intercept is set to zero.
+
+ References
+ ----------
+ C. M. Bishop (2006). Pattern Recognition and Machine Learning. Springer,
+ pp. 205-210.
+
+ See also
+ --------
+ sklearn.linear_model.LogisticRegression
+ sklearn.linear_model.SGDClassifier
+
+ """
+ def __init__(self, alpha=1e-4, fit_intercept=True, class_weight=None):
+ self.alpha = alpha
+ self.class_weight = class_weight
+ self.fit_intercept = fit_intercept
+
+ def fit(self, X, y):
+ """Fit logistic regression model to training data X, y.
Parameters
----------
- X : array-like, shape = [n_samples, n_features]
+ X : {array-like, sparse matrix}, shape = [n_samples, n_features]
+ Training vectors, where n_samples in the number of samples and
+ n_features is the number of features.
- Returns
- -------
- T : array-like, shape = [n_samples, n_classes]
- Returns the log-probability of the sample for each class in the
- model, where classes are ordered as they are in ``self.classes_``.
+ y : array-like, shape = [n_samples]
+ Target vector for the samples in X.
"""
- return np.log(self.predict_proba(X))
+
+ X = atleast2d_or_csr(X)
+ lbin = LabelBinarizer()
+ Y = lbin.fit_transform(y)
+ if Y.shape[1] == 1:
+ Y = np.hstack([1 - Y, Y])
+
+ # Fortran-ordered so we can slice off the intercept in loss_grad and
+ # get contiguous arrays.
+ w = np.zeros((Y.shape[1], X.shape[1] + bool(self.fit_intercept)),
+ order='F')
+
+ C = 1. / self.alpha
+ if C < 0:
+ raise ValueError("Penalty term must be positive; got (alpha=%r)"
+ % self.alpha)
+ w, loss, info = fmin_l_bfgs_b(_loss_grad, w.ravel(),
+ args=[X, Y, C, self.fit_intercept])
+ w = w.reshape(Y.shape[1], -1)
+ if self.fit_intercept:
+ intercept = w[:, -1]
+ w = w[:, :-1]
+ else:
+ intercept = np.zeros(Y.shape[1])
+
+ self.classes_ = lbin.classes_
+ if len(self.classes_) == 2:
+ w = w[1].reshape(1, -1)
+ intercept = intercept[1:]
+ self.coef_ = w
+ self.intercept_ = intercept
+
+ return self
+
+
+def _sqnorm(x):
+ x = x.ravel()
+ return np.dot(x, x)
+
+
+def _loss_grad(w, X, Y, C, fit_intercept):
+ # Cross-entropy loss and its gradient for multinomial logistic regression
+ # (Bishop 2006, p. 209) with L2 penalty (weight decay).
+ w = w.reshape(Y.shape[1], -1)
+ if fit_intercept:
+ intercept = w[:, -1]
+ w = w[:, :-1]
+ else:
+ intercept = 0
+
+ p = safe_sparse_dot(X, w.T)
+ p += intercept
+ p -= logsumexp(p, axis=1).reshape(-1, 1)
+
+ loss = (-C * (Y * p).sum()) + .5 * _sqnorm(w)
+
+ p = np.exp(p, p)
+ diff = p - Y
+ grad = safe_sparse_dot(diff.T, X)
+ grad *= C
+ grad += w
+ if fit_intercept:
+ grad = np.hstack([grad, diff.sum(axis=0).reshape(-1, 1)])
+
+ return loss, grad.ravel()
View
77 sklearn/linear_model/tests/test_logistic.py
@@ -9,7 +9,7 @@
from sklearn.utils.testing import assert_true
from sklearn.utils.testing import raises
-from sklearn.linear_model import logistic
+from sklearn.linear_model import LogisticRegression, MultinomialLR
from sklearn import datasets
X = [[-1, 0], [0, 1], [1, 1]]
@@ -42,28 +42,26 @@ def test_predict_2_classes():
Make sure it predicts the correct result on simple datasets.
"""
- check_predictions(logistic.LogisticRegression(random_state=0), X, Y1)
- check_predictions(logistic.LogisticRegression(random_state=0), X_sp, Y1)
+ check_predictions(LogisticRegression(random_state=0), X, Y1)
+ check_predictions(LogisticRegression(random_state=0), X_sp, Y1)
- check_predictions(logistic.LogisticRegression(C=100, random_state=0),
+ check_predictions(LogisticRegression(C=100, random_state=0), X, Y1)
+ check_predictions(LogisticRegression(C=100, random_state=0), X_sp, Y1)
+
+ check_predictions(LogisticRegression(fit_intercept=False, random_state=0),
X, Y1)
- check_predictions(logistic.LogisticRegression(C=100, random_state=0),
+ check_predictions(LogisticRegression(fit_intercept=False, random_state=0),
X_sp, Y1)
- check_predictions(logistic.LogisticRegression(fit_intercept=False,
- random_state=0), X, Y1)
- check_predictions(logistic.LogisticRegression(fit_intercept=False,
- random_state=0), X_sp, Y1)
-
def test_error():
"""Test for appropriate exception on errors"""
- assert_raises(ValueError, logistic.LogisticRegression(C=-1).fit, X, Y1)
+ assert_raises(ValueError, LogisticRegression(C=-1).fit, X, Y1)
def test_predict_3_classes():
- check_predictions(logistic.LogisticRegression(C=10), X, Y2)
- check_predictions(logistic.LogisticRegression(C=10), X_sp, Y2)
+ check_predictions(LogisticRegression(C=10), X, Y2)
+ check_predictions(LogisticRegression(C=10), X_sp, Y2)
def test_predict_iris():
@@ -71,24 +69,49 @@ def test_predict_iris():
n_samples, n_features = iris.data.shape
target = iris.target_names[iris.target]
- clf = logistic.LogisticRegression(C=len(iris.data)).fit(iris.data, target)
- assert_array_equal(np.unique(target), clf.classes_)
+ for clf in [LogisticRegression(C=len(iris.data)),
+ MultinomialLR(alpha=(1. / len(iris.data)))]:
+ clf.fit(iris.data, target)
- pred = clf.predict(iris.data)
- assert_greater(np.mean(pred == target), .95)
+ assert_array_equal(np.unique(target), clf.classes_)
- probabilities = clf.predict_proba(iris.data)
- assert_array_almost_equal(probabilities.sum(axis=1), np.ones(n_samples))
+ pred = clf.predict(iris.data)
+ assert_greater(np.mean(pred == target), .95)
+
+ probabilities = clf.predict_proba(iris.data)
+ assert_array_almost_equal(probabilities.sum(axis=1),
+ np.ones(n_samples))
+
+ pred = iris.target_names[probabilities.argmax(axis=1)]
+ assert_greater(np.mean(pred == target), .95)
+
+
+def test_multinomial_validation():
+ lr = MultinomialLR(alpha=-1)
+ assert_raises(ValueError, lr.fit, [[0, 1], [1, 0]], [0, 1])
+
+
+def test_multinomial_binary():
+ """Test multinomial LR on a binary problem."""
+ target = (iris.target > 0).astype(np.intp)
+ target = np.array(["setosa", "not-setosa"])[target]
+
+ clf = MultinomialLR().fit(iris.data, target)
+
+ assert_equal(clf.coef_.shape, (1, iris.data.shape[1]))
+ assert_equal(clf.intercept_.shape, (1,))
+ assert_array_equal(clf.predict(iris.data), target)
- pred = iris.target_names[probabilities.argmax(axis=1)]
- assert_greater(np.mean(pred == target), .95)
+ clf = MultinomialLR(fit_intercept=False).fit(iris.data, target)
+ pred = clf.classes_[np.argmax(clf.predict_log_proba(iris.data), axis=1)]
+ assert_greater(np.mean(pred == target), .9)
def test_sparsify():
"""Test sparsify and densify members."""
n_samples, n_features = iris.data.shape
target = iris.target_names[iris.target]
- clf = logistic.LogisticRegression(random_state=0).fit(iris.data, target)
+ clf = LogisticRegression(random_state=0).fit(iris.data, target)
pred_d_d = clf.decision_function(iris.data)
@@ -114,7 +137,7 @@ def test_inconsistent_input():
y_ = np.ones(X_.shape[0])
y_[0] = 0
- clf = logistic.LogisticRegression(random_state=0)
+ clf = LogisticRegression(random_state=0)
# Wrong dimensions for training data
y_wrong = y_[:-1]
@@ -130,7 +153,7 @@ def test_write_parameters():
#rng = np.random.RandomState(0)
#X = rng.random_sample((5, 10))
#y = np.ones(X.shape[0])
- clf = logistic.LogisticRegression(random_state=0)
+ clf = LogisticRegression(random_state=0)
clf.fit(X, Y1)
clf.coef_[:] = 0
clf.intercept_[:] = 0
@@ -145,13 +168,13 @@ def test_nan():
"""
Xnan = np.array(X, dtype=np.float64)
Xnan[0, 1] = np.nan
- logistic.LogisticRegression(random_state=0).fit(Xnan, Y1)
+ LogisticRegression(random_state=0).fit(Xnan, Y1)
def test_liblinear_random_state():
X, y = datasets.make_classification(n_samples=20)
- lr1 = logistic.LogisticRegression(random_state=0)
+ lr1 = LogisticRegression(random_state=0)
lr1.fit(X, y)
- lr2 = logistic.LogisticRegression(random_state=0)
+ lr2 = LogisticRegression(random_state=0)
lr2.fit(X, y)
assert_array_almost_equal(lr1.coef_, lr2.coef_)
View
14 sklearn/svm/base.py
@@ -653,11 +653,11 @@ def fit(self, X, y):
self : object
Returns self.
"""
- self._enc = LabelEncoder()
- y = self._enc.fit_transform(y)
- if len(self.classes_) < 2:
- raise ValueError("The number of classes has to be greater than"
- " one.")
+ lenc = LabelEncoder()
+ y = lenc.fit_transform(y)
+ self.classes_ = lenc.classes_
+ if len(lenc.classes_) < 2:
+ raise ValueError("The number of classes must be greater than one.")
X = atleast2d_or_csr(X, dtype=np.float64, order="C")
@@ -704,10 +704,6 @@ def fit(self, X, y):
return self
- @property
- def classes_(self):
- return self._enc.classes_
-
def _get_bias(self):
if self.fit_intercept:
return self.intercept_scaling
Something went wrong with that request. Please try again.