# [MRG+1] add multinomial SAG solver for LogisticRegression #5251

Merged
merged 1 commit into from Dec 4, 2015

## Conversation

Projects
None yet
6 participants
Member

### TomDLT commented Sep 11, 2015

 I added multinomial sag solver in LogisticRegression. The results are identical to lbfgs and newton-cg multinomial solvers. The benchmark (gist) on RCV1 dataset is very promising. I only took 4 classes of the dataset, and removed samples that have more than one classe. n_samples_train = 129073 n_samples_test = 23149 n_features = 47236 n_classes = 4 I also added a plot_logistic_multinomial.py to show the difference between OvR and Multinomial. In the same spirit, I could probably add a multinomial LogisticRegression in SGD (previously proposed in #849). Do we want it?

Member

### ogrisel commented Sep 11, 2015

 In the same spirit, I could probably add a multinomial LogisticRegression in SGD (previously proposed in #849). Do we want it? Sure but probably in another PR.

### ogrisel reviewed Sep 11, 2015

 @@ -640,9 +640,11 @@ hyperparameters :math:\lambda_1 and :math:\lambda_2. .. topic:: References: .. [1] Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 7.2.1

#### ogrisel Sep 11, 2015

Member

Have you checked the HTML rendering of this change?

### ogrisel reviewed Sep 11, 2015

 return log(out) + vmax cdef class MultinomialLogLoss:

#### ogrisel Sep 11, 2015

Member

Maybe you could insert a docstring with the math formula of the multinomial loss here.

#### TomDLT Sep 11, 2015

Author Member
"""
The multinomial logistic loss is equal to:
loss = - sw \sum_c \delta_{y,c} (prediction[c] - logsumexp(prediction))
= sw (logsumexp(prediction) - prediction[y])

where:
prediction = dot(x_sample, weights) + intercept
\delta_{y,c} = 1 if (y == c) else 0
sw = sample_weight
"""

and

"""
The gradient of the multinomial logistic loss is equal to:
grad_c = - sw * (p[c] - \delta_{y,c})

where:
p[c] = exp(logsumexp(prediction) - prediction[c])
prediction = dot(x_sample, weights) + intercept
\delta_{y,c} = 1 if (y == c) else 0
sw = sample_weight

Note that to obtain the true gradient, this value has to be multiplied
by the sample vector x.
"""

Feel free to suggest a clearer formulation, I don't like this one so much

#### MechCoder Nov 16, 2015

Member

Is it worthy to mention "The gradient of the multinomial logistic loss for every sample" is, (thought that is kind of obvious here)

Member

### ogrisel commented Sep 11, 2015

 Out of curiosity what are the relative fit times of: sag OVR sag multinomial with both n_jobs=1 and n_jobs=4 for instance on MNIST or RCV1?
Member

### amueller commented Sep 11, 2015

 awesome :)
Member

### ogrisel commented Sep 11, 2015

 awesome :) Indeed that's awesome.
Member Author

### TomDLT commented Sep 11, 2015

 Out of curiosity what are the relative fit times of: sag OVR sag multinomial with both n_jobs=1 and n_jobs=4 for instance on MNIST or RCV1? RCV1 dataset, simplified to 4 classes, as in previous benchmark: n_jobs : 1, max_iter : 1, ovr: 0.5536 sec n_jobs : 1, max_iter : 1, mul: 0.3770 sec n_jobs : 1, max_iter : 10, ovr: 5.0621 sec n_jobs : 1, max_iter : 10, mul: 3.3201 sec n_jobs : 1, max_iter : 100, ovr: 50.4057 sec n_jobs : 1, max_iter : 100, mul: 33.0481 sec n_jobs : 4, max_iter : 1, ovr: 0.2245 sec n_jobs : 4, max_iter : 1, mul: 0.4381 sec n_jobs : 4, max_iter : 10, ovr: 1.9564 sec n_jobs : 4, max_iter : 10, mul: 3.4660 sec n_jobs : 4, max_iter : 100, ovr: 18.1681 sec n_jobs : 4, max_iter : 100, mul: 33.0120 sec  Yet it converges to different solutions: ovr: train_score: 0.9857 mul: train_score: 0.9882 ovr: test_score: 0.9781 mul: test_score: 0.9786 
Member

### mblondel commented Sep 12, 2015

 In the same spirit, I could probably add a multinomial LogisticRegression in SGD (previously proposed in #849). Do we want it? I would say no. Sure SGD can do partial_fit and SAG can't but this will add a lot of code indirection for little benefit.

Member Author

### TomDLT commented Sep 16, 2015

 awesome :) Thanks ! Feel free to review :)

### agramfort reviewed Sep 16, 2015

 logistic regression model [3]_, which means that its probability estimates should be better calibrated than the default "one-vs-rest" setting. The "lbfgs", "sag" and "newton-cg"" solvers cannot optimize L1-penalized models, though, so the "multinomial" setting does not learn sparse models.

#### agramfort Sep 16, 2015

Member

though, so the "multinomial" setting does not learn sparse models.

->

therefore the "multinomial" setting does not learn sparse models.

### agramfort reviewed Sep 16, 2015

 warm-starting. For the multiclass case, if multi_class option is set to "ovr", an optimal C is obtained for each class and if the multi_class option is set to "multinomial", an optimal C is obtained that minimizes the cross- entropy loss.

#### agramfort Sep 16, 2015

Member

an optimal C is obtained that minimizes the cross-entropy loss.
->

an optimal C is obtained by minimizing the cross-entropy loss.

Member

### agramfort commented Sep 16, 2015

 that's it for me. Looks pretty clean ! good job

### ogrisel reviewed Sep 23, 2015

#### ogrisel Sep 23, 2015

Member

Could you please add a test that checks the output of _multinomial_loss_grad or _multinomial_grad_loss_all_samples on a small dataset where is possible to analytically compute the ground truth from the formula mentioned in the docstrings (e.g. with 2 samples in 2D) and check against the expected value?

#### TomDLT Sep 24, 2015

Author Member

I am not sure to understand what you mean.
Do you want to duplicate _multinomial_loss_grad in the test?
Do you want to compare its output with a hard-coded result ?

Member

### MechCoder commented Nov 16, 2015

 I can review this in the coming week !
Member Author

### TomDLT commented Nov 17, 2015

 Great !

### MechCoder reviewed Nov 17, 2015

 @@ -186,6 +187,8 @@ def sag_solver(X, y, sample_weight=None, loss='log', alpha=1., # As in SGD, the alpha is scaled by n_samples. alpha_scaled = float(alpha) / n_samples n_classes = int(y.max()) + 1 if loss == 'multinomial' else 1

#### MechCoder Nov 17, 2015

Member

Is it okay to assume y is label encoded?

### MechCoder reviewed Nov 17, 2015

 Loss function that will be optimized. 'log' is used for classification, like in LogisticRegression. 'squared' is used for regression, like in Ridge. loss : 'log' | 'squared' | 'multinomial'

#### MechCoder Nov 17, 2015

Member

Can you let me know how this auto step size is set?

#### MechCoder Nov 17, 2015

Member

Minor: Should we move max_squared_sum to utils at some point?

#### TomDLT Nov 18, 2015

Author Member

Did you checked get_auto_step_size's docstring?

about  max_squared_sum, I don't know in which case it would be useful, but we can definitely move it to utils, maybe with a caller function to check the input.

#### MechCoder Nov 19, 2015

Member

I at least use row_norms with squared=True frequently. Speaking of which let us take a step back.

    from sklearn.utils.extmath import row_norms
def also_get_max_squared_sum(X):
return row_norms(X, squared=True).max()
X = np.random.rand(1000, 10000)
%timeit get_max_squared_sum(X)
100 loops, best of 3: 10.8 ms per loop
%timeit also_get_max_squared_sum(X)
100 loops, best of 3: 5.36 ms per loop
X = np.random.rand(10000, 1000)
%timeit get_max_squared_sum(X)
100 loops, best of 3: 10.7 ms per loop
%timeit also_get_max_squared_sum(X)
100 loops, best of 3: 5.33 ms per loop

This should not matter much since it seems like it is done only once, but I feel it is unnecessary code duplication with almost no gain. (Unless I am overseeing something as usual)

#### TomDLT Nov 20, 2015

Author Member

I didn't know about row_norms, I will remove get_max_squared_norm.

### MechCoder reviewed Nov 17, 2015

 # the data pointer for X, the training set cdef double *x_data_ptr cdef double *x_data_ptr = NULL

#### MechCoder Nov 17, 2015

Member

Inserting spaces after every cdef might be good. It's kind of difficult to read now with the comments in between

### MechCoder reviewed Nov 17, 2015

 # the index pointer for the column of the data cdef int *x_ind_ptr # the label for the sample cdef int *x_ind_ptr = NULL

#### MechCoder Nov 17, 2015

Member

nit: Then x_col_ptr might be a better name?

#### TomDLT Nov 18, 2015

Author Member

I used the same name as in seq_dataset, which is related to classical names for scipy sparse CSR structure.

#### MechCoder Nov 19, 2015

Member

great, makes sense

### MechCoder reviewed Nov 17, 2015

 raise ValueError("Invalid loss parameter: got %s instead of " "one of ('log', 'squared', 'multinomial')" % loss_function) with nogil: start_time = time(NULL) while True:

#### MechCoder Nov 17, 2015

Member

Minor: Do you think it would read better to do for n_iter in range(max_iter) here and remove the infinite while loop.

### MechCoder reviewed Nov 17, 2015

 with nogil: start_time = time(NULL) while True: if infinity: break for itr in range(n_samples):

#### MechCoder Nov 17, 2015

Member

itr -> sample_ind?

#### MechCoder Nov 17, 2015

Member

Oh sorry, I see that this is not the sample_ind, but still might be better to name it as sample_itr

### MechCoder reviewed Nov 17, 2015

 feature_hist[idx] = itr # check to see that the weight is not inf or NaN if not skl_isfinite(weights[idx]): infinity = True

#### MechCoder Nov 17, 2015

Member

Would it be better to raise the ValueError here and do away with the infinity variable instead of breaking multiple times and check at the end?

There are only two times where the value of infinity changes.

### MechCoder reviewed Nov 17, 2015

 # helper variable for indexes cdef int idx cdef int idx, f, c, j

#### MechCoder Nov 17, 2015

Member

f -> feature_ind, c -> class_ind

### MechCoder reviewed Dec 2, 2015

 The multinomial logistic loss for one sample is: loss = - sw \sum_c \delta_{y,c} (prediction[c] - logsumexp(prediction)) = sw (logsumexp(prediction) - prediction[y])

#### MechCoder Dec 2, 2015

Member

Did you mean sw * (logsum(prediction) - prediction[c] * y)?. Also I think we can do away with the previous \delta_{y, c} definition then.

#### MechCoder Dec 2, 2015

Member

Sorry, I got confused. We can still remove the previous line though.

#### TomDLT Dec 3, 2015

Author Member

I find it more explicit to reuse the formula of the reference and to simplify only in the second line.
wdyt?

Member

OK.

### MechCoder reviewed Dec 2, 2015

#### MechCoder Dec 2, 2015

Member

Can you document warm_start_mem? Right now I don't know what they are without looking into the code.

Author Member

Done

### MechCoder reviewed Dec 2, 2015

 double sample_weight, double* gradient_ptr) nogil: """Multinomial Logistic regression gradient of the loss. The gradient of the multinomial logistic loss for one sample is:

#### MechCoder Dec 2, 2015

Member

with respect to a class c

#### MechCoder Dec 2, 2015

Member

Out of curiosity what does _dloss stand for?

#### TomDLT Dec 3, 2015

Author Member

loss is the loss, dloss is the derivative of the loss, has used in sgd_fast.pyx

### MechCoder reviewed Dec 2, 2015

 # y is the indice of the correct class of current sample. gradient_ptr[int(y)] -= 1.0 for class_ind in range(n_classes):

#### MechCoder Dec 2, 2015

Member

Nit: You can do gradient_ptr[int(y)] = -1 initially and do this loop along with the first, and prevent iterating twice across n_classes right?

Author Member

Done

### MechCoder reviewed Dec 2, 2015

 for j in range(xnnz): feature_ind = x_ind_ptr[j] innerprod += (w_data_ptr[feature_ind * n_classes + class_ind] * x_data_ptr[j])

#### MechCoder Dec 2, 2015

Member

Will this be slower due to random access in memory patterns? Note that w is C-contiguous across class_ind, so it should help if the inner loop across n_classes.
In that way you also prevent repeatedly accessing feature_ind, namely

for class_ind in range(n_classes):
prediction[class_ind]  = 0.0

for j in range(nnz):
feature_ind = ...
x_data = ...
w_start_ind = feature_ind * n_classes
for class_ind in range(n_classes):
prediction[class_ind] +=. w_data_ptr[w_start_ind + class_ind] * x_data


This should not matter much for sparse data but still..

#### TomDLT Dec 3, 2015

Author Member

I just did some benchmark on dense and sparse data, with multinomial and binary classification, but the gain is poor (-3%), when it is not a loss (up to +11%):

sparse, (162529, 47236) binary : +0%
sparse, (162529, 47236) multinomial: +3%
dense, (26851, 47236) binary : +11%
dense, (26851, 47236) multinomial 4 classes : -3%

I understand why it should be a bit faster, and I don't know why it's not in practice.

Here is my code, if I missed something:

cdef int feature_ind, class_ind, w_start_ind, j
cdef double x_data

for class_ind in range(n_classes):
prediction[class_ind]  = 0.0

# Compute the dot product only on non-zero elements of x
for j in range(xnnz):
feature_ind = x_ind_ptr[j]
x_data = x_data_ptr[j]
w_start_ind = feature_ind * n_classes
for class_ind in range(n_classes):
prediction[class_ind] += w_data_ptr[w_start_ind + class_ind] * x_data

for class_ind in range(n_classes):
prediction[class_ind] *= wscale
prediction[class_ind] += intercept[class_ind]

#### MechCoder Dec 3, 2015

Member

Thank you for the benchmarks. For sparse data, I did not expect a huge gain because of the randomly access in memory of w_data_ptr.

Since we are losing slightly on dense data for the multiclass case, as a last-ditch effort, would it be useful to prevent the last iteration by doing something like,

for class_ind in range(n_classes):
prediction[class_ind] ....
if j == xnnz - 1:
prediction[class_ind] *= wscale


Not sure if it will help or not. Btw, it seems you can do prediction[class_ind] = intercept[class_ind] when you iterate across n_classes initially.

#### TomDLT Dec 4, 2015

Author Member

The high loss (+11%) is in the binary case, so I don't think your suggestion might change so much.

#### MechCoder Dec 4, 2015

Member

So the +11% represents a loss? I thought it was a gain. Then it might be better to revert it back. It might be worth incrementing w_data_ptr instead of repeatedly accessing.

#### TomDLT Dec 4, 2015

Author Member

Reverted and squashed

Member

### MechCoder commented Dec 2, 2015

 I'm just curious about how the INTERCEPT_DECAY value for sparse_data is set to 0.01. Also the doc for multi_class should be updated. See #5897 LGTM when the minor queries are addressed.

### MechCoder reviewed Dec 2, 2015

 Plot decision surface of multinomial and One-vs-Rest Logistic Regression. The hyperplanes corresponding to the three One-vs-Rest (OVR) classifiers are represented by the dashed lines.

#### MechCoder Dec 2, 2015

Member

Maybe, you should combine this with the already existing plot_iris_logistic.py

#### TomDLT Dec 3, 2015

Author Member

I fear the multinomial example is not very visual with the iris dataset:
ovr:

multinomial:

training score : 0.807 (ovr)
training score : 0.833 (multinomial)

#### MechCoder Dec 3, 2015

Member

I did not see that this was make_blobs sorry.

Member Author

### TomDLT commented Dec 3, 2015

 I'm just curious about how the INTERCEPT_DECAY value for sparse_data is set to 0.01. The intercept decay was already used in SGD, and comes from Léon Bottou: "The learning rate for the bias is multiplied by 0.01 because this frequently improves the condition number." Also the doc for multi_class should be updated. See #5897 Done.

Closed

Member

### MechCoder commented Dec 3, 2015

 lgtm apart from that one inline comment
 ENH add multinomial SAG solver for LogisticRegression 
 9f136ff 

Member

### MechCoder commented Dec 4, 2015

 Will merge pending Travis !
Member Author

### TomDLT commented Dec 4, 2015

 Great ! Thanks again @MechCoder for the meticulous review !

### MechCoder added a commit that referenced this pull request Dec 4, 2015

 Merge pull request #5251 from TomDLT/sag_multi 
[MRG+1] add multinomial SAG solver for LogisticRegression
 43e5454 

### MechCoder merged commit 43e5454 into scikit-learn:master Dec 4, 2015 2 checks passed

#### 2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Member

### MechCoder commented Dec 4, 2015

 Merged !
Member

### MechCoder commented Dec 4, 2015

 It should be straightforward to support partial_fit now. Do you want me to do that?
Member

### agramfort commented Dec 4, 2015

 partial_fit with SAG? how do handle gradient memory?
Member

### MechCoder commented Dec 7, 2015

 We rewrite it for every partial fit? i.e just keep the coefficients and intercept. It should still converge faster than normal SGD for every partial fit, no?
Member

### agramfort commented Dec 7, 2015

 I don't think this can work or at least this is not SAG anymore.
Member

### amueller commented Dec 10, 2015

 Btw, was the reset of w_scale in the paper / reference implementation or is that a new idea for this implementation?
Member

### amueller commented Dec 10, 2015

 can we get a "versionadded" entry in LogisticRegression? Or is there one and I didn't see it?
Member

### amueller commented Dec 10, 2015

 btw, should there be an example showing how this is calibrated and OVR is not?
Member Author

### TomDLT commented Dec 11, 2015

 It should be straightforward to support partial_fit now. I don't think it makes sense for SAG, or at least it is not straightforward. Btw, was the reset of w_scale in the paper / reference implementation or is that a new idea for this implementation? I don't remember seing this in the paper, but it makes sense numerically, and it is present in Mark Schmidt's C code. can we get a "versionadded" entry in LogisticRegression? Or is there one and I didn't see it? Will do. btw, should there be an example showing how this is calibrated and OVR is not? What do you mean?

Closed

Member

### MechCoder commented Dec 11, 2015

 Yes it does not. I was mistaken. On Fri, Dec 11, 2015 at 5:29 AM, Tom Dupré la Tour
Member

### amueller commented Dec 11, 2015

 If this is inspired by Mark Schmidt's code, we should probably reference that (if we don't already). For calibration: I think it would be interesting to compare OVR and multinomial in terms of calibration, see http://scikit-learn.org/dev/modules/calibration.html