Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] add multinomial SAG solver for LogisticRegression #5251

Merged
merged 1 commit into from Dec 4, 2015

Conversation

Projects
None yet
6 participants
@TomDLT
Copy link
Member

TomDLT commented Sep 11, 2015

I added multinomial sag solver in LogisticRegression.
The results are identical to lbfgs and newton-cg multinomial solvers.


The benchmark (gist) on RCV1 dataset is very promising. I only took 4 classes of the dataset, and removed samples that have more than one classe.

n_samples_train = 129073
n_samples_test = 23149
n_features = 47236
n_classes = 4

dloss
train_score
test_score


I also added a plot_logistic_multinomial.py to show the difference between OvR and Multinomial.
ovr
multinomial


In the same spirit, I could probably add a multinomial LogisticRegression in SGD (previously proposed in #849). Do we want it?

@TomDLT TomDLT changed the title [WIP] add multinomial SAG solver for LogisticRegression [MRG] add multinomial SAG solver for LogisticRegression Sep 11, 2015

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Sep 11, 2015

In the same spirit, I could probably add a multinomial LogisticRegression in SGD (previously proposed in #849). Do we want it?

Sure but probably in another PR.

@@ -640,9 +640,11 @@ hyperparameters :math:`\lambda_1` and :math:`\lambda_2`.

.. topic:: References:

.. [1] Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 7.2.1

This comment has been minimized.

Copy link
@ogrisel

ogrisel Sep 11, 2015

Member

Have you checked the HTML rendering of this change?

return log(out) + vmax


cdef class MultinomialLogLoss:

This comment has been minimized.

Copy link
@ogrisel

ogrisel Sep 11, 2015

Member

Maybe you could insert a docstring with the math formula of the multinomial loss here.

This comment has been minimized.

Copy link
@TomDLT

TomDLT Sep 11, 2015

Author Member
"""
The multinomial logistic loss is equal to:
loss = - sw \sum_c \delta_{y,c} (prediction[c] - logsumexp(prediction))
     = sw (logsumexp(prediction) - prediction[y])

where:
    prediction = dot(x_sample, weights) + intercept
    \delta_{y,c} = 1 if (y == c) else 0
    sw = sample_weight
"""

and

"""
The gradient of the multinomial logistic loss is equal to:
grad_c = - sw * (p[c] - \delta_{y,c})

where:
    p[c] = exp(logsumexp(prediction) - prediction[c])
    prediction = dot(x_sample, weights) + intercept
    \delta_{y,c} = 1 if (y == c) else 0
    sw = sample_weight

Note that to obtain the true gradient, this value has to be multiplied
by the sample vector x.
"""

Feel free to suggest a clearer formulation, I don't like this one so much

This comment has been minimized.

Copy link
@MechCoder

MechCoder Nov 16, 2015

Member

Is it worthy to mention "The gradient of the multinomial logistic loss for every sample" is, (thought that is kind of obvious here)

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Sep 11, 2015

Out of curiosity what are the relative fit times of:

  • sag OVR
  • sag multinomial

with both n_jobs=1 and n_jobs=4 for instance on MNIST or RCV1?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Sep 11, 2015

awesome :)

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Sep 11, 2015

awesome :)

Indeed that's awesome.

@TomDLT

This comment has been minimized.

Copy link
Member Author

TomDLT commented Sep 11, 2015

Out of curiosity what are the relative fit times of:
sag OVR
sag multinomial
with both n_jobs=1 and n_jobs=4 for instance on MNIST or RCV1?

RCV1 dataset, simplified to 4 classes, as in previous benchmark:

n_jobs : 1, max_iter :   1, ovr:  0.5536 sec
n_jobs : 1, max_iter :   1, mul:  0.3770 sec
n_jobs : 1, max_iter :  10, ovr:  5.0621 sec
n_jobs : 1, max_iter :  10, mul:  3.3201 sec
n_jobs : 1, max_iter : 100, ovr: 50.4057 sec
n_jobs : 1, max_iter : 100, mul: 33.0481 sec

n_jobs : 4, max_iter :   1, ovr:  0.2245 sec
n_jobs : 4, max_iter :   1, mul:  0.4381 sec
n_jobs : 4, max_iter :  10, ovr:  1.9564 sec
n_jobs : 4, max_iter :  10, mul:  3.4660 sec
n_jobs : 4, max_iter : 100, ovr: 18.1681 sec
n_jobs : 4, max_iter : 100, mul: 33.0120 sec

Yet it converges to different solutions:

ovr: train_score: 0.9857
mul: train_score: 0.9882
ovr: test_score: 0.9781
mul: test_score: 0.9786
@mblondel

This comment has been minimized.

Copy link
Member

mblondel commented Sep 12, 2015

In the same spirit, I could probably add a multinomial LogisticRegression in SGD (previously proposed in #849). Do we want it?

I would say no. Sure SGD can do partial_fit and SAG can't but this will add a lot of code indirection for little benefit.

@TomDLT TomDLT force-pushed the TomDLT:sag_multi branch from 515b613 to 0e1264a Sep 14, 2015

@TomDLT

This comment has been minimized.

Copy link
Member Author

TomDLT commented Sep 16, 2015

awesome :)

Thanks !
Feel free to review :)

logistic regression model [3]_, which means that its probability estimates
should be better calibrated than the default "one-vs-rest" setting. The
"lbfgs", "sag" and "newton-cg"" solvers cannot optimize L1-penalized models,
though, so the "multinomial" setting does not learn sparse models.

This comment has been minimized.

Copy link
@agramfort

agramfort Sep 16, 2015

Member

though, so the "multinomial" setting does not learn sparse models.

->

therefore the "multinomial" setting does not learn sparse models.

warm-starting. For the multiclass case, if `multi_class` option is set to
"ovr", an optimal C is obtained for each class and if the `multi_class` option
is set to "multinomial", an optimal C is obtained that minimizes the cross-
entropy loss.

This comment has been minimized.

Copy link
@agramfort

agramfort Sep 16, 2015

Member

an optimal C is obtained that minimizes the cross-entropy loss.
->

an optimal C is obtained by minimizing the cross-entropy loss.

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Sep 16, 2015

that's it for me. Looks pretty clean ! good job

@TomDLT TomDLT force-pushed the TomDLT:sag_multi branch from 0e1264a to 7a14e47 Sep 16, 2015

grad_2 = grad_2[:, :-1].T

assert_array_almost_equal(grad_1, grad_2)
assert_almost_equal(loss_1, loss_2)

This comment has been minimized.

Copy link
@ogrisel

ogrisel Sep 23, 2015

Member

Could you please add a test that checks the output of _multinomial_loss_grad or _multinomial_grad_loss_all_samples on a small dataset where is possible to analytically compute the ground truth from the formula mentioned in the docstrings (e.g. with 2 samples in 2D) and check against the expected value?

This comment has been minimized.

Copy link
@TomDLT

TomDLT Sep 24, 2015

Author Member

I am not sure to understand what you mean.
Do you want to duplicate _multinomial_loss_grad in the test?
Do you want to compare its output with a hard-coded result ?

@TomDLT TomDLT changed the title [MRG] add multinomial SAG solver for LogisticRegression [MRG+1] add multinomial SAG solver for LogisticRegression Oct 9, 2015

@TomDLT TomDLT force-pushed the TomDLT:sag_multi branch 3 times, most recently from 1c79743 to 458c083 Nov 10, 2015

@MechCoder

This comment has been minimized.

Copy link
Member

MechCoder commented Nov 16, 2015

I can review this in the coming week !

@TomDLT

This comment has been minimized.

Copy link
Member Author

TomDLT commented Nov 17, 2015

Great !

@@ -186,6 +187,8 @@ def sag_solver(X, y, sample_weight=None, loss='log', alpha=1.,
# As in SGD, the alpha is scaled by n_samples.
alpha_scaled = float(alpha) / n_samples

n_classes = int(y.max()) + 1 if loss == 'multinomial' else 1

This comment has been minimized.

Copy link
@MechCoder

MechCoder Nov 17, 2015

Member

Is it okay to assume y is label encoded?

Loss function that will be optimized.
'log' is used for classification, like in LogisticRegression.
'squared' is used for regression, like in Ridge.
loss : 'log' | 'squared' | 'multinomial'

This comment has been minimized.

Copy link
@MechCoder

MechCoder Nov 17, 2015

Member

Can you let me know how this auto step size is set?

Maybe add a reference below.

This comment has been minimized.

Copy link
@MechCoder

MechCoder Nov 17, 2015

Member

Minor: Should we move max_squared_sum to utils at some point?

This comment has been minimized.

Copy link
@TomDLT

TomDLT Nov 18, 2015

Author Member

Did you checked get_auto_step_size's docstring?

about  max_squared_sum, I don't know in which case it would be useful, but we can definitely move it to utils, maybe with a caller function to check the input.

This comment has been minimized.

Copy link
@MechCoder

MechCoder Nov 19, 2015

Member

I at least use row_norms with squared=True frequently. Speaking of which let us take a step back.

    from sklearn.utils.extmath import row_norms
    def also_get_max_squared_sum(X):
        return row_norms(X, squared=True).max()
    X = np.random.rand(1000, 10000)
    %timeit get_max_squared_sum(X)
    100 loops, best of 3: 10.8 ms per loop
    %timeit also_get_max_squared_sum(X)
    100 loops, best of 3: 5.36 ms per loop
    X = np.random.rand(10000, 1000)
    %timeit get_max_squared_sum(X)
    100 loops, best of 3: 10.7 ms per loop
    %timeit also_get_max_squared_sum(X)
    100 loops, best of 3: 5.33 ms per loop

This should not matter much since it seems like it is done only once, but I feel it is unnecessary code duplication with almost no gain. (Unless I am overseeing something as usual)

This comment has been minimized.

Copy link
@TomDLT

TomDLT Nov 20, 2015

Author Member

I didn't know about row_norms, I will remove get_max_squared_norm.

# the data pointer for X, the training set
cdef double *x_data_ptr
cdef double *x_data_ptr = NULL

This comment has been minimized.

Copy link
@MechCoder

MechCoder Nov 17, 2015

Member

Inserting spaces after every cdef might be good. It's kind of difficult to read now with the comments in between

# the index pointer for the column of the data
cdef int *x_ind_ptr
# the label for the sample
cdef int *x_ind_ptr = NULL

This comment has been minimized.

Copy link
@MechCoder

MechCoder Nov 17, 2015

Member

nit: Then x_col_ptr might be a better name?

This comment has been minimized.

Copy link
@TomDLT

TomDLT Nov 18, 2015

Author Member

I used the same name as in seq_dataset, which is related to classical names for scipy sparse CSR structure.

This comment has been minimized.

Copy link
@MechCoder

MechCoder Nov 19, 2015

Member

great, makes sense

raise ValueError("Invalid loss parameter: got %s instead of "
"one of ('log', 'squared', 'multinomial')"
% loss_function)

with nogil:
start_time = time(NULL)
while True:

This comment has been minimized.

Copy link
@MechCoder

MechCoder Nov 17, 2015

Member

Minor: Do you think it would read better to do for n_iter in range(max_iter) here and remove the infinite while loop.

with nogil:
start_time = time(NULL)
while True:
if infinity:
break
for itr in range(n_samples):

This comment has been minimized.

Copy link
@MechCoder

MechCoder Nov 17, 2015

Member

itr -> sample_ind?

This comment has been minimized.

Copy link
@MechCoder

MechCoder Nov 17, 2015

Member

Oh sorry, I see that this is not the sample_ind, but still might be better to name it as sample_itr

feature_hist[idx] = itr
# check to see that the weight is not inf or NaN
if not skl_isfinite(weights[idx]):
infinity = True

This comment has been minimized.

Copy link
@MechCoder

MechCoder Nov 17, 2015

Member

Would it be better to raise the ValueError here and do away with the infinity variable instead of breaking multiple times and check at the end?

There are only two times where the value of infinity changes.

# helper variable for indexes
cdef int idx
cdef int idx, f, c, j

This comment has been minimized.

Copy link
@MechCoder

MechCoder Nov 17, 2015

Member

f -> feature_ind, c -> class_ind

The multinomial logistic loss for one sample is:
loss = - sw \sum_c \delta_{y,c} (prediction[c] - logsumexp(prediction))
= sw (logsumexp(prediction) - prediction[y])

This comment has been minimized.

Copy link
@MechCoder

MechCoder Dec 2, 2015

Member

Did you mean sw * (logsum(prediction) - prediction[c] * y)?. Also I think we can do away with the previous \delta_{y, c} definition then.

This comment has been minimized.

Copy link
@MechCoder

MechCoder Dec 2, 2015

Member

Sorry, I got confused. We can still remove the previous line though.

This comment has been minimized.

Copy link
@TomDLT

TomDLT Dec 3, 2015

Author Member

I find it more explicit to reuse the formula of the reference and to simplify only in the second line.
wdyt?

This comment has been minimized.

Copy link
@MechCoder

if 'intercept_sum_gradient' in warm_start_mem.keys():
intercept_sum_gradient_init = warm_start_mem['intercept_sum_gradient']
intercept_sum_gradient = warm_start_mem['intercept_sum_gradient']

This comment has been minimized.

Copy link
@MechCoder

MechCoder Dec 2, 2015

Member

Can you document warm_start_mem? Right now I don't know what they are without looking into the code.

This comment has been minimized.

Copy link
@TomDLT

TomDLT Dec 3, 2015

Author Member

Done

double sample_weight, double* gradient_ptr) nogil:
"""Multinomial Logistic regression gradient of the loss.
The gradient of the multinomial logistic loss for one sample is:

This comment has been minimized.

Copy link
@MechCoder

MechCoder Dec 2, 2015

Member

with respect to a class c

This comment has been minimized.

Copy link
@MechCoder

MechCoder Dec 2, 2015

Member

Out of curiosity what does _dloss stand for?

This comment has been minimized.

Copy link
@TomDLT

TomDLT Dec 3, 2015

Author Member

loss is the loss, dloss is the derivative of the loss, has used in sgd_fast.pyx

# y is the indice of the correct class of current sample.
gradient_ptr[int(y)] -= 1.0

for class_ind in range(n_classes):

This comment has been minimized.

Copy link
@MechCoder

MechCoder Dec 2, 2015

Member

Nit: You can do gradient_ptr[int(y)] = -1 initially and do this loop along with the first, and prevent iterating twice across n_classes right?

This comment has been minimized.

Copy link
@TomDLT

TomDLT Dec 3, 2015

Author Member

Done

for j in range(xnnz):
feature_ind = x_ind_ptr[j]
innerprod += (w_data_ptr[feature_ind * n_classes + class_ind] *
x_data_ptr[j])

This comment has been minimized.

Copy link
@MechCoder

MechCoder Dec 2, 2015

Member

Will this be slower due to random access in memory patterns? Note that w is C-contiguous across class_ind, so it should help if the inner loop across n_classes.
In that way you also prevent repeatedly accessing feature_ind, namely

for class_ind in range(n_classes):
     prediction[class_ind]  = 0.0

 for j in range(nnz):
     feature_ind = ... 
     x_data = ...
     w_start_ind = feature_ind * n_classes
     for class_ind in range(n_classes):
          prediction[class_ind] +=. w_data_ptr[w_start_ind + class_ind] * x_data

This should not matter much for sparse data but still..

This comment has been minimized.

Copy link
@TomDLT

TomDLT Dec 3, 2015

Author Member

I just did some benchmark on dense and sparse data, with multinomial and binary classification, but the gain is poor (-3%), when it is not a loss (up to +11%):

sparse, (162529, 47236) binary : +0%
sparse, (162529, 47236) multinomial: +3%
dense, (26851, 47236) binary : +11%
dense, (26851, 47236) multinomial 4 classes : -3%

I understand why it should be a bit faster, and I don't know why it's not in practice.

Here is my code, if I missed something:

cdef int feature_ind, class_ind, w_start_ind, j
cdef double x_data

for class_ind in range(n_classes):
     prediction[class_ind]  = 0.0

# Compute the dot product only on non-zero elements of x
for j in range(xnnz):
    feature_ind = x_ind_ptr[j]
    x_data = x_data_ptr[j]
    w_start_ind = feature_ind * n_classes
    for class_ind in range(n_classes):
        prediction[class_ind] += w_data_ptr[w_start_ind + class_ind] * x_data

for class_ind in range(n_classes):
    prediction[class_ind] *= wscale
    prediction[class_ind] += intercept[class_ind]

This comment has been minimized.

Copy link
@MechCoder

MechCoder Dec 3, 2015

Member

Thank you for the benchmarks. For sparse data, I did not expect a huge gain because of the randomly access in memory of w_data_ptr.

Since we are losing slightly on dense data for the multiclass case, as a last-ditch effort, would it be useful to prevent the last iteration by doing something like,

for class_ind in range(n_classes):
    prediction[class_ind] ....
    if j == xnnz - 1:
         prediction[class_ind] *= wscale

Not sure if it will help or not. Btw, it seems you can do prediction[class_ind] = intercept[class_ind] when you iterate across n_classes initially.

This comment has been minimized.

Copy link
@TomDLT

TomDLT Dec 4, 2015

Author Member

The high loss (+11%) is in the binary case, so I don't think your suggestion might change so much.

This comment has been minimized.

Copy link
@MechCoder

MechCoder Dec 4, 2015

Member

So the +11% represents a loss? I thought it was a gain. Then it might be better to revert it back. It might be worth incrementing w_data_ptr instead of repeatedly accessing.

This comment has been minimized.

Copy link
@TomDLT

TomDLT Dec 4, 2015

Author Member

Reverted and squashed

@MechCoder

This comment has been minimized.

Copy link
Member

MechCoder commented Dec 2, 2015

I'm just curious about how the INTERCEPT_DECAY value for sparse_data is set to 0.01. Also the doc for multi_class should be updated. See #5897

LGTM when the minor queries are addressed.

Plot decision surface of multinomial and One-vs-Rest Logistic Regression.
The hyperplanes corresponding to the three One-vs-Rest (OVR) classifiers
are represented by the dashed lines.

This comment has been minimized.

Copy link
@MechCoder

MechCoder Dec 2, 2015

Member

Maybe, you should combine this with the already existing plot_iris_logistic.py

This comment has been minimized.

Copy link
@TomDLT

TomDLT Dec 3, 2015

Author Member

I fear the multinomial example is not very visual with the iris dataset:
ovr:
ovr
multinomial:
multinomial

training score : 0.807 (ovr)
training score : 0.833 (multinomial)

This comment has been minimized.

Copy link
@MechCoder

MechCoder Dec 3, 2015

Member

I did not see that this was make_blobs sorry.

@TomDLT TomDLT force-pushed the TomDLT:sag_multi branch from 49119c8 to bcd3f3e Dec 3, 2015

@TomDLT

This comment has been minimized.

Copy link
Member Author

TomDLT commented Dec 3, 2015

I'm just curious about how the INTERCEPT_DECAY value for sparse_data is set to 0.01.

The intercept decay was already used in SGD, and comes from Léon Bottou:
"The learning rate for the bias is multiplied by 0.01 because this frequently improves the condition number."

Also the doc for multi_class should be updated. See #5897

Done.

@MechCoder

This comment has been minimized.

Copy link
Member

MechCoder commented Dec 3, 2015

lgtm apart from that one inline comment

@TomDLT TomDLT force-pushed the TomDLT:sag_multi branch from bcd3f3e to 9f136ff Dec 4, 2015

@MechCoder

This comment has been minimized.

Copy link
Member

MechCoder commented Dec 4, 2015

Will merge pending Travis !

@TomDLT

This comment has been minimized.

Copy link
Member Author

TomDLT commented Dec 4, 2015

Great !
Thanks again @MechCoder for the meticulous review !

MechCoder added a commit that referenced this pull request Dec 4, 2015

Merge pull request #5251 from TomDLT/sag_multi
[MRG+1] add multinomial SAG solver for LogisticRegression

@MechCoder MechCoder merged commit 43e5454 into scikit-learn:master Dec 4, 2015

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@MechCoder

This comment has been minimized.

Copy link
Member

MechCoder commented Dec 4, 2015

Merged !

@MechCoder

This comment has been minimized.

Copy link
Member

MechCoder commented Dec 4, 2015

It should be straightforward to support partial_fit now. Do you want me to do that?

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Dec 4, 2015

@MechCoder

This comment has been minimized.

Copy link
Member

MechCoder commented Dec 7, 2015

We rewrite it for every partial fit? i.e just keep the coefficients and intercept.

It should still converge faster than normal SGD for every partial fit, no?

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Dec 7, 2015

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Dec 10, 2015

Btw, was the reset of w_scale in the paper / reference implementation or is that a new idea for this implementation?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Dec 10, 2015

can we get a "versionadded" entry in LogisticRegression? Or is there one and I didn't see it?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Dec 10, 2015

btw, should there be an example showing how this is calibrated and OVR is not?

@TomDLT

This comment has been minimized.

Copy link
Member Author

TomDLT commented Dec 11, 2015

It should be straightforward to support partial_fit now.

I don't think it makes sense for SAG, or at least it is not straightforward.

Btw, was the reset of w_scale in the paper / reference implementation or is that a new idea for this implementation?

I don't remember seing this in the paper, but it makes sense numerically, and it is present in Mark Schmidt's C code.

can we get a "versionadded" entry in LogisticRegression? Or is there one and I didn't see it?

Will do.

btw, should there be an example showing how this is calibrated and OVR is not?

What do you mean?

@MechCoder

This comment has been minimized.

Copy link
Member

MechCoder commented Dec 11, 2015

Yes it does not. I was mistaken.

On Fri, Dec 11, 2015 at 5:29 AM, Tom Dupré la Tour <notifications@github.com

wrote:

It should be straightforward to support partial_fit now.

I don't think it makes sense for SAG, or at least it is not
straightforward.

Btw, was the reset of w_scale in the paper / reference implementation or
is that a new idea for this implementation?

I don't remember seing this in the paper, but it makes sense numerically,
and it is present in the Mark Schmidt's C code
http://www.cs.ubc.ca/%7Eschmidtm/Software/SAG.html.

can we get a "versionadded" entry in LogisticRegression? Or is there one
and I didn't see it?

Will do.

btw, should there be an example showing how this is calibrated and OVR is
not?

What do you mean?


Reply to this email directly or view it on GitHub
#5251 (comment)
.

Manoj,
http://github.com/MechCoder

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Dec 11, 2015

If this is inspired by Mark Schmidt's code, we should probably reference that (if we don't already).

For calibration:
I think it would be interesting to compare OVR and multinomial in terms of calibration, see http://scikit-learn.org/dev/modules/calibration.html

@TomDLT TomDLT deleted the TomDLT:sag_multi branch Dec 20, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.