Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Multinomial Logistic Regression #3490

Closed
wants to merge 13 commits into from

Conversation

MechCoder
Copy link
Member

  • Refactor code to let multinomial to be an option.
  • Test with further data, (e.g make blobs)
  • Try speeding up

Benchmarks for the haxby dataset.

When X is dense and across 9 classes (972, 577) (removing most of the zero features)
haxby_dense

When X is sparse and acroos 9 classes (972, 163839)
haxby_sparse_complete

Benchmarks for a synthetic dataset, using make_classification (50000, 2000)
synthetic_multinomial

newsgroup_multinomial

For the arcene data (100 * 10000)
arcene

@MechCoder
Copy link
Member Author

@larsmans I have done a few benches, I have timed the newsgroup dataset,

It takes around 195 seconds around which 80 seconds are spent in the _loss_grad function.

A huge amount of these 80 seconds (64.8%) are spent in these two operations

p = safe_sparse_dot(X, w.T)
grad = safe_sparse_dot(diff.T, X)

I tried splitting the loss function into functions that separately calculate the loss and gradient, but it slows down even more, probably because p gets calculated repeatedly. Do you have any tips to proceed?

cc: @agramfort

@MechCoder MechCoder changed the title Multinomial Logistic Regression [WIP] Multinomial Logistic Regression Jul 29, 2014
@larsmans
Copy link
Member

I think I actually already optimized those two lines of code. If X is CSR and w is C-ordered (so w.T is Fortran-ordered), safe_sparse_dot(X, w.T) takes the fastest path through scipy.sparse.

Similarly, while grad could be computed as safe_sparse_dot(X.T, diff).T, that would mean multiplying with a CSC matrix on the left, which is slow. One thing to try is

XT = X.tocsc().T
grad = safe_sparse_dot(XT, diff).T

but I'm not convinced that would actually be faster.

Ping @ogrisel, who knows a lot about sparse matmul as well.


def _sqnorm(x):
x = x.ravel()
return np.dot(x, x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this in sklearn.utils.extmath.squared_norm now.

@larsmans
Copy link
Member

Hold on: w is actually Fortran-ordered because that makes w[:, :-1] Fortran-ordered and contiguous, but as a result, w[:, :-1].T is C-ordered.

scipy.sparse doesn't care very much about contiguous arrays, so maybe making w C-ordered initially can shave off the 9% that it does in my IPython timings. OTOH, that might destroy performance on dense arrays...

@MechCoder
Copy link
Member Author

@larsmans

There was a difference in the tol and max_iter parameter in the fmin_lbfgs_b and the Logisic Regression solver, which I have fixed (which created the impression of it being much slower).

I played around with a few datasets, and it seems that though MultinomialLR is slower in sparse settings, it is much better than the LogisticRegression solver, especially for multi-class problems.
the first timings are logistic regression, the second ones are for multinomial

For the newsgroup dataset
11.2755260468 # Logistic Regression
59.4959349632 # Multinomial

For a synthetic dataset got from make_classification
n_samples=10000, n_features=1000, n_informative=500
5.37381219864 # Logistic Regression
4.84601402283 # Multinomial

n_samples=10000, n_features=1000, n_informative=2
0.776676177979
1.7113292217

n_samples=10000, n_features=1000, n_classes=4
19.1921730042 
5.38009214401

n_samples=10000, n_features=1000, n_classes=4, n_informative=2
6.80555915833
4.28371906281
  1. Thanks for the wonderful comments on memory-contiguity. I tried those but it does not seem to speed up at all. However, I thought that we need to worry about those only for cython dependent operations, and these are internally optimised by np.dot. Is it not so?

Also, I tried out this problem, http://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_weighted_samples.html for multinomial and logistic for the same setting, but I get different lines. I suppose this is not expected (the dahed line is multinomial)
figure_1
?

@MechCoder
Copy link
Member Author

Please ignore the graph above, I made a mistake (I was thinking that the lbfgs solver is the default one). This is the new graph, (the lines are slightly different from one another), is that ok?
figure_1

@larsmans
Copy link
Member

I don't know if this is ok; the buildbots don't seem to like the code... I didn't test the sample weights very well.

Re: contiguity, that doesn't actually involve NumPy in the sparse case. scipy.sparse has its own matmul routines. These accept (AFAIK) arbitrarily strided NumPy arrays, but then still they may be slow if the strides are too big. Like with NumPy, and BLAS in fact, performance is best when data are packed tightly and presented in the order that they get processed, because that minimizes the number of cache misses.

(Also, if you inspect the hairy beast that is numpy/core/blasdot/_dotblas.c closely, you'll find that sometimes it copies matrix inputs to get them properly contiguous, and for vector dot products it may back off to a very slow routine because BLAS won't handle negative strides the way NumPy wants.)

@MechCoder
Copy link
Member Author

thanks for the info. yes, I think sample weights is broken. I will fix it.

@jnothman
Copy link
Member

These accept (AFAIK) arbitrarily strided NumPy arrays, but then still they may be slow if the strides are too big.

For csr_matvecs, it will copy if the data isn't fortran contiguous, as far as I can see.

@MechCoder
Copy link
Member Author

I tested the scores on the 20 newsgroup datasets vectorized,

I get a score of 0.81704726500265534 for the Logistic Regression (OvA) and 0.79660116834838024 for the multinomial model setting C=10000.

@agramfort
Copy link
Member

What value of C? Did you cross-validate? You should report the results for a grid of C

@MechCoder
Copy link
Member Author

Yes, that was the best C that I got by using the Logistic Regression CV model.

@MechCoder
Copy link
Member Author

I shall do that, I'll plot scores on the y axis and C on the x axis in a bit.

@MechCoder
Copy link
Member Author

I've added support for class weights. Tests pass now.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 0f27c84 on MechCoder:multinomial_logreg into 1b2833a on scikit-learn:master.

@larsmans
Copy link
Member

Nice!

grad *= C
grad += w
if fit_intercept:
grad = np.hstack([grad, diff.sum(axis=0).reshape(-1, 1)])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larsmans I suppose this should be

grad = np.hstack([grad, C* diff.sum(axis=0).reshape(-1, 1)])

or even better we could just do alpha = 1./C . and multiply it with the penalty term to make it less confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e

grad += alpha * w

and

loss = ... + 0.5 * alpha * squared_norm(w)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I specifically chose not to penalize the intercept because it means I don't typically center my data. My reasoning was that a flat estimator would still learn the class distribution, not just be all zeros.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I've got the math wrong.
I thought we are not multiplying C with the grad term corresponding to the intercept, and its not the penalty term corresponding to the intercept.

Does multiplying the grad term with C count as penalising the intercept too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get you. C is multiplied into grad a few lines up, before the intercept is stacked in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just asking if the total loss was

  1. C * (entire loss) + penalty term (without the intercept) OR
  2. C* (loss without the intercept) + (loss due to intercept) + penalty term (without the intercept)

I was thinking it was 1, since the loss is

 -C * (sample_weight * Y * p)

and p includes the intercept here.
If it were 1 when I do a derivative wrt intercept, the C would remain right? Sorry if my questions are dumb.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats why I was suggesting it would be better if we write it this day

  1. Loss = Total Loss + alpha * penalty ( without intercept)

It would be a bit clearer, since alpha = 1. / C

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larsmans Does this make sense or am I crazy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you're making perfect sense. Using alpha is more in line with most of the LR literature and makes it easier to check the derivations, as I just noticed.

@MechCoder
Copy link
Member Author

@larsmans @agramfort

I ran a few benchmarks for the multinomial vs logistic regression code.

first_bench_multinomial

@agramfort
Copy link
Member

How many classes did you use?

On 31 juil. 2014, at 21:40, Manoj Kumar notifications@github.com wrote:

@larsmans @agramfort

I ran a few benchmarks for the multinomial vs logistic regression code.


Reply to this email directly or view it on GitHub.

@MechCoder
Copy link
Member Author

@agramfort I used all 20 classes of the 20 newsgroup data.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 8606ebf on MechCoder:multinomial_logreg into 6c7f029 on scikit-learn:master.

@MechCoder
Copy link
Member Author

@larsmans @agramfort I could not make the newsgroup example any faster by making changes in memory. What do you think would be a good argument, to take this PR ahead?


Parameters
----------
X : array-like, shape = [n_samples, n_features]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shape (n_samples, n_features)

@agramfort
Copy link
Member

what we really miss is an example that uses MLR

also I am not sure about the name MultinomialLR... or even the need for an extra class.

what would it take to have

multi_class: string, 'ovr' or 'multinomial' (default='ovr')

in LogisticRegression (inspired by LinearSVC)

thoughts?

@MechCoder
Copy link
Member Author

@jnothman Should I go ahead and replace it?

@jnothman
Copy link
Member

@jnothman Should I go ahead and replace it?

As @vene says it is messy in master, and should probably be done separately from this PR.

@GaelVaroquaux
Copy link
Member

@jnothman Should I go ahead and replace it?

As @vene says it is messy in master, and should probably be done separately
from this PR.

I agree. I think that we should merge this guy (I wanted to give it a
last look, but if it has two 👍...) and then you should work on a
cleanup.

@MechCoder
Copy link
Member Author

@GaelVaroquaux Ok just a second, I will address your last comment.

@MechCoder
Copy link
Member Author

@GaelVaroquaux Please merge if the last commit is ok with you.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) when pulling b2133eb on MechCoder:multinomial_logreg into 6d8ccbc on scikit-learn:master.

@MechCoder MechCoder changed the title [MRG+1] Multinomial Logistic Regression [MRG+2] Multinomial Logistic Regression Aug 18, 2014
@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) when pulling bc0a9a0 on MechCoder:multinomial_logreg into 6d8ccbc on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) when pulling 2f722bb on MechCoder:multinomial_logreg into 05aa804 on scikit-learn:master.

@@ -308,6 +372,13 @@ def logistic_regression_path(X, y, pos_class=None, Cs=10, fit_intercept=True,
To lessen the effect of regularization on synthetic feature weight
(and therefore on the intercept) intercept_scaling has to be increased.

multi_class : str, optional default 'ovr'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the usual standard is to write, instead of "str", "{'ovr', 'multinomial'}". I think that I prefer that latter option.

@MechCoder
Copy link
Member Author

@GaelVaroquaux Are we good to go now?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling 9171baf on MechCoder:multinomial_logreg into 05aa804 on scikit-learn:master.

@agramfort
Copy link
Member

+1 for merge on my side.

@MechCoder
Copy link
Member Author

@GaelVaroquaux I suppose we can move cleaning up to another PR. So I can haz merge?

@agramfort
Copy link
Member

shall I rebase and merge?

@GaelVaroquaux
Copy link
Member

shall I rebase and merge?

Yeah. I am currently in a criss-cross lock-in of urgent matters (papers,
code reviews, grant submission). Completely unproductive...

@MechCoder
Copy link
Member Author

@agramfort Sure. But is there a necessity? 13 commits always look better than a single one :p

@agramfort
Copy link
Member

rebase != squash :)

I'll do it now.

@agramfort
Copy link
Member

merged by rebase.

nice work @MechCoder !

@agramfort agramfort closed this Aug 21, 2014
@MechCoder MechCoder deleted the multinomial_logreg branch August 21, 2014 15:39
@MechCoder
Copy link
Member Author

Thanks everyone for reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants