New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Multinomial Logistic Regression #3490
Conversation
@larsmans I have done a few benches, I have timed the newsgroup dataset, It takes around 195 seconds around which 80 seconds are spent in the _loss_grad function. A huge amount of these 80 seconds (64.8%) are spent in these two operations
I tried splitting the loss function into functions that separately calculate the loss and gradient, but it slows down even more, probably because p gets calculated repeatedly. Do you have any tips to proceed? cc: @agramfort |
I think I actually already optimized those two lines of code. If Similarly, while XT = X.tocsc().T
grad = safe_sparse_dot(XT, diff).T but I'm not convinced that would actually be faster. Ping @ogrisel, who knows a lot about sparse matmul as well. |
|
||
def _sqnorm(x): | ||
x = x.ravel() | ||
return np.dot(x, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this in sklearn.utils.extmath.squared_norm
now.
Hold on: scipy.sparse doesn't care very much about contiguous arrays, so maybe making |
There was a difference in the tol and max_iter parameter in the fmin_lbfgs_b and the Logisic Regression solver, which I have fixed (which created the impression of it being much slower). I played around with a few datasets, and it seems that though MultinomialLR is slower in sparse settings, it is much better than the LogisticRegression solver, especially for multi-class problems.
Also, I tried out this problem, http://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_weighted_samples.html for multinomial and logistic for the same setting, but I get different lines. I suppose this is not expected (the dahed line is multinomial) |
I don't know if this is ok; the buildbots don't seem to like the code... I didn't test the sample weights very well. Re: contiguity, that doesn't actually involve NumPy in the sparse case. scipy.sparse has its own matmul routines. These accept (AFAIK) arbitrarily strided NumPy arrays, but then still they may be slow if the strides are too big. Like with NumPy, and BLAS in fact, performance is best when data are packed tightly and presented in the order that they get processed, because that minimizes the number of cache misses. (Also, if you inspect the hairy beast that is |
thanks for the info. yes, I think sample weights is broken. I will fix it. |
For csr_matvecs, it will copy if the data isn't fortran contiguous, as far as I can see. |
I tested the scores on the 20 newsgroup datasets vectorized, I get a score of 0.81704726500265534 for the Logistic Regression (OvA) and 0.79660116834838024 for the multinomial model setting C=10000. |
What value of C? Did you cross-validate? You should report the results for a grid of C |
Yes, that was the best C that I got by using the Logistic Regression CV model. |
I shall do that, I'll plot scores on the y axis and C on the x axis in a bit. |
I've added support for class weights. Tests pass now. |
Nice! |
grad *= C | ||
grad += w | ||
if fit_intercept: | ||
grad = np.hstack([grad, diff.sum(axis=0).reshape(-1, 1)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larsmans I suppose this should be
grad = np.hstack([grad, C* diff.sum(axis=0).reshape(-1, 1)])
or even better we could just do alpha = 1./C
. and multiply it with the penalty term to make it less confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e
grad += alpha * w
and
loss = ... + 0.5 * alpha * squared_norm(w)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I specifically chose not to penalize the intercept because it means I don't typically center my data. My reasoning was that a flat estimator would still learn the class distribution, not just be all zeros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I've got the math wrong.
I thought we are not multiplying C with the grad term corresponding to the intercept, and its not the penalty term corresponding to the intercept.
Does multiplying the grad term with C count as penalising the intercept too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I get you. C
is multiplied into grad
a few lines up, before the intercept is stacked in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just asking if the total loss was
- C * (entire loss) + penalty term (without the intercept) OR
- C* (loss without the intercept) + (loss due to intercept) + penalty term (without the intercept)
I was thinking it was 1, since the loss is
-C * (sample_weight * Y * p)
and p includes the intercept here.
If it were 1 when I do a derivative wrt intercept, the C would remain right? Sorry if my questions are dumb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats why I was suggesting it would be better if we write it this day
- Loss = Total Loss + alpha * penalty ( without intercept)
It would be a bit clearer, since alpha = 1. / C
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larsmans Does this make sense or am I crazy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you're making perfect sense. Using alpha
is more in line with most of the LR literature and makes it easier to check the derivations, as I just noticed.
I ran a few benchmarks for the multinomial vs logistic regression code. |
How many classes did you use?
|
@agramfort I used all 20 classes of the 20 newsgroup data. |
@larsmans @agramfort I could not make the newsgroup example any faster by making changes in memory. What do you think would be a good argument, to take this PR ahead? |
|
||
Parameters | ||
---------- | ||
X : array-like, shape = [n_samples, n_features] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shape (n_samples, n_features)
what we really miss is an example that uses MLR also I am not sure about the name MultinomialLR... or even the need for an extra class. what would it take to have
in LogisticRegression (inspired by LinearSVC) thoughts? |
@jnothman Should I go ahead and replace it? |
I agree. I think that we should merge this guy (I wanted to give it a |
@GaelVaroquaux Ok just a second, I will address your last comment. |
@GaelVaroquaux Please merge if the last commit is ok with you. |
@@ -308,6 +372,13 @@ def logistic_regression_path(X, y, pos_class=None, Cs=10, fit_intercept=True, | |||
To lessen the effect of regularization on synthetic feature weight | |||
(and therefore on the intercept) intercept_scaling has to be increased. | |||
|
|||
multi_class : str, optional default 'ovr' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the usual standard is to write, instead of "str", "{'ovr', 'multinomial'}". I think that I prefer that latter option.
@GaelVaroquaux Are we good to go now? |
+1 for merge on my side. |
@GaelVaroquaux I suppose we can move cleaning up to another PR. So I can haz merge? |
shall I rebase and merge? |
Yeah. I am currently in a criss-cross lock-in of urgent matters (papers, |
@agramfort Sure. But is there a necessity? 13 commits always look better than a single one :p |
rebase != squash :) I'll do it now. |
merged by rebase. nice work @MechCoder ! |
Thanks everyone for reviews. |
Benchmarks for the haxby dataset.
When X is dense and across 9 classes (972, 577) (removing most of the zero features)
When X is sparse and acroos 9 classes (972, 163839)
Benchmarks for a synthetic dataset, using make_classification (50000, 2000)
For the arcene data (100 * 10000)