[MRG] Just in time SAGA. #40
[MRG] Just in time SAGA. #40
Conversation
double* w, | ||
int* indices, | ||
double stepsize, | ||
double* w_scale, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
w_scale isn't used. Does it mean that you don't support elastic net?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, the scale is used through scale_cum. The w_scale is maintained in case some prox wants to overwrite it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. I would remove w_scale from the function signature until we have an actual use case. Also, I don't see any test for elastic net so this means that scale_cum is not tested.
This might also be of interest to @adefazio @agramfort @TomDLT |
thx how does it compare in terms of perf with sklearn SAG version? |
@agramfort We have updated this gist with sklearn's
I have assumed that lightning does OVR for multi class problems (@mblondel correct me if I misunderstood). The score discrepancy seems to be caused by the different stopping criteria. We have not (yet) rigorously compared the convergence speed (to the optimum) of SAG vs SAGA, the advantage of SAGA over SAG essentially being possibility to specify an arbitrary proximity operator. In term of code speed, adding
|
ok for the benefit of SAGA. in terms of computation time what I read is that it's pretty much the same. thx On Wed, Nov 18, 2015 at 2:59 PM, Arnaud Rachez notifications@github.com
|
@agramfort Yes, that is if we don't use any prox in SAGA as benchmarked above. Otherwise the jit updates associated with the prox will slow the computation down. |
The interest of SAGA (for me) is in the support for composite loss functions. For smooth problems they read more or less the same in my experience. |
Also, comparing them is tricky since it ends up depending on how you choose the step size. Once we have adaptive step size (as described in the Schmidt paper) we could do more meaningful benchmarks |
Also IIRC sklearn's implementation of SAG uses a line search. |
No, the sklearn's implementation of SAG uses a constant step size computed using the maximum Lipschitz constant over all samples (cf. get_auto_step_size). |
One trick that might help you speed SAGA up is to remove the random sampling of data points. Instead, at the beginning of the algorithm make a copy of the data set with the rows permuted at random (i.e. shuffled). Then every odd epoch access the datapoints in the original dataset in order, and every even epoch access them from the shuffled copy of the dataset in order also. This greatly reduces the number of TLB misses at the expense of requiring twice as much memory. |
Thanks for the information @adefazio, I did not know about it. @zermelozf can your take into account Mathieu's comments? |
Interesting trick but intuitively it should work only for dense data. There is also the cost of copying the data which could take a few dozens of seconds for very large data. So not copying the data actually gives a head start. |
2ac44ae
to
f93fa0c
Compare
I just removed |
Can you remove the w_scale from projection_lagged as Mathieu suggested? |
Also, you need to add a test for elastic-net (where alpha != 0 AND beta != 0) |
b098612
to
8738319
Compare
Elastic net test added and |
Sounds great. Thanks for the awesome contrib. Are we good to go? By the way, do you plan to add group lasso later? |
Good for me. Just needs to add SAGAClassifier and SAGARegression into I do have a prox for group lasso penalty that I'm using. Will contribute once this is merged if you think its worth (I do think it would be a nice addition). |
+1 |
8738319
to
93a2f97
Compare
Alright, pressing the green button :) |
If you've got time, a SAGA vs. SDCA vs. Adagrad comparison for elastic net would be nice. You can build an example based on this: FYI SDCA doesn't work well without l2 regul (e.g. l1 regul only). |
Yes, I was planning to do that as suggested by @fabianp as well. Thanks for the link, it seems like you have done 99% of the work already. I'll have a look at it next week after I finish a couple of things. |
Great work @zermelozf |
A squashed version of #38 ontaining:
Penalty
base class.