Skip to content
This repository has been archived by the owner on Dec 6, 2023. It is now read-only.

[MRG] Just in time SAGA. #40

Merged

Conversation

zermelozf
Copy link
Contributor

A squashed version of #38 ontaining:

  • SAGA algorithm in cython.
  • Basic python version of SAG and SAGA for testing.
  • Support for proximity operators through the Penalty base class.
  • L1 proximity operator with just in time update for sparse data.

double* w,
int* indices,
double stepsize,
double* w_scale,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w_scale isn't used. Does it mean that you don't support elastic net?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, the scale is used through scale_cum. The w_scale is maintained in case some prox wants to overwrite it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I would remove w_scale from the function signature until we have an actual use case. Also, I don't see any test for elastic net so this means that scale_cum is not tested.

@fabianp
Copy link
Member

fabianp commented Nov 17, 2015

This might also be of interest to @adefazio @agramfort @TomDLT

@agramfort
Copy link
Member

thx

how does it compare in terms of perf with sklearn SAG version?

@zermelozf
Copy link
Contributor Author

@agramfort We have updated this gist with sklearn's LogisticRegression.

Multiclass (OVR) time (sec.) score
SAGClassifier 8.727 0.1676
SAGAClassifier 12.245 0.1757
(SAG)LogisticRegression 12.061 0.1750

I have assumed that lightning does OVR for multi class problems (@mblondel correct me if I misunderstood).

The score discrepancy seems to be caused by the different stopping criteria. We have not (yet) rigorously compared the convergence speed (to the optimum) of SAG vs SAGA, the advantage of SAGA over SAG essentially being possibility to specify an arbitrary proximity operator.

In term of code speed, adding tol=0 leads to a runtime advantage for sklearn for 100 iterations over the whole 20newsgroup dataset.

100 iterations time (sec.) score
SAGClassifier 198 0.174872665535
SAGAClassifier 215 0.174872665535
(SAG)LogisticRegression 117 0.174872665535

@agramfort
Copy link
Member

ok for the benefit of SAGA.

in terms of computation time what I read is that it's pretty much the same.

thx

On Wed, Nov 18, 2015 at 2:59 PM, Arnaud Rachez notifications@github.com
wrote:

@agramfort https://github.com/agramfort We have updated this gist
https://gist.github.com/zermelozf/d4670d6cffea09f6e6f3 with sklearn's
LogisticRegression.
Multiclass (OVR) time (sec.) score SAGClassifier 8.727 0.1676
SAGAClassifier 12.245 0.1757 LogisticRegression 12.061 0.1750

I have assumed that lightning does OVR for multi class problems (@mblondel
https://github.com/mblondel correct me if I misunderstood).

The score discrepancy seems to be caused by the different stopping
criteria. We have not (yet) rigorously compared the convergence speed (to
the optimum) of SAG vs SAGA. The advantage of SAGA over SAG is the
possibility to specify an arbitrary proximity operator.

In term of code speed, adding tol=0 leads to a runtime advantage for
sklearn for 100 iterations over the whole 20newsgroup dataset.
100 iterations time (sec.) score SAGClassifier 198 0.174872665535
SAGAClassifier 215 0.174872665535 LogisticRegression 117 0.174872665535


Reply to this email directly or view it on GitHub
https://github.com/mblondel/lightning/pull/40#issuecomment-157720960.

@zermelozf
Copy link
Contributor Author

@agramfort Yes, that is if we don't use any prox in SAGA as benchmarked above. Otherwise the jit updates associated with the prox will slow the computation down.

@fabianp
Copy link
Member

fabianp commented Nov 18, 2015

The interest of SAGA (for me) is in the support for composite loss functions. For smooth problems they read more or less the same in my experience.

@fabianp
Copy link
Member

fabianp commented Nov 18, 2015

Also, comparing them is tricky since it ends up depending on how you choose the step size. Once we have adaptive step size (as described in the Schmidt paper) we could do more meaningful benchmarks

@mblondel
Copy link
Member

Also IIRC sklearn's implementation of SAG uses a line search.

@TomDLT
Copy link
Member

TomDLT commented Nov 18, 2015

No, the sklearn's implementation of SAG uses a constant step size computed using the maximum Lipschitz constant over all samples (cf. get_auto_step_size).

@adefazio
Copy link

One trick that might help you speed SAGA up is to remove the random sampling of data points. Instead, at the beginning of the algorithm make a copy of the data set with the rows permuted at random (i.e. shuffled). Then every odd epoch access the datapoints in the original dataset in order, and every even epoch access them from the shuffled copy of the dataset in order also. This greatly reduces the number of TLB misses at the expense of requiring twice as much memory.
This trick only works for SAGA or SDCA, if you use it with SAG it will diverge.

@fabianp
Copy link
Member

fabianp commented Nov 19, 2015

Thanks for the information @adefazio, I did not know about it.

@zermelozf can your take into account Mathieu's comments?

@mblondel
Copy link
Member

Interesting trick but intuitively it should work only for dense data. There is also the cost of copying the data which could take a few dozens of seconds for very large data. So not copying the data actually gives a head start.

@zermelozf
Copy link
Contributor Author

I just removed jj in the last squashed commit.

@fabianp
Copy link
Member

fabianp commented Nov 19, 2015

Can you remove the w_scale from projection_lagged as Mathieu suggested?

@fabianp
Copy link
Member

fabianp commented Nov 19, 2015

Also, you need to add a test for elastic-net (where alpha != 0 AND beta != 0)

@zermelozf
Copy link
Contributor Author

Elastic net test added and w_scale dependency removed in lagged update.

@mblondel
Copy link
Member

Sounds great. Thanks for the awesome contrib. Are we good to go?

By the way, do you plan to add group lasso later?

@fabianp
Copy link
Member

fabianp commented Nov 19, 2015

Good for me. Just needs to add SAGAClassifier and SAGARegression into doc/references.rst.

I do have a prox for group lasso penalty that I'm using. Will contribute once this is merged if you think its worth (I do think it would be a nice addition).

@mblondel
Copy link
Member

+1

@zermelozf
Copy link
Contributor Author

@fabianp @mblondel Doc reference updated.

@mblondel
Copy link
Member

Alright, pressing the green button :)

mblondel added a commit that referenced this pull request Nov 19, 2015
@mblondel mblondel merged commit 4a455f8 into scikit-learn-contrib:master Nov 19, 2015
@mblondel
Copy link
Member

If you've got time, a SAGA vs. SDCA vs. Adagrad comparison for elastic net would be nice. You can build an example based on this:
http://www.mblondel.org/lightning/auto_examples/plot_l2_solvers.html

FYI SDCA doesn't work well without l2 regul (e.g. l1 regul only).

@zermelozf
Copy link
Contributor Author

Yes, I was planning to do that as suggested by @fabianp as well. Thanks for the link, it seems like you have done 99% of the work already. I'll have a look at it next week after I finish a couple of things.

@fabianp fabianp deleted the fabian-saga-squashed branch November 19, 2015 15:30
@fabianp
Copy link
Member

fabianp commented Nov 19, 2015

Great work @zermelozf

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants