Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

[MRG+1] Updating a random feature with replacement for every iteration in the cd code #3335

Closed
wants to merge 7 commits into from

7 participants

@MechCoder
Owner

Testing if random with replacement has the same effect on the duke and arcene datasets.

@MechCoder
Owner

@agramfort @ogrisel These are the benches that I get for the duke and arcene dataset.

Duke:

bench_duke_with_replacement

Arcene:

best_alpha_arcene

@MechCoder
Owner

Umm. they do look similar.

@agramfort
Owner
@agramfort
Owner
@MechCoder
Owner

@ogrisel @agramfort I have pushed the GIL free code now.

@coveralls

Coverage Status

Coverage increased (+0.0%) when pulling 14dadff on MechCoder:random_with_replacement into 82611e8 on scikit-learn:master.

@MechCoder
Owner

Setting tol to 1e-8
bench duke time with alpha_e-8

when I set to tol to 1e-12

bench duke time with alpha

when I set to tol to 1e-18
bench duke time with alpha_e-8

setting tol 0
bench duke time with alpha_0

@agramfort
Owner
@ogrisel
Owner

@MechCoder please also do a train / test split, and for each value of alpha measure the training time and the test score for each method and then put 2 plots:

  • one for training time vs alpha
  • one for test score vs alpha

Use plt.semilogx to have a log scale for the alphas axis. And put a plt.ylim(0, None) to avoid biasing the reading of the y axis.

@MechCoder
Owner
@ogrisel
Owner

@MechCoder BTW please try to use more descriptive titles for your PRs. Here it could be "[WIP] Random Coordinate Descent: sampling coordinates with replacement"

@MechCoder MechCoder changed the title from WIP: Random with replacement to WIP: Updating a random feature with replacement for every iteration in the cd code
@MechCoder
Owner

@agramfort @ogrisel I used a grid of 100 alphas for different tolerances, and set the iterations as high as possible. I get good results for the random descent, Here are the plots.

Tolerance = 0
tol 0

Tolerance = e-18
tol 1e-18

Tolerance e-12
tol 1e-12

Tolerance e=08
tol 1e-08

sklearn/linear_model/coordinate_descent.py
@@ -596,6 +601,13 @@ class ElasticNet(LinearModel, RegressorMixin):
positive: bool, optional
When set to ``True``, forces the coefficients to be positive.
+ random_state: int, RandomState instance, or None (default)
+ The seed of the pseudo random number generator that selects
+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
@agramfort Owner

If set to True, a random coefficient is updated at every iteration
rather than looping over features sequentially.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/linear_model/coordinate_descent.py
@@ -791,6 +808,13 @@ class Lasso(ElasticNet):
positive : bool, optional
When set to ``True``, forces the coefficients to be positive.
+ random_state: int, RandomState instance, or None (default)
+ The seed of the pseudo random number generator that selects
+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
@agramfort Owner

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/linear_model/coordinate_descent.py
@@ -1192,6 +1220,13 @@ class LassoCV(LinearModelCV, RegressorMixin):
positive : bool, optional
If positive, restrict regression coefficients to be positive
+ random_state: int, RandomState instance, or None (default)
+ The seed of the pseudo random number generator that selects
+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
@agramfort Owner

here too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/linear_model/coordinate_descent.py
@@ -1303,6 +1339,13 @@ class ElasticNetCV(LinearModelCV, RegressorMixin):
positive : bool, optional
When set to ``True``, forces the coefficients to be positive.
+ random_state: int, RandomState instance, or None (default)
+ The seed of the pseudo random number generator that selects
+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
@agramfort Owner

here too

also you forgot the " " before : in all the new docstrings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/linear_model/coordinate_descent.py
@@ -1431,6 +1475,13 @@ class MultiTaskElasticNet(Lasso):
When set to ``True``, reuse the solution of the previous call to fit as
initialization, otherwise, just erase the previous solution.
+ random_state: int, RandomState instance, or None (default)
+ The seed of the pseudo random number generator that selects
+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
@agramfort Owner

here too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/linear_model/coordinate_descent.py
@@ -1586,6 +1640,13 @@ class MultiTaskLasso(MultiTaskElasticNet):
When set to ``True``, reuse the solution of the previous call to fit as
initialization, otherwise, just erase the previous solution.
+ random_state: int, RandomState instance, or None (default)
+ The seed of the pseudo random number generator that selects
+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
@agramfort Owner

here too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/linear_model/coordinate_descent.py
@@ -1699,6 +1764,13 @@ class MultiTaskElasticNetCV(LinearModelCV, RegressorMixin):
all the CPUs. Note that this is used only if multiple values for
l1_ratio are given.
+ random_state: int, RandomState instance, or None (default)
+ The seed of the pseudo random number generator that selects
+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
@agramfort Owner

here too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/linear_model/coordinate_descent.py
@@ -1829,6 +1904,13 @@ class MultiTaskLassoCV(LinearModelCV, RegressorMixin):
all the CPUs. Note that this is used only if multiple values for
l1_ratio are given.
+ random_state: int, RandomState instance, or None (default)
+ The seed of the pseudo random number generator that selects
+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
@agramfort Owner

here too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@agramfort
Owner

besides LGTM

@MechCoder
Owner

@agramfort I've updated it and changed it to MRG + 1

@MechCoder MechCoder changed the title from WIP: Updating a random feature with replacement for every iteration in the cd code to [MRG+1] : Updating a random feature with replacement for every iteration in the cd code
@MechCoder MechCoder changed the title from [MRG+1] : Updating a random feature with replacement for every iteration in the cd code to [MRG+1] Updating a random feature with replacement for every iteration in the cd code
@coveralls

Coverage Status

Coverage increased (+0.0%) when pulling acd97a5 on MechCoder:random_with_replacement into 82611e8 on scikit-learn:master.

@ogrisel
Owner

@MechCoder thanks for the new benchmarks. Could you please try something intermediate like tol=1e-15 to explore the transition between the tol=1e-12 regime and the tol=1e-18 regime?

  • Is the time on the y axis measured in seconds?

  • I think it would be worth cleaning your script and making it an benchmark script if it runs for more than 30 seconds total.

  • you can use the '--' style for the second line on the score plot to show that the scores curves overlap (BTW, this is weird to see a multi-modal regularization path, has anyone else seen this in the past?).

  • Which dataset it this? The dataset name could be added in the title of the figure.

The narrative documentation should be updated with a paragraph explaining the expected impact of shuffle=True, explaining that this can speed up the convergence, especially when tol is smaller than 1e-10 and regularization is light. The amount of speedup between shuffle=True and shuffle=False is data dependent.

Also please add a "reference" section to the narrative documentation to add the missing reference to the elastic net coordinate descent paper and a new reference to the Stochastic CD paper. @mblondel @agramfort any other paper to reference there?

@ogrisel
Owner

Also in the legend of the benchmark script, "Normal descent" could be replaced by the more explicit "Cyclic descent.".

@ogrisel
Owner

The benchmarks seem to demonstrate that shuffle=True is never worst than shuffle=False. I would be in favor of setting the default to shuffle=True if others things that those finding would generalize to other common datasets.

@agramfort
Owner
@ogrisel
Owner

The narrative documentation for the ElasticNet(CV) class should also state explicitly that coordinate descent is used as optimizer.

@ogrisel
Owner

@MechCoder for the syntax of the reference block, please have a look at other occurrences of such blocks in the existing documentation. You can find them with:

git grep `.. topic:: References:`
@ogrisel
Owner

Ok I understand the difference of the two regimes: in one case one reaches max_iter for almost any value of alpha and the other case, max_iter is almost never reached.

@ogrisel
Owner

Could you please release the upper limit of the tol=1e-12 plot to see whether we reach the flat area of max_iter on the left hand side of the time curves?

@ogrisel
Owner

The benchmarks seem to demonstrate that shuffle=True is never worst than shuffle=False. I would be in favor of setting the default to shuffle=True if others things that those finding would generalize to other common datasets.

why not but that would be justified by a more thorough benchmark.

One could try on the Boston dataset to check that it's not getting worse on a dataset that has not many (noisy) features.

@GaelVaroquaux

:+1: on setting shuffle=True as a default if the Boston dataset validates that it is in general better.

@MechCoder
Owner

Is the time on the y axis measured in seconds?

Yes

would be in favor of setting the default to shuffle=True if others things that those finding would generalize to other common datasets.

There are a number of tests that fail if I do this and most of them look non-trivial.

@ogrisel
Owner

There are a number of tests that fail if I do this and most of them look non-trivial.

Do you mean that there are test with the random CD algorithm does not converge to a good solution (compared the he cyclic CD algorithm currently used in master)?

Other than that they should not be that complicated to fix. In the worst case we could put shuffle=False individually for some tests.

If we set shuffle=True by default we should not forget to fix the random_state argument in all the test that run CD models to still get deterministic test runs.

@MechCoder
Owner

I get different results, when I bench with a fixed number of iterations, not allowing it to converge fully. Give me some time and I will post the results here.

@MechCoder
Owner

These are the final benchmarks, and I'm sorry for the confusion, I somehow got the random and normal misplaced in the last. These new benchmarks are in line with #3335 (comment)

Setting iterations very high, which essentially means this runs till convergence or the condition gap < tol

Tolerance is zero
tol0

Tolerance is 1e-18
tole-18

Tolerance is 1e-15
tole-15

Tolerance is 1e-12
tole-12

Tolerance is 1e-8
tole-8

Setting the tolerance to zero, which means the only parameter is the number of iterations.

n_iteration = 100
n_iter 100

n_iteration = 200
n_iter 200

n_iteration=1000
n_iter 1000

n_iteration=500
tol 500

@MechCoder
Owner

@agramfort @ogrisel

According to this dataset, can we say that, random is useful

  1. For a lower number of iterations and a lesser alpha to get a better score.

  2. And maybe with higher tolerances, like 1e-8, to get faster convergence.

@ogrisel
Owner

The first benchmark seems to highlight that random coordinate descent is never significantly faster than cyclic CD contrary to what you reported previously. This is looks quite in contradiction with the plot of nb iterations vs tol.

The second benchmark is weird that fixes the number of iterations. In practice one would never want to stop a CD optimizer before convergence. Instead one would fix the tolerance to a reasonable value (e.g. between 1e-4 and 1e-12) and then choose the best value of alpha to trade bias and variance to find the best test score.

@ogrisel
Owner

Could you please push the script you used for that last benchmark as a gist?

@MechCoder
Owner

@ogrisel The benchmark in this, #3335 (comment) is dual gap against n_iter and not time. I admit there was some mistake in this #3335 (comment) in the tol zero and tol e-18 parts.

Alex had told me that in some cases, we would not be blessed / have resources enough to be able to have a max number of iterations (around 50k) so that it can fully converge. That is the reason for the second plot, to study the time taken for a fixed number of iterations

this is the gist, https://gist.github.com/MechCoder/c9ab2b6de8f81c916d15

@MechCoder
Owner

@ogrisel It converges with a lower dual gap, but with a higher amount of time. Could it be becasuse of some overhead in generating the random feature?

@ogrisel
Owner

@ogrisel The benchmark in this, #3335 (comment) is dual gap against n_iter and not time.

I know that, but if we expect that the RNG computation is negligible, then random should be faster than cyclic as it should do less iterations to reach a given tol.

That is the reason for the second plot, to study the time taken for a fixed number of iterations

This is interesting as a debugging tool for us but IMO it should not be presented as a way for our users to select the best alpha. In my opinion the user should just a select a good tol and no care about the max_iter parameter unless they get the convergence warning.

@ogrisel It converges with a lower dual gap, but with a higher amount of time. Could it be becasuse of some overhead in generating the random feature?

This could be confirmed / infirmed by making the elastic net cython code return the effective number of iteration, storing it as an additional n_iter_ attribute on the ElasticNet & Lasso classes at the end of a call to the fit method and consecutively plotting: training time vs alpha, n_iter_ vs alpha and test score vs alpha.

@MechCoder
Owner

@ogrisel

Which would be more useful?
1. To fix the number of iterations and see if random converges with a lower dual gap? (or)
2. To set the tolerance to a level say (1e-4, 1e-8, 1e-12) and which converges with a lesser number of iterations?

Are you saying the second is more helpful?

@MechCoder
Owner

@ogrisel @agramfort I cherrypicked that commit, on top of this, and these are the benchmarks that I get. The benching for tol 1e-18, is taking some time. Random descent takes lesser number of iterations to converge.

Tolerance of 1e-8
tole-8with_iter

Tolerance of 1e-12
tole-12_niter

Tolerance of 1e-15
tole-15withiter

@MechCoder
Owner

I don't really understand why I am getting slightly different benches for time. Is it because of the randomness?

@ogrisel
Owner

@MechCoder Thanks for the benchmarks, those combined plots are much clearer. RCD is not that bad anymore, at least on duke.

The fact that the time improvement is lower than the n_iter improvement over cyclic CD can be caused either by:

  • the computational overhead of sampling from the RNG, but honestly that should be negligible if the number of samples is not too small
  • suboptimal use of the CPU cache because of the random memory access pattern caused by the Random CD algorithm.

Could you please run similar benchmarks for other datasets, e.g sklearn.datasets.load_boston and Arcene and maybe california housing (in sklearn datasets as well)? You can do it just for tol=1e-8 or even lower higher values such as tol=1e-4 to get a first idea without waiting for ages for the model to converge. Please report n_samples and n_features next to your plots (e.g. in the title of the plots) so that we can get an intuition just by looking at the plot.

The difference in CPU cache miss-rate should measurable using 2 small benchmark scripts and the linux perf tool:

$ apt-get install linux-tools-common
$ perf -h
# install the package matching your kernel version
$ perf stat -e cycles,instructions,cache-misses -a python cyclic_cd_bench_script.py
$ perf stat -e cycles,instructions,cache-misses -a python stochastic_cd_bench_script.py

If we see that the miss-rate is significantly higher for the stochastic benchmark then it probably means that the random access pattern is the culprit.

More info on linux perf here: https://perf.wiki.kernel.org/index.php/Tutorial

There is also PAPI but it sounds more complicated to install and use.

Maybe @larsmans, @jnothman and @glouppe also have ideas to check that hypothesis

@MechCoder
Owner

There is no use of benching for a lower tolerance of the duke dataset, it takes a really long time to run, 200000+ iterations.

@ogrisel
Owner

Actually I meant higher tolerance like 1e-4, sorry.

@MechCoder
Owner

For the boston dataset.

bostone-12
bostone-15
boston_tole-4
bostontole-8

@GaelVaroquaux

Does this superseed #3300 ? If so, you should close #3300 .

@agramfort
Owner
@vene
Owner

I take an issue with calling this shuffle=True just for consistency with SGD.

In SGD, (last time I checked), the rows (actually, the indices) are shuffled before each iteration, effectively doing random sampling without replacement, but making sure that each epoch visits each data point once.

Calling it shuffle=True here would lead me to think it does the same thing. Sampling with replacement is supposed to work better and to match theoretical results. This can mislead users, it's fake consistency. Rather then calling them the same and having gigantic warnings in the comments, I'd just rename it to e.g., randomized=True.

More comments:

  • Plots with unlabeled axes are hard to read, and sometimes your plot titles are wrong (the boston benchmarks say duke). For this reason I find it hard to tell, at a glance, how bad the performance decrease is for boston.
  • I would still be curious to see how this fares on a big text dataset like Amazon7, as @mblondel suggested.
  • Can turning this on by default ever change the results of user code, or just the convergence time? In the first case it would warrant deprecation warnings.
@ogrisel
Owner

We could have method="stochastic_cd" vs method="cyclic_cd".

@GaelVaroquaux
@agramfort
Owner
@mblondel
Owner

How about selection="cyclic", selection="random" or coordinate_selection="cyclic", coordinate_selection="random" (longer but more explicit)?

@agramfort
Owner
@ogrisel
Owner

I like @mblondel's suggestion as well. I have no strong opinion on selection vs coordinate_selection.

@coveralls

Coverage Status

Coverage increased (+0.0%) when pulling 09c4aad on MechCoder:random_with_replacement into fb87430 on scikit-learn:master.

@MechCoder
Owner

These are the benchmarks for the california dataset.

For tolerance 1e-4

california_tole-8

For tolerance 1e-8
california_tole-81

For tolerance 1e-12
califoniatole-12

@MechCoder
Owner

For the california dataset, it shows that even though the time taken to converge is almost the same, the random descent takes more iterations. I shall bench with a bigger dataset now.

@MechCoder
Owner

@ogrisel

For the California dataset.
with shuffle=True

3,24,99,62,303 cycles                    [100.00%]
4,14,50,94,547 instructions              #    1.28  insns per cycle         [100.00%]
   1,37,68,875 cache-misses                                                

   0.973972635 seconds time elapsed

with shuffle=False

3,76,09,80,486 cycles                    [100.00%]
4,56,70,57,043 instructions              #    1.21  insns per cycle         [100.00%]
   1,44,23,273 cache-misses                                                

   1.035316444 seconds time elapsed

It looks like there is not much difference between both.

@MechCoder
Owner

For the duke dataset.

shuffle=False

13,35,96,02,61,099 cycles                    [27.78%]
24,22,96,73,31,327 instructions              #    1.81  insns per cycle         [27.78%]
    2,15,23,11,099 cache-misses                                                 [27.78%]

     330.178694464 seconds time elapsed

shuffle=True

 8,29,74,91,37,140 cycles                    [27.78%]
12,11,26,34,55,288 instructions              #    1.46  insns per cycle         [27.78%]
    1,51,28,28,510 cache-misses                                                 [27.78%]

     209.530373490 seconds time elapsed

This definitely looks better for the duke dataset.

@ogrisel
Owner

This definitely looks better for the duke dataset.

The second perf report for the Duke data set is for shuffle=False right?

If so it means that shuffle=True introduces more cache-misses than shuffle=False. Therefore Random CD is actually worse (CPU cache-wise) than the original Cyclic CD from the master branch. That could explain that the run time of Random CD is not significantly shorter than for Cyclic CD despite the lower number of iterations to convergence on this data.

Also a general note: please try to make objective statements when you report results rather than "This definitely looks better for the duke dataset." which don't mean anything.

Can you confirm that you still get results in the same order if your re-run the perf cache profiling several times?

Please also be explicit on what is reported for each benchmark. Guessing what is by implicitly matching the ordering on the past benchmarks is tedious and can lead to interpretation errors. Explicit is better than implicit.

@GaelVaroquaux

It's kinda interesting: what we are finding here is, AFAIK, in contradiction with the published literature :)

@ogrisel
Owner

For the california dataset, it shows that even though the time taken to converge is almost the same, the random descent takes more iterations.

A run time of 2ms is probably dominated by Python boilerplate and input checking overhead rather than the inner Cython loop of the CD optimizer.

I shall bench with a bigger dataset now.

Indeed it's best to bench on datasets where the individual CD optimizer run for at least 100ms to safely ignore the Python boilerplate overhead.

@ogrisel
Owner

It's kinda interesting: what we are finding here is, AFAIK, in contradiction with the published literature :)

The fact that randomized CD destroys CPU cache efficiency to the point of counter balancing the purely algorithmic benefits is probably very architecture implementation and dataset specific. It would be worth checking on a dataset larger than duke.

@MechCoder you could try to make the boston dataset more complicated by adding

import numpy as np
from sklearn.datasets import load_boston

boston = load_boston()
n_samples, n_true_features = boston.data.shape
rng = np.random.RandomState(42)
n_noise_features = 10000
X = np.hstack([boston.data, rng.normal(size=(n_samples, n_noise_features))])
y = boston.target

You should also try on a sparse data such as load_20newsgroups_vectorized to test on sparse data.

@GaelVaroquaux
@MechCoder
Owner

@ogrisel I'm extremely sorry, by better I meant, the second one (the one with the lesser run time) was for shuffle=True

@ogrisel
Owner

@ogrisel I'm extremely sorry, by better I meant, the second one (the one with the lesser run time) was for shuffle=True

This is weird then as this means that random CD on Duke (for this alpha and this tolerance level which you did not report BTW) makes is 66% of the time of the cyclic CD. This improvement ratio is not what was observed in your previous benchmarks. How do you explain this? What have you changed between the 2 benchmarks? The tolerance level?

@MechCoder
Owner

@ogrisel I have benched according to your previous comment, #3335 (comment)

i.e for the same grid of 100 alpha , however limiting the tolerance to just 1e-4 and 1e-8. I'm confused also as to why this happens. I shall bench one more just to verify this.

@ogrisel
Owner

It might be the case that random CD is only beneficial for high tolerance (e.g. tol=1e-4) while the benefit vanishes for completely for lower tolerance levels from 1e-15 to 1e-18. If this is the case, it means that random CD is still interesting in practice as for machine learning we don't necessarily want to optimize to tol=1e-15. I think tol=1e-4 is a good default. If confirmed the impact of tol on the optimal choice of the optimization method should be documented (either in the docstring or the narrative doc, or both).

@agramfort
Owner
@MechCoder
Owner

Dataset : Duke Dataset
alpha : grid of alphas given by the _alpha_grid function.
tol = 1e-4
setting shuffle as False

 1,82,15,75,69,631 cycles                    [100.00%]
 3,68,81,61,31,525 instructions              #    2.02  insns per cycle         [100.00%]
  15,34,97,031 cache-misses                                                

  52.062235532 seconds time elapsed

setting shuffle as True i.e updating a random feature

   80,17,35,60,628 cycles                    [100.00%]
 1,42,44,49,40,765 instructions              #    1.78  insns per cycle         [100.00%]
   4,68,95,408 cache-misses                                                

      25.160964047 seconds time elapsed

decreasing tolerance to 1e-8

This is for the cyclic update.

 10,02,43,19,82,602 cycles                    [100.00%]
19,59,75,79,22,631 instructions              #    1.96  insns per cycle         [100.00%]
1,20,36,69,920 cache-misses                                                

 269.065356493 seconds time elapsed

Updating a random feature ie. shuffle = True

 6,57,51,93,29,201 cycles                    [100.00%]
10,25,43,29,89,054 instructions              #    1.56  insns per cycle         [100.00%]
  90,61,38,834 cache-misses                                                

 176.856839635 seconds time elapsed

If we start to think, higher the tolerance, then we get a better benefit, I benched for a still higher tol (1e-3) ,
(the speedup is comparitively less as compared to tol of 1e-4)

Random descent:

40,37,26,64,624 cycles                    [100.00%]
62,23,85,20,269 instructions              #    1.54  insns per cycle         [100.00%]
    6,00,46,084 cache-misses                                                
  10.911813916 seconds time elapsed

Cyclic descent:

Performance counter stats for 'system wide':

48,44,05,43,476 cycles                    [100.00%]
1,04,34,12,32,806 instructions              #    2.15  insns per cycle         [100.00%]
   3,43,09,248 cache-misses                                                

  14.878715270 seconds time elapsed

As one final attempt, I benched for a lower tol (1e-12)
For cyclic decent

14,77,77,24,85,007 cycles                    [100.00%]
28,47,35,97,63,158 instructions              #    1.93  insns per cycle         [100.00%]
    1,87,19,15,116 cache-misses                                                

     389.043936137 seconds time elapsed

Random feature update.

12,87,76,91,23,047 cycles                    [100.00%]
19,75,08,72,86,695 instructions              #    1.53  insns per cycle         [100.00%]
    1,96,13,85,404 cache-misses                                                

     338.385406497 seconds time elapsed

As @ogrisel has mentioned, it is clear that benching to a higher tolerance means random is (more?) beneficial. But I am still confused if this depends on the data or if this is always the case.

I shall give similar benches for the boston, arcene asnd california dataset to make sure.

@MechCoder
Owner

Data: The noisy boston data got by adding 10000 features.
Setting tolerance 1e-12
The grid of alphas obtained using _alpha_grid

For updating a random feature

3,42,87,14,88,365 cycles                    [100.00%]
3,62,28,50,58,064 instructions              #    1.06  insns per cycle         [100.00%]
2,21,30,04,881 cache-misses                                                
  93.064692844 seconds time elapsed

For cyclic update:

Performance counter stats for 'system wide':

 1,59,51,93,39,102 cycles                    [100.00%]
 1,75,90,17,28,045 instructions              #    1.10  insns per cycle         [100.00%]
      73,83,38,189 cache-misses                                                

  42.122770462 seconds time elapsed

For a much higher tolerance (1e-4)
For updating a random feature

1,32,85,76,47,392 cycles                    [100.00%]
1,36,76,34,17,559 instructions              #    1.03  insns per cycle         [100.00%]
  89,10,50,934 cache-misses                                                

  36.352075984 seconds time elapsed

For a cyclic update:

Performance counter stats for 'system wide':

  62,89,26,66,488 cycles                    [100.00%]
  70,01,06,77,576 instructions              #    1.11  insns per cycle         [100.00%]
  35,86,51,120 cache-misses                                                

  18.888608539 seconds time elapsed

Intution makes me think that (I may be wrong) since there are a great number of noisy features, updating a random feature at a time, might mean converging faster, but the benchmarks say otherwise.

@MechCoder
Owner

I'm not able to find a single dataset in which there is a conclusive win other than the Duke dataset. I shall bench now for the arcene dataset since it is exclusively mentioned in the SGD paper that @ogrisel has provided.

@MechCoder
Owner

For the arcene dataset (100 X 10000)
Tol = 1e-4
Setting shuffle = True, i.e a random feature to update.

Performance counter stats for 'system wide':

43,95,20,98,73,533 cycles                    [41.66%]
46,04,41,52,65,193 instructions              #    1.05  insns per cycle         [41.66%]
49,25,40,44,009 cache-misses                                                 [41.66%]

 1208.838631889 seconds time elapsed

Doing a cyclic update:

1,38,17,80,36,58,203 cycles                    [41.67%]
1,79,41,66,67,42,926 instructions              #    1.30  insns per cycle         [41.67%]
 54,55,25,50,748 cache-misses                                                 [41.67%]

3349.204620314 seconds time elapsed
@MechCoder
Owner

@ogrisel So we can see that there are benefits for the datasets given in the SGD paper dataset. i.e the Duke and Arcene wrt time taken to converge. I shall bench for a high dimesiona; dataset tomorrow. Is there any detail that I have missed in the above comments?

@ogrisel
Owner

@ogrisel So we can see that there are benefits for the datasets given in the SGD paper dataset. i.e the Duke and Arcene wrt time taken to converge. I shall bench for a high dimensions.

Don't confuse SGD with SCD. SGD is unrelated to this PR (randomization happens on the data in the samples dimension).

It would indeed be great to confirm on other datasets. You can take a subset of 20 newsgroups, for instance the Atheism vs Religion classes. Atheism could have target value -1 and Religion +1 to artificially cast this classification dataset as a regression problem to bench Lasso / ElasticNet on it.

@coveralls

Coverage Status

Coverage increased (+0.0%) when pulling d6fff51 on MechCoder:random_with_replacement into 376ac51 on scikit-learn:master.

@MechCoder MechCoder FIX: Fix sparse cd for random coordinate descent
cc9c324
@MechCoder
Owner

@ogrisel There was a bug in the sparse code in the random update for the sparse coordinate descent code that I hadn't realised for one and a half days (fixed in the last commit). I thought the slowness of random descent was to the randomness, then I found out that it was because the start pointer was same!

@ogrisel
Owner

then I found out that it was because the start pointer was same

I don't understand this last statement. Could please explain in more details?

@MechCoder
Owner

There were two problems.

1] In the previous code, the startptr was initialized to X_indptr[0], but now we pick a feature ii in random, so it should be X_indptr[ii]

2] In the previous code, at the end startptr is initialize to the previous endptr, this makes sense when it is cyclic, for example at the end of updating feature 2, startptr is initialized to X_indptr[3] .

For the random case that means, after updating (say feature n), the (startptr is initialized to feature X_indptr[n + 1]) (in Line 431) , however it should be initialised to X_indptr[ii] where i is the next random feature. (in Line 385)

It is a really silly mistake.

@MechCoder
Owner

@ogrisel I have finally benched for the newsgroup dataset

Setting tolerance to [1e-4], I used the alphas that is the grid of alphas obtained from _alpha_grid (as usual)

These are the benchmarks I got for the random descent

 9,04,42,89,09,287 cycles                    [38.20%]
15,90,55,14,02,131 instructions              #    1.76  insns per cycle         [38.19%]
  71,72,27,118 cache-misses                                                 [38.19%]

 252.283861536 seconds time elapsed

For the cyclic descent,

14,55,14,87,64,893 cycles                    [72.92%]
26,20,67,80,97,173 instructions              #    1.80  insns per cycle         [72.92%]
  81,12,95,268 cache-misses                                                 [72.92%]

 405.704868804 seconds time elapsed

Just to make sure that this is just not for a higher tolerance, I benched for tolerance equals to 1e-8

For the random descent

26,50,19,90,84,400 cycles                    [72.92%]
46,77,46,32,09,368 instructions              #    1.76  insns per cycle         [72.92%]
 1,99,79,21,006 cache-misses                                                 [72.92%]

 704.953551630 seconds time elapsed

For the cyclic descent

41,28,66,12,04,441 cycles                    [50.00%]
74,84,59,17,11,280 instructions              #    1.81  insns per cycle         [50.00%]
2,20,69,84,208 cache-misses                                                 [50.00%]

1133.909704090 seconds time elapsed

These are the plots for number of iterations.
Tolerance 1e-4
tol1e-4

Tolerance 1e-8
tol1e-8

@MechCoder
Owner

@ogrisel So can we be convinced that keeping this option is helpful from the above benches?

@vene
Owner

Looking at these last plots it seems helpful indeed. I'd like to see a convergence plot for a fixed problem, with n_iter on the x axis (or even wall time), which is what I'm used to seeing in literature.

@vene
Owner

Also +1 for Alex's question on OOB scores. One more note: I might be dense, but I have no idea how to read/compare the output of your benchmarks (cycles, cache misses, and the percentages.) Could you clarify?

@MechCoder
Owner

The cache misses refer to the number of times, the CPU was unable to read the data from the cache but had to reload it from the memory,

The cycles refer to the number of CPU cycles, and the instructions are simply the number of instructions given to the CPU.

I actually googled and got most of it from this link http://stackoverflow.com/questions/23023193/units-of-perf-stat-statistics after @ogrisel had suggested it.

@MechCoder
Owner

@agramfort @vene

This is a bench of n_iterations against dual gap for the same dataset.

I set the alpha to be 0.00010971276060444348, since it was the best alpha that I had got from using ElasticNetCV
n_iters_vs_dual_gap

@MechCoder
Owner

This basically shows for a given n_iter (for the newsgroup dataset), random descent converges faster.

@MechCoder
Owner

@vene @agramfort

This is the oob score using the same conditions as the last plot, I did a 2:1 split using train_test_split. I don't think we can conclude whether random or cyclic descent has a better accuracy score from the plot.

oob_score

@agramfort
Owner
@MechCoder
Owner

I have averaged the scores across ten splits, other settings kept same. It does seem that random wins in some cases.

oob_average_split

@MechCoder
Owner

@ogrisel @vene From the benches above, we can see that the random descent performs better for the newsgroup, duke and arcene dataset in terms of speed. do you have any more comments? I have renamed shuffle to selection as per @mblondel 's suggestion.

@MechCoder MechCoder Renamed shuffle to selection and added tests
00f8c62
@ogrisel
Owner

I have averaged the scores across ten splits, other settings kept same. It does seem that random wins in some cases.

You cannot tell without adding error bars for the standard deviation (and / or the standard error) of the mean. I think they are equivalent on average. It would be interesting to see if the standard deviation of Random CD is significantly larger than Cyclic CD though.

@ogrisel
Owner

The n_iter vs dual gap plot is very interesting. Indeed Random CD has a better convergence profile.

@MechCoder
Owner

@ogrisel Sorry I had meant to say, random does not perform worse. What more do I need to do, to get this merged?

@coveralls

Coverage Status

Coverage increased (+0.0%) when pulling 00f8c62 on MechCoder:random_with_replacement into 5cd2da2 on scikit-learn:master.

@ogrisel
Owner

What more do I need to do, to get this merged?

There is no hurry. I still don't know whether it's better to use Random or Cyclic CD by default.

Could you please reproduce the benchmarks for tol=1e-4 and tol=1e-8 for duke arcene with the pointer bugfix?

Could you also do a (validation score & training score) vs tol plot for duke and newsgroups? Please vary tol on np.logspace(-1, -10, 4).

@ogrisel
Owner

For the benchmarks I would like to see time vs alpha, n_iter vs alpha, validation score vs alpha, vertically stack for tol=1e-4 and tol=1e-8, for duke, arcene and newgroups.

@ogrisel
Owner

@vene About the perf statistics output, cyclic CD gets ~1.80 instructions per cycle and random CD ~1.75. Both are good values. The random CD does more random memory access, hence the CPU cache miss rate is slightly higher but apparently not to the point of completely destroying the algorithmic convergence advantage of random CD, at least for high tolerance levels.

@vene
Owner

hence the CPU cache miss rate is slightly higher

That's exactly what I don't understand, I would expect it to be much higher. Or is this because the cyclic one doesn't take advantage of memory layout and it could be much lower?

@ogrisel
Owner

That's exactly what I don't understand, I would expect it to be much higher. Or is this because the cyclic one doesn't take advantage of memory layout and it could be much lower?

I don't know the inner working of CPUs enough to explain it either.

@MechCoder
Owner

@ogrisel I benched for the 20newsgroup datasets, and the duke dataset (the arecene dataset is currently taking a bit of time)

I followed the convention, blue to be random and red to be cylic.
The grid of alphas are the 100 taken from _alpha_grid
I did a 2:1 split and took the mean,

Duke dataset (for tolerance 1e-4)
duke_tol4

Duke dataset (tol 1e-8)
duke_tol8

Newsgroup (tol e-4)

newsgroup_tol4

Newsgroup (tol e-8)
newsgroup_tol8

So this is the same as the previous results.

@agramfort
Owner
@ogrisel
Owner

@ogrisel how about keeping cyclic the default?

+1

@MechCoder
Owner

@ogrisel how do we deal with the appveyor build fails.

@ogrisel
Owner

So this is the same as the previous results.

Well no: there used to be a peak validation score (~ 0.15 explained variance) for alpha around 0.2 / 0.3 with on Duke and now the score is almost always flat at 0.85. What has changed?

@ogrisel
Owner

@ogrisel how do we deal with the appveyor build fails.

This specific AppVeyor failure is spurious, I reported the issue to the AppVeyor developers. Please ignore it in the mean time.

@ogrisel
Owner

Also I would be very interested in the following benchmark:

Could you also do a (validation score & training score) vs tol plot for duke and newsgroups? Please vary tol on np.logspace(-1, -10, 4).

tol as the x-axis and validation score & training scores on the y-axis for a fixed value of alpha. preferably the best value found by CV.

@MechCoder
Owner

@ogrisel

Are you referring to this (#3335 (comment)) ?

I think this has two reasons

  1. The pointer bug fix.
  2. The previous scores are scores on the training data and I did a mistake by using the default score of ElasticNet i.e the r2 score, The recent benchmarks are the average of scores f the test data, that I have calculated using 3 random splits, using the accuracy score, i.e I thought is a better way to get a proper score while using regressors to bench classification problems.
@ogrisel
Owner

What is the target variable of Duke? Continuous or boolean? It is a regression or classification task? The coordinate descent models of scikit-learn are linear regression models minimizing the squared loss so it's ok to use the r2_score or explained variance to evaluate there performance.

I think this has two reasons
The pointer bug fix.

This is probably not the case as the maximum value of the score was the same for cyclic CD and randomized CD. The pointer bug did only impact the random CD, right?

@MechCoder
Owner

The Duke model is a classification problem. So that is the reason I used the accuracy score, (which is the default score of Logistic Regression) , which I thought would give us a better picture of the accuracy.
Do I do the remaining benches with the r2 score?

maximum value of the score was the same for cyclic CD and randomized CD

Yes you are right, sorry.

@MechCoder
Owner

@ogrisel A general question. I thought that for a classification problem, we are more interested in predicting if the label is right or not, so I thought the accuracy score would be better.

But now it seems to that the explained variance score, also might be a good idea, since it gives an idea about how close we were to predicting the right label.

Which of these is a better option?

@ogrisel
Owner

The Duke model is a classification problem. So that is the reason I used the accuracy score, (which is the default score of Logistic Regression)

How did you compute it? Both ElasticNet and Lasso classes are implementation of linear regression models, hence they predict a continuous variable. accuracy_score expects categorical inputs encoded as integers or booleans. Did you threshold the predictions of the regressions models manually? If so which threashold value? What are the possible target values in the Duke dataset?

which I thought would give us a better picture of the accuracy. Do I do the remaining benches with the r2 score?

Yes I think it's better to use a regression metric to quantify the quality of a regressor model. Otherwise one needs to properly post process the output of the regressor to make it behave as a classifier.

@MechCoder
Owner

@ogrisel
The Duke dataset is a binary problem. I handled the output (logically) similar to how Logistic Regression handles it,

    clf.fit(X_train, y_train)
    y_pred = np.sign(clf.predict(X_test))
    c_score4 = accuracy_score(y_test, y_pred)
@MechCoder
Owner

@ogrisel
Oh I am extremely sorry, #3335 (comment) these were for the BOSTON dataset. I have mentioned in the top.

This (#3335 (comment)) and this (#3335 (comment)) should be compared. he difference in scores are definitely due to the usage of r2 score vs accuracy score.

I have used the r2 score for these benches. (which can explain the difference in this and the above benchmarks)

Blue - random
Red - cyclic
alpha = 0.0001 (best alpha)

This is the score on the training data
training_newsgroup

This is the score on the left out data.
cross_valid_newsgroup

@MechCoder
Owner

@ogrisel This is the bench for duke on left out data, the benches comparable to the one in this,

duke_leftout

The values for that on the training data are
random data = [0.99999290839194466, 0.99999744375869459, 0.99999745712686572, 0.999997457135264]
cyclic data = [0.99999674312730391, 0.99999746925713084, 0.99999745713785693, 0.99999745713522792]

Could you please tell me what you had wanted to confirm from the above benches?

@MechCoder
Owner

@ogrisel Do you want to me to bench again using the default r2 score, or is it ok since we understand that the random descent does not behave worse on the duke and newsgroup datasets (in terms of time and score)

@ogrisel
Owner

Oh I am extremely sorry, #3335 (comment) these were for the BOSTON dataset. I have mentioned in the top.

Those were not the benchmarks I was referring to.

This (#3335 (comment)) and this (#3335 (comment)) should be compared. he difference in scores are definitely due to the usage of r2 score vs accuracy score.

Those are precisely the 2 benchmarks I was referring too: there is an optimal value for the r2 score for alpha around 0.2. You cannot see that for the accuracy score of the latter where lower alpha values are always best and the score is flat. I want to know if there is an optimal value of alpha for the regression task.

So as this discussion is getting really messy please once and for all, run the following benchmark for duke and newsgroups and please double-check that the labeling of the axis and the titles are correct before posting, and no need to repeat in comments what's written on the axis labels:

  • fixed tol=1e-4

  • 3 vertically stacked plots with alpha on the x-axis

    • y-axis: n_iter, 2 curves one for Random CD, one for Cyclic CD

    • y-axis: fit time, 2 curves one for Random CD, one for Cyclic CD

    • y-axis: mean r2 score on left out folds, 2 curves one for Random CD, one for Cyclic CD

@ogrisel
Owner

Could you please tell me what you had wanted to confirm from the above benches?

I wanted this bench to compare the training and test r2 score on the same plot vs tolerances ranging from tol=0.1 to tol=1e-10 so as to be able to check whether there was a sweet spot for the default value of the tolerance parameter in terms of generalization.

One would expect to see the training score always climb with decreasing values of tolerance (we better optimize the model for the training set) while the validation r2 score should climb at the beginning and reach a plateau where it's no longer useful to optimize further or even would cause the model to overfit and have the r2 score decrease a bit for tol ~ 1e-10.

The last benchmark on Duke has tolerance between 1e-4 and 1e-10 so we cannot observe that.

Basically get the same kind of intuition as those curves: http://scikit-learn.org/stable/auto_examples/plot_validation_curve.html but evaluating the impact of the optimizer quality (tol parameter) instead of the regularization strength on the x axis.

@ogrisel
Owner

Also what is the optimal value of alpha for Duke when using the r2 score on held out data to tune it?

@MechCoder
Owner

#3335 (comment) Thanks for this explanation. I had left out the tol of 10e--1 as it had given me a negative r2 score.

The optimal value of alpha that I had used is 0.0013520555661157029 got from ENetCV

@ogrisel
Owner

r2 can be negative, this is not a problem.

@ogrisel
Owner

The optimal value of alpha that I had used is 0.0013520555661157029 got from ENetCV

That sounds much smaller that the past benchmarks curve seem to indicate, looking forward to reading the final benchmark on duke.

@MechCoder
Owner

@ogrisel The last and final benchmarks.

duke_final
newsgroup_final

I have double checked everything twice.

@ogrisel
Owner

Thanks @MechCoder this looks good. Will give this PR it a final review.

sklearn/linear_model/coordinate_descent.py
@@ -462,6 +463,14 @@ def enet_path(X, y, l1_ratio=0.5, eps=1e-3, n_alphas=100, alphas=None,
max_iter = params.get('max_iter', 1000)
dual_gaps = np.empty(n_alphas)
n_iters = []
+
+ rng = check_random_state(params.get('random_state', None))
+
+ shuffle = False
+ selection = params.get('selection', 'cyclic')
+ if selection == 'random':
+ shuffle = True
@ogrisel Owner
ogrisel added a note

You should make sure that invalid values for the selection param raise a ValueError with a explicit error message that give the list of accepted values for the parameter and the actual value passed by the user.

Please add a tests for that as well.

@MechCoder Owner

I have done in this in the higher level modules. should I make selection a public param for enet_path and add it here too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/linear_model/coordinate_descent.py
@@ -1472,6 +1532,14 @@ class MultiTaskElasticNet(Lasso):
When set to ``True``, reuse the solution of the previous call to fit as
initialization, otherwise, just erase the previous solution.
+ random_state : int, RandomState instance, or None (default)
+ The seed of the pseudo random number generator that selects
+ a random feature to update. Useful only when shuffle is set to True.
@ogrisel Owner
ogrisel added a note

I would put random_state as the last parameter as is usually done in other sklearn models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/linear_model/coordinate_descent.py
@@ -1472,6 +1532,14 @@ class MultiTaskElasticNet(Lasso):
When set to ``True``, reuse the solution of the previous call to fit as
initialization, otherwise, just erase the previous solution.
+ random_state : int, RandomState instance, or None (default)
+ The seed of the pseudo random number generator that selects
+ a random feature to update. Useful only when shuffle is set to True.
+
+ selection : str, default 'cyclic'
+ If set to 'random', a random coefficient is updated every iteration
+ rather than looping over features sequentially by default.
@ogrisel Owner
ogrisel added a note

Please add a comment such as:

selection="random" often leads to significantly faster convergence, especially when tol is higher than 1e-4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel ogrisel commented on the diff
sklearn/linear_model/coordinate_descent.py
@@ -1575,9 +1645,16 @@ def fit(self, X, y):
self.coef_ = np.asfortranarray(self.coef_) # coef contiguous in memory
+ if self.selection not in ['random', 'cyclic']:
+ raise ValueError("selection should be either random or cyclic.")
@ogrisel Owner
ogrisel added a note

Good. The same parameter validation should be performed everywhere. And tested in unittests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/linear_model/coordinate_descent.py
@@ -1575,9 +1645,16 @@ def fit(self, X, y):
self.coef_ = np.asfortranarray(self.coef_) # coef contiguous in memory
+ if self.selection not in ['random', 'cyclic']:
+ raise ValueError("selection should be either random or cyclic.")
+ shuffle = False
+ if self.selection == 'random':
+ shuffle = True
@ogrisel Owner
ogrisel added a note

Could be written more simply as:

shuffle = (self.selection == 'random')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/linear_model/tests/test_coordinate_descent.py
@@ -515,6 +515,47 @@ def test_warm_start_convergence_with_regularizer_decrement():
assert_greater(low_reg_model.n_iter_, warm_low_reg_model.n_iter_)
+def test_random_descent():
+ """Test that both random and cyclic selection give the same results
+ when converged fully and using all conditions.
+ """
@ogrisel Owner
ogrisel added a note

PEP 257:

def test_random_descent():
    """Test that both random and cyclic selection give the same results

    Ensure that the test models fully converge fully and check a wide array
    of conditions.

    """
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/linear_model/tests/test_coordinate_descent.py
@@ -515,6 +515,47 @@ def test_warm_start_convergence_with_regularizer_decrement():
assert_greater(low_reg_model.n_iter_, warm_low_reg_model.n_iter_)
+def test_random_descent():
+ """Test that both random and cyclic selection give the same results
+ when converged fully and using all conditions.
+ """
+
+ # This uses the coordinate descent algo using the gram trick.
+ X, y, _, _ = build_dataset(n_samples=50, n_features=20)
@ogrisel Owner
ogrisel added a note

How long is this tests? Would it be possible to make it run faster without warning with n_samples=10?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@MechCoder
Owner

@ogrisel I have addressed all your comments other the one about the tests.

Tests run for 0.732 s in my branch
And for 0.674 s in master which is a difference of 60 ms, is it a problem?

@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling 75b78f2 on MechCoder:random_with_replacement into 5cd2da2 on scikit-learn:master.

sklearn/linear_model/tests/test_coordinate_descent.py
((31 lines not shown))
+ clf_random = ElasticNet(selection='random', tol=1e-8, random_state=42)
+ clf_random.fit(sparse.csr_matrix(X), y)
+ assert_array_almost_equal(clf_cyclic.coef_, clf_random.coef_)
+ assert_almost_equal(clf_cyclic.intercept_, clf_random.intercept_)
+
+ # Multioutput case.
+ new_y = np.hstack((y[:, np.newaxis], y[:, np.newaxis]))
+ clf_cyclic = MultiTaskElasticNet(selection='cyclic', tol=1e-8)
+ clf_cyclic.fit(X, new_y)
+ clf_random = MultiTaskElasticNet(selection='random', tol=1e-8,
+ random_state=42)
+ clf_random.fit(X, new_y)
+ assert_array_almost_equal(clf_cyclic.coef_, clf_random.coef_)
+ assert_almost_equal(clf_cyclic.intercept_, clf_random.intercept_)
+
+
@ogrisel Owner
ogrisel added a note

pep8: single blank line inside body of functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
Owner

Tests run for 0.732 s in my branch. And for 0.674 s in master which is a difference of 60 ms, is it a problem?

It's fine. Why do you test with tol=1e-8 BTW? Do those tests fail when tol is kept to its default value?

The rule of thumb is to have individual tests shorter than 1s and most of them shorter than 100ms individually. Smoke tests should be as fast as possible.

@ogrisel
Owner

Apart from my last comments, +1 for merge.

sklearn/linear_model/cd_fast.pyx
((5 lines not shown))
- for ii in range(n_features): # Loop over coordinates
+ for f_iter in range(n_features): # Loop over coordinates
+ if shuffle:
@agramfort Owner

Shuffle is now misleading. I would rename to random or random_selection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/linear_model/tests/test_coordinate_descent.py
@@ -515,6 +515,54 @@ def test_warm_start_convergence_with_regularizer_decrement():
assert_greater(low_reg_model.n_iter_, warm_low_reg_model.n_iter_)
+def test_random_descent():
+ """Test that both random and cyclic selection give the same results.
+
+ Ensure that the test models fully converge fully and check a wide
@agramfort Owner

Fully fully ;)

@MechCoder Owner

just to make sure it actually converges fully :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@agramfort
Owner

Besides +1 for merge. Nice and clean job @MechCoder !

@MechCoder
Owner

Why do you test with tol=1e-8 BTW?
The tests fail, but then If I change the decimal points of accuracy, it passes. So I thought it would be better to ensure more precision.

@MechCoder MechCoder DOC: Minor docfixes and cosmits
6700abb
@MechCoder
Owner

@ogrisel @ogrisel I have done the minor changes and pushed. Thanks for the reviews!

@agramfort
Owner

Green for me !

@ogrisel
Owner

The tests fail, but then If I change the decimal points of accuracy, it passes. So I thought it would be better to ensure more precision.

Alright, this is fine then. I will rebase and merge.

@ogrisel
Owner

Squashed & rebased as: cf4cf60 (I also added a whats_new.rst entry).

Thanks @MechCoder!

@ogrisel ogrisel closed this
@MechCoder MechCoder deleted the MechCoder:random_with_replacement branch
@MechCoder
Owner

thanks! @ogrisel

@vene
Owner

Thank you @MechCoder, I'm happy to see this going in. The latest benchmarks paint a nice picture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jul 23, 2014
  1. @MechCoder
  2. @MechCoder

    Made the following changes:

    MechCoder authored
    1. Added shuffle parameter
    2. Random coordinate update is not GIL free
  3. @MechCoder
  4. @MechCoder
Commits on Jul 24, 2014
  1. @MechCoder

    FIX: Fix sparse cd for random coordinate descent

    MechCoder authored
Commits on Jul 25, 2014
  1. @MechCoder

    Renamed shuffle to selection and added tests

    MechCoder authored
Commits on Jul 29, 2014
  1. @MechCoder

    DOC: Minor docfixes and cosmits

    MechCoder authored
Something went wrong with that request. Please try again.