[MRG+1] Updating a random feature with replacement for every iteration in the cd code #3335

Closed
wants to merge 7 commits into
from

Conversation

Projects
None yet
7 participants
Owner

MechCoder commented Jul 2, 2014

Testing if random with replacement has the same effect on the duke and arcene datasets.

Owner

MechCoder commented Jul 2, 2014

@agramfort @ogrisel These are the benches that I get for the duke and arcene dataset.

Duke:

bench_duke_with_replacement

Arcene:

best_alpha_arcene

Owner

MechCoder commented Jul 2, 2014

Umm. they do look similar.

Owner

agramfort commented Jul 2, 2014

the arcene bench does not make sense. The dual gap starts at 1e-30 which
more than machine precision. The duke looks great.

Owner

agramfort commented Jul 2, 2014

add the shuffle param and get rid of the gil using the tree random tricks

Owner

MechCoder commented Jul 3, 2014

@ogrisel @agramfort I have pushed the GIL free code now.

Coverage Status

Coverage increased (+0.0%) when pulling 14dadff on MechCoder:random_with_replacement into 82611e8 on scikit-learn:master.

Owner

MechCoder commented Jul 3, 2014

Setting tol to 1e-8
bench duke time with alpha_e-8

when I set to tol to 1e-12

bench duke time with alpha

when I set to tol to 1e-18
bench duke time with alpha_e-8

setting tol 0
bench duke time with alpha_0

Owner

agramfort commented Jul 3, 2014

what are your conclusions?

Owner

ogrisel commented Jul 4, 2014

@MechCoder please also do a train / test split, and for each value of alpha measure the training time and the test score for each method and then put 2 plots:

  • one for training time vs alpha
  • one for test score vs alpha

Use plt.semilogx to have a log scale for the alphas axis. And put a plt.ylim(0, None) to avoid biasing the reading of the y axis.

Owner

MechCoder commented Jul 4, 2014

yup. will do

On Fri, Jul 4, 2014 at 10:40 AM, Olivier Grisel notifications@github.com
wrote:

@MechCoder https://github.com/MechCoder please also do a train / test
split, and for each value of alpha measure the training time and the test
score for each method and then put 2 plots:

  • one for training time vs alpha
  • one for test score vs alpha

Use plt.semilogx to have a log scale for the alphas axis. And put a plt.ylim(0,
None) to avoid biasing the reading of the y axis.


Reply to this email directly or view it on GitHub
#3335 (comment)
.

Regards,
Manoj Kumar,
GSoC 2014, Scikit-learn
Mech Undergrad
http://manojbits.wordpress.com

Owner

ogrisel commented Jul 4, 2014

@MechCoder BTW please try to use more descriptive titles for your PRs. Here it could be "[WIP] Random Coordinate Descent: sampling coordinates with replacement"

@MechCoder MechCoder changed the title from WIP: Random with replacement to WIP: Updating a random feature with replacement for every iteration in the cd code Jul 4, 2014

Owner

MechCoder commented Jul 5, 2014

@agramfort @ogrisel I used a grid of 100 alphas for different tolerances, and set the iterations as high as possible. I get good results for the random descent, Here are the plots.

Tolerance = 0
tol 0

Tolerance = e-18
tol 1e-18

Tolerance e-12
tol 1e-12

Tolerance e=08
tol 1e-08

+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
@agramfort

agramfort Jul 5, 2014

Owner

If set to True, a random coefficient is updated at every iteration
rather than looping over features sequentially.

+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
@agramfort

agramfort Jul 5, 2014

Owner

same here

+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
@agramfort

agramfort Jul 5, 2014

Owner

here too

also you forgot the " " before : in all the new docstrings.

+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
+ a random feature to update. Useful only when shuffle is set to True.
+
+ shuffle: bool, default False
+ If set to True, a random coefficient is updated every iteration.
Owner

agramfort commented Jul 5, 2014

besides LGTM

Owner

MechCoder commented Jul 5, 2014

@agramfort I've updated it and changed it to MRG + 1

@MechCoder MechCoder changed the title from WIP: Updating a random feature with replacement for every iteration in the cd code to [MRG+1] : Updating a random feature with replacement for every iteration in the cd code Jul 5, 2014

@MechCoder MechCoder changed the title from [MRG+1] : Updating a random feature with replacement for every iteration in the cd code to [MRG+1] Updating a random feature with replacement for every iteration in the cd code Jul 5, 2014

Coverage Status

Coverage increased (+0.0%) when pulling acd97a5 on MechCoder:random_with_replacement into 82611e8 on scikit-learn:master.

Owner

ogrisel commented Jul 6, 2014

@MechCoder thanks for the new benchmarks. Could you please try something intermediate like tol=1e-15 to explore the transition between the tol=1e-12 regime and the tol=1e-18 regime?

  • Is the time on the y axis measured in seconds?
  • I think it would be worth cleaning your script and making it an benchmark script if it runs for more than 30 seconds total.
  • you can use the '--' style for the second line on the score plot to show that the scores curves overlap (BTW, this is weird to see a multi-modal regularization path, has anyone else seen this in the past?).
  • Which dataset it this? The dataset name could be added in the title of the figure.

The narrative documentation should be updated with a paragraph explaining the expected impact of shuffle=True, explaining that this can speed up the convergence, especially when tol is smaller than 1e-10 and regularization is light. The amount of speedup between shuffle=True and shuffle=False is data dependent.

Also please add a "reference" section to the narrative documentation to add the missing reference to the elastic net coordinate descent paper and a new reference to the Stochastic CD paper. @mblondel @agramfort any other paper to reference there?

Owner

ogrisel commented Jul 6, 2014

Also in the legend of the benchmark script, "Normal descent" could be replaced by the more explicit "Cyclic descent.".

Owner

ogrisel commented Jul 6, 2014

The benchmarks seem to demonstrate that shuffle=True is never worst than shuffle=False. I would be in favor of setting the default to shuffle=True if others things that those finding would generalize to other common datasets.

Owner

agramfort commented Jul 6, 2014

The benchmarks seem to demonstrate that shuffle=True is never worst than shuffle=False. I would be in favor of setting the default to shuffle=True if others things that those finding would generalize to other common datasets.

why not but that would be justified by a more thorough benchmark.

Owner

ogrisel commented Jul 6, 2014

The narrative documentation for the ElasticNet(CV) class should also state explicitly that coordinate descent is used as optimizer.

Owner

ogrisel commented Jul 6, 2014

@MechCoder for the syntax of the reference block, please have a look at other occurrences of such blocks in the existing documentation. You can find them with:

git grep `.. topic:: References:`
Owner

ogrisel commented Jul 6, 2014

Ok I understand the difference of the two regimes: in one case one reaches max_iter for almost any value of alpha and the other case, max_iter is almost never reached.

Owner

ogrisel commented Jul 6, 2014

Could you please release the upper limit of the tol=1e-12 plot to see whether we reach the flat area of max_iter on the left hand side of the time curves?

Owner

ogrisel commented Jul 6, 2014

The benchmarks seem to demonstrate that shuffle=True is never worst than shuffle=False. I would be in favor of setting the default to shuffle=True if others things that those finding would generalize to other common datasets.

why not but that would be justified by a more thorough benchmark.

One could try on the Boston dataset to check that it's not getting worse on a dataset that has not many (noisy) features.

Owner

GaelVaroquaux commented Jul 6, 2014

👍 on setting shuffle=True as a default if the Boston dataset validates that it is in general better.

Owner

MechCoder commented Jul 7, 2014

Is the time on the y axis measured in seconds?

Yes

would be in favor of setting the default to shuffle=True if others things that those finding would generalize to other common datasets.

There are a number of tests that fail if I do this and most of them look non-trivial.

Owner

ogrisel commented Jul 7, 2014

There are a number of tests that fail if I do this and most of them look non-trivial.

Do you mean that there are test with the random CD algorithm does not converge to a good solution (compared the he cyclic CD algorithm currently used in master)?

Other than that they should not be that complicated to fix. In the worst case we could put shuffle=False individually for some tests.

If we set shuffle=True by default we should not forget to fix the random_state argument in all the test that run CD models to still get deterministic test runs.

Owner

MechCoder commented Jul 27, 2014

The Duke model is a classification problem. So that is the reason I used the accuracy score, (which is the default score of Logistic Regression) , which I thought would give us a better picture of the accuracy.
Do I do the remaining benches with the r2 score?

maximum value of the score was the same for cyclic CD and randomized CD

Yes you are right, sorry.

Owner

MechCoder commented Jul 27, 2014

@ogrisel A general question. I thought that for a classification problem, we are more interested in predicting if the label is right or not, so I thought the accuracy score would be better.

But now it seems to that the explained variance score, also might be a good idea, since it gives an idea about how close we were to predicting the right label.

Which of these is a better option?

Owner

ogrisel commented Jul 27, 2014

The Duke model is a classification problem. So that is the reason I used the accuracy score, (which is the default score of Logistic Regression)

How did you compute it? Both ElasticNet and Lasso classes are implementation of linear regression models, hence they predict a continuous variable. accuracy_score expects categorical inputs encoded as integers or booleans. Did you threshold the predictions of the regressions models manually? If so which threashold value? What are the possible target values in the Duke dataset?

which I thought would give us a better picture of the accuracy. Do I do the remaining benches with the r2 score?

Yes I think it's better to use a regression metric to quantify the quality of a regressor model. Otherwise one needs to properly post process the output of the regressor to make it behave as a classifier.

Owner

MechCoder commented Jul 27, 2014

@ogrisel
The Duke dataset is a binary problem. I handled the output (logically) similar to how Logistic Regression handles it,

    clf.fit(X_train, y_train)
    y_pred = np.sign(clf.predict(X_test))
    c_score4 = accuracy_score(y_test, y_pred)
Owner

MechCoder commented Jul 27, 2014

@ogrisel
Oh I am extremely sorry, #3335 (comment) these were for the BOSTON dataset. I have mentioned in the top.

This (#3335 (comment)) and this (#3335 (comment)) should be compared. he difference in scores are definitely due to the usage of r2 score vs accuracy score.

I have used the r2 score for these benches. (which can explain the difference in this and the above benchmarks)

Blue - random
Red - cyclic
alpha = 0.0001 (best alpha)

This is the score on the training data
training_newsgroup

This is the score on the left out data.
cross_valid_newsgroup

Owner

MechCoder commented Jul 27, 2014

@ogrisel This is the bench for duke on left out data, the benches comparable to the one in this,

duke_leftout

The values for that on the training data are
random data = [0.99999290839194466, 0.99999744375869459, 0.99999745712686572, 0.999997457135264]
cyclic data = [0.99999674312730391, 0.99999746925713084, 0.99999745713785693, 0.99999745713522792]

Could you please tell me what you had wanted to confirm from the above benches?

Owner

MechCoder commented Jul 28, 2014

@ogrisel Do you want to me to bench again using the default r2 score, or is it ok since we understand that the random descent does not behave worse on the duke and newsgroup datasets (in terms of time and score)

Owner

ogrisel commented Jul 28, 2014

Oh I am extremely sorry, #3335 (comment) these were for the BOSTON dataset. I have mentioned in the top.

Those were not the benchmarks I was referring to.

This (#3335 (comment)) and this (#3335 (comment)) should be compared. he difference in scores are definitely due to the usage of r2 score vs accuracy score.

Those are precisely the 2 benchmarks I was referring too: there is an optimal value for the r2 score for alpha around 0.2. You cannot see that for the accuracy score of the latter where lower alpha values are always best and the score is flat. I want to know if there is an optimal value of alpha for the regression task.

So as this discussion is getting really messy please once and for all, run the following benchmark for duke and newsgroups and please double-check that the labeling of the axis and the titles are correct before posting, and no need to repeat in comments what's written on the axis labels:

  • fixed tol=1e-4
  • 3 vertically stacked plots with alpha on the x-axis
    • y-axis: n_iter, 2 curves one for Random CD, one for Cyclic CD
    • y-axis: fit time, 2 curves one for Random CD, one for Cyclic CD
    • y-axis: mean r2 score on left out folds, 2 curves one for Random CD, one for Cyclic CD
Owner

ogrisel commented Jul 28, 2014

Could you please tell me what you had wanted to confirm from the above benches?

I wanted this bench to compare the training and test r2 score on the same plot vs tolerances ranging from tol=0.1 to tol=1e-10 so as to be able to check whether there was a sweet spot for the default value of the tolerance parameter in terms of generalization.

One would expect to see the training score always climb with decreasing values of tolerance (we better optimize the model for the training set) while the validation r2 score should climb at the beginning and reach a plateau where it's no longer useful to optimize further or even would cause the model to overfit and have the r2 score decrease a bit for tol ~ 1e-10.

The last benchmark on Duke has tolerance between 1e-4 and 1e-10 so we cannot observe that.

Basically get the same kind of intuition as those curves: http://scikit-learn.org/stable/auto_examples/plot_validation_curve.html but evaluating the impact of the optimizer quality (tol parameter) instead of the regularization strength on the x axis.

Owner

ogrisel commented Jul 28, 2014

Also what is the optimal value of alpha for Duke when using the r2 score on held out data to tune it?

Owner

MechCoder commented Jul 28, 2014

#3335 (comment) Thanks for this explanation. I had left out the tol of 10e--1 as it had given me a negative r2 score.

The optimal value of alpha that I had used is 0.0013520555661157029 got from ENetCV

Owner

ogrisel commented Jul 28, 2014

r2 can be negative, this is not a problem.

Owner

ogrisel commented Jul 28, 2014

The optimal value of alpha that I had used is 0.0013520555661157029 got from ENetCV

That sounds much smaller that the past benchmarks curve seem to indicate, looking forward to reading the final benchmark on duke.

Owner

MechCoder commented Jul 28, 2014

@ogrisel The last and final benchmarks.

duke_final
newsgroup_final

I have double checked everything twice.

Owner

ogrisel commented Jul 28, 2014

Thanks @MechCoder this looks good. Will give this PR it a final review.

+ shuffle = False
+ selection = params.get('selection', 'cyclic')
+ if selection == 'random':
+ shuffle = True
@ogrisel

ogrisel Jul 28, 2014

Owner

You should make sure that invalid values for the selection param raise a ValueError with a explicit error message that give the list of accepted values for the parameter and the actual value passed by the user.

Please add a tests for that as well.

@MechCoder

MechCoder Jul 28, 2014

Owner

I have done in this in the higher level modules. should I make selection a public param for enet_path and add it here too?

@@ -1472,6 +1532,14 @@ class MultiTaskElasticNet(Lasso):
When set to ``True``, reuse the solution of the previous call to fit as
initialization, otherwise, just erase the previous solution.
+ random_state : int, RandomState instance, or None (default)
+ The seed of the pseudo random number generator that selects
+ a random feature to update. Useful only when shuffle is set to True.
@ogrisel

ogrisel Jul 28, 2014

Owner

I would put random_state as the last parameter as is usually done in other sklearn models.

+
+ selection : str, default 'cyclic'
+ If set to 'random', a random coefficient is updated every iteration
+ rather than looping over features sequentially by default.
@ogrisel

ogrisel Jul 28, 2014

Owner

Please add a comment such as:

selection="random" often leads to significantly faster convergence, especially when tol is higher than 1e-4.

@@ -1575,9 +1645,16 @@ def fit(self, X, y):
self.coef_ = np.asfortranarray(self.coef_) # coef contiguous in memory
+ if self.selection not in ['random', 'cyclic']:
+ raise ValueError("selection should be either random or cyclic.")
@ogrisel

ogrisel Jul 28, 2014

Owner

Good. The same parameter validation should be performed everywhere. And tested in unittests.

+ raise ValueError("selection should be either random or cyclic.")
+ shuffle = False
+ if self.selection == 'random':
+ shuffle = True
@ogrisel

ogrisel Jul 28, 2014

Owner

Could be written more simply as:

shuffle = (self.selection == 'random')
+def test_random_descent():
+ """Test that both random and cyclic selection give the same results
+ when converged fully and using all conditions.
+ """
@ogrisel

ogrisel Jul 28, 2014

Owner

PEP 257:

def test_random_descent():
    """Test that both random and cyclic selection give the same results

    Ensure that the test models fully converge fully and check a wide array
    of conditions.

    """
+ """
+
+ # This uses the coordinate descent algo using the gram trick.
+ X, y, _, _ = build_dataset(n_samples=50, n_features=20)
@ogrisel

ogrisel Jul 28, 2014

Owner

How long is this tests? Would it be possible to make it run faster without warning with n_samples=10?

Owner

MechCoder commented Jul 28, 2014

@ogrisel I have addressed all your comments other the one about the tests.

Tests run for 0.732 s in my branch
And for 0.674 s in master which is a difference of 60 ms, is it a problem?

Coverage Status

Coverage increased (+0.01%) when pulling 75b78f2 on MechCoder:random_with_replacement into 5cd2da2 on scikit-learn:master.

+ assert_array_almost_equal(clf_cyclic.coef_, clf_random.coef_)
+ assert_almost_equal(clf_cyclic.intercept_, clf_random.intercept_)
+
+
@ogrisel

ogrisel Jul 29, 2014

Owner

pep8: single blank line inside body of functions.

Owner

ogrisel commented Jul 29, 2014

Tests run for 0.732 s in my branch. And for 0.674 s in master which is a difference of 60 ms, is it a problem?

It's fine. Why do you test with tol=1e-8 BTW? Do those tests fail when tol is kept to its default value?

The rule of thumb is to have individual tests shorter than 1s and most of them shorter than 100ms individually. Smoke tests should be as fast as possible.

Owner

ogrisel commented Jul 29, 2014

Apart from my last comments, +1 for merge.

sklearn/linear_model/cd_fast.pyx
- for ii in range(n_features): # Loop over coordinates
+ for f_iter in range(n_features): # Loop over coordinates
+ if shuffle:
@agramfort

agramfort Jul 29, 2014

Owner

Shuffle is now misleading. I would rename to random or random_selection

+def test_random_descent():
+ """Test that both random and cyclic selection give the same results.
+
+ Ensure that the test models fully converge fully and check a wide
@agramfort

agramfort Jul 29, 2014

Owner

Fully fully ;)

@MechCoder

MechCoder Jul 29, 2014

Owner

just to make sure it actually converges fully :P

Owner

agramfort commented Jul 29, 2014

Besides +1 for merge. Nice and clean job @MechCoder !

Owner

MechCoder commented Jul 29, 2014

Why do you test with tol=1e-8 BTW?
The tests fail, but then If I change the decimal points of accuracy, it passes. So I thought it would be better to ensure more precision.

Owner

MechCoder commented Jul 29, 2014

@ogrisel @ogrisel I have done the minor changes and pushed. Thanks for the reviews!

Owner

agramfort commented Jul 29, 2014

Green for me !

Owner

ogrisel commented Jul 29, 2014

The tests fail, but then If I change the decimal points of accuracy, it passes. So I thought it would be better to ensure more precision.

Alright, this is fine then. I will rebase and merge.

Owner

ogrisel commented Jul 29, 2014

Squashed & rebased as: cf4cf60 (I also added a whats_new.rst entry).

Thanks @MechCoder!

@ogrisel ogrisel closed this Jul 29, 2014

@MechCoder MechCoder deleted the MechCoder:random_with_replacement branch Jul 29, 2014

Owner

MechCoder commented Jul 29, 2014

thanks! @ogrisel

Owner

vene commented Jul 29, 2014

Thank you @MechCoder, I'm happy to see this going in. The latest benchmarks paint a nice picture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment