Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add jitter to LassoLars #15179

Merged
merged 24 commits into from
Apr 17, 2020
Merged

Conversation

angelaambroz
Copy link
Contributor

Reference Issues/PRs

See #2746.

What does this implement/fix? Explain your changes.

Following the discussion in the above issue, this PR adds a jitter keyword argument to the Lars and LassoLars classes. The argument, defaulted to 0.0001, applies uniformly distributed noise, bounded by jitter, to the y variable when fitting.

Any other comments?

Main question:

  • By adding this random noise, we start to get failing tests in test_least_angle.py from differences in floating point estimates for model coefficients. One way to address this would be loosening the precision of the assert_array_almost_equal() statements. Is this how we'd like to proceed?
  • Does it make sense to set a np.random.seed() and "lock in" the noise so that the model gives predictable results once it's been imported, even if the user keeps instantiating and fitting it? I think no, but... I'm not sure.

Thanks to all the maintainers of sklearn! It's a great project. Thanks for the contributing guidelines also, those were very helpful.

@angelaambroz
Copy link
Contributor Author

Addressing MR comments (thanks, @agramfort):

  • Going down one order of magnitute on jitter to 0.001 from 0.0001.
  • Fixing the randomness with RandomState (this answers my original Q - thanks).

Still pending: 3 tests fail now due to floating point precision stuff. Should I just loosen the precision of those to let things pass?

Also: I've moved the jitter default value up into a global var (JITTER, at the top), since I didn't like repeating it in the class definitions twice (for Lars and LassoLars). Not sure if this is stylistically okay.

@agramfort
Copy link
Member

@angelaambroz please do not use a jitter by default. It's working for many people as it is now. Don't affect the behavior of many just for your usecase. thanks

@angelaambroz
Copy link
Contributor Author

@agramfort Not actually my use case - I was addressing what was discussed in #2746. A default of 10e-5 was discussed there; though I didn't get specific confirmation on that being the agreed-on default value. I can default jitter=None, so that it won't affect existing implementations (and the tests), but keep it as a kwarg if folks want to change it. Does that work?

@agramfort
Copy link
Member

agramfort commented Oct 21, 2019 via email

@agramfort
Copy link
Member

@angelaambroz there are unrelated changes in the diff (in cython files). This needs to be cleaned up. thanks

@angelaambroz
Copy link
Contributor Author

Ack, unintended! Will fix shortly.

@angelaambroz angelaambroz changed the title [WIP] Adding jitter to LassoLars fit ENH: Adding jitter to LassoLars fit Oct 25, 2019
@angelaambroz
Copy link
Contributor Author

The error message I'm seeing in the Azure pipeline is:

    ImportError:
    Importing the multiarray numpy extension module failed.  Most
    likely you are trying to import a failed build of numpy.
    If you're working with a numpy git repo, try `git clean -xdf` (removes all
    files not under version control).  Otherwise reinstall numpy.
    
    Original error was: cannot import name 'multiarray' from partially initialized module 'numpy.core' (most likely due to a circular import) (/tmp/pip-build-env-zfiic5r2/overlay/lib/python3.8/site-packages/numpy/core/__init__.py)

I expect this is something orthogonal to my changes - I don't see how my diff could have affected numpy stuff. Help?

@@ -855,7 +861,8 @@ class Lars(MultiOutputMixin, RegressorMixin, LinearModel):

def __init__(self, fit_intercept=True, verbose=False, normalize=True,
precompute='auto', n_nonzero_coefs=500,
eps=np.finfo(np.float).eps, copy_X=True, fit_path=True):
eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
jitter=DEFAULT_JITTER):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
jitter=DEFAULT_JITTER):
jitter=None):

@@ -24,6 +24,7 @@
from ..exceptions import ConvergenceWarning

SOLVE_TRIANGULAR_ARGS = {'check_finite': False}
DEFAULT_JITTER = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed.

@@ -1030,6 +1045,11 @@ class LassoLars(Lars):
setting ``fit_path`` to ``False`` will lead to a speedup, especially
with a small alpha.

jitter : float, default=None
Uniform noise parameter, added to the y values, to satisfy \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove trailing \

@@ -1074,7 +1094,7 @@ class LassoLars(Lars):
>>> reg.fit([[-1, 1], [0, 0], [1, 1]], [-1, 0, -1])
LassoLars(alpha=0.01)
>>> print(reg.coef_)
[ 0. -0.963257...]
[ 0. -0.9632...]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this, the default shouldn't have changed.

@@ -1092,7 +1112,7 @@ class LassoLars(Lars):
def __init__(self, alpha=1.0, fit_intercept=True, verbose=False,
normalize=True, precompute='auto', max_iter=500,
eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
positive=False):
positive=False, jitter=DEFAULT_JITTER):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
positive=False, jitter=DEFAULT_JITTER):
positive=False, jitter=None):

@rth
Copy link
Member

rth commented Oct 30, 2019

I expect this is something orthogonal to my changes - I don't see how my diff could have affected numpy stuff. Help?

Yes, merging master in might help.

@@ -963,6 +976,12 @@ def fit(self, X, y, Xy=None):
else:
max_iter = self.max_iter

if self.jitter:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.jitter:
if self.jitter is not None:

y = np.array(y_list)
expected_output = np.array(expected_y)
alpha = 0.001
fit_intercept = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why forcing fit_intercept = False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the at the edge of my stats/linear algebra understanding, but I think we need to force it to be False since the error only occurs for exactly aligned values (e.g. this comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from your comment I understand that the test would not be a non-regression test with fit_intercept=True as you only see an error with fit_intercept. However below the X and y you chose do work too with jitter=None. Did you check that the test still pass with fit_intercept=True?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the target is constant (-2.5, -2.5), fitting with the intercept would mean the coeffs are just all 0 (and the intercept is -2.5).

So I guess it makes sense to leave fit_intercept=False. It properly reproduces the original example from the issue

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and for your patience so far @angelaambroz.

Made a few comments. Are you available to make the changes? Otherwise we'll do it.
This also needs an entry in doc/whats_new/v0.23.rst

Thanks!

y = np.array(y_list)
expected_output = np.array(expected_y)
alpha = 0.001
fit_intercept = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the target is constant (-2.5, -2.5), fitting with the intercept would mean the coeffs are just all 0 (and the intercept is -2.5).

So I guess it makes sense to leave fit_intercept=False. It properly reproduces the original example from the issue

w_nojitter = lars.coef_
w_jitter = lars_with_jitter.coef_

assert not np.array_equal(w_jitter, w_nojitter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too easy to pass, so maybe instead check the MSD

assert np.mean((w_jitter - w_nojitter)**2) > .1

@NicolasHug NicolasHug changed the title ENH: Adding jitter to LassoLars fit [MRG] Add jitter to LassoLars Apr 6, 2020
@NicolasHug NicolasHug added this to the 0.23 milestone Apr 6, 2020
@adrinjalali
Copy link
Member

@NicolasHug seems like @angelaambroz may not be available for this one. Would you like to take it over? (trying to clean up the milestone and prepare for release).

@NicolasHug
Copy link
Member

yup I'll do it.
@angelaambroz feel free to check in whenever and take it over again

@@ -1711,7 +1743,8 @@ class LassoLarsIC(LassoLars):
"""
def __init__(self, criterion='aic', fit_intercept=True, verbose=False,
normalize=True, precompute='auto', max_iter=500,
eps=np.finfo(np.float).eps, copy_X=True, positive=False):
eps=np.finfo(np.float).eps, copy_X=True, positive=False,
random_state=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why adding a random_state param here and not jitter?

@agramfort agramfort merged commit abfb6fd into scikit-learn:master Apr 17, 2020
@agramfort
Copy link
Member

thx @angelaambroz and @NicolasHug !

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
* Adding jitter to LassoLars fit

* CircleCI fail

* MR comments

* Jitter becomes default, added test based on issue description

* flake8 fixes

* Removing unexpected cython files

* Better coverage

* PR comments

* PR comments

* PR comments

* PR comments

* PR comments

* Linting

* Apply suggestions from code review

* addressed comments

* added whatnew entry

* test both estimators

* update whatsnew

* removed random_state for lassolarsIC

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
* Adding jitter to LassoLars fit

* CircleCI fail

* MR comments

* Jitter becomes default, added test based on issue description

* flake8 fixes

* Removing unexpected cython files

* Better coverage

* PR comments

* PR comments

* PR comments

* PR comments

* PR comments

* Linting

* Apply suggestions from code review

* addressed comments

* added whatnew entry

* test both estimators

* update whatsnew

* removed random_state for lassolarsIC

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants