Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Enable the "sufficient stats" mode of LARS #11699

Merged
merged 66 commits into from Mar 6, 2019

Conversation

Projects
None yet
3 participants
@yukuairoy
Copy link
Contributor

yukuairoy commented Jul 27, 2018

What does this implement/fix? Explain your changes.

We'd like to enable the "gram and cov matrix" based mode of the LARS algorithm in the function lars_path(...). As the original paper of B. Efron, T. Hastie, I. Johnstone, R. Tibshirani (2004) documented, as long as we know the sufficient statistics, in this case, the Gram matrix, the Cov vector (Xy) and sample size, the LARS algorithm will be able to work.

We'd like to add a lars_path_gram(...) function to allow users to run through it even if they only know the sufficient statistics but not the original data X and y.

Additional tests have been added to ensure the new lars_path_gram(...) function works as intended.

yukuairoy added some commits Jun 15, 2018

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Jul 28, 2018

@yukuairoy you can already pass gram and Xy as parameter to lars_path. Why is it not enough?

also I see a number of cosmetic changes to existing code in this PR. Avoid this when possible
so the diff is limited to what the new feature code.

yukuairoy added some commits Jul 28, 2018

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Jul 29, 2018

@yukuairoy can you answer my question above?

@yukuairoy

This comment has been minimized.

Copy link
Contributor Author

yukuairoy commented Jul 29, 2018

@agramfort Thank you for looking at this PR. Previously users have to pass in non-None X and y for lars_path() to work. The "sufficient stats" mode allows users to skip providing X and y (using None as a placeholder) by providing the set of sufficient statistics instead. In addition to the Gram matrix and Cov matrix (Xy), n_samples is needed as the complete set of sufficient statistics. n_samples was not part of the signature of lars_path() hence we also need to add it into the function signature for this mode to work through.

I've sent a separate PR (#11703) to fix the cosmetic issues and I'll push another commit to make sure this current PR is only about the change in functionality.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 29, 2018

@yukuairoy

This comment has been minimized.

Copy link
Contributor Author

yukuairoy commented Jul 30, 2018

@jnothman Thank you for the comment. Could you clarify what you mean by "n_samples = Xy.shape[0]???​"? I cannot find this line in my diff. Or are you suggesting that n_samples can be inferred from Xy?

Please correct me if I'm wrong, I'd think Xy.shape[0] informs us the n_features info as opposed to n_samples. As a matter of fact, n_samples cannot be inferred from either Gram or Xy unless it is explicitly input by the user.

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Jul 31, 2018

thinking about it I think it would be cleaner to have a new lars_path_gram function that takes Gram and Xy as input (no X or y) and to deprecate the option to pass Gram and Xy to lars_path. That will simplify the API of lars_path.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 31, 2018

@yukuairoy

This comment has been minimized.

Copy link
Contributor Author

yukuairoy commented Aug 3, 2018

Hi @jnothman, thanks very much for bringing this up. It looks like on the master branch there is indeed a documentation "bug" in the parameter description of Xy. After this PR is approved and merged into master, I can create another PR to fix that documentation "bug" if it helps the community.

@yukuairoy

This comment has been minimized.

Copy link
Contributor Author

yukuairoy commented Aug 3, 2018

Hi @agramfort, thanks very much for this nice thought. To provide a bit more context, in many of the existing client codebases we know (as well as several cases which the unit-test code exemplified), sometimes users like to invoke so-called "precomputed" mode of the lars_path(...) function, i.e. passing in both X, y, besides Gram and Xy which are literally precomputed. Under the hood of the numerical implementation, knowing X directly might enable the solving algorithm to employ slightly more efficiently numerical linear-algebra routines, compared with the case where X is not known. This is also consistent with the fact that the numerical output of the “precomputed” mode is not always exactly equal to output of the raw mode (some living examples can be found in the test-cases; more can be found in client codebases though they are hard to see from the github perspective). If we deprecate the option to pass Gram and Xy to lars_path(...), one concern is that it will likely break many of the existing client codebases, which may not be thoroughly necessary.

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Aug 4, 2018

@yukuairoy you will have to change the code of your clients anyway to support what you are aiming for.

When you start to have parameters that are necessary/optional depending on other parameters documenting the API starts to be a mess. With the current master you always need X and y and you can pass precomputed values to avoid recomputation. With what you propose we can have X and y None but then we need to pass n_samples. It starts to be mess I think.

I would prefer to have

lars_path(X, y, precompute=’auto’ | True | False) that would follow the Lars API http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lars.html

and

lars_path_gram(Gram, Xy, n_samples)

of course we should do this without code duplication via a private function.

@yukuairoy

This comment has been minimized.

Copy link
Contributor Author

yukuairoy commented Aug 15, 2018

@agramfort I agree that we should keep the two modes separate. My only concern with your suggestion of lars_path(X, y, precompute=’auto’ | True | False) is in order to support the precompute mode, lars_path(), in addition to a precompute parameter, still needs to take Gram and Xy -- making the precompute parameter redundant.

How about we keep the original

lars_path(X, y, Xy=None, Gram=None, max_iter=500, alpha_min=0, method='lar', copy_X=True, eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0, return_path=True, return_n_iter=False, positive=False)

intact and add an additional

lars_path_gram(Gram, Xy, n_samples, max_iter=500, alpha_min=0, method='lar', copy_X=True, eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0, return_path=True, return_n_iter=False, positive=False)?

This way we get to keep backward compatibility. Of course we'll use a private function to avoid code duplication.

yukuairoy added some commits Aug 15, 2018

@jnothman
Copy link
Member

jnothman left a comment

For functionality we already have, this seems to be adding a lot of code to testing. But overall it is looking good

Show resolved Hide resolved benchmarks/bench_plot_lasso_path.py Outdated
Show resolved Hide resolved benchmarks/bench_plot_omp_lars.py Outdated
Show resolved Hide resolved sklearn/linear_model/tests/test_least_angle.py Outdated

yukuairoy added some commits Jan 3, 2019

@yukuairoy

This comment has been minimized.

Copy link
Contributor Author

yukuairoy commented Jan 5, 2019

Thanks @agramfort and @jnothman for the comments. I've updated the code. Please take a look.

yukuairoy added some commits Jan 5, 2019

Show resolved Hide resolved sklearn/linear_model/tests/test_least_angle.py Outdated

yukuairoy added some commits Jan 8, 2019

fix

@jnothman jnothman changed the title Enable the "sufficient stats" mode of LARS ENH Enable the "sufficient stats" mode of LARS Jan 8, 2019

@yukuairoy

This comment has been minimized.

Copy link
Contributor Author

yukuairoy commented Jan 15, 2019

@agramfort Do you have further comments?

@yukuairoy

This comment has been minimized.

Copy link
Contributor Author

yukuairoy commented Jan 24, 2019

@agramfort Friendly ping

@yukuairoy

This comment has been minimized.

Copy link
Contributor Author

yukuairoy commented Feb 4, 2019

@agramfort can you please review the current version?

@yukuairoy

This comment has been minimized.

Copy link
Contributor Author

yukuairoy commented Feb 4, 2019

@jnothman Thanks for reviewing this merge request. Is there anything we can do to make sure @agramfort reviews the latest changes?

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Feb 16, 2019

@yukuairoy we need a what's new update before merging.

@yukuairoy

This comment has been minimized.

Copy link
Contributor Author

yukuairoy commented Mar 5, 2019

@agramfort @jnothman thanks for LGTM. I've updated the What's New.

yukuairoy and others added some commits Mar 6, 2019

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Mar 6, 2019

Thanks @yukuairoy!

@jnothman jnothman merged commit fec7670 into scikit-learn:master Mar 6, 2019

1 of 6 checks passed

LGTM analysis: C/C++ Fetching git commits
Details
LGTM analysis: JavaScript Fetching git commits
Details
LGTM analysis: Python Fetching git commits
Details
ci/circleci: doc CircleCI is running your tests
Details
ci/circleci: doc-min-dependencies CircleCI is running your tests
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.