Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal Generalized linear models implementation (L2 + lbfgs) #14300

Merged
merged 292 commits into from Mar 4, 2020

Conversation

@rth
Copy link
Member

@rth rth commented Jul 9, 2019

A minimal implementation of Generalized Linear Models (GLM) from #9405 by @lorentzenchr

This only includes L2 penalty with the lbfgs solver. In other words, in excludes L1 penalties (or CD solver), matrix penalties, warm start, some distributions (e.g. BinomialDistribution), newton-cg & irls solvers from the original GLM PR.

The goals is to get an easier to review initial implementation. Benchmarks were done in #9405 (comment)

TODO

  • fix user manual and comments regarding removed functionality
  • examples may need some more work
  • use a standalone Tweedie deviance metric #14263 once it is merged
  • refactor the LBFS optimizer interface once #14250 is merged
  • use a common function for sample_weight validations #14307
  • expose objects that are for now not controversial to sklearn.linear_model
  • consider using a diagonal preconditioner as discussed in #15556 and #15583 for logistic regression Edit: Not in this PR -- Roman
Christian Lorentzen and others added 30 commits Feb 17, 2019
* smarter initialization of intercept

* PEP 257 -- Docstring Conventions

* minor docstring changes
* P2 also accepts 1d array and interprets it as diagonal matrix

* improved input checks for P1 and P2
…hecks

* adapt examples of GeneralizedLinearModel to new defaults for
  P1, P2 and selection

* fix precision/decimal issue in test_poisson_enet

* use more robust least squares instead of solve in IRLS

* fix sign error in input checks
    * add Binomial distribution

    * add Logit link

    * tests for binomial against LogisticRegression

    * option 'auto' for link

    * reduce code duplication by replacing @abstractproperty by @Property
* refactor into function _irls_solver

* refactor into function _cd_solver

* replace of safe_sparse_dot by matmul operator @

* more efficient handling of fisher matrix

* sparse coo matrices are converted to csc or csr

* sample weights don't except sparse matrices

* minor doc changes
* renamed option irls into guess

* removed option least_squares

* updated tests
* new estimator GeneralizedLinearRegressor
* loss functions for Tweedie family and Binomial
* elasitc net penalties
* control of penalties by matrix P2 and vector P1
* new solvers: coordinate descent, irls
* tests
* documentation
* example for Poisson regression
* make glm module private
* typos
* improve tests
Copy link
Member

@NicolasHug NicolasHug left a comment

I made minor pushes to compare the perf of the 2 models side by side in the example (293214c)

CI is green, LGTM. I think most of the comments are either minor or addressed.

Thank you very much for the great work and for your patience @lorentzenchr @rth . Much appreciated. Looking forward to seeing the future improvements!

Feel free to review my last changes and merge when you see fit

@lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Mar 1, 2020

And the second prize of chocolate goes to ... @NicolasHug 🥈 🍫 🎉

@rth
Copy link
Member Author

@rth rth commented Mar 1, 2020

Thanks a lot for the review @NicolasHug and for quickly addressing comments @lorentzenchr !

Let's give it another couple of days in case @ogrisel has other suggestions.

For the person who is going to merge this please edit the merge commit description to remove individual commits but to keep one of each of Co-authored-by tags as discussed in #16550 to keep attributions.

@rth rth closed this Mar 1, 2020
@rth rth reopened this Mar 1, 2020
@rth
Copy link
Member Author

@rth rth commented Mar 1, 2020

(Sorry accidental key press, updated message above with the missing part).

@jnothman
Copy link
Member

@jnothman jnothman commented Mar 2, 2020

Very excited for this! Sorry I've not been able to review in a while.

@lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Mar 2, 2020

@ogrisel As a last remark: How do you feel about reversing the sorting of the x-axis in the last plot of plot_tweedie_regression_insurance_claims.py from "safest to riskiest" to "riskiest to safest"? This way, both examples would have a comparable "rank" plot at the end.

@rth
Copy link
Member Author

@rth rth commented Mar 4, 2020

OK, merging with +2 and several partial reviews. Thanks @lorentzenchr and to all reviewers! I'm really glad this is done.

How do you feel about reversing the sorting of the x-axis in the last plot of plot_tweedie_regression_insurance_claims.py from "safest to riskiest" to "riskiest to safest"? This way, both examples would have a comparable "rank" plot at the end.

Feel free to open a follow up PR. Also looking forward to other additional features we could add from #9405. Will comment there.

@rth rth merged commit 69ea066 into scikit-learn:master Mar 4, 2020
21 checks passed
@rth rth deleted the GLM-minimal branch Mar 4, 2020
@agramfort
Copy link
Member

@agramfort agramfort commented Mar 4, 2020

🍺 !

@lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Mar 4, 2020

Many thanks to all of you! This was hard work and with enough perseverance we got it done. 👏

@jnothman
Copy link
Member

@jnothman jnothman commented Mar 5, 2020

Awesome! Might be worth announcing on the mailing list and asking people to take it for a ride.

@jnothman
Copy link
Member

@jnothman jnothman commented Mar 5, 2020

@rth
Copy link
Member Author

@rth rth commented Mar 5, 2020

Awesome! Might be worth announcing on the mailing list and asking people to take it for a ride.

@lorentzenchr Would you like to write that announcement email? Users can try it with nightly wheels https://scikit-learn.org/stable/developers/advanced_installation.html#installing-nightly-builds and the dev documentation https://scikit-learn.org/dev/modules/linear_model.html#generalized-linear-regression

@lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Mar 5, 2020

@rth Thanks for asking. As I'm not (yet) on the scikit-learn mailing list and as I can't find a precedent for new features as template, I'd rather let you go ahead 😊

ashutosh1919 added a commit to ashutosh1919/scikit-learn that referenced this issue Mar 13, 2020
…ikit-learn#14300)

Co-authored-by: Christian Lorentzen <lorentzen.ch@googlemail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
gio8tisu added a commit to gio8tisu/scikit-learn that referenced this issue May 15, 2020
…ikit-learn#14300)

Co-authored-by: Christian Lorentzen <lorentzen.ch@googlemail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

10 participants