sgdnet: efficient regularized GLMs for big data
R has excellent support for regularized generalized linear models (GLM), through for instance the glmnet package which fits elastic net penalized GLMs using coordinate descent.
Recent advances, however, have demonstrated that the performance of coordinate descent methods may be outperformed by stochastic gradient (SG) algorithms when n is large and computing the full gradient is prohibitive, which SG-based methods circumvent by subsampling of the training set, thereby making the iteration cost independent of n.
One such algorithm is the SAGA algorithm, which is able to solve convex problem with non-smooth penalty terms, making it ideal for fitting generalized linear models with the elastic net penalty along with several other penalty constructs.
There is not yet any way in R to fully leverage the power of stochastic gradient algorithms for fitting generalized linear models. For Google Summer of Code 2018, however, a project was initialized with the goal of providing a drop-in replacement for glmnet, based on the SAGA algorithm, with the intent of outperforming glmnet for big data problems with many more observations than variables.
That project resulted in the sgdnet package, which, however, fell short of the mark of both equalling glmnet in terms of features and beating it on performance.
This project is a follow-up to a GSOC18 project, which one of the mentors (Johan Larsson) undertook and the other (Michael Weylandt) mentored for. The resulting package is at https://github.com/jolars/sgdnet and you can read the final report at https://github.com/jolars/sgdnet/wiki/GSoC-2018.
- glmnet is a popular package for solving generalized linear models using coordinate descent.
- bigglm fits generalized linear models to bigger-than-memory data sets.
- scikit-learn has a python implementation of SAGA for fitting elastic net penalized glms.
- lightning is a python implementation of, among other algorithms, SAGA for fitting glms.
- tick is a python module with a C++ backbone that uses the SAGA algorithm.
- Ischmael Belghazi developed the bigoptim package for his GSOC18 project, which used a similar algorithm (SAG) to solve L2-regularized problems.
Details of your coding project
The overarching goal of the project is to reach a mature state where the feature set and performance makes it attractive as a replacement for glmnet in settings where n >> g, hopefully resulting in a CRAN submission by the end of GSOC.
- Incorporate feature screening rules (which are a major reason why glmnet is so fast)
- Poisson regression
- Binomial regression for binary (0/1) outputs is already implemented, but it would be nice to also support generic non-negative count data (0, 1, 2, ..., N), i.e. the loss function is the negative log-likelihood of the binomial distribution.
- Cox regression
- Move to cyclic instead of truly stochastic sampling (see https://arxiv.org/abs/1810.11167)
- Observation weights
- Update and refine vignettes
- Loss functions for general (left, right, interval) censoring, e.g. normal and logistic accelerated failure time (AFT) models.
- Negative Binomial Regression
- SCAD penalty
- MCP penalty
- Feature weights
- Box constraints
- Streaming/updating support
- Implement mini-batching for the SAGA algorithm
- Implement an asynchronous version of SAGA
The R community will have an efficient alternative for fitting generalized linear models for large data sets, possibly with additional functionaltiy.
We also expect the package to be easier to extend and more readable than the fortran-based code in glmnet.
Students, please contact mentors below after completing at least one of the tests below.
- Michael Weylandt firstname.lastname@example.org was a mentor for the sgdnet package project last year and is a PhD student at Rice University, involved in research on convex optimization among other things.
- Johan Larsson email@example.com is the author of R package sgdnet and the GSOC student who initially worked on sgdnet. He is also a PhD student in statistics at Lund University.
Students, please do one or more of the following tests before contacting the mentors above.
- Easy: Benchmark sgdnet against glmnet (and similar packages for bonus points) using microbenchmark, bench, or rbenchmark for time and loss for the gaussian, logistic, and multinomial families. Organize your findings in a Rmarkdown or Sweave document.
- Medium: Use RcppEigen to write a class that defines the lasso (l1) penalty, which should contain both a function to evaluate the proximal operator (soft thresholding) and evaluate the loss.
- Hard: Write an R package using RcppEigen to write a function that uses a regular stochastic gradient descent algorithm to fit an ordinary least squares model to a univariate response. Bonus: use the lasso penalty class from the previous test to regularize the model.
Solutions of tests
Students, please post a link to your test results here.