Skip to content
Aditya Samantaray edited this page Mar 17, 2019 · 15 revisions

Background

Interval regression is a class of machine learning models which is useful when predicted values should be real numbers, but outputs in the training data set may be partially observed. A common example is survival analysis, in which data are patient survival times.

For example, say that Alice and Bob came into the hospital and were treated for cancer on the same day in 2000. Now we are in 2016 and we would like to study the treatment efficacy. Say Alice died in 2010, and Bob is still alive. The survival time for Alice is 10 years, and although we do not know Bob’s survival time, we know it is in the interval (16, Infinity).

Say that we also measured some covariates (input variables) for Alice and Bob (age, sex, gene expression). We can fit an Accelerated Failure Time (AFT) model which takes those input variables and outputs a predicted survival time. L1 regularized AFT models are of interest when there are many input variables and we would like the model to automatically ignore those which are un-informative (do not help predicting survival time). Several papers describe L1 regularized AFT models

Interval regression (or interval censoring) is a generalization in which any kind of interval is an acceptable output in the training data. Any real-valued or positive-valued probability distribution may be used to model the outputs (e.g. normal or logistic if output is real-valued, log-normal or log-logistic if output is positive-valued like a survival time). For more details read this 1-page explanation of un-regularized parametric AFT models.

output interval likelihood censoring
exactly 10 (10, 10) density function none
at least 16 (16, Infinity) cumulative distribution function right
at most 3 (-Infinity, 3) cumulative distribution function left
between -4 and 5 (-4, 5) cumulative distribution function interval

Another application of interval regression is in learning penalty functions for detecting change-points and peaks in genomic data (data viz).

The iregnet package was coded in GSOC2016 by @anujkhare. It is the first R package to support

  • general interval output data (including left and interval censoring; not just observed and right-censored data typical of survival analysis),
  • elastic net (L1 + L2) regularization, and
  • a fast glmnet-like coordinate descent solver, coded in C++ by following the mathematics of Simon et al (JSS).

Related R packages

  • AdapEnetClass::WEnetCC.aft (arXiv paper) fits two different models, both with AFT weighted square loss and elastic net regularization.
  • glmnet fits models for elastic net regularization with several loss functions, but neither AFT nor interval regression losses are supported.
  • interval::icfit and survival::survreg provide solvers for non-regularized interval regression models.
  • The penaltyLearning package contains a solver which uses the FISTA algorithm to fit an L1 regularized model for general interval output data. However, there are two issues: (1) it is not as fast as the coordinate descent algorithm implmented in glmnet, and (2) it does not support L2-regularization.
function/pkg censoring regularization loss algorithm
glmnet none, right L1 + L2 Cox coordinate descent
glmnet none L1 + L2 normal, logistic coordinate descent
AdapEnetClass none, right L1 + L2 normal LARS
coxph none, right, left, interval none Cox ?
survreg none, right, left, interval none normal, logistic, Weibull Newton-Raphson
PeakSegDP left, right, interval L1 squared hinge, log FISTA
iregnet none, right, left, interval L1 + L2 normal, logistic, Weibull coordinate descent

Coding project: iregnet on CRAN

The main goal of this GSOC project is to get the iregnet package on CRAN.

  • start by finishing PR#54 which implements the cv.iregnet function.
  • make sure docs and tests pass all CRAN checks.
  • write a vignette which explains the statistical model, optimization problem, and shows how to use iregnet on several data sets.
  • write a vignette with speed/optimization accuracy comparisons (glmnet, survival, iregnet).

Tests that make sure it works for several data sets:

  • data(“neuroblastomaProcessed”, package=”penaltyLearning”) is about learning a function for predicting breakpoints in DNA copy number profiles.
  • data(“penalty.learning”, package=”iregnet”) is about learning a function for predicting peaks in epigenomic data.
  • there are several more data sets on penalty function learning for predicting peaks in epigenomic data:
    • Labels/intervals: https://raw.githubusercontent.com/tdhock/feature-learning-benchmark/master/labeled_problems_targets.csv
    • Inputs/features: https://raw.githubusercontent.com/tdhock/feature-learning-benchmark/master/labeled_problems_features.csv

Expected impact

The iregnet package will be feature-complete, well-documented, and available on CRAN.

Mentors

  • Toby Dylan Hocking <tdhock5@gmail.com> proposed this project and can mentor.
  • Anuj Khare <khareanuj18@gmail.com> coded iregnet in GSOC2016 and can mentor.

Tests

Students, please complete as many tests as possible before emailing the mentors. If we do not find a student who can complete the Hard test, then we should not approve this GSOC project.

  • Easy: perform a side-by-side comparison of iregnet and glmnet for a lasso problem with no censored data. Consider the prostate cancer data set, which has no censored data. Use the microbenchmark package to time the iregnet and glmnet functions. Do the two functions return the same result? Which is faster? Plot of time versus data set size. (one plot for rows and one plot for columns) This kind of plot makes it very easy to see the differences in timings.
  • Medium: use iregnet to fit a model on the penaltyLearning::neuroblastomaProcessed data.
  • Hard: use 5-fold CV to compare the iregnet model with penaltyLearning::IntervalRegressionCV in terms of test error. For a test error function you can use the number of predicted values which are outside the corresponding label/interval.

Solutions of tests

Students, please post a link to your test results here.

  • Name : Aditya Samantaray

    Email : b517003@iiit-bh.ac.in, aditya.samantaray1@gmail.com

    University : International Institute of Information Technology, Bhubaneswar

    Course : Computer Engineering

    Solution to Easy Test : NewEasyTest OldEasyTest

    Solution to Medium Test : MediumTest

    Solution to Hard Test : HardTest

  • Name : Ao Ni

    Email : niao@mail2.sysu.edu.cn, neo.aoni@gmail.com

    University : Sun Yat-sen University

    Program : BS in Applied Statistics

    Solution to Easy Test : Easy Test

    Solution to Medium Test : Medium Test

    Solution to Hard Test : Hard Test

Clone this wiki locally