postCP change point detection

Anuj Khare edited this page Feb 7, 2016 · 17 revisions

Background

Change-point detection algorithms are useful for analyzing time series data that exhibit abrupt changes in distribution. There are many R packages for detecting K change-points in N data points, but only a few are able to compute error bands (confidence intervals) for change point locations. The only R package that can compute error bands with provably linear O(N) time complexity is the postCP package. However, the last time I checked (25 Nov 2015), CRAN has put the postCP package in the Archive: “Archived on 2014-09-26 as misused .C(DUP = FALSE) and failed its checks on several platforms, including Solaris and Linux under valgrind.”

Related work

Some other R packages that can compute multiple change-point models:

paper package time error bands?
Luong et al postCP O(N) yes
Rigaill et al EBS O(KN^2) yes
Frick et al stepR near-linear yes
Rigaill et al cghseg O(N) no
Killick et al changepoint O(N) no

Coding project ideas

postCP back on CRAN

Start from the most recent version of postCP on CRAN, and make the necessary changes to get it passing checks and back on CRAN. Writing a vignette and tests would also be nice.

Separating model-specific and generic parts of the code

postCP is based on a simple idea: a segmentation of n points into k segments can be seen as a Markov chain constrained to start in segment 1 and to end up in segment k. With this idea, any breakpoint model can be easily rewritten as a constrained Hidden Markov Model (HMM). Therefore postCP has two main steps:

  1. from a matrix of evidence (in logscale) one has to compute the posterior distribution of segmentation using forward-backward. One can easily derive from this quantity the marginal posterior distribution of any breakpoint and/or particular position.
  2. from a posterior distribution and the data, one can update the parameter estimate (in EM context)

It is hence clear that postCP should have a technical core part dedicated to constrained forward-backward for any model (typically coded in Rcpp for speed), and a model specific part (Gaussian regression, Poisson, survival, etc.) for the computation of the log-evidence and for the parameter updates.

Any good implementation of postCP should take this into account in the most R-standard compliant way.

How to handle various classical models the R way ?

The simplest would probably be to include the most common regression model like Gaussian, binomial, Poisson regressions, and even survival analysis. This can be done easily using the standard regression function of R (ex: lm, glm, coxph). The all idea is to exploit both the general regression syntax (see model.matrix()) simply by using the weights option.

Example with Gaussian regression:

  1. eta[i,k] contains the posterior probability of observation i to belong to Segment k (typically computed through the core part of postCP)
  2. fit=lm(y~X,weights=eta[,k]) allows to fit the model for Segment k
  3. coefficients(fit) gives the parameter estimates for Segment k
  4. dnorm(y,predict(fit),log=TRUE) gives the new log-evidence for Segment k

This can be easily adapted to other regression contexts with minimum efforts…

Here is a list of the family models available in glm for example: binomial(link = “logit”) gaussian(link = “identity”) Gamma(link = “inverse”) inverse.gaussian(link = “1/mu^2”) poisson(link = “log”) quasi(link = “identity”, variance = “constant”) quasibinomial(link = “logit”) quasipoisson(link = “log”)

What is currently available ?

  • postCP old implementation: core + three simple model (Gaussian, Poisson, negative binomial); no regression
  • core Rcpp code for Forward-Backward (unpublished, but available on request from G. Nuel <nuel@math.cnrs.fr>)

Expected impact

useRs who work on change-point detection in large data sets will again be able to rely on the postCP package for computing error bands for change-point models in provably linear time. There are no other R packages that can compute error bands for change-point models in provably linear time.

Mentors

Gregory Nuel <nuel@math.cnrs.fr> would be the primary mentor since he knows the theory and C++ code implemented in the postCP package. Guillem Rigaill <guillem.rigaill@evry.inra.fr> could be a co-mentor, since he has developed several other R packages for change-point detection.

Tests

Do one or several — doing more hard tests makes you more likely to be selected.

Solutions of tests

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.