Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
postCP change point detection
Change-point detection algorithms are useful for analyzing time series data that exhibit abrupt changes in distribution. There are many R packages for detecting K change-points in N data points, but only a few are able to compute error bands (confidence intervals) for change point locations. The only R package that can compute error bands with provably linear O(N) time complexity is the postCP package. However, the last time I checked (25 Nov 2015), CRAN has put the postCP package in the Archive: “Archived on 2014-09-26 as misused .C(DUP = FALSE) and failed its checks on several platforms, including Solaris and Linux under valgrind.”
Some other R packages that can compute multiple change-point models:
|Luong et al||postCP||O(N)||yes|
|Rigaill et al||EBS||O(KN^2)||yes|
|Frick et al||stepR||near-linear||yes|
|Rigaill et al||cghseg||O(N)||no|
|Killick et al||changepoint||O(N)||no|
Coding project ideas
postCP back on CRAN
Start from the most recent version of postCP on CRAN, and make the necessary changes to get it passing checks and back on CRAN. Writing a vignette and tests would also be nice.
Separating model-specific and generic parts of the code
postCP is based on a simple idea: a segmentation of n points into k segments can be seen as a Markov chain constrained to start in segment 1 and to end up in segment k. With this idea, any breakpoint model can be easily rewritten as a constrained Hidden Markov Model (HMM). Therefore postCP has two main steps:
- from a matrix of evidence (in logscale) one has to compute the posterior distribution of segmentation using forward-backward. One can easily derive from this quantity the marginal posterior distribution of any breakpoint and/or particular position.
- from a posterior distribution and the data, one can update the parameter estimate (in EM context)
It is hence clear that postCP should have a technical core part dedicated to constrained forward-backward for any model (typically coded in Rcpp for speed), and a model specific part (Gaussian regression, Poisson, survival, etc.) for the computation of the log-evidence and for the parameter updates.
Any good implementation of postCP should take this into account in the most R-standard compliant way.
How to handle various classical models the R way ?
The simplest would probably be to include the most common regression model like Gaussian, binomial, Poisson regressions, and even survival analysis. This can be done easily using the standard regression function of R (ex: lm, glm, coxph). The all idea is to exploit both the general regression syntax (see model.matrix()) simply by using the weights option.
Example with Gaussian regression:
- eta[i,k] contains the posterior probability of observation i to belong to Segment k (typically computed through the core part of postCP)
- fit=lm(y~X,weights=eta[,k]) allows to fit the model for Segment k
- coefficients(fit) gives the parameter estimates for Segment k
- dnorm(y,predict(fit),log=TRUE) gives the new log-evidence for Segment k
This can be easily adapted to other regression contexts with minimum efforts…
Here is a list of the family models available in glm for example: binomial(link = “logit”) gaussian(link = “identity”) Gamma(link = “inverse”) inverse.gaussian(link = “1/mu^2”) poisson(link = “log”) quasi(link = “identity”, variance = “constant”) quasibinomial(link = “logit”) quasipoisson(link = “log”)
What is currently available ?
- postCP old implementation: core + three simple model (Gaussian, Poisson, negative binomial); no regression
- core Rcpp code for Forward-Backward (unpublished, but available on request from G. Nuel <firstname.lastname@example.org>)
useRs who work on change-point detection in large data sets will again be able to rely on the postCP package for computing error bands for change-point models in provably linear time. There are no other R packages that can compute error bands for change-point models in provably linear time.
Gregory Nuel <email@example.com> would be the primary mentor since he knows the theory and C++ code implemented in the postCP package. Guillem Rigaill <firstname.lastname@example.org> could be a co-mentor, since he has developed several other R packages for change-point detection.
Do one or several — doing more hard tests makes you more likely to be selected.
- Easy: download and install the most recent version of postCP on CRAN, then run
example(postCP)and explain why the output is problematic.
- Medium: demonstrate that you know how to write vignettes and package tests. What tests would you suggest to verify that the postCP function is working?
- Hard: demonstrate that you know how to interface C++ code with R using
.Call. Have you already written an R package that uses C code? If not, you may want to read ”When and how to write low-level (C/C++) instead of high-level (R/Python) code?” Then write a solution to this exercise to prove that you can write an interface using