PeakSegOptimal: Optimal Segmentation Subject to Up-Down Constraints
NOTE: the name of the package has been changed from coseg to PeakSegOptimal, to avoid confusion with the CoSeg package on CRAN (which is completely unrelated to this package).
Linear time exact C++ solver for the optimal change points using the Poisson loss and the PeakSeg constraint.
PeakSeg Segment Neighborhood problem
There are two R functions that solve the Segment Neighborhood problem (best models with at most 1,…,S segments). The constraints are first segment mean <= second segment mean >= third segment mean, etc. For S segments and N data points, complexity is O(S N log N) memory and O(S N log N) time.
- PeakSegPDPA is a direct interface to the C++ code. Returns matrices for the optimal cost and number of intervals stored by the algorithm at each step. Input parameter is the maximum number of segments S, and outputs are the optimal models from 1 to S segments.
- PeakSegPDPAchrom is a more user-friendly interface, which takes a data.frame with columns chrom, chromStart, chromEnd, count as input, and returns data.frames as output. Input parameter is the maximum number of peaks P (implying S = P*2+1 segments), and outputs are the optimal models from 0 to P peaks.
PeakSeg Optimal Partitioning problem
There are two R functions that solve the Optimal Partitioning problem (best model for a given penalty parameter). Input parameter is a non-negative penalty constant (bigger for fewer peaks, smaller for more peaks), and output is the model that minimizes the penalized cost using the specified penalty constant. For N data points, complexity is O(N log N) memory and time. The first segment mean is constrained to be down (mu_1 <= mu_2), as is the last (muN-1 >= mu_N).
- PeakSegFPOP is a direct interface to the C++ code. Returns matrices for the optimal cost and number of intervals stored by the algorithm at each step.
- PeakSegFPOPchrom is a more user-friendly interface, which takes a data.frame with columns chrom, chromStart, chromEnd, count as input, and returns data.frames as output.
Note about inequality constraints
Note that as defined in the PeakSeg ICML’15 paper, the PeakSeg problem has strict inequality constraints (segment mean 1 < segment mean 2 > segment mean 3, etc). However, the solvers in this package use non-strict inequality constraints (segment mean 1 <= segment mean 2 >= segment mean 3, etc), and so may return a model with adjacent segment means that are equal. In this case, the minimum of the PeakSeg problem with strict inequality constraints is undefined. For example,
library(coseg) data.vec <- as.integer(c(1, 2, 3)) fit <- PeakSegPDPA(data.vec, max.segments=3L)
For the data 1, 2, 3 the optimal model with at most 3 segments actually has only 2 segments:
> fit$mean.mat[3,]  1.0 2.5 2.5
This implies that the optimal model with strict inequality constraints is undefined for these data. For example, the model with means 1, 3, 2 satisfies the strict inequality constraint, and has a PoissonLoss of
> PoissonLoss(c(1,2,3), c(1,3,2))  1.723334
But that model is not optimal, because there is another model that satisfies the strict inequality constraints, but has a better cost. For example,
> PoissonLoss(c(1,2,3), c(1,2.9,2.1))  1.644766
But that model is not optimal, for the same reason (exercise for the reader).
Related work: peak calling on the whole genome
An on-disk implementation of PeakSegFPOP is available in PeakSegFPOP or PeakSegPipeline. These programs can be used to compute optimal peak models for large bedGraph files, since they stores the optimal cost in a temporary file on disk (rather than in memory). If you want to use PeakSeg to perform peak calling on the whole genome, please try using these other implementations.
|PeakSegPipeline::PeakSegFPOP_disk||O(N log N)||O(log N)||O(N log N)|
|PeakSegOptimal::PeakSegFPOP||O(N log N)||O(N log N)||0|
Note that although both implementations are O(N log N) time complexity for N data points, the PeakSegPipeline implementation is slower due to disk read/write overhead.