Negative Binomial cost function #15

diego-urgell · 2021-07-28T06:08:03Z

My name is Diego (@diego-urgell), and I am developing BinSeg, a changepoint analysis package for Google Summer of Code, using the Binary Segmentation algorithm. One of the objectives of this package is to support several distributions, among them Negative Binomial. The gfpop package also implements it, and I wonder if you could please help me out with an issue I found regarding the cost function. I would really appreciate it 😄.

Background

My current implementation of the cost function is the following:

    double costFunction(int start, int end){
        double lSum = this -> summaryStatistics -> getLinearSum(start, end);
        double mean = this -> summaryStatistics -> getMean(start, end);
        double varN = this -> summaryStatistics -> getVarianceN(start, end, false);
        int N = end - start + 1;
        if (varN <= 0) return INFINITY;
        double var = varN/N;
        double r_dispersion = ceil(fabs(pow(mean, 2)/(var-mean)));
        double p_success = mean/var;
        return (lSum * log(1-p_success) + N * r_dispersion * log(p_success));
    }

As you can see, on every possible segment I compute the overdispersion parameter r, and then the probability of success p by using the following parameterization that depends on the mean and the variance (Lindén and Mäntyniemi, 2011):

Then, I calculate the cost using the equation described by Cleynen et al (2011) (the one used in Segmentator3IsBack).

Question

I tested the results against the gfpop package. Even though both packages produces similar changepoints (not always the same, but that is expected given the algorithms), there are some differences regarding the overall cost and the parameter estimation that troubled me.

The first thing I found is that the overdispersion parameter is computed at the beginning of the algorithm. Windows of size 100 are set, and the dispersion parameter is computed for each of them. Then, the "average over dispersion" is computed and the whole signal is divided by this value. Why did you prefer to do this, instead of computing an over dispersion parameter for each segment?

Also, the overall cost of the model computed by gfpop is different from the one computed by my implementation on BinSeg. I believe you are also optimizing the log likelihood, so I wonder why they differ. Could you please share with me your cost function, and the equations you are using to approximate the probability of success parameter?

Thank you so much for your time!!

References:

Cleynen, A., Koskas, M., Lebarbier, E. et al. Segmentor3IsBack: an R package for the fast and exact segmentation of Seq-data. Algorithms Mol Biol 9, 6 (2014). https://doi.org/10.1186/1748-7188-9-6

Lindén, A., & Mäntyniemi, S. (2011). Using the negative binomial distribution to model overdispersion in ecological count data. Ecology, 92(7), 1414-1421. doi:10.1890/10-1831.1

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative Binomial cost function #15

Negative Binomial cost function #15

diego-urgell commented Jul 28, 2021

Negative Binomial cost function #15

Negative Binomial cost function #15

Comments

diego-urgell commented Jul 28, 2021

Background

Question

References: