Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative Binomial cost function #15

Open
diego-urgell opened this issue Jul 28, 2021 · 0 comments
Open

Negative Binomial cost function #15

diego-urgell opened this issue Jul 28, 2021 · 0 comments

Comments

@diego-urgell
Copy link

Hi @vrunge!

My name is Diego (@diego-urgell), and I am developing BinSeg, a changepoint analysis package for Google Summer of Code, using the Binary Segmentation algorithm. One of the objectives of this package is to support several distributions, among them Negative Binomial. The gfpop package also implements it, and I wonder if you could please help me out with an issue I found regarding the cost function. I would really appreciate it 😄.

Background

My current implementation of the cost function is the following:

    double costFunction(int start, int end){
        double lSum = this -> summaryStatistics -> getLinearSum(start, end);
        double mean = this -> summaryStatistics -> getMean(start, end);
        double varN = this -> summaryStatistics -> getVarianceN(start, end, false);
        int N = end - start + 1;
        if (varN <= 0) return INFINITY;
        double var = varN/N;
        double r_dispersion = ceil(fabs(pow(mean, 2)/(var-mean)));
        double p_success = mean/var;
        return (lSum * log(1-p_success) + N * r_dispersion * log(p_success));
    }

As you can see, on every possible segment I compute the overdispersion parameter r, and then the probability of success p by using the following parameterization that depends on the mean and the variance (Lindén and Mäntyniemi, 2011):

Screen Shot 2021-07-22 at 18 58 48

Then, I calculate the cost using the equation described by Cleynen et al (2011) (the one used in Segmentator3IsBack).

Question

I tested the results against the gfpop package. Even though both packages produces similar changepoints (not always the same, but that is expected given the algorithms), there are some differences regarding the overall cost and the parameter estimation that troubled me.

The first thing I found is that the overdispersion parameter is computed at the beginning of the algorithm. Windows of size 100 are set, and the dispersion parameter is computed for each of them. Then, the "average over dispersion" is computed and the whole signal is divided by this value. Why did you prefer to do this, instead of computing an over dispersion parameter for each segment?

Also, the overall cost of the model computed by gfpop is different from the one computed by my implementation on BinSeg. I believe you are also optimizing the log likelihood, so I wonder why they differ. Could you please share with me your cost function, and the equations you are using to approximate the probability of success parameter?

Thank you so much for your time!!

References:

Cleynen, A., Koskas, M., Lebarbier, E. et al. Segmentor3IsBack: an R package for the fast and exact segmentation of Seq-data. Algorithms Mol Biol 9, 6 (2014). https://doi.org/10.1186/1748-7188-9-6

Lindén, A., & Mäntyniemi, S. (2011). Using the negative binomial distribution to model overdispersion in ecological count data. Ecology, 92(7), 1414-1421. doi:10.1890/10-1831.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant