You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My name is Diego (@diego-urgell), and I am developing BinSeg, a changepoint analysis package for Google Summer of Code, using the Binary Segmentation algorithm. One of the objectives of this package is to support several distributions, among them Negative Binomial. The gfpop package also implements it, and I wonder if you could please help me out with an issue I found regarding the cost function. I would really appreciate it 😄.
Background
My current implementation of the cost function is the following:
doublecostFunction(int start, int end){
double lSum = this -> summaryStatistics -> getLinearSum(start, end);
double mean = this -> summaryStatistics -> getMean(start, end);
double varN = this -> summaryStatistics -> getVarianceN(start, end, false);
int N = end - start + 1;
if (varN <= 0) return INFINITY;
double var = varN/N;
double r_dispersion = ceil(fabs(pow(mean, 2)/(var-mean)));
double p_success = mean/var;
return (lSum * log(1-p_success) + N * r_dispersion * log(p_success));
}
As you can see, on every possible segment I compute the overdispersion parameter r, and then the probability of success p by using the following parameterization that depends on the mean and the variance (Lindén and Mäntyniemi, 2011):
Then, I calculate the cost using the equation described by Cleynen et al (2011) (the one used in Segmentator3IsBack).
Question
I tested the results against the gfpop package. Even though both packages produces similar changepoints (not always the same, but that is expected given the algorithms), there are some differences regarding the overall cost and the parameter estimation that troubled me.
The first thing I found is that the overdispersion parameter is computed at the beginning of the algorithm. Windows of size 100 are set, and the dispersion parameter is computed for each of them. Then, the "average over dispersion" is computed and the whole signal is divided by this value. Why did you prefer to do this, instead of computing an over dispersion parameter for each segment?
Also, the overall cost of the model computed by gfpop is different from the one computed by my implementation on BinSeg. I believe you are also optimizing the log likelihood, so I wonder why they differ. Could you please share with me your cost function, and the equations you are using to approximate the probability of success parameter?
Thank you so much for your time!!
References:
Cleynen, A., Koskas, M., Lebarbier, E. et al. Segmentor3IsBack: an R package for the fast and exact segmentation of Seq-data. Algorithms Mol Biol 9, 6 (2014). https://doi.org/10.1186/1748-7188-9-6
Lindén, A., & Mäntyniemi, S. (2011). Using the negative binomial distribution to model overdispersion in ecological count data. Ecology, 92(7), 1414-1421. doi:10.1890/10-1831.1
The text was updated successfully, but these errors were encountered:
Hi @vrunge!
My name is Diego (@diego-urgell), and I am developing BinSeg, a changepoint analysis package for Google Summer of Code, using the Binary Segmentation algorithm. One of the objectives of this package is to support several distributions, among them Negative Binomial. The
gfpop
package also implements it, and I wonder if you could please help me out with an issue I found regarding the cost function. I would really appreciate it 😄.Background
My current implementation of the cost function is the following:
As you can see, on every possible segment I compute the overdispersion parameter r, and then the probability of success p by using the following parameterization that depends on the mean and the variance (Lindén and Mäntyniemi, 2011):
Then, I calculate the cost using the equation described by Cleynen et al (2011) (the one used in
Segmentator3IsBack
).Question
I tested the results against the
gfpop
package. Even though both packages produces similar changepoints (not always the same, but that is expected given the algorithms), there are some differences regarding the overall cost and the parameter estimation that troubled me.The first thing I found is that the overdispersion parameter is computed at the beginning of the algorithm. Windows of size 100 are set, and the dispersion parameter is computed for each of them. Then, the "average over dispersion" is computed and the whole signal is divided by this value. Why did you prefer to do this, instead of computing an over dispersion parameter for each segment?
Also, the overall cost of the model computed by
gfpop
is different from the one computed by my implementation onBinSeg
. I believe you are also optimizing the log likelihood, so I wonder why they differ. Could you please share with me your cost function, and the equations you are using to approximate the probability of success parameter?Thank you so much for your time!!
References:
Cleynen, A., Koskas, M., Lebarbier, E. et al. Segmentor3IsBack: an R package for the fast and exact segmentation of Seq-data. Algorithms Mol Biol 9, 6 (2014). https://doi.org/10.1186/1748-7188-9-6
Lindén, A., & Mäntyniemi, S. (2011). Using the negative binomial distribution to model overdispersion in ecological count data. Ecology, 92(7), 1414-1421. doi:10.1890/10-1831.1
The text was updated successfully, but these errors were encountered: