adding BetaBernoulli distribution with LogScore #132

guyko81 · 2020-06-14T17:27:34Z

BetaBernoulli distribution implemented with LogScore, Fisher Information included.

tonyduan

Thanks for the PR. I've left some comments.

The main issue is I'm not sure how you derived the Fisher information. In my derivation the Fisher Information is not diagonal, and moreover its entries involve the trigamma function.

I don't have 100% confidence in my derivation though, so if you could paste your derivation it'd be very helpful to double check.

tonyduan · 2020-06-19T23:01:59Z

ngboost/distns/betabernoulli.py

+                    )
+        return D
+
+    def metric(self):


How did you derive this? In my derivation the Fisher Information is not diagonal.

Here's what I have; it'd be great if you could paste your independent derivation so that we can double check.

I made a mistake not including the other diagonal. My calculation was based on the definition of the FI matrix as the variance of the score. Therefore I just simply squared the gradient (but forgot that it's actually a vector and the square should be S*S.T).
Can we use the double derivative? Are we using this:
Claim: The negative expected Hessian of log likelihood is equal to the Fisher Information Matrix F.
https://wiseodd.github.io/techblog/2018/03/11/fisher-information/

As per my calculation the last row is different, it's

However when I use that formula the model doesn't work - it drops the singular matrix error again. And when I include the full FI matrix from the variance definition it also drops the singular matrix error. When I use the diagonal matrix from the Hessian definition it simply doesn't learn.
So the only working solution is the diagonal matrix from the variance definition. I have no clue why.
If you want to try the different approaches I have updated the code. All you need to do is to comment out the other diagonal or to comment out the other metric definition. For now I keep the working version as my pull request, but please check my code as I'm not sure I didn't miss something or didn't do a typo again.

ngboost/distns/betabernoulli.py

tonyduan · 2020-06-19T23:03:33Z

ngboost/distns/betabernoulli.py

+
+    def fit(Y):
+
+        def fit_alpha_beta_py(impressions, clicks, alpha0=1.5, beta0=5, niter=1000):


Can we clean this function up a bit? In particular it's not clear what impressions / clicks are supposed to be. If impressions is going to be a vector of ones in all cases maybe we can remove it as an argument?

Also it'd be great if we could apply black formatting.

Cleared the function

ngboost/distns/betabernoulli.py

alejandroschuler · 2020-07-13T17:36:54Z

@tonyduan are you planning on doing a re-review here? I see some relevant changes have been made.

tonyduan · 2020-07-13T18:07:21Z

@tonyduan are you planning on doing a re-review here? I see some relevant changes have been made.

There's an identifiability issue so I'm recommending against implementing the BetaBernoulli. We should implement the BetaBinomial instead. Pasting my earlier comment below:

The FI is equivalent to the negative Hessian, so lines 85-89 below should have a negative sign.
After fixing the above, I can confirm the FI derived from the expected negative Hessian is always equal to the FI derived from the variance of the score (as we would expect).
However, the resulting FI is always singular. This is mathematically correct. It didn't occur to me earlier, but there's actually an identifiability issue here when n=1. Specifically, for any value of (alpha, beta) we can scale both by some scalar (k alpha, k beta) and it'd result in the same Bernoulli distribution p(x) for x in (0, 1). When n>1 this is not an issue since the second parameter controls the dispersion of the resulting Binomial distribution and model is identifiable. So the Beta-Binomial distribution only makes sense when n>1.
Moving forward we should still implement the Beta-Binomial distribution for n>1 (an implementation is provided by [GAMLSS], for example). It most likely makes more sense to follow their implementation using a location-scale parameterization instead of an alpha-beta parameterization. Taking the Hessian seems like it'd be annoying so I'd recommend following your variance derivation.

alejandroschuler · 2020-07-13T18:17:37Z

@tonyduan got it, thanks for that. @guyko81 could you reimplement along those lines?

guyko81 · 2020-07-16T11:50:23Z

I didn't express it clearly but I proposed this branch as an alternative for logistic regression or any binary classifier. Therefore creating the Beta-Binomial would add no benefit for that specific problem. Although I'm happy to implement the Beta-Binomial too, but I would still like to keep the Beta-Bernoulli as well, probably in this form.

Can I ask whether it's possible that the diagonal version of the FI matrix still helps finding the true Beta-Bernoulli distribution (strictly from mathematical point of view)?

I mean when I ran the model it gave reasonable results. I tested it on Kaggle and the model was at an acceptable level (around the random forest result), but obviously with an extra info of the uncertainty of the predicted probabilities. Or I'm in a false confidence just based on the expected value, and the uncertainty was just a random value?

alejandroschuler · 2020-07-16T16:39:53Z

I mean when I ran the model it gave reasonable results. I tested it on Kaggle and the model was at an acceptable level (around the random forest result), but obviously with an extra info of the uncertainty of the predicted probabilities. Or I'm in a false confidence just based on the expected value, and the uncertainty was just a random value?

Unfortunately I think it's the latter :(

You can tell this just by doing some math with the probability mass function of the beta-bernoulli, which is

P(Y=y) = B(y+a, 1-y+b)/B(a+b)

where a and b are the two parameters and B is the beta function. Plugging in the two cases y=0 and y=1 and using the properties of the beta and gamma functions we arrive at

P(Y=0) = b/(a+b)
P(Y=1) = a/(a+b)

Call these values p*=a/(a+b) and 1-p* = b/(a+b). From the perspective of model fitting, the likelihood or scoring rule therefore only depends on a single parameter p* that is a function of a and b. i.e. the exact values of a and b only matter insofar as they produce a different value of p*. Note also that p* is the mean of p ~ Beta(a,b).

For any given p*, however, there are an infinite number of values for (a,b) that would produce it. As long as a = bp*/(1-p*), you're fine. Since a and b are what determine the uncertainty in your probability p ~ Beta(a,b), that means that you can have infinitely many distributions, all with different uncertainties on p, all of which are equally good in terms of log likelihood or any other scoring rule. So, yes, you will be able to find values of a,b that give good prediction (i.e. p* is good), but the uncertainty estimates for p are meaningless because there is a model with a different uncertainty (any uncertainty, actually) which is equally good.

ryan-wolbeck · 2020-07-21T13:51:14Z

Just a suggestion but I think it would be nice as part of this PR to add to the distns test here for the new dist: https://github.com/stanfordmlgroup/ngboost/blob/master/ngboost/tests/test_distns.py

alejandroschuler · 2020-09-10T17:54:15Z

@guyko81 should we close this PR in light of the issues with this model or are you interested in extending it to the n>1 case that @tonyduan described?

Moving forward we should still implement the Beta-Binomial distribution for n>1 (an implementation is provided by [GAMLSS], for example). It most likely makes more sense to follow their implementation using a location-scale parameterization instead of an alpha-beta parameterization. Taking the Hessian seems like it'd be annoying so I'd recommend following your variance derivation.

guyko81 · 2020-09-11T10:20:34Z

@alejandroschuler I had many things on my plate, but I will find the time to do it in the upcoming days. Please keep it open for few days and I come back with an update.
Thanks

ryan-wolbeck · 2020-12-17T18:50:29Z

@guyko81 just checking in on this pr

guyko81 · 2021-01-12T13:31:04Z

I struggle to figure out how to handle n (trials) in the metric/d_score. The number of trials in BetaBinomial should come from the dataset, therefore it's not a parameter, but the equations expect them in the FI. Can someone help me out?

this is the metric function:
def metric(self):

it has no n in the parameters

and this is the FI[:, 0, 0], but there is n:
(sum(-numpy.exp(logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(0, _k +numpy.exp(logalpha))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(-numpy.exp(logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(0,numpy.exp(logalpha) +numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(numpy.exp(logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(0, n +numpy.exp(logalpha) +numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(numpy.exp(logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(0,numpy.exp(logalpha))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(-numpy.exp(2*logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(1, _k +numpy.exp(logalpha))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(-numpy.exp(2*logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(1,numpy.exp(logalpha) +numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(numpy.exp(2*logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(1, n +numpy.exp(logalpha) +numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(numpy.exp(2*logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(1,numpy.exp(logalpha))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))))

alejandroschuler · 2021-01-26T17:45:44Z

n is not a continuous parameter so it can't be optimized via gradient descent (without some kind of clever trickery I'm not aware of) and thus cannot be a parameter, as you say. It also can't be less than the maximum value observed in the data, so there are large regions of infinite score. So all said we shouldn't worry about optimizing over n.

That means n has to be passed as a fixed parameter during the initialization of the distribution. The way I'd deal with that is to write a factory function that generates the beta binomial of the given size, i.e.

beta_binomial(n: int) -> BetaBinomial
    class BetaBinomial
        n = n
        ...

    return BetaBinomial

this is the way I've implemented the categorical distribution so you can look there for inspiration. If you need the value of n in the score implementations you can simply call self.n and it will reference the class attribute n that you set in the factory function.

adding BetaBernoulli distribution with LogScore

98b5f94

alejandroschuler requested a review from tonyduan June 19, 2020 02:52

stanfordmlgroup deleted a comment from guyko81 Jun 19, 2020

guyko81 added 2 commits June 20, 2020 00:05

fix import

c30c64e

fix import dist

98c9508

tonyduan requested changes Jun 19, 2020

View reviewed changes

implementing different Fisher Information matrices

ef7c60f

alejandroschuler mentioned this pull request Jul 6, 2020

beta distribution #144

Closed

ryan-wolbeck added the enhancement New feature or request label Jul 21, 2020

ryan-wolbeck marked this pull request as draft December 30, 2020 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding BetaBernoulli distribution with LogScore #132

adding BetaBernoulli distribution with LogScore #132

guyko81 commented Jun 14, 2020 •

edited

Loading

tonyduan left a comment

tonyduan Jun 19, 2020

guyko81 Jun 20, 2020

guyko81 Jun 21, 2020

tonyduan Jun 19, 2020

guyko81 Jun 21, 2020

alejandroschuler commented Jul 13, 2020

tonyduan commented Jul 13, 2020

alejandroschuler commented Jul 13, 2020

guyko81 commented Jul 16, 2020

alejandroschuler commented Jul 16, 2020

ryan-wolbeck commented Jul 21, 2020

alejandroschuler commented Sep 10, 2020

guyko81 commented Sep 11, 2020

ryan-wolbeck commented Dec 17, 2020

guyko81 commented Jan 12, 2021 •

edited

Loading

alejandroschuler commented Jan 26, 2021


		def fit(Y):

		def fit_alpha_beta_py(impressions, clicks, alpha0=1.5, beta0=5, niter=1000):

adding BetaBernoulli distribution with LogScore #132

Are you sure you want to change the base?

adding BetaBernoulli distribution with LogScore #132

Conversation

guyko81 commented Jun 14, 2020 • edited Loading

tonyduan left a comment

Choose a reason for hiding this comment

tonyduan Jun 19, 2020

Choose a reason for hiding this comment

guyko81 Jun 20, 2020

Choose a reason for hiding this comment

guyko81 Jun 21, 2020

Choose a reason for hiding this comment

tonyduan Jun 19, 2020

Choose a reason for hiding this comment

guyko81 Jun 21, 2020

Choose a reason for hiding this comment

alejandroschuler commented Jul 13, 2020

tonyduan commented Jul 13, 2020

alejandroschuler commented Jul 13, 2020

guyko81 commented Jul 16, 2020

alejandroschuler commented Jul 16, 2020

ryan-wolbeck commented Jul 21, 2020

alejandroschuler commented Sep 10, 2020

guyko81 commented Sep 11, 2020

ryan-wolbeck commented Dec 17, 2020

guyko81 commented Jan 12, 2021 • edited Loading

alejandroschuler commented Jan 26, 2021

guyko81 commented Jun 14, 2020 •

edited

Loading

guyko81 commented Jan 12, 2021 •

edited

Loading