Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding BetaBernoulli distribution with LogScore #132

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

guyko81
Copy link

@guyko81 guyko81 commented Jun 14, 2020

BetaBernoulli distribution implemented with LogScore, Fisher Information included.

Copy link
Collaborator

@tonyduan tonyduan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @guyko81,

Thanks for the PR. I've left some comments.

The main issue is I'm not sure how you derived the Fisher information. In my derivation the Fisher Information is not diagonal, and moreover its entries involve the trigamma function.

I don't have 100% confidence in my derivation though, so if you could paste your derivation it'd be very helpful to double check.

)
return D

def metric(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you derive this? In my derivation the Fisher Information is not diagonal.

Here's what I have; it'd be great if you could paste your independent derivation so that we can double check.

Screen Shot 2020-06-19 at 4 04 26 PM

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a mistake not including the other diagonal. My calculation was based on the definition of the FI matrix as the variance of the score. Therefore I just simply squared the gradient (but forgot that it's actually a vector and the square should be S*S.T).
Can we use the double derivative? Are we using this:
Claim: The negative expected Hessian of log likelihood is equal to the Fisher Information Matrix F.
https://wiseodd.github.io/techblog/2018/03/11/fisher-information/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my calculation the last row is different, it's
formula
However when I use that formula the model doesn't work - it drops the singular matrix error again. And when I include the full FI matrix from the variance definition it also drops the singular matrix error. When I use the diagonal matrix from the Hessian definition it simply doesn't learn.
So the only working solution is the diagonal matrix from the variance definition. I have no clue why.
If you want to try the different approaches I have updated the code. All you need to do is to comment out the other diagonal or to comment out the other metric definition. For now I keep the working version as my pull request, but please check my code as I'm not sure I didn't miss something or didn't do a typo again.

ngboost/distns/betabernoulli.py Outdated Show resolved Hide resolved

def fit(Y):

def fit_alpha_beta_py(impressions, clicks, alpha0=1.5, beta0=5, niter=1000):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clean this function up a bit? In particular it's not clear what impressions / clicks are supposed to be. If impressions is going to be a vector of ones in all cases maybe we can remove it as an argument?

Also it'd be great if we could apply black formatting.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleared the function

ngboost/distns/betabernoulli.py Outdated Show resolved Hide resolved
@alejandroschuler
Copy link
Collaborator

@tonyduan are you planning on doing a re-review here? I see some relevant changes have been made.

@tonyduan
Copy link
Collaborator

@tonyduan are you planning on doing a re-review here? I see some relevant changes have been made.

There's an identifiability issue so I'm recommending against implementing the BetaBernoulli. We should implement the BetaBinomial instead. Pasting my earlier comment below:

  1. The FI is equivalent to the negative Hessian, so lines 85-89 below should have a negative sign.

  2. After fixing the above, I can confirm the FI derived from the expected negative Hessian is always equal to the FI derived from the variance of the score (as we would expect).

  3. However, the resulting FI is always singular. This is mathematically correct. It didn't occur to me earlier, but there's actually an identifiability issue here when n=1. Specifically, for any value of (alpha, beta) we can scale both by some scalar (k alpha, k beta) and it'd result in the same Bernoulli distribution p(x) for x in (0, 1). When n>1 this is not an issue since the second parameter controls the dispersion of the resulting Binomial distribution and model is identifiable. So the Beta-Binomial distribution only makes sense when n>1.

  4. Moving forward we should still implement the Beta-Binomial distribution for n>1 (an implementation is provided by [GAMLSS], for example). It most likely makes more sense to follow their implementation using a location-scale parameterization instead of an alpha-beta parameterization. Taking the Hessian seems like it'd be annoying so I'd recommend following your variance derivation.

@alejandroschuler
Copy link
Collaborator

@tonyduan got it, thanks for that. @guyko81 could you reimplement along those lines?

@guyko81
Copy link
Author

guyko81 commented Jul 16, 2020

I didn't express it clearly but I proposed this branch as an alternative for logistic regression or any binary classifier. Therefore creating the Beta-Binomial would add no benefit for that specific problem. Although I'm happy to implement the Beta-Binomial too, but I would still like to keep the Beta-Bernoulli as well, probably in this form.

Can I ask whether it's possible that the diagonal version of the FI matrix still helps finding the true Beta-Bernoulli distribution (strictly from mathematical point of view)?

I mean when I ran the model it gave reasonable results. I tested it on Kaggle and the model was at an acceptable level (around the random forest result), but obviously with an extra info of the uncertainty of the predicted probabilities. Or I'm in a false confidence just based on the expected value, and the uncertainty was just a random value?

@alejandroschuler
Copy link
Collaborator

I mean when I ran the model it gave reasonable results. I tested it on Kaggle and the model was at an acceptable level (around the random forest result), but obviously with an extra info of the uncertainty of the predicted probabilities. Or I'm in a false confidence just based on the expected value, and the uncertainty was just a random value?

Unfortunately I think it's the latter :(

You can tell this just by doing some math with the probability mass function of the beta-bernoulli, which is

P(Y=y) = B(y+a, 1-y+b)/B(a+b)

where a and b are the two parameters and B is the beta function. Plugging in the two cases y=0 and y=1 and using the properties of the beta and gamma functions we arrive at

P(Y=0) = b/(a+b)
P(Y=1) = a/(a+b)

Call these values p*=a/(a+b) and 1-p* = b/(a+b). From the perspective of model fitting, the likelihood or scoring rule therefore only depends on a single parameter p* that is a function of a and b. i.e. the exact values of a and b only matter insofar as they produce a different value of p*. Note also that p* is the mean of p ~ Beta(a,b).

For any given p*, however, there are an infinite number of values for (a,b) that would produce it. As long as a = bp*/(1-p*), you're fine. Since a and b are what determine the uncertainty in your probability p ~ Beta(a,b), that means that you can have infinitely many distributions, all with different uncertainties on p, all of which are equally good in terms of log likelihood or any other scoring rule. So, yes, you will be able to find values of a,b that give good prediction (i.e. p* is good), but the uncertainty estimates for p are meaningless because there is a model with a different uncertainty (any uncertainty, actually) which is equally good.

@ryan-wolbeck
Copy link
Collaborator

Just a suggestion but I think it would be nice as part of this PR to add to the distns test here for the new dist: https://github.com/stanfordmlgroup/ngboost/blob/master/ngboost/tests/test_distns.py

@ryan-wolbeck ryan-wolbeck added the enhancement New feature or request label Jul 21, 2020
@alejandroschuler
Copy link
Collaborator

@guyko81 should we close this PR in light of the issues with this model or are you interested in extending it to the n>1 case that @tonyduan described?

Moving forward we should still implement the Beta-Binomial distribution for n>1 (an implementation is provided by [GAMLSS], for example). It most likely makes more sense to follow their implementation using a location-scale parameterization instead of an alpha-beta parameterization. Taking the Hessian seems like it'd be annoying so I'd recommend following your variance derivation.

@guyko81
Copy link
Author

guyko81 commented Sep 11, 2020

@alejandroschuler I had many things on my plate, but I will find the time to do it in the upcoming days. Please keep it open for few days and I come back with an update.
Thanks

@ryan-wolbeck
Copy link
Collaborator

@guyko81 just checking in on this pr

@ryan-wolbeck ryan-wolbeck marked this pull request as draft December 30, 2020 17:02
@guyko81
Copy link
Author

guyko81 commented Jan 12, 2021

I struggle to figure out how to handle n (trials) in the metric/d_score. The number of trials in BetaBinomial should come from the dataset, therefore it's not a parameter, but the equations expect them in the FI. Can someone help me out?

this is the metric function:
def metric(self):

it has no n in the parameters

and this is the FI[:, 0, 0], but there is n:
(sum(-numpy.exp(logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(0, _k +numpy.exp(logalpha))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(-numpy.exp(logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(0,numpy.exp(logalpha) +numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(numpy.exp(logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(0, n +numpy.exp(logalpha) +numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(numpy.exp(logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(0,numpy.exp(logalpha))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(-numpy.exp(2*logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(1, _k +numpy.exp(logalpha))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(-numpy.exp(2*logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(1,numpy.exp(logalpha) +numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(numpy.exp(2*logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(1, n +numpy.exp(logalpha) +numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))) + (sum(numpy.exp(2*logalpha)*math.gamma(_k + numpy.exp(logalpha))*math.gamma(-_k + n + numpy.exp(logbeta))/math.gamma(n + numpy.exp(logalpha) + numpy.exp(logbeta))*binomial(n, _k)*polygamma(1,numpy.exp(logalpha))/math.gamma(numpy.exp(logalpha))*math.gamma(numpy.exp(logbeta))/math.gamma(numpy.exp(logalpha) + numpy.exp(logbeta)) for _k in range(0, n+1))))

@alejandroschuler
Copy link
Collaborator

n is not a continuous parameter so it can't be optimized via gradient descent (without some kind of clever trickery I'm not aware of) and thus cannot be a parameter, as you say. It also can't be less than the maximum value observed in the data, so there are large regions of infinite score. So all said we shouldn't worry about optimizing over n.

That means n has to be passed as a fixed parameter during the initialization of the distribution. The way I'd deal with that is to write a factory function that generates the beta binomial of the given size, i.e.

beta_binomial(n: int) -> BetaBinomial
    class BetaBinomial
        n = n
        ...

    return BetaBinomial

this is the way I've implemented the categorical distribution so you can look there for inspiration. If you need the value of n in the score implementations you can simply call self.n and it will reference the class attribute n that you set in the factory function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants