Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/ENH: Probit for continuous data #7210

Open
josef-pkt opened this issue Dec 16, 2020 · 10 comments
Open

BUG/ENH: Probit for continuous data #7210

josef-pkt opened this issue Dec 16, 2020 · 10 comments
Labels
comp-discrete type-bug type-bug-wrong serious bugs that silently return incorrect numbers type-enh

Comments

@josef-pkt
Copy link
Member

josef-pkt commented Dec 16, 2020

I suspect that Probit only works for binary {0, 1} endog.

If I add some noise to get a continuous endog, then Probit does not produce the same result as the GLM version, but it does for Logit.

I need to go through the math to check, but I think the q = 2*self.endog - 1 trick in loglike, score and similar might not work for continuous endog. (I haven't looked at this in many years)
My guess is that this makes it similar to MNLogit and OrderedModel, where we only compute the choice that has been selected in that observation.

Or there are some other problems related to a noncanonical link for Binomial.

This web page has loglike and score with and without the q trick
https://www.statlect.com/fundamentals-of-statistics/probit-model-maximum-likelihood

@josef-pkt josef-pkt added type-bug type-enh comp-discrete type-bug-wrong serious bugs that silently return incorrect numbers labels Dec 16, 2020
@josef-pkt
Copy link
Member Author

I'm raising this to a serious bug

We are advertising QMLE, but Probit might return some wrong things without complaining or indication that anything is unusual, or that it requires binary endog.

@josef-pkt
Copy link
Member Author

I think we can drop the "q" trick if that's the culprit.
In the binary case, there is not much to gain computationally. if we compute prob(y=1 | x), then we also have 1 - prob.

If we have 10 levels/choices in multinomial, then saving 9/10 makes a larger difference, also in terms of memory.

@josef-pkt
Copy link
Member Author

relevant commits

#2044 changed from binary check to interval check, related to fractional Logit issue #2040
#1978 original PR that added binary check for endog

@josef-pkt
Copy link
Member Author

josef-pkt commented Dec 16, 2020

on the GLM side missing unit test #4598
I don't find any unit tests in genmod tests for Binomial with link specified, i.e. for using noncanonical, non-logit link, using file
search
That's weird.

there are generic or parameterized tests that use family=family(link=link()) which don't show up when searching for Binomial( or ``Binomial(link`

searching for "probit" finds the parameterized tests for gradient optimization without and with weights.

# class TestGlmBernoulliProbit(CheckModelResultsMixin):
#    pass

@josef-pkt
Copy link
Member Author

the q trick also assumes a symmetric distribution 1 - F(xb) = F(-xb)

@josef-pkt
Copy link
Member Author

josef-pkt commented Dec 16, 2020

maybe we should keep the current q trick for Probit
it looks like it shoiuld be numerically more stable and avoids getting close to 0 * log(0) at the unlikely choice, i.e. y log(prob) is only evaluated where y = 1, and we don't have 0 * log(prob) in loglike.

score and hessian don't have a log term, but they have division by prob and (1 - prob)

an article that includes recommendation for numerical stability, e.g. also recommends a tail approximation of cdf, pdf terms
Demidenko, Eugene. "Computational aspects of probit model." Mathematical Communications 6, no. 2 (2001): 233-247.

@josef-pkt
Copy link
Member Author

josef-pkt commented Dec 17, 2020

Looks kind of bad here

Logit also uses the q trick in loglike.
using method="newton" doesn't use loglike in the optimization, Logit and GLM-Logit agree in params and cov_params, but not in llf.
Using bfgs causes a convergence warning and a bit different numbers

Stata 14 has fracreg
agrees with GLM and Logit in params
bse using HC0 agree with Stata using a correction factor
bse = bse_stata * np.sqrt((nobs - 1) / nobs)
(I guess this is one of the disagreements between Stata and our GLM cov_type, correction factors are used in unit tests. There should be an option for small sample corrections.)

llf loglike differs between Stata and GLM, and Logit is again different.

Note, for interval data: we have only QMLE and loglike, llf cannot be used for inference anyway.

@josef-pkt
Copy link
Member Author

After fixing precision problems in GLM probit link second derivative #1878, GLM now agrees with Stata fracreg for both Logit and Probit also in standard errors bse (except for the correction factor in the default).

@josef-pkt
Copy link
Member Author

josef-pkt commented Dec 17, 2020

a picture for fun,

using GLM for fractional regression
endog has a bit of noise added to make it interval data
orange is cauchy link
blue is probit link

cauchy seems to predict better based on the plot
but HC0 p-values for cauchy are very large, none significant, Maybe still a bug in my changes to CDFLinks
loglike with cauchy link is larger than with logit and probit.
no aic in summary, and I would like tic and imratio here (waiting for #7166 )

image

Stata and SAS also don't have cauchy link, AFAICS

@josef-pkt
Copy link
Member Author

removing prio high
the PR that adds an exception for continuous data in probit has been merged, #7229

Whether we extend probit to continuous data QMLE is still an open question.
I'm tending now to not supporting it in discrete Probit, because after improving probit link in #7226, GLM should be a good alternative to discrete Probit that already allows for continuous data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp-discrete type-bug type-bug-wrong serious bugs that silently return incorrect numbers type-enh
Projects
None yet
Development

No branches or pull requests

1 participant