Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweedie loglikelihood returning inf for some values and impacting constrained optimization. Related to use of scipy's wrigh_bessel function in computing likelihood. #9234

Open
diegodebrito opened this issue Apr 26, 2024 · 8 comments

Comments

@diegodebrito
Copy link

Describe the bug

I ran into these results when running a constrained optimization GLM using Tweedie(var_power=1.5). I believe more than the constrained optimization itself, this is related to Tweedie's implementation of log-likelihood. Since likelihood is used when fitting constrained models (instead of reweighted least squares), this is how I found this potential issue.

I can't share the original modeling dataset (it is quite large and confidential), but I was able to generate some results that illustrated the issue.

When running the loglike_obs (first code snipped below), some of the results diverge (inf). Since problems optimizing the loglikelihood use a sum of all these values, the sum itself ends up being inf. In the constrained optimization I was trying to run, the first round would start with the objective function = -inf, and the minimize optimization would be stuck there and basically not do anything. I ended up with just the start_parms as a result and warning of non-convervenge.

I used the Tweedie package for comparison and all the entries have proper values. Also notice that the non-inf values are identical. See second code snipped.

I was able to track down the issue to the wright_bessel function (see here). It basically diverges for some entries. I believe that the tweedie package implemented Dunn and Smith's method manually, instead of using scipy's function, so maybe that's why it works for these cases. I put a 3rd code snipped with part of Statsmodels src code for your convenience.

Please let me know if there is more data or information I can provide.

Code Sample, a copy-pastable example if possible

import statsmodels.api as sm
import numpy as np

endog = np.array([0, 0, 0, 0, 192.85613765, 7.84301478, 182.15075391, 51.85940469, 39.49500056, 4.07506614,   2.97574021,  92.37706761])
mu = np.array([40.8384544, 40.8384544 , 7.26705526, 7.26705526, 192.85613765, 7.26705526, 182.15075391, 40.8384544, 40.8384544, 8.33852205, 2.97574021, 8.33852205])
var_weights = np.array([6.3831906 ,  0.47627479,  1.1490363 ,  5.11229578, 13.72221246, 79.00111743, 15.33762454, 29.44406732, 33.02803322, 12.84154581,6.17631048,  1.73855041])
scale = np.array([1.0])
p = 1.5

tweedie_statsmodels = sm.families.Tweedie(var_power=p)
tweedie_statsmodels.loglike_obs(endog, mu, var_weights=var_weights, scale=scale)
array([-81.58352325,  -6.08726542,  -6.19502375, -27.56291842,
                inf,          inf,          inf,          inf,
                inf,  -7.41592981,  -0.83535226, -58.47804978])

Using Tweedie package

from tweedie import tweedie as td

td(mu=mu, p=p, phi=scale / var_weights).logpdf(endog)
array([-81.58352325,  -6.08726542,  -6.19502375, -27.56291842,
        -3.55638127,  -0.92296862,  -3.45786323,  -8.24816698,
        -2.04396817,  -7.41592981,  -0.83535226, -58.47804978])

Root of the issue (Statsmodels using wright_bessel function). You should be able to just run the snipped below after running the code above:

scale = scale / var_weights
theta = mu ** (1 - p) / (1 - p)
kappa = mu ** (2 - p) / (2 - p)
alpha = (2 - p) / (1 - p)

ll_obs = (endog * theta - kappa) / scale
idx = endog > 0

if np.any(idx):
    if not np.isscalar(endog):
        endog = endog[idx]
    if not np.isscalar(scale):
        scale = scale[idx]
    x = ((p - 1) * scale / endog) ** alpha
    x /= (2 - p) * scale
    wb = special.wright_bessel(-alpha, 0, x)
    ll_obs[idx] += np.log(1/endog * wb)

print(wb)
array([           inf,            inf,            inf,            inf,
                  inf, 2.18259366e+45, 4.16191874e+18, 1.72866117e+29])

The results above are only for endog > 0, so the inf correspond to the rows with inf in the loglike results.

Note: As you can see, there are many issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates.

Note: Please be sure you are using the latest released version of statsmodels, or a recent build of main. If your problem has been fixed in an unreleased version, you might be able to use main until a new release occurs.

Note: If you are using a released version, have you verified that the bug exists in the main branch of this repository? It helps the limited resources if we know problems exist in the current main branch so that they do not need to check whether the code sample produces a bug in the next release.

If the issue has not been resolved, please file it in the issue tracker.

Expected Output

A clear and concise description of what you expected to happen.

I'm wondering what the best way to deal with these inf values is. I know that implementing the approximation/normalization for scratch might not be feasible, but I'm curious to know if there is a way to handle these inf entries. In my case, I had 10k observations and only 5 had this issue, so even dropping the problematic rows would be an option.

Output of import statsmodels.api as sm; sm.show_versions()

[paste the output of import statsmodels.api as sm; sm.show_versions() here below this line]

INSTALLED VERSIONS

Python: 3.11.8.final.0

statsmodels

Installed: 0.14.1 (C:\Users\VAA5FZ\repos\sagemaker-templates.venv\Lib\site-packages\statsmodels)

Required Dependencies

cython: Not installed
numpy: 1.26.4 (C:\Users\VAA5FZ\repos\sagemaker-templates.venv\Lib\site-packages\numpy)
scipy: 1.13.0 (C:\Users\VAA5FZ\repos\sagemaker-templates.venv\Lib\site-packages\scipy)
pandas: 2.2.2 (C:\Users\VAA5FZ\repos\sagemaker-templates.venv\Lib\site-packages\pandas)
dateutil: 2.9.0.post0 (C:\Users\VAA5FZ\repos\sagemaker-templates.venv\Lib\site-packages\dateutil)
patsy: 0.5.6 (C:\Users\VAA5FZ\repos\sagemaker-templates.venv\Lib\site-packages\patsy)

Optional Dependencies

matplotlib: 3.8.4 (C:\Users\VAA5FZ\repos\sagemaker-templates.venv\Lib\site-packages\matplotlib)
backend: module://matplotlib_inline.backend_inline
cvxopt: Not installed
joblib: 1.4.0 (C:\Users\VAA5FZ\repos\sagemaker-templates.venv\Lib\site-packages\joblib)

Developer Tools

IPython: 8.23.0 (C:\Users\VAA5FZ\repos\sagemaker-templates.venv\Lib\site-packages\IPython)
jinja2: 3.1.3 (C:\Users\VAA5FZ\repos\sagemaker-templates.venv\Lib\site-packages\jinja2)
sphinx: 7.2.6 (C:\Users\VAA5FZ\repos\sagemaker-templates.venv\Lib\site-packages\sphinx)
pygments: 2.17.2 (C:\Users\VAA5FZ\repos\sagemaker-templates.venv\Lib\site-packages\pygments)
pytest: 8.1.1 (C:\Users\VAA5FZ\repos\sagemaker-templates.venv\Lib\site-packages\pytest)
virtualenv: 20.25.1 (C:\Users\VAA5FZ\repos\sagemaker-templates.venv\Lib\site-packages\virtualenv)

@lorentzenchr
Copy link
Contributor

The first (non-zero) value is x = 1.452e+5 and alpha=-1. Now, wright_bessel(1, 0, 1e5) = 2.356e+275 and np.finfo(np.float64).max = 1.798e+308. So the simple answer is that wright_bessel(1, 0, 1.4e5) is simply larger than the max floating point number.

I think the tweedie package treats it differently and works with the logarithm of Wright's Bessel function, thus avoiding the overflow.

@diegodebrito
Copy link
Author

Thanks for the answer and for opening the Scipy issue, @lorentzenchr

In your opinion, would it be better to wait/push for the Scipy update and then modify the wright_bessel here? Or would it be a good idea to remove wright_bessel and try to add the more detailed calculations here? I noticed that glum and tweedie packages have the latter. I'm not familiar with Scipy's release cycle, so not sure what the timeline for that feature would be.

Also interested on your input here, @josef-pkt

@lorentzenchr
Copy link
Contributor

That's a question for @josef-pkt to answer and it also depends on you expected timeline. I will only pursue the "proper" solution with scipy. Let's see how they respond.

@josef-pkt
Copy link
Member

(I'm currently stuck in a very different neighborhood of statistics and have not looked at the details for this.)

If there is a workaround that can be added in statsmodels, then it would be good to have it because it would also work with older scipy versions, e.g. we copied some code from scipy to statsmodels.compat to make parts available faster.

However, if there is no good replacement for (compiled) scipy.special functions, then we just have to wait for the fix/enhancement in scipy and the release cycle.

@diegodebrito
Copy link
Author

Great point, @josef-pkt. An alternative implementation wouldn't limit users to the eventual version of SciPy with the fix. I'm interest in working on this. We would love to have everything available in Statsmodels and not have to rely on the Tweedie package.

A question for you both regarding licenses @josef-pkt @lorentzenchr Can I read code available in other packages to help with this? Or should I rely on papers only? I think Tweedie and GLUM licenses are pretty permissive, but I don't want to start looking at the source code if I'm not allowed to.

@josef-pkt
Copy link
Member

GLUM https://github.com/Quantco/glum is also BSD-3 licensed, so there is no restriction looking at their code, or even copying parts.
@thequackdaddy was the original contributor of tweedie in statsmodels. His package might not have much additional to our code, but I haven't looked at it.

@josef-pkt
Copy link
Member

related:
tweedie family is missing get_distribution. We currently do not have a distribution class for it, AFAICS

@lorentzenchr
Copy link
Contributor

With scipy/scipy#20646 merged, as of (future) scipy 1.14, statsmodels can replace log(wright_bessel) with log_wright_bessel. This should solve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants