Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: override halfnorm fit #18760

Merged
merged 4 commits into from Jun 30, 2023
Merged

ENH: override halfnorm fit #18760

merged 4 commits into from Jun 30, 2023

Conversation

dschmitz89
Copy link
Contributor

Reference issue

Part of #17832 and #11782

What does this implement/fix?

Adds an analytical fitting method for the halfnormal distribution. This one is a low hanging fruit compared to most other distributions. The default fit method works but is slow compared to the simple analytical result.

Additional information

Math

$$ p(x|\mu,\sigma)=\frac{\sqrt{2}}{\sigma\sqrt{\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) $$

The log likelihood for the parameters $\mu$, $\sigma$ given data ($x_1,...,x_n$) is then

$$ \ln p(\mu,\sigma|x_1,...x_n)=n(1/2\ln(2)-1/2\ln(\pi)-\ln(\sigma))-\sum_i^n\frac{(x_i-\mu)^2}{2\sigma^2}$$

As $x>0$, this is is a monotonically increasing function for $\mu$. The maximum likelihood is reached for $\mu=x_{min}$ as for a higher $\mu$, the minimal value $x_{min}$ would be outside of the support of the distribution (similar to the halflogistic distribution in #18695).

Setting the gradient to zero yields the regular variance which depends on the location:

$$ \begin{align} \frac{\partial\ln p}{\partial \sigma} &= -\frac{n}{\sigma}+\sum_{i=1}^n\frac{(x_i-\mu)^2}{\sigma^3}=0 \\ \sigma & = \sqrt{\frac{\sum_i(x_i-\mu)^2}{n}} \end{align} $$

Fit results for free parameters

Details

import numpy as np
from scipy import stats
from time import perf_counter
import matplotlib.pyplot as plt
import os

rng = np.random.default_rng(2433006235492611347)
qrng = stats.qmc.Halton(6, seed=rng)

N = 1000
times_pr = np.full(N, np.nan)
times_super = np.full(N, np.nan)
nllfs_super = np.full(N, np.nan)
nllfs_pr = np.full(N, np.nan)

distribution = stats.halfnorm

for i in range(N):
print(i)
sample = qrng.random()
sample = stats.qmc.scale(sample, [-3, -3, -3, 1, -3, -1], [2, 3, 3, 5, 2, 1])[0]
_, loc, scale, size, floc = 10**sample[:5]
size = int(size)

dist = distribution(loc=loc, scale=scale)
rvs = dist.rvs(size=size, random_state=rng)

try:
    tic = perf_counter()
    loc_fit, scale_fit = distribution.fit(rvs)
    toc = perf_counter()
    nllf_pr = distribution.nnlf((loc_fit, scale_fit), rvs)
    nllfs_pr[i] = nllf_pr
    times_pr[i] = toc-tic

except Exception as e:
    print(f"Failure: {e}")

try:
    tic = perf_counter()
    loc_fit, scale_fit = distribution.fit(rvs, superfit=True)
    toc = perf_counter()
    nllf_super = distribution.nnlf((loc_fit, scale_fit), rvs)
    nllfs_super[i] = nllf_super
    times_super[i] = toc-tic

except Exception as e:
    print(f"Failure: {e}")

nllf_diff = (nllfs_pr - nllfs_super)/np.abs(nllfs_super)
nllf_diff = nllf_diff[np.isfinite(nllf_diff)]
nllf_diff_good = -nllf_diff[nllf_diff < 0]
nllf_diff_bad = nllf_diff[nllf_diff > 0]

bins = np.arange(-6, 1.5, 0.25)
plt.hist(np.log10(times_pr), bins=bins, alpha=0.5)
plt.hist(np.log10(times_super), bins=bins, alpha=0.5)
plt.legend(('PR', 'main'))
plt.xlabel('log10 of fit execution time (s)')
plt.ylabel('Frequency')
plt.title('Histogram of log10 of fit execution time (s), free')
plt.savefig("Time_free.png")

plt.clf()
plt.hist(np.log10(nllf_diff_good), label="good", alpha=0.5)
plt.hist(np.log10(nllf_diff_bad), label="bad", alpha=0.5)
plt.xlabel('log10 of NLLF differences')
plt.ylabel('Frequency')
plt.title('Histogram of log10 of NLLF differences, free')
plt.legend()
plt.savefig("NNLF_Differences_free.png")

Time_free
NNLF_Differences_free

Fit results for fixed location
Time_fix_floc
NNLF_Differences_fix_floc

Fit results for fixed scale
Time_fscale_fix
NNLF_Differences_fscale_fix

@dschmitz89 dschmitz89 added the enhancement A new feature or improvement label Jun 26, 2023
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't run this locally yet, but the math is good.
What happens when floc is greater than the minimum of the data?

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>
@dschmitz89
Copy link
Contributor Author

I haven't run this locally yet, but the math is good. What happens when floc is greater than the minimum of the data?

Good point. The minimal sample would be outside of the support of the distribution. The likelihood to observe that data point would be 0. Should we throw an error?

@mdhaber
Copy link
Contributor

mdhaber commented Jun 28, 2023

I think so. In pareto.fit, for example:

# ensure that any fixed parameters don't violate constraints of the
# distribution before continuing.
if floc is not None and np.min(data) - floc < (fscale or 0):
raise FitDataError("pareto", lower=1, upper=np.inf)

@dschmitz89
Copy link
Contributor Author

I think so. In pareto.fit, for example:

# ensure that any fixed parameters don't violate constraints of the
# distribution before continuing.
if floc is not None and np.min(data) - floc < (fscale or 0):
raise FitDataError("pareto", lower=1, upper=np.inf)

Ok, I will also add that for the halflogistic distribution.

@mdhaber mdhaber merged commit 093f1b2 into scipy:main Jun 30, 2023
23 of 24 checks passed
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed math and compared implementation against math. Confirmed the speed and accuracy improvements locally. Confirmed that the appropriate error messages are raised when floc greater than minimum of the data and when scale and location are both fixed. Tests look appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants