ENH: override halfnorm fit #18760

dschmitz89 · 2023-06-26T21:00:30Z

Reference issue

What does this implement/fix?

Adds an analytical fitting method for the halfnormal distribution. This one is a low hanging fruit compared to most other distributions. The default fit method works but is slow compared to the simple analytical result.

Additional information

Math

$$ p(x|\mu,\sigma)=\frac{\sqrt{2}}{\sigma\sqrt{\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) $$

The log likelihood for the parameters $\mu$, $\sigma$ given data ($x_1,...,x_n$) is then

$$ \ln p(\mu,\sigma|x_1,...x_n)=n(1/2\ln(2)-1/2\ln(\pi)-\ln(\sigma))-\sum_i^n\frac{(x_i-\mu)^2}{2\sigma^2}$$

As $x>0$, this is is a monotonically increasing function for $\mu$. The maximum likelihood is reached for $\mu=x_{min}$ as for a higher $\mu$, the minimal value $x_{min}$ would be outside of the support of the distribution (similar to the halflogistic distribution in #18695).

Setting the gradient to zero yields the regular variance which depends on the location:

$$ \begin{align} \frac{\partial\ln p}{\partial \sigma} &= -\frac{n}{\sigma}+\sum_{i=1}^n\frac{(x_i-\mu)^2}{\sigma^3}=0 \\ \sigma & = \sqrt{\frac{\sum_i(x_i-\mu)^2}{n}} \end{align} $$

Fit results for free parameters

Details

import numpy as np
from scipy import stats
from time import perf_counter
import matplotlib.pyplot as plt
import os

rng = np.random.default_rng(2433006235492611347)
qrng = stats.qmc.Halton(6, seed=rng)

N = 1000
times_pr = np.full(N, np.nan)
times_super = np.full(N, np.nan)
nllfs_super = np.full(N, np.nan)
nllfs_pr = np.full(N, np.nan)

distribution = stats.halfnorm

for i in range(N):
print(i)
sample = qrng.random()
sample = stats.qmc.scale(sample, [-3, -3, -3, 1, -3, -1], [2, 3, 3, 5, 2, 1])[0]
_, loc, scale, size, floc = 10**sample[:5]
size = int(size)

dist = distribution(loc=loc, scale=scale)
rvs = dist.rvs(size=size, random_state=rng)

try:
    tic = perf_counter()
    loc_fit, scale_fit = distribution.fit(rvs)
    toc = perf_counter()
    nllf_pr = distribution.nnlf((loc_fit, scale_fit), rvs)
    nllfs_pr[i] = nllf_pr
    times_pr[i] = toc-tic

except Exception as e:
    print(f"Failure: {e}")

try:
    tic = perf_counter()
    loc_fit, scale_fit = distribution.fit(rvs, superfit=True)
    toc = perf_counter()
    nllf_super = distribution.nnlf((loc_fit, scale_fit), rvs)
    nllfs_super[i] = nllf_super
    times_super[i] = toc-tic

except Exception as e:
    print(f"Failure: {e}")

nllf_diff = (nllfs_pr - nllfs_super)/np.abs(nllfs_super)
nllf_diff = nllf_diff[np.isfinite(nllf_diff)]
nllf_diff_good = -nllf_diff[nllf_diff < 0]
nllf_diff_bad = nllf_diff[nllf_diff > 0]

bins = np.arange(-6, 1.5, 0.25)
plt.hist(np.log10(times_pr), bins=bins, alpha=0.5)
plt.hist(np.log10(times_super), bins=bins, alpha=0.5)
plt.legend(('PR', 'main'))
plt.xlabel('log10 of fit execution time (s)')
plt.ylabel('Frequency')
plt.title('Histogram of log10 of fit execution time (s), free')
plt.savefig("Time_free.png")

plt.clf()
plt.hist(np.log10(nllf_diff_good), label="good", alpha=0.5)
plt.hist(np.log10(nllf_diff_bad), label="bad", alpha=0.5)
plt.xlabel('log10 of NLLF differences')
plt.ylabel('Frequency')
plt.title('Histogram of log10 of NLLF differences, free')
plt.legend()
plt.savefig("NNLF_Differences_free.png")

Fit results for fixed location

Fit results for fixed scale

scipy/stats/_continuous_distns.py

mdhaber

I haven't run this locally yet, but the math is good.
What happens when floc is greater than the minimum of the data?

scipy/stats/tests/test_distributions.py

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>

dschmitz89 · 2023-06-28T18:54:33Z

I haven't run this locally yet, but the math is good. What happens when floc is greater than the minimum of the data?

Good point. The minimal sample would be outside of the support of the distribution. The likelihood to observe that data point would be 0. Should we throw an error?

mdhaber · 2023-06-28T19:02:18Z

I think so. In pareto.fit, for example:

scipy/scipy/stats/_continuous_distns.py

Lines 7412 to 7415 in 2e16ec5

    
           # ensure that any fixed parameters don't violate constraints of the 
        
           # distribution before continuing. 
        
           if floc is not None and np.min(data) - floc < (fscale or 0): 
        
               raise FitDataError("pareto", lower=1, upper=np.inf)

dschmitz89 · 2023-06-28T19:03:40Z

I think so. In pareto.fit, for example:

scipy/scipy/stats/_continuous_distns.py

Lines 7412 to 7415 in 2e16ec5

# ensure that any fixed parameters don't violate constraints of the

# distribution before continuing.

if floc is not None and np.min(data) - floc < (fscale or 0):

raise FitDataError("pareto", lower=1, upper=np.inf)

Ok, I will also add that for the halflogistic distribution.

mdhaber

Confirmed math and compared implementation against math. Confirmed the speed and accuracy improvements locally. Confirmed that the appropriate error messages are raised when floc greater than minimum of the data and when scale and location are both fixed. Tests look appropriate.

ENH: override halfnorm fit

4da3e89

github-actions bot added the scipy.stats label Jun 26, 2023

dschmitz89 added the enhancement A new feature or improvement label Jun 26, 2023

MatteoRaso reviewed Jun 27, 2023

View reviewed changes

scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved

ENH: try to fit fit for censoreddata

59b137f

mdhaber reviewed Jun 27, 2023

View reviewed changes

scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved

mdhaber reviewed Jun 27, 2023

View reviewed changes

scipy/stats/tests/test_distributions.py Outdated Show resolved Hide resolved

Apply suggestions from code review

2b93dfa

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>

ENH/TST: raise error for invalid floc

5bb7f8d

dschmitz89 mentioned this pull request Jun 29, 2023

MAINT: fix halflogistic.fit for bad location guess #18794

Merged

mdhaber merged commit 093f1b2 into scipy:main Jun 30, 2023
23 of 24 checks passed

mdhaber reviewed Jun 30, 2023

View reviewed changes

j-bowhay added this to the 1.12.0 milestone Jun 30, 2023

dschmitz89 deleted the halfnorm_fit branch July 3, 2023 06:36

dschmitz89 mentioned this pull request Jul 4, 2023

ENH: override halfcauchy distribution fit #18824

Merged

ev-br mentioned this pull request Sep 1, 2023

ENH: Simplify filter_type in *ord() functions #10623

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: override halfnorm fit #18760

ENH: override halfnorm fit #18760

dschmitz89 commented Jun 26, 2023

mdhaber left a comment

dschmitz89 commented Jun 28, 2023

mdhaber commented Jun 28, 2023

dschmitz89 commented Jun 28, 2023

mdhaber left a comment

ENH: override halfnorm fit #18760

ENH: override halfnorm fit #18760

Conversation

dschmitz89 commented Jun 26, 2023

Reference issue

What does this implement/fix?

Additional information

mdhaber left a comment

Choose a reason for hiding this comment

dschmitz89 commented Jun 28, 2023

mdhaber commented Jun 28, 2023

dschmitz89 commented Jun 28, 2023

mdhaber left a comment

Choose a reason for hiding this comment