Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter estimation for : Left skewed distributions #13071

Closed
digitalgemba opened this issue Nov 12, 2020 · 8 comments
Closed

Parameter estimation for : Left skewed distributions #13071

digitalgemba opened this issue Nov 12, 2020 · 8 comments
Labels
query A question or suggestion that requires further information scipy.stats

Comments

@digitalgemba
Copy link

I was going through the code for def fit(self, data, *args, **kwds):, Does the optimisation differentiate between negative and positive skew distribution ?

if yes , please let me know the line link and ignore further.

If not , for a data as below,

temp=([0.035, 0.038, 0.054, 0.049, 0.052, 0.035, 0.035, 0.05 , 0.042, 0.034, 0.052, 0.035, 0.034, 0.037, 0.052, 0.035, 0.041, 0.035, 0.039, 0.04 , 0.04 , 0.043, 0.043, 0.035, 0.037, 0.034, 0.036, 0.037, 0.035, 0.032, 0.034, 0.038, 0.031, 0.05 , 0.042, 0.034, 0.036, 0.037, 0.055, 0.034, 0.036, 0.037, 0.035, 0.055, 0.044,0.04 , 0.031, 0.035, 0.038, 0.054])

print(skewtest(data)[0])
-3.4412103891299326

For Rayleigh Distribution,

rayleigh.fit(data)
(0.01122619903412441, 0.025047969058503067)

This does not make sense to me as this is case of 0 to loc , rather than 0 to Inf.

I have written the following, and I am struggling with ._fitstart .

def func(params):
     n_bad=[]
    loc,scale=params
    data1=(loc-data)/(scale)
    n_log_scale = (len(data)) *log(scale)
    logpdf = rayleigh._logpdf(data1)
    finite_logpdf = np.isfinite(logpdf)
    n_bad += np.sum(~finite_logpdf)
    if n_bad > 0:
        penalty = n_bad * log(np.finfo(float).max) * 100
        return -np.sum(logpdf[finite_logpdf], axis=0)+n_log_scale+penalty
    return -np.sum(logpdf, axis=0)+n_log_scale

x0=rayleigh._fitstart(max(data)*1.01+data)
res=optimize.fmin(func,x0)
array([0.0602275 , 0.01192598])

Please comment.

@AtsushiSakai AtsushiSakai added scipy.stats query A question or suggestion that requires further information labels Nov 13, 2020
@mdhaber
Copy link
Contributor

mdhaber commented Nov 25, 2020

@digitalgemba

Does the optimisation differentiate between negative and positive skew distribution

To my recollection, no. The generic optimization does not check the skew and change the procedure based on that, if that's what you're asking. Should it?

I'm not sure I can help with the rest, but I'll try.

There is data in a variable temp, but then there is rayleigh.fit(data). Is data the same as temp?

This does not make sense to me as this is case of 0 to loc , rather than 0 to Inf.

Can you elaborate on what you are expecting and why what you get does not match your expectation? I don't understand what is not making sense. We may not be as familiar with the distribution as you are, so it is not obvious from the numbers what is wrong.

I have written the following, and I am struggling with ._fitstart

_fitstart is a private method. You are welcome to use it, but we can't support its usage. If there is a bug, let us know.

You may be interested in gh-12097. It added an analytical formula for the scale parameter. It will be available when SciPy 1.6 is released, or you are welcome to build SciPy from master now. You might also be interested in this reference; see the fitting information in Chapter 39. The other first-order condition for Rayleigh, which you can use to get the location if you know the scale, is
image

In master, the fit returned by rayleigh.fit satisfies the first-order conditions very accurately.

Please also check how rayleigh is defined in scipy.stats. It might be different from what you are expecting.

@mdhaber
Copy link
Contributor

mdhaber commented Nov 30, 2020

@digitalgemba I'll go ahead and close this, but if you think there is a bug in master or if you're requesting some other feature, please let me know.

@mdhaber mdhaber closed this as completed Nov 30, 2020
@digitalgemba
Copy link
Author

The data below , is skewed

data=([0.038, 0.042, 0.012, 0.044, 0.049, 0.052, 0.054, 0.044, 0.058,0.039, 0.054, 0.043, 0.057, 0.058, 0.052, 0.051, 0.046, 0.049,0.052, 0.052, 0.051, 0.053, 0.044, 0.037, 0.04 , 0.032, 0.037,0.042, 0.045, 0.042, 0.041, 0.035, 0.052, 0.051, 0.052, 0.054,0.055, 0.052, 0.051, 0.042, 0.047, 0.032, 0.033, 0.036, 0.041,0.051, 0.046, 0.047, 0.042, 0.052])

skewtest(data)

SkewtestResult(statistic=-3.4412103891299326, pvalue=0.0005791180499256706)

image

The parameters does not reflect actual data .

How do we address such issues ?

@mdhaber
Copy link
Contributor

mdhaber commented Dec 3, 2020

I want to make sure I understand your issue. Let me try to restate it:
Your data is skewed. You try to fit a distribution to the data using the fit method, but the fitted distribution is not skewed.
Is that correct?

@digitalgemba
Copy link
Author

Yes.

@mdhaber
Copy link
Contributor

mdhaber commented Dec 3, 2020

Ok, let me try fitting your data. Which distribution do you want me to try? Rayleigh, lognorm, or something else? In your original post you mentioned Rayleigh, but your recent post showed "lognorm" in the plot.

@digitalgemba
Copy link
Author

Please check for lognormal, rest I will try after your feedback.

I measure distribution fitting by r^2.

temp=sorted(data)
xi=morestats._calc_uniform_order_statistic_medians(50)
xj=rayleigh.ppf(xi,loc,scale)
linregress(xj,temp)

@mdhaber
Copy link
Contributor

mdhaber commented Dec 3, 2020

lognorm's shape parameter doesn't support skew in that direction. Take a look at the plots on Wikipedia; they all have long tails to the right whereas your data has a longer tail on the left. If I reflect your data about the vertical axis, the shape parameter begins to be used.

import matplotlib.pyplot as plt
from scipy.stats import lognorm
import numpy as np

data=([0.038, 0.042, 0.012, 0.044, 0.049, 0.052, 0.054, 0.044, 0.058,0.039,
       0.054, 0.043, 0.057, 0.058, 0.052, 0.051, 0.046, 0.049,0.052, 0.052,
       0.051, 0.053, 0.044, 0.037, 0.04 , 0.032, 0.037,0.042, 0.045, 0.042,
       0.041, 0.035, 0.052, 0.051, 0.052, 0.054,0.055, 0.052, 0.051, 0.042,
       0.047, 0.032, 0.033, 0.036, 0.041,0.051, 0.046, 0.047, 0.042, 0.052])
data = -np.array(data)
p = lognorm.fit(data)
x = np.linspace(-0.1, 0.1, 300)
fig, ax = plt.subplots(1, 1)
ax.hist(data, density=True, histtype='stepfilled', alpha=0.2)
ax.plot(x, lognorm(*p).pdf(x))

image

But the other way of looking at this might be that lognormal may not be the appropriate distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
query A question or suggestion that requires further information scipy.stats
Projects
None yet
Development

No branches or pull requests

2 participants