Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Robust skewness, kurtosis and medcouple measures #1234

Merged
merged 17 commits into from Apr 4, 2014

Conversation

Projects
None yet
4 participants
@bashtage
Copy link
Contributor

commented Dec 16, 2013

This addresses #838 by adding medcouple and the Kim & White robust skewness and kurtosis measures.

I'm not sure if I handle ties correctly in medcouple, so probably wait a bit for any merge.

@coveralls

This comment has been minimized.

Copy link

commented Dec 16, 2013

Coverage Status

Coverage remained the same when pulling 1b20c17 on bashtage:medcouple into 5cf4cad on statsmodels:master.

@coveralls

This comment has been minimized.

Copy link

commented Dec 16, 2013

Coverage Status

Coverage remained the same when pulling 8523ccf on bashtage:medcouple into 5cf4cad on statsmodels:master.

@@ -90,17 +93,18 @@ def test_jarque_bera():
assert_almost_equal(jb, st_pv_R, 14)

st_pv_R = np.array([78.329987305556, 0.000000000000])
jb = jarque_bera(x**2)[:2]
jb = jarque_bera(x ** 2)[:2]

This comment has been minimized.

Copy link
@josef-pkt

josef-pkt Dec 16, 2013

Member

our (statsmodels, scipy, numpy) convention is not to add spaces around power ** (the only operator for no spaces)

This comment has been minimized.

Copy link
@bashtage

bashtage Dec 16, 2013

Author Contributor

What I get for letting PyCharm reformat the file.

@coveralls

This comment has been minimized.

Copy link

commented Dec 16, 2013

Coverage Status

Coverage remained the same when pulling c782d04 on bashtage:medcouple into 5cf4cad on statsmodels:master.

@coveralls

This comment has been minimized.

Copy link

commented Dec 16, 2013

Coverage Status

Coverage remained the same when pulling c782d04 on bashtage:medcouple into 5cf4cad on statsmodels:master.

@coveralls

This comment has been minimized.

Copy link

commented Dec 16, 2013

Coverage Status

Coverage remained the same when pulling 6fd25b0 on bashtage:medcouple into 5cf4cad on statsmodels:master.

@coveralls

This comment has been minimized.

Copy link

commented Dec 16, 2013

Coverage Status

Coverage remained the same when pulling 3b9f3c3 on bashtage:medcouple into 5cf4cad on statsmodels:master.

fb = np.percentile(y, 25.0, axis)

kr1 = stats.kurtosis(y, axis)
kr2 = ((e7 - e5) + (e3 - e1)) / (e6 - e2) - 1.2330951154852172

This comment has been minimized.

Copy link
@josef-pkt

josef-pkt Dec 18, 2013

Member

are these relative to normal?
1.2330951154852172 ?

I think I prefer to turn this off with an option, if that's what it is. My Kim and Wilde is hiding in a pile of papers, right now.

This comment has been minimized.

Copy link
@bashtage

bashtage Dec 18, 2013

Author Contributor

Yes. The thing is that unlike the usual Kurtosis, where everyone known that 3 is the normal value, these have non-standard values and so are not easy to understand unless expressed as excesses relative to that of a normal. I was thinking of making these user supplied with defaults, and didn't do this due to the extra docs.

This comment has been minimized.

Copy link
@josef-pkt

josef-pkt Dec 18, 2013

Member

When I looked into this some time ago, then I thought of adding the similar function that uses the scipy distribution ppf for the quantiles. IIRC there are the tables with the values for some standard distributions, that I matched. (I didn't look at medcouple.)

In this case it would be easier to compare the absolute kurtosis with the value for the t distribution for example. (I don't remember if it made sense when the distribution parameters need to be estimated.)

One application I have seen somewhere is robust jarque-bera for normal distribution.

This comment has been minimized.

Copy link
@bashtage

bashtage Dec 18, 2013

Author Contributor

The best strategy would be use a kwarg to disable the scaling. However, it would also be useful to have a function that would take the appropriate SciPy distribution and compute the comparable quantities.

Alternatively, just take the SciPy and do the adjustment directly in the function.

[[34.523210399523926, 4.429509162503833, 3.860396220444025],
[3.186985686465249e-08, 9.444780064482572e-06, 1.132033129378485e-04]])
[[34.523210399523926, 4.429509162503833, 3.860396220444025],
[3.186985686465249e-08, 9.444780064482572e-06, 1.132033129378485e-04]])

This comment has been minimized.

Copy link
@josef-pkt

josef-pkt Dec 18, 2013

Member

Is this also automatic reformatting?
If there is space, I'd like to keep the numbers to the right of the equal sign, but I have no idea about pep-8

This comment has been minimized.

Copy link
@bashtage

bashtage Dec 18, 2013

Author Contributor

I've rolled these back, and yes, autoformat.

It doesn't trip a pep-8 warning, so I think this is PyCharm's preference.

test_adnorm()
test_durbin_watson_pandas()

class TestStattools(TestCase):

This comment has been minimized.

Copy link
@josef-pkt

josef-pkt Dec 18, 2013

Member

subclassing TestCase is not necessary, we usually just subclass object.
I have no idea whether it makes any difference.

This comment has been minimized.

Copy link
@bashtage

bashtage Dec 18, 2013

Author Contributor

It is useful for automated testing in PyCharm, since it identified unittests and you get more detailed output.

@coveralls

This comment has been minimized.

Copy link

commented Dec 18, 2013

Coverage Status

Coverage remained the same when pulling a640ab1 on bashtage:medcouple into 5cf4cad on statsmodels:master.

@coveralls

This comment has been minimized.

Copy link

commented Dec 18, 2013

Coverage Status

Coverage remained the same when pulling 3092aed on bashtage:medcouple into 5cf4cad on statsmodels:master.

@coveralls

This comment has been minimized.

Copy link

commented Dec 18, 2013

Coverage Status

Coverage remained the same when pulling 36191ff on bashtage:medcouple into 5cf4cad on statsmodels:master.

e3 = np.percentile(y, 37.5, axis=axis)
e5 = np.percentile(y, 62.5, axis=axis)
e6 = np.percentile(y, 75.0, axis=axis)
e7 = np.percentile(y, 87.5, axis=axis)

This comment has been minimized.

Copy link
@josef-pkt

josef-pkt Dec 19, 2013

Member

I forgot to check earlier, np.percentile is available in our minimum numpy version.
np.percentile is vectorized
np.percentile(10*np.arange(6), [0.1, 0.5, 0.75])
which is faster because it doesn't do repeated sorting (starting in 1.8 it will do only partial sorting which is even faster)

@coveralls

This comment has been minimized.

Copy link

commented Dec 19, 2013

Coverage Status

Coverage remained the same when pulling c5feb79 on bashtage:medcouple into 5cf4cad on statsmodels:master.

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Dec 19, 2013

Looks pretty much done to me, Thanks Kevin

Do you still have any changes planned?
I'd like to merge it before #1255 because afterwards I prefer if this is rebased before merging.

@bashtage

This comment has been minimized.

Copy link
Contributor Author

commented Dec 19, 2013

The only things that could be done would be user specification of the cutoffs in some of the estimators, and the better algorithm for the medcouple. I might try to do the former, but the letter will have to wait.

On Dec 19, 2013 6:13 PM, Josef Perktold notifications@github.com wrote:

Looks pretty much done to me, Thanks Kevin

Do you still have any changes planned?
I'd like to merge it before #1255#1255 because afterwards I prefer if this is rebased before merging.


Reply to this email directly or view it on GitHubhttps://github.com//pull/1234#issuecomment-30951816.

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Dec 19, 2013

Ok no need to rush, a rebase is no problem

I have GEE almost ready for merge and I would like to merge also IV/GMM #1105 today, and maybe #1225

@bashtage

This comment has been minimized.

Copy link
Contributor Author

commented Dec 20, 2013

I think this should complete this PR unless you see anything new.

@coveralls

This comment has been minimized.

Copy link

commented Dec 20, 2013

Coverage Status

Coverage remained the same when pulling 4197c48 on bashtage:medcouple into 5cf4cad on statsmodels:master.

----------
y : array-like, 1-d
alpha : float, optional
Lower cut-off for measuring expectation in tail.

This comment has been minimized.

Copy link
@josef-pkt

josef-pkt Jan 6, 2014

Member

too much intend,

Returns
-------
kr3 : float
Robust kurtosis estimator based on

This comment has been minimized.

Copy link
@josef-pkt

josef-pkt Jan 6, 2014

Member

unfinished sentence ?

ab: iterable, optional
Contains 100*(alpha, beta) in the kr3 measure
db: iterable, optional
Contains 100*(delta, gamma) in the kr4 measure

This comment has been minimized.

Copy link
@josef-pkt

josef-pkt Jan 6, 2014

Member

ab, db, alpha beta delta gamma are not informative if we don't read the formulas.

"""
if (axis is None or
(y.squeeze().ndim == 1 and y.ndim != 1)):
y = y.flat[:]

This comment has been minimized.

Copy link
@josef-pkt

josef-pkt Jan 6, 2014

Member

I don't think we need a copy which flat[:] does AFAICS. use y.ravel()

e1, e2, e3, e5, e6, e7, fd, f1md, fg, f1mg = np.percentile(y, perc, axis=axis)

expected_value = np.zeros(4)
if excess:

This comment has been minimized.

Copy link
@josef-pkt

josef-pkt Jan 6, 2014

Member

the calculation below should be put into a separate function, I think

@bashtage

This comment has been minimized.

Copy link
Contributor Author

commented Jan 10, 2014

I have implemented all of these suggestions, so hopefully this is ready to be finished.

@coveralls

This comment has been minimized.

Copy link

commented Jan 10, 2014

Coverage Status

Coverage remained the same when pulling 95a6a6b on bashtage:medcouple into 5cf4cad on statsmodels:master.

@bashtage

This comment has been minimized.

Copy link
Contributor Author

commented Feb 11, 2014

I can't remember if I needed to do anything on this - I don't see anything above, so I hope it is finished.

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Feb 11, 2014

Yes, I think it looks good and can be merged.

The two things I thought about that should be changed eventually

  • expected_robust_kurtosis could take an optional distribution, or a ppf function, so we can also calculate it for other distributions, e.g. compare with a t-distribution. same for an expected_robust_skewness. (it might also be used for a robust method of moment estimation of the distribution parameter, but I haven't looked at those yet.)
  • as we get more robust statistics, especially descriptive statistics, we need a better location than adding to a generic stattools.py
    #838 (comment)

Sorry for being so slow. I got lost again. This time in robust estimation and general M-Estimators, trying to catch up with some theory and getting a better overview.

@bashtage

This comment has been minimized.

Copy link
Contributor Author

commented Feb 11, 2014

I think the optional distribution is probably not really possible since some of the statistics do not have (obvious) closed forms, in particular the ones which depend on the expected value in the tails. In particular both kr3 and sk3 do not have simple to compute forms using scipy stats RVs since the expected value in the tail is not available or the expected absolute deviation is not available.

@bashtage

This comment has been minimized.

Copy link
Contributor Author

commented Feb 11, 2014

These types of direct numerical integration usually don't work well in the tails, and better numerical integration requires a bespoke solution (usually one which transforms from an unbounded to a bounded space)

@josef-pkt

This comment has been minimized.

Copy link
Member

commented Feb 11, 2014

In my experience integrate.quad works pretty well.
There might only be a few exceptions with heavy tails where integrate.quad doesn't work.
We had some cases in scipy issues, but I don't remember the details, IIRC when calculating higher moments, if they (almost) don't exist.
some distributions might also have numerical precision problems in the tails.

@josef-pkt josef-pkt added the PR label Feb 19, 2014

@bashtage

This comment has been minimized.

Copy link
Contributor Author

commented Mar 9, 2014

This was getting a bit old so I merged upstream master to make sure that it will still be OK.

@jseabold

This comment has been minimized.

Copy link
Member

commented Apr 3, 2014

Ok to merge this?

Can you add a brief note in docs/source/release/version0.6.rst?

Kevin Sheppard added some commits Dec 14, 2013

Kevin Sheppard
ENH: Medcouple for robust skewness estimation
ENH: Function allowing other functions to be applied using only 1d
    at a time
ENH: Robust Skewness from Kim and White
ENH: Robust Kurtosis from Kim and White
ENH: Helper functions for quantile.  Different from numpy.
ENH: Tests for medcouple
Kevin Sheppard
General improvements to robust skewness and kurtosis
ENH: Generalized other texts to work along axis for standards
Simplified code to remove new weighted quantile function, instead
    use np.percentile
Removed incorrect test file in tools/tests/test_statstools.py
Added tests for multi-axis functions in stattools
Kevin Sheppard
Clean-up and documentation for robust skewness and kurtosis measures
Cleaned-up stattools
Added docstring for those missing
Added test for robust kurtosis
Kevin Sheppard
100% test coverage
Conflicts:
	statsmodels/stats/tests/test_statstools.py
Kevin Sheppard
ENH: Added flag to select excess or not
Allows raw values to be returned when excess=False

Conflicts:
	statsmodels/stats/stattools.py
Kevin Sheppard
Added test for excess=False case
Conflicts:
	statsmodels/stats/tests/test_statstools.py
Kevin Sheppard
Vectorized np.percentil use
Switched to vectorized use of np.percentile where appropriate
Cleaned up some excessively verbose code

Conflicts:
	statsmodels/stats/stattools.py
Kevin Sheppard
Documentation improvements and additional options
Added math to robust_kurtosis
Added math to robust_skewness
Made some of the parameters in robust_kurtosis settable
Added tests of new features

Conflicts:
	statsmodels/stats/stattools.py
Kevin Sheppard
Improved documentation of robust kurtosis
Separated the expected kurtotis calculation

Conflicts:
	statsmodels/stats/stattools.py
@bashtage

This comment has been minimized.

Copy link
Contributor Author

commented Apr 4, 2014

I added a note.


out = np.zeros(shape)

for i in np.ndindex(shape):

This comment has been minimized.

Copy link
@jseabold

jseabold Apr 4, 2014

Member

I think this needs a np.ndindex(*shape).

@@ -565,3 +566,26 @@ class Bunch(dict):
def __init__(self, **kw):
dict.__init__(self, kw)
self.__dict__ = self


def apply_1d_function(a, func, axis=0):

This comment has been minimized.

Copy link
@jseabold

jseabold Apr 4, 2014

Member

Is this different than np.apply_along_axis?

This comment has been minimized.

Copy link
@bashtage

bashtage Apr 4, 2014

Author Contributor

I think this was from an early version that got back in during rebase. I have removed it. It wasn't being used by anything as far as I can tell.

This comment has been minimized.

Copy link
@jseabold

jseabold Apr 4, 2014

Member

Great thanks. Ping me when you push, and I'll merge. It still shows up here?

This comment has been minimized.

Copy link
@bashtage

bashtage Apr 4, 2014

Author Contributor

All done.

Kevin Sheppard added some commits Apr 3, 2014

Kevin Sheppard
Kevin Sheppard

jseabold added a commit that referenced this pull request Apr 4, 2014

Merge pull request #1234 from bashtage/medcouple
ENH: Robust skewness, kurtosis and medcouple measures

@jseabold jseabold merged commit 067b41f into statsmodels:master Apr 4, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details

@bashtage bashtage deleted the bashtage:medcouple branch Apr 4, 2014

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this pull request Sep 2, 2014

Merge pull request statsmodels#1234 from bashtage/medcouple
ENH: Robust skewness, kurtosis and medcouple measures

@josef-pkt josef-pkt added this to the 0.6 milestone Apr 30, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.