ENH: Robust skewness, kurtosis and medcouple measures #1234

Merged
merged 17 commits into from Apr 4, 2014

Projects

None yet

4 participants

@bashtage
Contributor

This addresses #838 by adding medcouple and the Kim & White robust skewness and kurtosis measures.

I'm not sure if I handle ties correctly in medcouple, so probably wait a bit for any merge.

@coveralls

Coverage Status

Coverage remained the same when pulling 1b20c17 on bashtage:medcouple into 5cf4cad on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 8523ccf on bashtage:medcouple into 5cf4cad on statsmodels:master.

@josef-pkt josef-pkt and 1 other commented on an outdated diff Dec 16, 2013
statsmodels/stats/tests/test_statstools.py
@@ -90,17 +93,18 @@ def test_jarque_bera():
assert_almost_equal(jb, st_pv_R, 14)
st_pv_R = np.array([78.329987305556, 0.000000000000])
- jb = jarque_bera(x**2)[:2]
+ jb = jarque_bera(x ** 2)[:2]
@josef-pkt
josef-pkt Dec 16, 2013 Member

our (statsmodels, scipy, numpy) convention is not to add spaces around power ** (the only operator for no spaces)

@bashtage
bashtage Dec 16, 2013 Contributor

What I get for letting PyCharm reformat the file.

@coveralls

Coverage Status

Coverage remained the same when pulling c782d04 on bashtage:medcouple into 5cf4cad on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling c782d04 on bashtage:medcouple into 5cf4cad on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 6fd25b0 on bashtage:medcouple into 5cf4cad on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 3b9f3c3 on bashtage:medcouple into 5cf4cad on statsmodels:master.

@josef-pkt josef-pkt and 1 other commented on an outdated diff Dec 18, 2013
statsmodels/stats/stattools.py
+ e1 = np.percentile(y, 12.5, axis=axis)
+ e2 = np.percentile(y, 25.0, axis=axis)
+ e3 = np.percentile(y, 37.5, axis=axis)
+ e5 = np.percentile(y, 62.5, axis=axis)
+ e6 = np.percentile(y, 75.0, axis=axis)
+ e7 = np.percentile(y, 87.5, axis=axis)
+
+ alpha, beta = 5.0, 50.0
+
+ f1ma = np.percentile(y, 97.5, axis)
+ fa = np.percentile(y, 2.5, axis)
+ f1mb = np.percentile(y, 75.0, axis)
+ fb = np.percentile(y, 25.0, axis)
+
+ kr1 = stats.kurtosis(y, axis)
+ kr2 = ((e7 - e5) + (e3 - e1)) / (e6 - e2) - 1.2330951154852172
@josef-pkt
josef-pkt Dec 18, 2013 Member

are these relative to normal?
1.2330951154852172 ?

I think I prefer to turn this off with an option, if that's what it is. My Kim and Wilde is hiding in a pile of papers, right now.

@bashtage
bashtage Dec 18, 2013 Contributor

Yes. The thing is that unlike the usual Kurtosis, where everyone known that 3 is the normal value, these have non-standard values and so are not easy to understand unless expressed as excesses relative to that of a normal. I was thinking of making these user supplied with defaults, and didn't do this due to the extra docs.

@josef-pkt
josef-pkt Dec 18, 2013 Member

When I looked into this some time ago, then I thought of adding the similar function that uses the scipy distribution ppf for the quantiles. IIRC there are the tables with the values for some standard distributions, that I matched. (I didn't look at medcouple.)

In this case it would be easier to compare the absolute kurtosis with the value for the t distribution for example. (I don't remember if it made sense when the distribution parameters need to be estimated.)

One application I have seen somewhere is robust jarque-bera for normal distribution.

@bashtage
bashtage Dec 18, 2013 Contributor

The best strategy would be use a kwarg to disable the scaling. However, it would also be useful to have a function that would take the appropriate SciPy distribution and compute the comparable quantities.

Alternatively, just take the SciPy and do the adjustment directly in the function.

@josef-pkt josef-pkt and 1 other commented on an outdated diff Dec 18, 2013
statsmodels/stats/tests/test_statstools.py
st_pv_R = np.array(
- [[34.523210399523926, 4.429509162503833, 3.860396220444025],
- [3.186985686465249e-08, 9.444780064482572e-06, 1.132033129378485e-04]])
+ [[34.523210399523926, 4.429509162503833, 3.860396220444025],
+ [3.186985686465249e-08, 9.444780064482572e-06, 1.132033129378485e-04]])
@josef-pkt
josef-pkt Dec 18, 2013 Member

Is this also automatic reformatting?
If there is space, I'd like to keep the numbers to the right of the equal sign, but I have no idea about pep-8

@bashtage
bashtage Dec 18, 2013 Contributor

I've rolled these back, and yes, autoformat.

It doesn't trip a pep-8 warning, so I think this is PyCharm's preference.

@josef-pkt josef-pkt commented on the diff Dec 18, 2013
statsmodels/stats/tests/test_statstools.py
assert_almost_equal(durbin_watson(x), durbin_watson(x_series), decimal=13)
-if __name__ == '__main__':
- test_durbin_watson()
- test_omni_normtest()
- test_jarque_bera()
- test_shapiro()
- test_adnorm()
- test_durbin_watson_pandas()
+
+class TestStattools(TestCase):
@josef-pkt
josef-pkt Dec 18, 2013 Member

subclassing TestCase is not necessary, we usually just subclass object.
I have no idea whether it makes any difference.

@bashtage
bashtage Dec 18, 2013 Contributor

It is useful for automated testing in PyCharm, since it identified unittests and you get more detailed output.

@coveralls

Coverage Status

Coverage remained the same when pulling a640ab1 on bashtage:medcouple into 5cf4cad on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 3092aed on bashtage:medcouple into 5cf4cad on statsmodels:master.

@coveralls

Coverage Status

Coverage remained the same when pulling 36191ff on bashtage:medcouple into 5cf4cad on statsmodels:master.

@josef-pkt josef-pkt commented on an outdated diff Dec 19, 2013
statsmodels/stats/stattools.py
+ -----
+ .. [1] Tae-Hwan Kim and Halbert White, "On more robust estimation of
+ skewness and kurtosis," Finance Research Letters, vol. 1, pp. 56-73,
+ March 2004.
+ """
+ if (axis is None or
+ (y.squeeze().ndim == 1 and y.ndim != 1)):
+ y = y.flat[:]
+ axis = 0
+
+ e1 = np.percentile(y, 12.5, axis=axis)
+ e2 = np.percentile(y, 25.0, axis=axis)
+ e3 = np.percentile(y, 37.5, axis=axis)
+ e5 = np.percentile(y, 62.5, axis=axis)
+ e6 = np.percentile(y, 75.0, axis=axis)
+ e7 = np.percentile(y, 87.5, axis=axis)
@josef-pkt
josef-pkt Dec 19, 2013 Member

I forgot to check earlier, np.percentile is available in our minimum numpy version.
np.percentile is vectorized
np.percentile(10*np.arange(6), [0.1, 0.5, 0.75])
which is faster because it doesn't do repeated sorting (starting in 1.8 it will do only partial sorting which is even faster)

@coveralls

Coverage Status

Coverage remained the same when pulling c5feb79 on bashtage:medcouple into 5cf4cad on statsmodels:master.

@josef-pkt
Member

Looks pretty much done to me, Thanks Kevin

Do you still have any changes planned?
I'd like to merge it before #1255 because afterwards I prefer if this is rebased before merging.

@bashtage
Contributor

The only things that could be done would be user specification of the cutoffs in some of the estimators, and the better algorithm for the medcouple. I might try to do the former, but the letter will have to wait.

On Dec 19, 2013 6:13 PM, Josef Perktold notifications@github.com wrote:

Looks pretty much done to me, Thanks Kevin

Do you still have any changes planned?
I'd like to merge it before #1255#1255 because afterwards I prefer if this is rebased before merging.


Reply to this email directly or view it on GitHubhttps://github.com/statsmodels/statsmodels/pull/1234#issuecomment-30951816.

@josef-pkt
Member

Ok no need to rush, a rebase is no problem

I have GEE almost ready for merge and I would like to merge also IV/GMM #1105 today, and maybe #1225

@bashtage
Contributor

I think this should complete this PR unless you see anything new.

@coveralls

Coverage Status

Coverage remained the same when pulling 4197c48 on bashtage:medcouple into 5cf4cad on statsmodels:master.

@josef-pkt josef-pkt commented on an outdated diff Jan 6, 2014
statsmodels/stats/stattools.py
+ sk2 = (q1 + q3 - 2.0 * q2) / (q3 - q1)
+ sk3 = (mu - q2) / np.mean(abs(y - q2_b), axis=axis)
+ sk4 = (mu - q2) / sigma
+
+ return sk1, sk2, sk3, sk4
+
+
+def _kr3(y, alpha=5.0, beta=50.0):
+ """
+ KR3 estimator from Kim & White
+
+ Parameters
+ ----------
+ y : array-like, 1-d
+ alpha : float, optional
+ Lower cut-off for measuring expectation in tail.
@josef-pkt
josef-pkt Jan 6, 2014 Member

too much intend,

@josef-pkt josef-pkt commented on an outdated diff Jan 6, 2014
statsmodels/stats/stattools.py
+def _kr3(y, alpha=5.0, beta=50.0):
+ """
+ KR3 estimator from Kim & White
+
+ Parameters
+ ----------
+ y : array-like, 1-d
+ alpha : float, optional
+ Lower cut-off for measuring expectation in tail.
+ beta : float, optional
+ Lower cut-off for measuring expectation in center.
+
+ Returns
+ -------
+ kr3 : float
+ Robust kurtosis estimator based on
@josef-pkt
josef-pkt Jan 6, 2014 Member

unfinished sentence ?

@josef-pkt josef-pkt commented on an outdated diff Jan 6, 2014
statsmodels/stats/stattools.py
+
+
+def robust_kurtosis(y, axis=0, ab=(5.0, 50.0), dg=(2.5, 25.0), excess=True):
+ """
+ Calculates the four kurtosis measures in Kim & White
+
+ Parameters
+ ----------
+ y : array-like
+ axis : int or None, optional
+ Axis along which the kurtosis measures are computed. If `None`, the
+ entire array is used.
+ ab: iterable, optional
+ Contains 100*(alpha, beta) in the kr3 measure
+ db: iterable, optional
+ Contains 100*(delta, gamma) in the kr4 measure
@josef-pkt
josef-pkt Jan 6, 2014 Member

ab, db, alpha beta delta gamma are not informative if we don't read the formulas.

@josef-pkt josef-pkt commented on an outdated diff Jan 6, 2014
statsmodels/stats/stattools.py
+ -\\hat{E}\\left(y|y<\\hat{q}_{\\beta}\\right)}
+
+ .. math::
+
+ KR_{4}=\\frac{\\hat{q}_{1-\\delta}-\\hat{q}_{\\delta}}
+ {\\hat{q}_{1-\\gamma}-\\hat{q}_{\\gamma}}
+
+ where :math:`\\hat{q}_{p}` is the estimated quantile at :math:`p`.
+
+ .. [1] Tae-Hwan Kim and Halbert White, "On more robust estimation of
+ skewness and kurtosis," Finance Research Letters, vol. 1, pp. 56-73,
+ March 2004.
+ """
+ if (axis is None or
+ (y.squeeze().ndim == 1 and y.ndim != 1)):
+ y = y.flat[:]
@josef-pkt
josef-pkt Jan 6, 2014 Member

I don't think we need a copy which flat[:] does AFAICS. use y.ravel()

@josef-pkt josef-pkt commented on an outdated diff Jan 6, 2014
statsmodels/stats/stattools.py
+ March 2004.
+ """
+ if (axis is None or
+ (y.squeeze().ndim == 1 and y.ndim != 1)):
+ y = y.flat[:]
+ axis = 0
+
+ alpha, beta = ab
+ delta, gamma = dg
+
+ perc = (12.5, 25.0, 37.5, 62.5, 75.0, 87.5,
+ delta, 100.0 - delta, gamma, 100.0 - gamma)
+ e1, e2, e3, e5, e6, e7, fd, f1md, fg, f1mg = np.percentile(y, perc, axis=axis)
+
+ expected_value = np.zeros(4)
+ if excess:
@josef-pkt
josef-pkt Jan 6, 2014 Member

the calculation below should be put into a separate function, I think

@bashtage
Contributor

I have implemented all of these suggestions, so hopefully this is ready to be finished.

@coveralls

Coverage Status

Coverage remained the same when pulling 95a6a6b on bashtage:medcouple into 5cf4cad on statsmodels:master.

@bashtage
Contributor

I can't remember if I needed to do anything on this - I don't see anything above, so I hope it is finished.

@josef-pkt
Member

Yes, I think it looks good and can be merged.

The two things I thought about that should be changed eventually

  • expected_robust_kurtosis could take an optional distribution, or a ppf function, so we can also calculate it for other distributions, e.g. compare with a t-distribution. same for an expected_robust_skewness. (it might also be used for a robust method of moment estimation of the distribution parameter, but I haven't looked at those yet.)
  • as we get more robust statistics, especially descriptive statistics, we need a better location than adding to a generic stattools.py
    #838 (comment)

Sorry for being so slow. I got lost again. This time in robust estimation and general M-Estimators, trying to catch up with some theory and getting a better overview.

@bashtage
Contributor

I think the optional distribution is probably not really possible since some of the statistics do not have (obvious) closed forms, in particular the ones which depend on the expected value in the tails. In particular both kr3 and sk3 do not have simple to compute forms using scipy stats RVs since the expected value in the tail is not available or the expected absolute deviation is not available.

@josef-pkt
Member

I didn't remember this part about kr3 and sk3.

It might still be possible in many case to use the expect method of the scipy distributions
for example, I used it here:
#1372 (comment)

I guess tail integration might not work for all heavy tailed distributions. I don't remember how much I tried. expect just uses scipy.integrate.quad for the continuous distributions.

@bashtage
Contributor

These types of direct numerical integration usually don't work well in the tails, and better numerical integration requires a bespoke solution (usually one which transforms from an unbounded to a bounded space)

@josef-pkt
Member

In my experience integrate.quad works pretty well.
There might only be a few exceptions with heavy tails where integrate.quad doesn't work.
We had some cases in scipy issues, but I don't remember the details, IIRC when calculating higher moments, if they (almost) don't exist.
some distributions might also have numerical precision problems in the tails.

@josef-pkt josef-pkt added the PR label Feb 19, 2014
@bashtage
Contributor
bashtage commented Mar 9, 2014

This was getting a bit old so I merged upstream master to make sure that it will still be OK.

@jseabold
Member
jseabold commented Apr 3, 2014

Ok to merge this?

Can you add a brief note in docs/source/release/version0.6.rst?

Kevin Sheppard added some commits Dec 14, 2013
Kevin Sheppard ENH: Medcouple for robust skewness estimation
ENH: Function allowing other functions to be applied using only 1d
    at a time
ENH: Robust Skewness from Kim and White
ENH: Robust Kurtosis from Kim and White
ENH: Helper functions for quantile.  Different from numpy.
ENH: Tests for medcouple
efd4ac2
Kevin Sheppard BUG: Made use of np.apply_along_axis and replaced custom version d9d6482
Kevin Sheppard General improvements to robust skewness and kurtosis
ENH: Generalized other texts to work along axis for standards
Simplified code to remove new weighted quantile function, instead
    use np.percentile
Removed incorrect test file in tools/tests/test_statstools.py
Added tests for multi-axis functions in stattools
763a83a
Kevin Sheppard Clean-up and documentation for robust skewness and kurtosis measures
Cleaned-up stattools
Added docstring for those missing
Added test for robust kurtosis
33e1fd5
Kevin Sheppard Fixed defect where ties were not handled correctly by medcouple f5d5d60
Kevin Sheppard Reverted space additions around ** operator 9512fd6
Kevin Sheppard Removed unnecessary float powers 2c0cb9f
Kevin Sheppard 100% test coverage
Conflicts:
	statsmodels/stats/tests/test_statstools.py
4ae0f22
Kevin Sheppard Revert auto-spacing fb29e1d
Kevin Sheppard ENH: Added flag to select excess or not
Allows raw values to be returned when excess=False

Conflicts:
	statsmodels/stats/stattools.py
b1bd75d
Kevin Sheppard Added test for excess=False case
Conflicts:
	statsmodels/stats/tests/test_statstools.py
79d4206
Kevin Sheppard Vectorized np.percentil use
Switched to vectorized use of np.percentile where appropriate
Cleaned up some excessively verbose code

Conflicts:
	statsmodels/stats/stattools.py
74bc09b
Kevin Sheppard Documentation improvements and additional options
Added math to robust_kurtosis
Added math to robust_skewness
Made some of the parameters in robust_kurtosis settable
Added tests of new features

Conflicts:
	statsmodels/stats/stattools.py
a0ba265
Kevin Sheppard Improved documentation of robust kurtosis
Separated the expected kurtotis calculation

Conflicts:
	statsmodels/stats/stattools.py
b393264
Kevin Sheppard Added a release note. 07418dd
@bashtage
Contributor
bashtage commented Apr 4, 2014

I added a note.

@jseabold jseabold commented on an outdated diff Apr 4, 2014
statsmodels/tools/tools.py
+def apply_1d_function(a, func, axis=0):
+ """
+ Applies a 1-dimensional function to an arbitrary array along a given axis
+ """
+ # TODO: Better docs
+ if axis is None:
+ return func(a.flat[:])
+
+ shape = list(a.shape)
+ shape.pop(axis)
+ if not shape:
+ return func(a)
+
+ out = np.zeros(shape)
+
+ for i in np.ndindex(shape):
@jseabold
jseabold Apr 4, 2014 Member

I think this needs a np.ndindex(*shape).

@jseabold jseabold and 1 other commented on an outdated diff Apr 4, 2014
statsmodels/tools/tools.py
@@ -565,3 +566,26 @@ class Bunch(dict):
def __init__(self, **kw):
dict.__init__(self, kw)
self.__dict__ = self
+
+
+def apply_1d_function(a, func, axis=0):
@jseabold
jseabold Apr 4, 2014 Member

Is this different than np.apply_along_axis?

@bashtage
bashtage Apr 4, 2014 Contributor

I think this was from an early version that got back in during rebase. I have removed it. It wasn't being used by anything as far as I can tell.

@jseabold
jseabold Apr 4, 2014 Member

Great thanks. Ping me when you push, and I'll merge. It still shows up here?

@bashtage
bashtage Apr 4, 2014 Contributor

All done.

Kevin Sheppard added some commits Apr 3, 2014
@jseabold jseabold merged commit 067b41f into statsmodels:master Apr 4, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@bashtage bashtage deleted the bashtage:medcouple branch Apr 4, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment