Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: specify which definition of skewness is used #9505

Merged
merged 5 commits into from Dec 18, 2018

Conversation

@sjpet
Copy link
Contributor

sjpet commented Nov 19, 2018

Since there are a few ways of estimating skewness, it would be helpful to specify which definition is used by scipy in the docstring of scipy.stats.skew.

@rlucas7

This comment has been minimized.

Copy link
Member

rlucas7 commented Nov 21, 2018

@sjpet I remember in yulesimon distribution I had to sit down and convert either central to non-central or vice versa for skew + kurtosis , glad you're calling this out!

I'm not sure what "adjusted Fisher-Pearson" means, Fisher and Pearson have many things attached to their names! The description is probably too specialized for a general user. Looks like it occurs only once on wiki page:

https://en.wikipedia.org/wiki/Skewness

perhaps a :math: equation in the doc-string would help clarify (make explicit) the formula used?

@sjpet

This comment has been minimized.

Copy link
Contributor Author

sjpet commented Nov 26, 2018

I moved the details to Notes as the docstring standard seems to suggest, where I also elaborated a bit and added equations, as suggested by @rlucas7.

@rlucas7

This comment has been minimized.

Copy link
Member

rlucas7 commented on scipy/stats/stats.py in bd1281a Nov 27, 2018

Here I would add a parens after skewness and insert (g_1) to be clear to the reader that g_1 is the skewnewss.
Also, beneath, the m_3, m_2 and k_3 could be anything (not defined in the docstring). I'd either use standard math/probability notations (expectation etc) or define the quantities in the docstring.

This comment has been minimized.

Copy link
Contributor Author

sjpet replied Nov 27, 2018

I'm not sure I agree. g_1 is not the skewness, it's a specific estimator of the skewness, so it would be something like

"The sample skewness is defined as the adjusted Fisher-Pearson standardized moment coefficient (g_1), i.e.

g_1 = ..."

which is rather redundant. If one wishes to be abundantly clear,

sample skewness ≜ g_1 = ...

is an option.

As for the second point, I could be persuaded that defining it in terms of expectation is reasonable, but I would expect anyone that is interested in knowing which skewness estimator is used would be familiar enough with statistics to understand what m means in this context.

This comment has been minimized.

Copy link
Member

rlucas7 replied Nov 27, 2018

I'm not sure I agree. g_1 is not the skewness, it's a specific estimator of the skewness,

agreed. The way the docstring is written the only way for a reader to determine this is skew statistic (and not skew parameter) is with the word sample or perhaps if they correctly infer what g_1, m_2 etc denote. If the goal is to make it clear and explicit to the reader which one is used, perhaps write something like:
"""
g_1 = \frac{\sum_{i=1}^N (x_i -\bar{x})^3}{s^3}
where \bar{x} is the sample mean and s is the sample standard deviation.
"""

expect anyone that is interested in knowing which skewness estimator is used would be familiar enough with >statistics to understand what m means in this context.

I'm not sure there are clear conventions on the letters used to denote sample skewness
that is why I'm recommending to be explicit in the docstring via an equation.

@sjpet

This comment has been minimized.

Copy link
Contributor Author

sjpet commented Nov 28, 2018

Using s can be a trap since that is often used for the unbiased sample standard deviation, which isn't quite what is done here. We're using the biased sample third and second central moments, and given their similarity, it would be odd to clarify only one of them.

My new suggestion adds an equation for m_i as the (biased sample) i:th central moment.

@rlucas7

This comment has been minimized.

Copy link
Member

rlucas7 commented Nov 30, 2018

@sjpet

This comment has been minimized.

Copy link
Contributor Author

sjpet commented Nov 30, 2018

I'm sorry, but that doesn't appear to be similar at all. The issue you linked stopped during test collection due to an import error, whereas the CircleCI failures here are two testcases failing: one of them (scipy/sparse/linalg/isolve/tests/test_lgmres.py::TestLGMRES::test_arnoldi) due to a crashed worker and the other one (scipy/spatial/tests/test_kdtree.py::test_ckdtree_memuse) due to an apparent memory leak. I dare say none of them is related to a docstring change in scipy.stats and besides, they both pass on my local machine.

@rgommers

This comment has been minimized.

Copy link
Member

rgommers commented Nov 30, 2018

These failures are common and can be ignored. We're about to resolve the PyPy issues in another PR.

@rlucas7

This comment has been minimized.

Copy link
Member

rlucas7 commented Dec 1, 2018

@rlucas7

This comment has been minimized.

Copy link
Member

rlucas7 commented Dec 6, 2018

Using s can be a trap since that is often used for the unbiased sample standard deviation, which isn't quite what is done here. We're using the biased sample third and second central moments, and given their similarity, it would be odd to clarify only one of them.

My new suggestion adds an equation for m_i as the (biased sample) i:th central moment.

Great. This is very clear and explicit for the general reader, thanks for making these changes.

My only remaining suggestion is to not use a superscript on $i^th$. The word $i$th is a word in english;
https://en.wiktionary.org/wiki/ith
unfortunately lots of papers and books use the super scripting needlessly perpetuating the usage.

@sjpet

This comment has been minimized.

Copy link
Contributor Author

sjpet commented Dec 6, 2018

My only remaining suggestion is to not use a superscript on $i^th$. The word $i$th is a word in english;
https://en.wiktionary.org/wiki/ith
unfortunately lots of papers and books use the super scripting needlessly perpetuating the usage.

As far as I am able to find, using superscript for the ordinal indicator is not incorrect, but a matter of style and perhaps somewhat archaic and generally not recommended (http://publish.ucc.ie/doc/wordedit?sectoc=5#noautostyle, https://english.stackexchange.com/questions/111265/should-ordinal-indicators-be-inline). I have updated accordingly.

@rlucas7

This comment has been minimized.

Copy link
Member

rlucas7 commented Dec 7, 2018

@sjpet

This comment has been minimized.

Copy link
Contributor Author

sjpet commented Dec 17, 2018

I think we're ready then. Anything else needed for this to be merged, @rgommers?

@rgommers

This comment has been minimized.

Copy link
Member

rgommers commented Dec 18, 2018

Rendered math looks good to me, and this looks like a useful doc improvement. Thanks @sjpet, and thanks @rlucas7 for reviewing!

Merging.

@rgommers rgommers merged commit 0464ae0 into scipy:master Dec 18, 2018
4 checks passed
4 checks passed
ci/circleci: build_docs Your tests passed on CircleCI!
Details
ci/circleci: pypy3 Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@rgommers rgommers added this to the 1.3.0 milestone Dec 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.