Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats: quartile coeff dispersion #13475

Closed
wants to merge 34 commits into from

Conversation

YarivLevy81
Copy link

Reference issue

Closes #13385

What does this implement/fix?

Added implementation for quartile coefficient of dispersion, as requested in #13385.
Added test cases, documentation like requested in contributing guidelines.
Tried to generalise things as good as possible, and to have good code coverage - It's not a very complicated issue so I hope it's good enough.

Additional information

I wasn't sure about few things:

  1. We could possibly implement it as quantile_coeff_dispersion with other values expect of 0.25, 0.5, 0.75. What are your thoughts?
  2. my function returns a float or an array_like of float, is it problematic? Is it better to just return an array with a single value?
  3. I'm using np.quantile function to compute the quartiles, this function has more arguments and but I don't necessarily want to use all of them. What would be the best practice here?

@YarivLevy81 YarivLevy81 changed the title ENH: stats: Quartile coeff dispersion (#13385) ENH: stats: Quartile coeff dispersion Jan 31, 2021
@YarivLevy81 YarivLevy81 changed the title ENH: stats: Quartile coeff dispersion ENH: stats: quartile coeff dispersion Jan 31, 2021
@rgommers rgommers added enhancement A new feature or improvement scipy.stats labels Jan 31, 2021
scipy/stats/__init__.py Outdated Show resolved Hide resolved
scipy/stats/morestats.py Outdated Show resolved Hide resolved
scipy/stats/morestats.py Outdated Show resolved Hide resolved
scipy/stats/morestats.py Outdated Show resolved Hide resolved
scipy/stats/morestats.py Outdated Show resolved Hide resolved
scipy/stats/morestats.py Outdated Show resolved Hide resolved
scipy/stats/morestats.py Outdated Show resolved Hide resolved
scipy/stats/morestats.py Outdated Show resolved Hide resolved
scipy/stats/morestats.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_morestats.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_morestats.py Outdated Show resolved Hide resolved
@mdhaber
Copy link
Contributor

mdhaber commented Jan 31, 2021

We could possibly implement it as quantile_coeff_dispersion with other values expect of 0.25, 0.5, 0.75. What are your thoughts?

I'd prefer to leave out the restrictions on quantile values, personally. The restriction seems unnecessary and it's simpler in documentation and code to leave out the restriction. If there is a strong reason to protect people from using the 20th and 80th percentiles, please correct me!

my function returns a float or an array_like of float, is it problematic? Is it better to just return an array with a single value?

It is preferred to return a float than a 0d array when the axis of a 1d array gets consumed. That is the typical behavior of np.mean([1, 2, 3]), for example, unless keepdims=True, and I don't think we have functions with keepdims=True like NumPy does.

I'm using np.quantile function to compute the quartiles, this function has more arguments and but I don't necessarily want to use all of them. What would be the best practice here?

Do you mean "should I expose them as parameters of this function?" It's OK to leave them out. For the sake of keeping the API simpler, I think it's preferable to leave them out.

@mdhaber
Copy link
Contributor

mdhaber commented Jan 31, 2021

Nice start @YarivLevy81 ! Don't worry about some of the CI failures, as we're having issue in master right now due to the NumPy 1.20 release, but do check out the "scipy.scipy (Lint)" failure (PEP8).

scipy/stats/morestats.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few other things to think about, but I think after this it will be very close!

scipy/stats/morestats.py Outdated Show resolved Hide resolved
scipy/stats/morestats.py Outdated Show resolved Hide resolved
scipy/stats/morestats.py Outdated Show resolved Hide resolved
scipy/stats/morestats.py Outdated Show resolved Hide resolved
scipy/stats/morestats.py Outdated Show resolved Hide resolved
scipy/stats/morestats.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_morestats.py Outdated Show resolved Hide resolved
scipy/stats/morestats.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_morestats.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me after the suggestions are implemented (assuming CI passes).

@YarivLevy81
Copy link
Author

Looks good to me after the suggestions are implemented (assuming CI passes).
All checks pass.

@mdhaber
Copy link
Contributor

mdhaber commented Feb 9, 2021

@rkern I think this PR is in pretty good shape. Would you like to take a look?

@mdhaber mdhaber requested a review from rkern February 9, 2021 19:41
@mdhaber
Copy link
Contributor

mdhaber commented Feb 9, 2021

Actually, one thought after looking at the reference linked from Wikipedia - it might be desirable in the future to provide a way of computing a coefficient of dispersion confidence interval. If that's the case, please think about how that might be possible in a backwards-compatible way. One option is to have an optional argument that selects the option to return the confidence interval and, if that's the case, the function returns the confidence interval rather than just the sample statistic. Alternatively, other functions are beginning to return objects that have a method to evaluate the confidence interval if desired. (That might be overkill here, but if it might be desirable, it would be harder to do in a backwards-compatible way if we only return a scalar now.) Of course, there could always be a separate function - but in any case, it's something that deserves a little thought now.

Update: email sent to developer mailing list.

scipy/stats/morestats.py Outdated Show resolved Hide resolved
Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>
@mdhaber
Copy link
Contributor

mdhaber commented Feb 18, 2021

@smurfit89 Does this address your proposal on gh-13385?

Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been taking another look at this today, and I think we need to address the discrepancy in quartile calculation conventions. Currently this function cannot replicate the result given in [1]. I submitted YarivLevy81#1 to begin to address this, but I think we need to take a closer look at the literature and see if there is consensus on how to define the quartiles.

Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the change in opinion @YarivLevy81. I didn't notice before that a sample had been added in the data used in tests to make the results match those in Wikipedia.

This actually might change my opinion on whether we should make this quantile_coeff_variation instead of quartile_coeff_variation. A quartile_coeff_variation that uses a more standard convention for calculating quartiles may be more useful than a quantile_coeff_variation which just uses np.quantile (and doesn't agree with examples in textbooks, etc.)

Do you have access to textbooks, etc. in which this statistic is defined and examples are given?

The Wikipedia article refers to this article, which refers to a book by Zwillinger and Kokoska . Here is how they define quartiles:

image

where

image

Then they define the Coefficient of Quartile Variation:

image

I'm beginning to think we should follow that and not attempt to generalize to arbitrary quantiles at this time. What do you think?

@YarivLevy81
Copy link
Author

Sorry for the change in opinion @YarivLevy81. I didn't notice before that a sample had been added in the data used in tests to make the results match those in Wikipedia.

This actually might change my opinion on whether we should make this quantile_coeff_variation instead of quartile_coeff_variation. A quartile_coeff_variation that uses a more standard convention for calculating quartiles may be more useful than a quantile_coeff_variation which just uses np.quantile (and doesn't agree with examples in textbooks, etc.)

Do you have access to textbooks, etc. in which this statistic is defined and examples are given?

The Wikipedia article refers to this article, which refers to a book by Zwillinger and Kokoska . Here is how they define quartiles:

image

where

image

Then they define the Coefficient of Quartile Variation:

image

I'm beginning to think we should follow that and not attempt to generalize to arbitrary quantiles at this time. What do you think?

This absolutely makes sense, I'll work to readjust it.

@mdhaber
Copy link
Contributor

mdhaber commented Feb 19, 2022

@YarivLevy81 were you still interested in completing this? If not, I understand.

@mdhaber
Copy link
Contributor

mdhaber commented Feb 27, 2022

Closing since I haven't heard back, but feel free to reopen if you're still interested!

@mdhaber mdhaber closed this Feb 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

robust_variation aka robust version of coeffient of variation for scipy.stats
3 participants