Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove spurious points from ECDFs of discrete distributions #6480

Closed
Wrzlprmft opened this issue Jan 28, 2020 · 12 comments
Closed

Remove spurious points from ECDFs of discrete distributions #6480

Wrzlprmft opened this issue Jan 28, 2020 · 12 comments

Comments

@Wrzlprmft
Copy link

Consider the following code:

from statsmodels.distributions.empirical_distribution import ECDF
import numpy as np

ecdf = ECDF([1,2,2,2,3])
print(np.vstack((ecdf.x,ecdf.y)))

This returns:

[[-inf  0. ]
 [ 1.   0.2]
 [ 2.   0.4]
 [ 2.   0.6]
 [ 2.   0.8]
 [ 3.   1. ]]

The point (2, 0.6) is not needed here. For an ECDF of a large sample from a discrete distribution, the number of such spurious points can outnumber the required ones by far and make makes plotting, evaluation, etc. more time-consuming.

It would therefore be great if ECDF removed such points by itself. If there is an efficiency trade-off, there should at least be the option to obtain this behaviour.

@josef-pkt
Copy link
Member

ecdf is designed for continuous distributions. Maybe we need a separate ecdf (with partial overlap/inheritance) for discrete distributions that are defined by counts or freq_weights.

(I never thought much about this case, AFAIR.)

One difference will be when we want to sample from the ecdf. Without duplicate removal, each point has the same probability, eg. rvs is equivalent to simple random sampling from the data.
(Another difference, maybe not relevant for ecdf, is that we can add noise or extend it to a piecewise linear cdf in the case of continuous random variables.)

@bashtage
Copy link
Member

Should be pretty trivial to group by the first column and then take the max of the second for each group.

@josef-pkt
Copy link
Member

The user could just do a pandas groupby to count the frequencies, if count or freq_weights are an option.

https://github.com/statsmodels/statsmodels/pull/5701/files#diff-20d6c52bc3c1b608d3bb7305106f5778
for the earlier (incomplete) linear interpolation case

@josef-pkt
Copy link
Member

I used np.bincount for counting integer valued random variables

example for count data, but doesn't use a step function for plotting
https://gist.github.com/josef-pkt/c932904296270d75366a24ee92a4eb2f
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/discrete/_diagnostics_count.py#L73

@josef-pkt
Copy link
Member

do we have a plot method for ecdf?

AFAICS, statsmodels.distributions.empirical_distribution.StepFunction is missing an plot_ecdf method.

@josef-pkt
Copy link
Member

adding confint for ecdf #1297 (comment)

if we use ks_test bounds, e.g https://stats.stackexchange.com/questions/298290/plotting-non-parametric-ecdef-confidence-envelopes-for-comparison
then the distribution is assumed to be continuous.

@Wrzlprmft
Copy link
Author

One difference will be when we want to sample from the ecdf. Without duplicate removal, each point has the same probability, eg. rvs is equivalent to simple random sampling from the data.

How can there be a difference?
We are talking about two different ways to represent the exact same distribution here.
In fact, if there were any difference, I would consider this a bug in StepFunction.

For example, for sampling, we end up with the following equivalent cases:

  • Sampling from [1,2,2,2,3] with equal probability.
  • Sampling 1 with p = 0.2, 2 with p = 0.6, and 3 with p = 0.2.

Should be pretty trivial to group by the first column and then take the max of the second for each group.

Looking at the source code of ECDF, you could just do:

needed = np.diff(x,append=np.inf).astype(bool)
super(ECDF, self).__init__(x[needed], y[needed], side=side, sorted=True)

This (O(n)) should take less time than the sorting (O(n·log(n))) anyway.

On another thought, such a thing may be better placed in the __init__ of StepFunction.

do we have a plot method for ecdf?

AFAICS, statsmodels.distributions.empirical_distribution.StepFunction is missing an plot_ecdf method.

FWIW, plt.step(ecdf.x,ecdf.y,where="post") seems to do the right thing.

@josef-pkt
Copy link
Member

I didn't realize we already have the code for simultaneous confidence band in _conf_set function. It's only use in the plot example in __main__
https://en.wikipedia.org/wiki/Dvoretzky%E2%80%93Kiefer%E2%80%93Wolfowitz_inequality

What I mean here is that the technical solution of extending it to discrete random variables is pretty easy, however, the statistics for mixing discrete and continuous in one class is not.
In the current ecdf class, there is no code specific to continuous variables, but if we add support for discrete, then we admit that it supports discrete random variables.
This would not work with the extensions that we only have under the assumption that the underlying random variable is continuous.
That is, I would like to "future proof" any changes that we make to support ecdf for discrete random variables.

@josef-pkt
Copy link
Member

About the plot method:
The main reason to add a method even if it's just a one-liner is that I didn't and wouldn't remember the where="post" option.
(I made a mistake somewhere before where I added a step function that used the wrong where)

@josef-pkt
Copy link
Member

One possibility would be to add a flag like data_type='continuous' to the ECDF class, so we can can make returns depend on the data type.
I thought initially of just using a binary keyword option like is_continuous=True, but we might have to distinguish, e.g. ordered (arbitrary x with finite (?) support) and count (x are integers based on range(0, xmax)) data.

I guess that the statistics for ordered data with finite support could be based on multinomial distribution.

@josef-pkt
Copy link
Member

pull request #8192 implements ECDFDiscrete which computes unique values and their frequency counts from data. Alternatively, the user can specify the frequencies for uniques as freq_weights

@josef-pkt josef-pkt added this to the 0.15 milestone Dec 21, 2022
@josef-pkt
Copy link
Member

see also #8193 for adding confint

closing this #8192 ECDFDiscrete is merged (for 0.14)

still open are confint and plot methods. Those are separate open issues for 0.15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants