Remove spurious points from ECDFs of discrete distributions #6480

Wrzlprmft · 2020-01-28T15:04:22Z

Consider the following code:

from statsmodels.distributions.empirical_distribution import ECDF
import numpy as np

ecdf = ECDF([1,2,2,2,3])
print(np.vstack((ecdf.x,ecdf.y)))

This returns:

[[-inf  0. ]
 [ 1.   0.2]
 [ 2.   0.4]
 [ 2.   0.6]
 [ 2.   0.8]
 [ 3.   1. ]]

The point (2, 0.6) is not needed here. For an ECDF of a large sample from a discrete distribution, the number of such spurious points can outnumber the required ones by far and make makes plotting, evaluation, etc. more time-consuming.

It would therefore be great if ECDF removed such points by itself. If there is an efficiency trade-off, there should at least be the option to obtain this behaviour.

The text was updated successfully, but these errors were encountered:

josef-pkt · 2020-01-28T15:28:33Z

ecdf is designed for continuous distributions. Maybe we need a separate ecdf (with partial overlap/inheritance) for discrete distributions that are defined by counts or freq_weights.

(I never thought much about this case, AFAIR.)

One difference will be when we want to sample from the ecdf. Without duplicate removal, each point has the same probability, eg. rvs is equivalent to simple random sampling from the data.
(Another difference, maybe not relevant for ecdf, is that we can add noise or extend it to a piecewise linear cdf in the case of continuous random variables.)

bashtage · 2020-01-28T15:30:35Z

Should be pretty trivial to group by the first column and then take the max of the second for each group.

josef-pkt · 2020-01-28T15:57:11Z

The user could just do a pandas groupby to count the frequencies, if count or freq_weights are an option.

https://github.com/statsmodels/statsmodels/pull/5701/files#diff-20d6c52bc3c1b608d3bb7305106f5778
for the earlier (incomplete) linear interpolation case

josef-pkt · 2020-01-28T16:02:55Z

I used np.bincount for counting integer valued random variables

example for count data, but doesn't use a step function for plotting
https://gist.github.com/josef-pkt/c932904296270d75366a24ee92a4eb2f
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/discrete/_diagnostics_count.py#L73

josef-pkt · 2020-01-28T16:09:43Z

do we have a plot method for ecdf?

AFAICS, statsmodels.distributions.empirical_distribution.StepFunction is missing an plot_ecdf method.

josef-pkt · 2020-01-28T16:24:24Z

adding confint for ecdf #1297 (comment)

if we use ks_test bounds, e.g https://stats.stackexchange.com/questions/298290/plotting-non-parametric-ecdef-confidence-envelopes-for-comparison
then the distribution is assumed to be continuous.

Wrzlprmft · 2020-01-28T17:41:27Z

One difference will be when we want to sample from the ecdf. Without duplicate removal, each point has the same probability, eg. rvs is equivalent to simple random sampling from the data.

How can there be a difference?
We are talking about two different ways to represent the exact same distribution here.
In fact, if there were any difference, I would consider this a bug in StepFunction.

For example, for sampling, we end up with the following equivalent cases:

Sampling from [1,2,2,2,3] with equal probability.
Sampling 1 with p = 0.2, 2 with p = 0.6, and 3 with p = 0.2.

Should be pretty trivial to group by the first column and then take the max of the second for each group.

Looking at the source code of ECDF, you could just do:

needed = np.diff(x,append=np.inf).astype(bool)
super(ECDF, self).__init__(x[needed], y[needed], side=side, sorted=True)

This (O(n)) should take less time than the sorting (O(n·log(n))) anyway.

On another thought, such a thing may be better placed in the __init__ of StepFunction.

do we have a plot method for ecdf?

AFAICS, statsmodels.distributions.empirical_distribution.StepFunction is missing an plot_ecdf method.

FWIW, plt.step(ecdf.x,ecdf.y,where="post") seems to do the right thing.

josef-pkt · 2020-01-28T18:39:12Z

I didn't realize we already have the code for simultaneous confidence band in _conf_set function. It's only use in the plot example in __main__
https://en.wikipedia.org/wiki/Dvoretzky%E2%80%93Kiefer%E2%80%93Wolfowitz_inequality

What I mean here is that the technical solution of extending it to discrete random variables is pretty easy, however, the statistics for mixing discrete and continuous in one class is not.
In the current ecdf class, there is no code specific to continuous variables, but if we add support for discrete, then we admit that it supports discrete random variables.
This would not work with the extensions that we only have under the assumption that the underlying random variable is continuous.
That is, I would like to "future proof" any changes that we make to support ecdf for discrete random variables.

josef-pkt · 2020-01-28T18:42:01Z

About the plot method:
The main reason to add a method even if it's just a one-liner is that I didn't and wouldn't remember the where="post" option.
(I made a mistake somewhere before where I added a step function that used the wrong where)

josef-pkt · 2020-01-28T19:09:53Z

One possibility would be to add a flag like data_type='continuous' to the ECDF class, so we can can make returns depend on the data type.
I thought initially of just using a binary keyword option like is_continuous=True, but we might have to distinguish, e.g. ordered (arbitrary x with finite (?) support) and count (x are integers based on range(0, xmax)) data.

I guess that the statistics for ordered data with finite support could be based on multinomial distribution.

josef-pkt · 2022-03-26T17:14:43Z

pull request #8192 implements ECDFDiscrete which computes unique values and their frequency counts from data. Alternatively, the user can specify the frequencies for uniques as freq_weights

josef-pkt · 2022-12-21T15:49:54Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove spurious points from ECDFs of discrete distributions #6480

Remove spurious points from ECDFs of discrete distributions #6480

Wrzlprmft commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

bashtage commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

Wrzlprmft commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

josef-pkt commented Mar 26, 2022

josef-pkt commented Dec 21, 2022

Remove spurious points from ECDFs of discrete distributions #6480

Remove spurious points from ECDFs of discrete distributions #6480

Comments

Wrzlprmft commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

bashtage commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

Wrzlprmft commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

josef-pkt commented Jan 28, 2020

josef-pkt commented Mar 26, 2022

josef-pkt commented Dec 21, 2022