-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove spurious points from ECDFs of discrete distributions #6480
Comments
ecdf is designed for continuous distributions. Maybe we need a separate ecdf (with partial overlap/inheritance) for discrete distributions that are defined by counts or freq_weights. (I never thought much about this case, AFAIR.) One difference will be when we want to sample from the ecdf. Without duplicate removal, each point has the same probability, eg. rvs is equivalent to simple random sampling from the data. |
Should be pretty trivial to group by the first column and then take the max of the second for each group. |
The user could just do a pandas groupby to count the frequencies, if count or freq_weights are an option. https://github.com/statsmodels/statsmodels/pull/5701/files#diff-20d6c52bc3c1b608d3bb7305106f5778 |
I used np.bincount for counting integer valued random variables example for count data, but doesn't use a step function for plotting |
do we have a plot method for ecdf? AFAICS, statsmodels.distributions.empirical_distribution.StepFunction is missing an plot_ecdf method. |
adding confint for ecdf #1297 (comment) if we use ks_test bounds, e.g https://stats.stackexchange.com/questions/298290/plotting-non-parametric-ecdef-confidence-envelopes-for-comparison |
How can there be a difference? For example, for sampling, we end up with the following equivalent cases:
Looking at the source code of needed = np.diff(x,append=np.inf).astype(bool)
super(ECDF, self).__init__(x[needed], y[needed], side=side, sorted=True) This (O(n)) should take less time than the sorting (O(n·log(n))) anyway. On another thought, such a thing may be better placed in the
FWIW, |
I didn't realize we already have the code for simultaneous confidence band in What I mean here is that the technical solution of extending it to discrete random variables is pretty easy, however, the statistics for mixing discrete and continuous in one class is not. |
About the plot method: |
One possibility would be to add a flag like I guess that the statistics for ordered data with finite support could be based on multinomial distribution. |
pull request #8192 implements ECDFDiscrete which computes unique values and their frequency counts from data. Alternatively, the user can specify the frequencies for uniques as |
Consider the following code:
This returns:
The point (2, 0.6) is not needed here. For an ECDF of a large sample from a discrete distribution, the number of such spurious points can outnumber the required ones by far and make makes plotting, evaluation, etc. more time-consuming.
It would therefore be great if
ECDF
removed such points by itself. If there is an efficiency trade-off, there should at least be the option to obtain this behaviour.The text was updated successfully, but these errors were encountered: