Additions to Univariate KDEs #973

Padarn · 2013-07-18T00:09:45Z

I've added 'cdf_eval', 'icdf_eval', 'variance' and 'variance_eval' functions to the UnivariateKDEs.

I realise that the format of these changes may not be what is wanted in the long run, but I figured starting a pull request would help push discussion along so some choice is made.

ref: #904

…nel for Gaussian

…e calculation method.

rgommers · 2013-07-20T10:40:15Z

statsmodels/nonparametric/kde.py

        # put here to ensure empty cache after re-fit with new options
        self._cache = resettable_cache()

    @cache_readonly
-    def cdf(self):
+    def cdf_sup(self):


This breaks existing code, there's no good reason to do so here.

rgommers · 2013-07-20T10:52:09Z

I can see how some of these methods will be useful, but I don't yet have a good feeling about what and how to add to make this class complete (whatever that means).

Radical idea: let fit() construct a scipy.stats.distributions instance, with _pdf returning the density estimate. This gives us all the generic methods of statistical distributions with a familiar API.

Padarn · 2013-07-21T22:00:06Z

Oops, some careless mistakes here sorry, thanks rgommers. Will clean this up when I get a chance later today.

I'm not sure about the radical idea. There are certain operations that are particularly expensive when working with a KDE, so it might be good to keep them separate. Also, you can always add new data/change the bandwidth etc.

Padarn · 2013-07-28T06:39:10Z

Sorry it took a while to address this, have just started a new job, so haven't had a lot of time.

I've fixed the careless mistakes (I hope) you pointed out above. The functions that were commented out were two functions I was trying to suggest we add. I have renamed them cdf_eval and icdf_eval for now - they provide the ability to get the cdf and icdf at any given point, not just the point on the 'support grid'.

Would agree these names are not ideal, not sure what the best solution is though.

josef-pkt · 2013-07-28T06:51:43Z

I don't think we start descriptions with Returns ...

although
http://www.python.org/dev/peps/pep-0257/
starts with Return this and that

statsmodels docstring standard ? implied or MIA
(often we need to minimize the use of almost redundant words to fit it in one line)

Padarn · 2013-07-28T06:57:33Z

Sorry I keep changing my mind. I've formatted the comment to fit in with the description of the other functions in the same file.

Happy to change it if needed - probably not urgent as the other parts of this pull request are probably more likely to offend.

josef-pkt · 2013-07-28T07:11:35Z

statsmodels/nonparametric/kde.py

+        if x <= self.support[0]:
+            return -1*np.infty
+
+        index = bisect_left(self.cdf, x)


Does bisect_left do the same as np.searchsorted ?

It does indeed, I hadn't seen that. np.searchsorted is obviously significantly faster too - will change this.

josef-pkt · 2013-07-28T07:36:58Z

statsmodels/nonparametric/kde.py

@@ -231,19 +265,59 @@ def entr(x,s):
        return -integrate.quad(entr, a,b, args=(endog,))[0]

    @cache_readonly
-    def icdf(self):
+    def icdf(self, sample_quantile = False):


we might need a third option for the exact calculation, then instead of a boolean sample_quantiles we might need a string, something like method='interpolate'

josef-pkt · 2013-07-28T07:41:10Z

Overall: I think the direction of adding the options works. We might want to review the method names at the end again.

One issue is to have consistent methods/options available across methods, so we can either get the exact calculation (everything based on the kernel density) or the fast methods (with interpolation or other tricks).
I'm not sure yet how this works with icdf_eval.

Padarn · 2013-07-28T07:46:35Z

I'll wait to address specific issues until I can tidy it up in a consistent fashion.

Hmm, I suppose it is possible that you would want to 'icdf_eval' on the sample quantiles - but this seems somewhat unlikely to me. Honestly, it seems very odd to even allow icdf to return the sample quantiles, and I would think it more sensible to remove this, but it might break compatibility somewhere?

josef-pkt · 2013-07-28T07:54:35Z

'icdf_eval' on the sample quantiles

I also think it's not appropriate in kde, this is also available in ECDF

Padarn · 2013-07-28T08:31:24Z

Okay, so I remove this as an option entirely? I guess the unit tests should show any problems than crop up.

josef-pkt · 2013-07-28T08:45:47Z

Yes, remove the sample quantiles. I just skimmed #904 again. And because the initial problem was with the incorrect caching, it didn't work anyway.
I think we should to do some changes to this still for 0.5 (before a release).

Padarn · 2013-07-28T10:07:08Z

Yes there is the caching issue too.

I think this could be finalized fairly quickly. What do you think about the names 'cdf_eval' etc? It would be nice if you could call 'cdf(method=exact)' or something, but the caching stops this.

josef-pkt · 2013-07-28T13:10:43Z

Yes there is the caching issue too.

I think this could be finalized fairly quickly. What do you think about
the names 'cdf_eval' etc? It would be nice if you could call
'cdf(method=exact)' or something, but the caching stops this.

We still need unittests.

My preferred would still be to call the fixed precalculated values
xxx_values, e.g. for pdf and cdf.

Did we have the incorrect caching also for cdf? I cannot figure this out
anymore from a quick read of issue #885.

Changing cdf from a cached attribute would be a backwards incompatible
change, if it was working before and the change is not a bug fix.

Padarn · 2013-07-29T10:09:32Z

Okay, I'll make some changes any then recommit and see what you think.

I think there were/are two caching issues:

Caching behavior for KDEUnivariate icdf #885 was a true bug, and has now been fixed. It affected all the cached valued I believe.
Not exactly a bug, but due to the way caching is implemented, it is not possible to add optional arguments to @cache_readonly functions.

.xxx_values can still be cached, although if .xxx is changed to take arguments, then .xxx will return a function rather than the array that it currently returns. (apologies, finding it difficult to articulate this well)

josef-pkt · 2013-10-23T01:48:39Z

statsmodels/nonparametric/kde.py

        else:
-            a,b = kern.domain
-        func = lambda x,s: kern.density(s,x)
+            return np.cumsum(self.pdf_values)


cumsum doesn't sound very accurate. Should we still get the integrate.quad solution somewhere?

Also don't we need to correct for the distance between points.

The cdf(method='exact') will still return the integrate.quad solution. I wasn't too happy with the cumsum implementation, but my idea was to prioritize speed first, and then have optional accuracy.

I was under the impression the density grid was uniform (my initial scan shows it being build by a np.linspace call).

oops I see what you mean about the distance - yes good point, need to change this.

Padarn · 2013-10-23T03:40:23Z

I fixed up a bug that josef-pkt pointed out and improved the accuracy of the cdf_values method by doing a 'trapz' integration rather than my silly midpoint. Shouldn't be any slower.

Two notes:

Still failing tests due to some method renaming
Think probably the next thing I should do is write some unit-tests for the KDEUnivariate testing this functionality. Silly bugs keep creeping in.

josef-pkt · 2013-10-24T00:27:17Z

I think the overall pattern looks good. There might be smaller changes that will still show up.
It needs unit tests for the changes and additions.

(One refactoring that might be useful maybe in the future is to distinguish between model and results, since no many attributes rely on fit being called first. But it's not quite the same because the methods don't need necessarily the results of fit.)

josef-pkt · 2013-10-24T00:37:41Z

related point fit should return self
https://groups.google.com/d/msg/pystatsmodels/2bJGPadlpXg/6ocU-2C8h1wJ
I don't think we opened an issue, and I don't know what the other KDE classes are doing

Padarn · 2013-11-16T21:54:24Z

I was playing around with my changes the other day for a project I'm working on, and felt like maybe I should build them from the ground up again and request a new pull?

What do you think of the idea of having the interface work more like a scipy.stats distribution? It would be nice to be able to do things like

KDE.rvs(n)

Using the estimated distribution to draw from.

josef-pkt · 2013-11-17T00:02:26Z

Adding an rvs method should be part of these kind of classes.

However, I don't think scipy.stats.distribution has a good pattern for this. The main problem is that they don't have a way of storing pre-calculated variables or temp variables.
Each stats distribution is almost completely a collection of functions, each method does everything from scratch.
(There are possibilities to adjust frozen distribution, but nothing like that is implemented yet. And in many cases we don't save much in the scipy stats distributions because there is nothing worthwile to precalculate and store.)

The design here is much more appropriate because in many cases we can and want to reuse prior calculations to speed up the methods.
A pure non-binned KDE could be made to work from scratch, like the "exact" version, but that would be slow for many use cases.

Padarn · 2013-11-17T20:35:36Z

Ah yes, you make a very good point.

Although presumably we could subclass scipy.stats.distributions, redefining everything to make use of pre-calculated values. Not sure what value there would be in that though.

…nto kde-updates

coveralls · 2013-11-17T22:22:44Z

Coverage remained the same when pulling b2c35b4 on Padarn:master into 9d4b1f8 on statsmodels:master.

Padarn · 2013-11-17T22:24:48Z

I found a fair number of bugs when I was playing with my code, which I've patched up. I also rewrote the ppf stuff - because it was wrong. I'm not really happy with the way it works, but the concerns are the same as expressed earlier.

jseabold · 2014-04-03T22:01:11Z

I didn't really follow all of this. What's the status on this? Is it a mix of enhancements and bug fixes? Are there API changes?

Padarn · 2014-04-04T00:43:15Z

It was a while ago so I can't remember 100%, I'll have to take another look to give you an accurate answer.

But as far as I can remember - yes there were some suggested API changes, along with bug fixes/enhancements. I believe the API changes are all backwards compatible though.

I'll take another look through this to refresh my memory and then summarize the proposed changes.

jseabold · 2014-04-04T00:46:04Z

That would be great. If there are smaller changes that would fit naturally as their own PR feel free to pull them out into a new PR. E.g., splitting up enhancements and bug fixes.

Padarn · 2014-04-04T22:42:36Z

Okay I've had a look over this now and my overall impression is that it would probably be best to just submit a new issue for the API changes/enhancements, as there there was no firm consensus.

The API changes were essentially changing kde properties like kde.pdf to functions which allow a variety of evaluation options, e.g., kde.pdf(x, method='exact'). The pre-computed versions would then be stored in kde.pdf_values instead. This API change does break some tests unfortunately.

Enhancements were mostly in adding 'exact' calculations for the cdf, pdf and icdf (ppf). Also, the kde.cdf values are currently very slow to calculate, so I was adding a rougher approximation for this.

Bug fixes throughout, but most have already been incorporated through other pull requests.

If it sounds good to you, I'll just close this and open an issue to discuss it. Won't be hard to incorporate into a new pull request.

Padarn added 5 commits June 27, 2013 13:48

Modifications to cdf and icdf calculation. Addition of integrated ker…

02d2730

…nel for Gaussian

Changes to kde.py

bee9203

Added approximate variance calculation to kernels

5c9b50d

Merge branch 'master' of git://github.com/statsmodels/statsmodels

b3b5be8

Fixes to incorrect calculation of variance, and removal of alternativ…

006a0ca

…e calculation method.

rgommers reviewed Jul 20, 2013
View reviewed changes

Renaming functions to fix careless mistake

1249044

updated comment on density_variance

5a38bb5

Padarn added 2 commits July 28, 2013 16:54

update 2 of comment

2732623

update 3 of comment

6df077e

josef-pkt reviewed Jul 28, 2013
View reviewed changes

removed bisect_left and erfinv

3a1ea35

josef-pkt reviewed Jul 28, 2013
View reviewed changes

josef-pkt reviewed Oct 23, 2013
View reviewed changes

Padarn added 2 commits October 23, 2013 14:19

Merge branch 'master' of https://github.com/statsmodels/statsmodels

81c3eeb

fix for cdf_values

4c1ae6e

josef-pkt mentioned this pull request Oct 25, 2013

fittedvalues in discrete #411

Open

Padarn added 2 commits November 18, 2013 09:09

bug fixes + test_update

e0a5f31

Merge branch 'master' of https://github.com/statsmodels/statsmodels i…

b2c35b4

…nto kde-updates

josef-pkt mentioned this pull request Dec 17, 2013

sandbox kernels speed up inDomain #1241

Open

josef-pkt added the PR label Feb 19, 2014

josef-pkt modified the milestones: 0.6, 0.7 Sep 22, 2014

jseabold mentioned this pull request Jan 19, 2015

Issues with UnivariateKDE.evaluate #2152

Open

josef-pkt modified the milestones: 0.8, 0.7 Jul 3, 2015

josef-pkt mentioned this pull request Jul 4, 2015

SUMM: old pull requests #2503

Open

15 tasks

josef-pkt removed the PR label Mar 5, 2016

jseabold removed this from the 0.8 milestone Apr 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additions to Univariate KDEs #973

Additions to Univariate KDEs #973

Padarn commented Jul 18, 2013

rgommers Jul 20, 2013

rgommers commented Jul 20, 2013

Padarn commented Jul 21, 2013

Padarn commented Jul 28, 2013

josef-pkt commented Jul 28, 2013

Padarn commented Jul 28, 2013

josef-pkt Jul 28, 2013

Padarn Jul 28, 2013

josef-pkt Jul 28, 2013

josef-pkt commented Jul 28, 2013

Padarn commented Jul 28, 2013

josef-pkt commented Jul 28, 2013

Padarn commented Jul 28, 2013

josef-pkt commented Jul 28, 2013

Padarn commented Jul 28, 2013

josef-pkt commented Jul 28, 2013

Padarn commented Jul 29, 2013

josef-pkt Oct 23, 2013

Padarn Oct 23, 2013

Padarn Oct 23, 2013

Padarn commented Oct 23, 2013

josef-pkt commented Oct 24, 2013

josef-pkt commented Oct 24, 2013

Padarn commented Nov 16, 2013

josef-pkt commented Nov 17, 2013

Padarn commented Nov 17, 2013

coveralls commented Nov 17, 2013

Padarn commented Nov 17, 2013

jseabold commented Apr 3, 2014

Padarn commented Apr 4, 2014

jseabold commented Apr 4, 2014

Padarn commented Apr 4, 2014

Additions to Univariate KDEs #973

Are you sure you want to change the base?

Additions to Univariate KDEs #973

Conversation

Padarn commented Jul 18, 2013

rgommers Jul 20, 2013

Choose a reason for hiding this comment

rgommers commented Jul 20, 2013

Padarn commented Jul 21, 2013

Padarn commented Jul 28, 2013

josef-pkt commented Jul 28, 2013

Padarn commented Jul 28, 2013

josef-pkt Jul 28, 2013

Choose a reason for hiding this comment

Padarn Jul 28, 2013

Choose a reason for hiding this comment

josef-pkt Jul 28, 2013

Choose a reason for hiding this comment

josef-pkt commented Jul 28, 2013

Padarn commented Jul 28, 2013

josef-pkt commented Jul 28, 2013

Padarn commented Jul 28, 2013

josef-pkt commented Jul 28, 2013

Padarn commented Jul 28, 2013

josef-pkt commented Jul 28, 2013

Padarn commented Jul 29, 2013

josef-pkt Oct 23, 2013

Choose a reason for hiding this comment

Padarn Oct 23, 2013

Choose a reason for hiding this comment

Padarn Oct 23, 2013

Choose a reason for hiding this comment

Padarn commented Oct 23, 2013

josef-pkt commented Oct 24, 2013

josef-pkt commented Oct 24, 2013

Padarn commented Nov 16, 2013

josef-pkt commented Nov 17, 2013

Padarn commented Nov 17, 2013

coveralls commented Nov 17, 2013

Padarn commented Nov 17, 2013

jseabold commented Apr 3, 2014

Padarn commented Apr 4, 2014

jseabold commented Apr 4, 2014

Padarn commented Apr 4, 2014