Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additions to Univariate KDEs #973

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

Additions to Univariate KDEs #973

wants to merge 19 commits into from

Conversation

Padarn
Copy link
Contributor

@Padarn Padarn commented Jul 18, 2013

I've added 'cdf_eval', 'icdf_eval', 'variance' and 'variance_eval' functions to the UnivariateKDEs.

I realise that the format of these changes may not be what is wanted in the long run, but I figured starting a pull request would help push discussion along so some choice is made.

ref: #904

# put here to ensure empty cache after re-fit with new options
self._cache = resettable_cache()

@cache_readonly
def cdf(self):
def cdf_sup(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks existing code, there's no good reason to do so here.

@rgommers
Copy link
Member

I can see how some of these methods will be useful, but I don't yet have a good feeling about what and how to add to make this class complete (whatever that means).

Radical idea: let fit() construct a scipy.stats.distributions instance, with _pdf returning the density estimate. This gives us all the generic methods of statistical distributions with a familiar API.

@Padarn
Copy link
Contributor Author

Padarn commented Jul 21, 2013

Oops, some careless mistakes here sorry, thanks rgommers. Will clean this up when I get a chance later today.

I'm not sure about the radical idea. There are certain operations that are particularly expensive when working with a KDE, so it might be good to keep them separate. Also, you can always add new data/change the bandwidth etc.

@Padarn
Copy link
Contributor Author

Padarn commented Jul 28, 2013

Sorry it took a while to address this, have just started a new job, so haven't had a lot of time.

I've fixed the careless mistakes (I hope) you pointed out above. The functions that were commented out were two functions I was trying to suggest we add. I have renamed them cdf_eval and icdf_eval for now - they provide the ability to get the cdf and icdf at any given point, not just the point on the 'support grid'.

Would agree these names are not ideal, not sure what the best solution is though.

@josef-pkt
Copy link
Member

I don't think we start descriptions with Returns ...

although
http://www.python.org/dev/peps/pep-0257/
starts with Return this and that

statsmodels docstring standard ? implied or MIA
(often we need to minimize the use of almost redundant words to fit it in one line)

@Padarn
Copy link
Contributor Author

Padarn commented Jul 28, 2013

Sorry I keep changing my mind. I've formatted the comment to fit in with the description of the other functions in the same file.

Happy to change it if needed - probably not urgent as the other parts of this pull request are probably more likely to offend.

if x <= self.support[0]:
return -1*np.infty

index = bisect_left(self.cdf, x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does bisect_left do the same as np.searchsorted ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does indeed, I hadn't seen that. np.searchsorted is obviously significantly faster too - will change this.

@@ -231,19 +265,59 @@ def entr(x,s):
return -integrate.quad(entr, a,b, args=(endog,))[0]

@cache_readonly
def icdf(self):
def icdf(self, sample_quantile = False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might need a third option for the exact calculation, then instead of a boolean sample_quantiles we might need a string, something like method='interpolate'

@josef-pkt
Copy link
Member

Overall: I think the direction of adding the options works. We might want to review the method names at the end again.

One issue is to have consistent methods/options available across methods, so we can either get the exact calculation (everything based on the kernel density) or the fast methods (with interpolation or other tricks).
I'm not sure yet how this works with icdf_eval.

@Padarn
Copy link
Contributor Author

Padarn commented Jul 28, 2013

I'll wait to address specific issues until I can tidy it up in a consistent fashion.

Hmm, I suppose it is possible that you would want to 'icdf_eval' on the sample quantiles - but this seems somewhat unlikely to me. Honestly, it seems very odd to even allow icdf to return the sample quantiles, and I would think it more sensible to remove this, but it might break compatibility somewhere?

@josef-pkt
Copy link
Member

'icdf_eval' on the sample quantiles

I also think it's not appropriate in kde, this is also available in ECDF

@Padarn
Copy link
Contributor Author

Padarn commented Jul 28, 2013

Okay, so I remove this as an option entirely? I guess the unit tests should show any problems than crop up.

@josef-pkt
Copy link
Member

Yes, remove the sample quantiles. I just skimmed #904 again. And because the initial problem was with the incorrect caching, it didn't work anyway.
I think we should to do some changes to this still for 0.5 (before a release).

@Padarn
Copy link
Contributor Author

Padarn commented Jul 28, 2013

Yes there is the caching issue too.

I think this could be finalized fairly quickly. What do you think about the names 'cdf_eval' etc? It would be nice if you could call 'cdf(method=exact)' or something, but the caching stops this.

@josef-pkt
Copy link
Member

Yes there is the caching issue too.

I think this could be finalized fairly quickly. What do you think about
the names 'cdf_eval' etc? It would be nice if you could call
'cdf(method=exact)' or something, but the caching stops this.

We still need unittests.

My preferred would still be to call the fixed precalculated values
xxx_values, e.g. for pdf and cdf.

Did we have the incorrect caching also for cdf? I cannot figure this out
anymore from a quick read of issue #885.

Changing cdf from a cached attribute would be a backwards incompatible
change, if it was working before and the change is not a bug fix.

@Padarn
Copy link
Contributor Author

Padarn commented Jul 29, 2013

Okay, I'll make some changes any then recommit and see what you think.

I think there were/are two caching issues:

  1. Caching behavior for KDEUnivariate icdf  #885 was a true bug, and has now been fixed. It affected all the cached valued I believe.
  2. Not exactly a bug, but due to the way caching is implemented, it is not possible to add optional arguments to @cache_readonly functions.

.xxx_values can still be cached, although if .xxx is changed to take arguments, then .xxx will return a function rather than the array that it currently returns. (apologies, finding it difficult to articulate this well)

else:
a,b = kern.domain
func = lambda x,s: kern.density(s,x)
return np.cumsum(self.pdf_values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cumsum doesn't sound very accurate. Should we still get the integrate.quad solution somewhere?

Also don't we need to correct for the distance between points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cdf(method='exact') will still return the integrate.quad solution. I wasn't too happy with the cumsum implementation, but my idea was to prioritize speed first, and then have optional accuracy.

I was under the impression the density grid was uniform (my initial scan shows it being build by a np.linspace call).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops I see what you mean about the distance - yes good point, need to change this.

@Padarn
Copy link
Contributor Author

Padarn commented Oct 23, 2013

I fixed up a bug that josef-pkt pointed out and improved the accuracy of the cdf_values method by doing a 'trapz' integration rather than my silly midpoint. Shouldn't be any slower.

Two notes:

  1. Still failing tests due to some method renaming
  2. Think probably the next thing I should do is write some unit-tests for the KDEUnivariate testing this functionality. Silly bugs keep creeping in.

@josef-pkt
Copy link
Member

I think the overall pattern looks good. There might be smaller changes that will still show up.
It needs unit tests for the changes and additions.

(One refactoring that might be useful maybe in the future is to distinguish between model and results, since no many attributes rely on fit being called first. But it's not quite the same because the methods don't need necessarily the results of fit.)

@josef-pkt
Copy link
Member

related point fit should return self
https://groups.google.com/d/msg/pystatsmodels/2bJGPadlpXg/6ocU-2C8h1wJ
I don't think we opened an issue, and I don't know what the other KDE classes are doing

@Padarn
Copy link
Contributor Author

Padarn commented Nov 16, 2013

I was playing around with my changes the other day for a project I'm working on, and felt like maybe I should build them from the ground up again and request a new pull?

What do you think of the idea of having the interface work more like a scipy.stats distribution? It would be nice to be able to do things like

KDE.rvs(n)

Using the estimated distribution to draw from.

@josef-pkt
Copy link
Member

Adding an rvs method should be part of these kind of classes.

However, I don't think scipy.stats.distribution has a good pattern for this. The main problem is that they don't have a way of storing pre-calculated variables or temp variables.
Each stats distribution is almost completely a collection of functions, each method does everything from scratch.
(There are possibilities to adjust frozen distribution, but nothing like that is implemented yet. And in many cases we don't save much in the scipy stats distributions because there is nothing worthwile to precalculate and store.)

The design here is much more appropriate because in many cases we can and want to reuse prior calculations to speed up the methods.
A pure non-binned KDE could be made to work from scratch, like the "exact" version, but that would be slow for many use cases.

@Padarn
Copy link
Contributor Author

Padarn commented Nov 17, 2013

Ah yes, you make a very good point.

Although presumably we could subclass scipy.stats.distributions, redefining everything to make use of pre-calculated values. Not sure what value there would be in that though.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling b2c35b4 on Padarn:master into 9d4b1f8 on statsmodels:master.

@Padarn
Copy link
Contributor Author

Padarn commented Nov 17, 2013

I found a fair number of bugs when I was playing with my code, which I've patched up. I also rewrote the ppf stuff - because it was wrong. I'm not really happy with the way it works, but the concerns are the same as expressed earlier.

@jseabold
Copy link
Member

jseabold commented Apr 3, 2014

I didn't really follow all of this. What's the status on this? Is it a mix of enhancements and bug fixes? Are there API changes?

@Padarn
Copy link
Contributor Author

Padarn commented Apr 4, 2014

It was a while ago so I can't remember 100%, I'll have to take another look to give you an accurate answer.

But as far as I can remember - yes there were some suggested API changes, along with bug fixes/enhancements. I believe the API changes are all backwards compatible though.

I'll take another look through this to refresh my memory and then summarize the proposed changes.

@jseabold
Copy link
Member

jseabold commented Apr 4, 2014

That would be great. If there are smaller changes that would fit naturally as their own PR feel free to pull them out into a new PR. E.g., splitting up enhancements and bug fixes.

@Padarn
Copy link
Contributor Author

Padarn commented Apr 4, 2014

Okay I've had a look over this now and my overall impression is that it would probably be best to just submit a new issue for the API changes/enhancements, as there there was no firm consensus.

The API changes were essentially changing kde properties like kde.pdf to functions which allow a variety of evaluation options, e.g., kde.pdf(x, method='exact'). The pre-computed versions would then be stored in kde.pdf_values instead. This API change does break some tests unfortunately.

Enhancements were mostly in adding 'exact' calculations for the cdf, pdf and icdf (ppf). Also, the kde.cdf values are currently very slow to calculate, so I was adding a rougher approximation for this.

Bug fixes throughout, but most have already been incorporated through other pull requests.

If it sounds good to you, I'll just close this and open an issue to discuss it. Won't be hard to incorporate into a new pull request.

@josef-pkt josef-pkt modified the milestones: 0.6, 0.7 Sep 22, 2014
@josef-pkt josef-pkt modified the milestones: 0.8, 0.7 Jul 3, 2015
@josef-pkt josef-pkt mentioned this pull request Jul 4, 2015
15 tasks
@josef-pkt josef-pkt removed the PR label Mar 5, 2016
@jseabold jseabold removed this from the 0.8 milestone Apr 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants