-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additions to Univariate KDEs #973
base: main
Are you sure you want to change the base?
Conversation
…e calculation method.
# put here to ensure empty cache after re-fit with new options | ||
self._cache = resettable_cache() | ||
|
||
@cache_readonly | ||
def cdf(self): | ||
def cdf_sup(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks existing code, there's no good reason to do so here.
I can see how some of these methods will be useful, but I don't yet have a good feeling about what and how to add to make this class complete (whatever that means). Radical idea: let |
Oops, some careless mistakes here sorry, thanks rgommers. Will clean this up when I get a chance later today. I'm not sure about the radical idea. There are certain operations that are particularly expensive when working with a KDE, so it might be good to keep them separate. Also, you can always add new data/change the bandwidth etc. |
Sorry it took a while to address this, have just started a new job, so haven't had a lot of time. I've fixed the careless mistakes (I hope) you pointed out above. The functions that were commented out were two functions I was trying to suggest we add. I have renamed them cdf_eval and icdf_eval for now - they provide the ability to get the cdf and icdf at any given point, not just the point on the 'support grid'. Would agree these names are not ideal, not sure what the best solution is though. |
I don't think we start descriptions with although statsmodels docstring standard ? implied or MIA |
Sorry I keep changing my mind. I've formatted the comment to fit in with the description of the other functions in the same file. Happy to change it if needed - probably not urgent as the other parts of this pull request are probably more likely to offend. |
if x <= self.support[0]: | ||
return -1*np.infty | ||
|
||
index = bisect_left(self.cdf, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does bisect_left do the same as np.searchsorted ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does indeed, I hadn't seen that. np.searchsorted is obviously significantly faster too - will change this.
@@ -231,19 +265,59 @@ def entr(x,s): | |||
return -integrate.quad(entr, a,b, args=(endog,))[0] | |||
|
|||
@cache_readonly | |||
def icdf(self): | |||
def icdf(self, sample_quantile = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might need a third option for the exact calculation, then instead of a boolean sample_quantiles
we might need a string, something like method='interpolate'
Overall: I think the direction of adding the options works. We might want to review the method names at the end again. One issue is to have consistent methods/options available across methods, so we can either get the exact calculation (everything based on the kernel density) or the fast methods (with interpolation or other tricks). |
I'll wait to address specific issues until I can tidy it up in a consistent fashion. Hmm, I suppose it is possible that you would want to 'icdf_eval' on the sample quantiles - but this seems somewhat unlikely to me. Honestly, it seems very odd to even allow icdf to return the sample quantiles, and I would think it more sensible to remove this, but it might break compatibility somewhere? |
I also think it's not appropriate in kde, this is also available in ECDF |
Okay, so I remove this as an option entirely? I guess the unit tests should show any problems than crop up. |
Yes, remove the sample quantiles. I just skimmed #904 again. And because the initial problem was with the incorrect caching, it didn't work anyway. |
Yes there is the caching issue too. I think this could be finalized fairly quickly. What do you think about the names 'cdf_eval' etc? It would be nice if you could call 'cdf(method=exact)' or something, but the caching stops this. |
We still need unittests. My preferred would still be to call the fixed precalculated values Did we have the incorrect caching also for cdf? I cannot figure this out Changing cdf from a cached attribute would be a backwards incompatible |
Okay, I'll make some changes any then recommit and see what you think. I think there were/are two caching issues:
.xxx_values can still be cached, although if .xxx is changed to take arguments, then .xxx will return a function rather than the array that it currently returns. (apologies, finding it difficult to articulate this well) |
else: | ||
a,b = kern.domain | ||
func = lambda x,s: kern.density(s,x) | ||
return np.cumsum(self.pdf_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cumsum doesn't sound very accurate. Should we still get the integrate.quad solution somewhere?
Also don't we need to correct for the distance between points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cdf(method='exact') will still return the integrate.quad solution. I wasn't too happy with the cumsum implementation, but my idea was to prioritize speed first, and then have optional accuracy.
I was under the impression the density grid was uniform (my initial scan shows it being build by a np.linspace call).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops I see what you mean about the distance - yes good point, need to change this.
I fixed up a bug that josef-pkt pointed out and improved the accuracy of the cdf_values method by doing a 'trapz' integration rather than my silly midpoint. Shouldn't be any slower. Two notes:
|
I think the overall pattern looks good. There might be smaller changes that will still show up. (One refactoring that might be useful maybe in the future is to distinguish between model and results, since no many attributes rely on fit being called first. But it's not quite the same because the methods don't need necessarily the results of fit.) |
related point |
I was playing around with my changes the other day for a project I'm working on, and felt like maybe I should build them from the ground up again and request a new pull? What do you think of the idea of having the interface work more like a scipy.stats distribution? It would be nice to be able to do things like
Using the estimated distribution to draw from. |
Adding an rvs method should be part of these kind of classes. However, I don't think scipy.stats.distribution has a good pattern for this. The main problem is that they don't have a way of storing pre-calculated variables or temp variables. The design here is much more appropriate because in many cases we can and want to reuse prior calculations to speed up the methods. |
Ah yes, you make a very good point. Although presumably we could subclass scipy.stats.distributions, redefining everything to make use of pre-calculated values. Not sure what value there would be in that though. |
I found a fair number of bugs when I was playing with my code, which I've patched up. I also rewrote the ppf stuff - because it was wrong. I'm not really happy with the way it works, but the concerns are the same as expressed earlier. |
I didn't really follow all of this. What's the status on this? Is it a mix of enhancements and bug fixes? Are there API changes? |
It was a while ago so I can't remember 100%, I'll have to take another look to give you an accurate answer. But as far as I can remember - yes there were some suggested API changes, along with bug fixes/enhancements. I believe the API changes are all backwards compatible though. I'll take another look through this to refresh my memory and then summarize the proposed changes. |
That would be great. If there are smaller changes that would fit naturally as their own PR feel free to pull them out into a new PR. E.g., splitting up enhancements and bug fixes. |
Okay I've had a look over this now and my overall impression is that it would probably be best to just submit a new issue for the API changes/enhancements, as there there was no firm consensus. The API changes were essentially changing kde properties like Enhancements were mostly in adding 'exact' calculations for the Bug fixes throughout, but most have already been incorporated through other pull requests. If it sounds good to you, I'll just close this and open an issue to discuss it. Won't be hard to incorporate into a new pull request. |
I've added 'cdf_eval', 'icdf_eval', 'variance' and 'variance_eval' functions to the UnivariateKDEs.
I realise that the format of these changes may not be what is wanted in the long run, but I figured starting a pull request would help push discussion along so some choice is made.
ref: #904