KDE implementation #1254
Replies: 2 comments
-
Responses to some of Ozzie's comments (out of original order):
I can't speak to whether other statisticians have approached similar use cases, but this does make a difference. For example we know the exact probability density at each of our samples when we convert from symbolic. But there's no obvious way to carry this information forward through operations on distributions. Searching for properties that are easier to maintain sounds like a promising direction to me.
Yes, a KDE with Gaussians for example can only ever fall of like exp(-x²), which obviously doesn't work for some other tail like 1/x². I think this information is just lost by the time we take samples, and a different approach would only make different assumptions about the tail. However, KDE in log-space is pretty interesting. The case I've thought about is doing the interpolation in log-space after log(exp(-x²)) is of course -x², which can be interpolated perfectly! For a sum of Gaussians, the second derivative is no longer constant, but it stays bounded at areas where the probability density is not small (close to some sample) and only has trouble at large gaps between samples, where we don't have good information anyway. However, the same error in log-space is much larger in non-log-space when the probability is higher, so I'm not sure whether this ends up working nicely in the end.
It certainly tries to do this here, and it seems to me this should be consistent with your normalization.
Agreed. I do think work we're going to see in other languages is more along the lines of how to compute a KDE rather than how to use one. |
Beta Was this translation helpful? Give feedback.
-
Prior context on tails and interpolation:
Interpolation of point sets seems to me like it ought to be a significant problem. Or maybe opportunity, if you're covering for bad interpolation by taking lots of points. I'd be very concerned about cubics, and to a lesser extent Bézier curves, because ordinary shapes could give you negative probabilities. That goes away using log-probability if the computational cost can be handled. Using a CDF-like transformation when possible should also make polynomial interpolation more effective, by eliminating exponential dropoffs. If it's common for there to be strange shapes in the interior of the curve this still wouldn't be enough. Another strategy is to make the point density adaptive to how well the curve interpolates in that region; seems like this should be possible for many analytical curves. More on tails: a symbolic limit approximation is nice because it should be possible to do a lot of operations with it. For example, if we have PDFs p(x) q(x) and corresponding CDFs P and Q, PDF r for the sum of the distributions satisfies r(x) <= (Q*p + P*q)(x/2) provided p and q are decreasing above x/2, since shifting all the probability in q up to at least x/2 can only increase the value of the convolution, and the same for p. More generally r(x) <= (Q*p)(t) + (P*q)(x-t). There ought to be a related lower bound too. I think this ends up meaning that the tail of r approaches the larger of the two tails but I'd have to do a bit more work to prove that. |
Beta Was this translation helpful? Give feedback.
-
Considerations regarding the way Squiggle converts sample lists to point sets with Kernel Density Estimation, copied over from Slack:
The current benchmark doesn't vary the sample size at all: it's always converting 10,000 samples to 1,000 points. I don't know how realistic this is; better information about likely use case would surely be useful.
It appears the KDE library we're using is found at https://github.com/gyosh/pdfast . It's very simple, only a few hundred lines, which is explained by using a triangular kernel, so that aggregation of points is easy. I had been expecting Gaussians, and FFT-based convolution. Now I find the fact that it takes longer than sorting (30% of benchmark time vs 20%) surprising. I think its use of {x,y} objects slows it down relative to arrays, and it doesn't seem to take advantage of the input being sorted. It's also doing a fair amount of work to ensure that the total probability mass contributed by each point is equal after truncating at the sides, which may not be needed but could also be drastically sped up.
There's no reason except maybe rounding error for the output to contain negative values; I don't get this at all. And the library does normalize to set the sum of all y values to 1, which should be the same as the trapezoid rule since the x values are evenly spaced?
On the other hand, KDE with a Gaussian kernel should produce better output and would be worth considering if there's a fast enough implementation. It's a good fit for polynomial interpolation: the accuracy of a cubic spline is based on the function's second derivative, and the Gaussian's second derivative is well behaved. The first derivative can also be obtained from KDE, where the kernel is the derivative of a Gaussian. Since the derivatives fall off very quickly after a fixed distance from a sample, it's possible to get global error bounds on the spline relative to a theoretical perfect KDE. These could be improved a lot by using adaptive x values that get denser where sample density is higher, although that makes later operations more difficult.
Having more samples than points means it's more valuable to reduce the number of samples. With inverse CDF sampling, I'd be interested in quasi-random rather than random sampling, so that the probability of a point being chosen doesn't change but the distance between adjacent points is more evenly distributed. Seems Quasi-Monte Carlo methods are a known technique.
Gaussian KDE plus cubics does have the problem of negative values, but I think this is only possible far away from samples where the probability is low. With adaptive x values it should be possible to handle the gaps specially and make sure there's a known minimum.
Beta Was this translation helpful? Give feedback.
All reactions