KDE implementation #1254

mlochbaum · 2022-10-11T19:13:50Z

mlochbaum
Oct 11, 2022

Considerations regarding the way Squiggle converts sample lists to point sets with Kernel Density Estimation, copied over from Slack:

The current benchmark doesn't vary the sample size at all: it's always converting 10,000 samples to 1,000 points. I don't know how realistic this is; better information about likely use case would surely be useful.

It appears the KDE library we're using is found at https://github.com/gyosh/pdfast . It's very simple, only a few hundred lines, which is explained by using a triangular kernel, so that aggregation of points is easy. I had been expecting Gaussians, and FFT-based convolution. Now I find the fact that it takes longer than sorting (30% of benchmark time vs 20%) surprising. I think its use of {x,y} objects slows it down relative to arrays, and it doesn't seem to take advantage of the input being sorted. It's also doing a fair amount of work to ensure that the total probability mass contributed by each point is equal after truncating at the sides, which may not be needed but could also be drastically sped up.

There's no reason except maybe rounding error for the output to contain negative values; I don't get this at all. And the library does normalize to set the sum of all y values to 1, which should be the same as the trapezoid rule since the x values are evenly spaced?

On the other hand, KDE with a Gaussian kernel should produce better output and would be worth considering if there's a fast enough implementation. It's a good fit for polynomial interpolation: the accuracy of a cubic spline is based on the function's second derivative, and the Gaussian's second derivative is well behaved. The first derivative can also be obtained from KDE, where the kernel is the derivative of a Gaussian. Since the derivatives fall off very quickly after a fixed distance from a sample, it's possible to get global error bounds on the spline relative to a theoretical perfect KDE. These could be improved a lot by using adaptive x values that get denser where sample density is higher, although that makes later operations more difficult.

Having more samples than points means it's more valuable to reduce the number of samples. With inverse CDF sampling, I'd be interested in quasi-random rather than random sampling, so that the probability of a point being chosen doesn't change but the distance between adjacent points is more evenly distributed. Seems Quasi-Monte Carlo methods are a known technique.

Gaussian KDE plus cubics does have the problem of negative values, but I think this is only possible far away from samples where the probability is low. With adaptive x values it should be possible to handle the gaps specially and make sure there's a known minimum.

mlochbaum · 2022-10-11T19:40:16Z

mlochbaum
Oct 11, 2022
Author

Responses to some of Ozzie's comments (out of original order):

I know there are a bunch of research papers about such things, but our case might be a bit unusual, because the outputs are coming from human calculations, as opposed to raw data sets. This probably would encourage some heuristics.

I can't speak to whether other statisticians have approached similar use cases, but this does make a difference. For example we know the exact probability density at each of our samples when we convert from symbolic. But there's no obvious way to carry this information forward through operations on distributions. Searching for properties that are easier to maintain sounds like a promising direction to me.

I think the easiest gain we’d get would come from taking the kde of the log of the function. Not sure if it could easily work, but in theory, it could help a lot with tails. Distributions with long tails have severe trouble now with kde.

Yes, a KDE with Gaussians for example can only ever fall of like exp(-x²), which obviously doesn't work for some other tail like 1/x². I think this information is just lost by the time we take samples, and a different approach would only make different assumptions about the tail. However, KDE in log-space is pretty interesting. The case I've thought about is doing the interpolation in log-space after log(exp(-x²)) is of course -x², which can be interpolated perfectly! For a sum of Gaussians, the second derivative is no longer constant, but it stays bounded at areas where the probability density is not small (close to some sample) and only has trouble at large gaps between samples, where we don't have good information anyway. However, the same error in log-space is much larger in non-log-space when the probability is higher, so I'm not sure whether this ends up working nicely in the end.

About the normalization to 1; I think we were confused here. I expected the algorithm to do that automatically, but it didn’t seem to, so we just added a manual normalization to it.

It certainly tries to do this here, and it seems to me this should be consistent with your normalization.

Previously we used PDFast because it was one of the very few libraries around to do this sort of work in JavaScript.
I think before doing a lot of work figuring this out from first principles, it could be good to just survey the fields of python and Julia a bit and see what implementations they might have. I did a little of this but not much.

I agree about a Gaussian kernel seeming better than a triangular one. Main issue is just performance.

Agreed. I do think work we're going to see in other languages is more along the lines of how to compute a KDE rather than how to use one.

0 replies

mlochbaum · 2022-10-11T19:43:01Z

mlochbaum
Oct 11, 2022
Author

Prior context on tails and interpolation:

The CDF transforms the sample space to another one such that the probability is uniform over [0,1]. An interesting hybrid would be a transformation to make the probabilities more uniform, but not completely, plus either a sample list or point set in this space. This doesn't actually change anything about the sampling, just how it's interpolated. See below.
Somewhat similar, there may be a way to symbolically approximate the limiting behavior at infinity, like a Laplace transform (or maybe Laurent series? Think a normal distribution falls too fast for either of these though.) to be used in addition to a sample list. Then a tail is like a special sample, and during MC a sample that falls within it is handled specially.

Interpolation of point sets seems to me like it ought to be a significant problem. Or maybe opportunity, if you're covering for bad interpolation by taking lots of points. I'd be very concerned about cubics, and to a lesser extent Bézier curves, because ordinary shapes could give you negative probabilities. That goes away using log-probability if the computational cost can be handled. Using a CDF-like transformation when possible should also make polynomial interpolation more effective, by eliminating exponential dropoffs. If it's common for there to be strange shapes in the interior of the curve this still wouldn't be enough. Another strategy is to make the point density adaptive to how well the curve interpolates in that region; seems like this should be possible for many analytical curves.

More on tails: a symbolic limit approximation is nice because it should be possible to do a lot of operations with it. For example, if we have PDFs p(x) q(x) and corresponding CDFs P and Q, PDF r for the sum of the distributions satisfies r(x) <= (Q*p + P*q)(x/2) provided p and q are decreasing above x/2, since shifting all the probability in q up to at least x/2 can only increase the value of the convolution, and the same for p. More generally r(x) <= (Q*p)(t) + (P*q)(x-t). There ought to be a related lower bound too. I think this ends up meaning that the tail of r approaches the larger of the two tails but I'd have to do a bit more work to prove that.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KDE implementation #1254

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

KDE implementation #1254

mlochbaum Oct 11, 2022

Replies: 2 comments

mlochbaum Oct 11, 2022 Author

mlochbaum Oct 11, 2022 Author

mlochbaum
Oct 11, 2022

mlochbaum
Oct 11, 2022
Author

mlochbaum
Oct 11, 2022
Author