New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draw random samples from arbitrary distributions #3747

Open
cdeil opened this Issue Jun 19, 2014 · 15 comments

Comments

Projects
None yet
8 participants
@cdeil
Contributor

cdeil commented Jun 19, 2014

This is a feature request asking for a simple-to-use method to draw random samples from arbitrary n-dimensional distribution functions (could be given by an analytical function or some density estimation).

One possibility would be to implement something like what is available in ROOT (see Python wrapper function or implementation), which if I understand basically does this:

  1. Create an array where the bin value represents the integral of the density distribution over the bin.
  2. Use inverse transform sampling to generate random bin indices using the cumulative flattened array.
  3. Re-distribute events in each bin to get a smooth distribution.

Would something like this be welcome as an addition in scipy?
(I need this in a Python package where I don't want ROOT as a dependency.)

Or maybe someone has the skills / time to implement better methods such as the ones in UNURAN?

cc @josef-pkt @ndawe

@rgommers

This comment has been minimized.

Member

rgommers commented Aug 26, 2014

For (1) to work well, you need a standard way for the user to provide those n-dimensional distribution functions. In stats there are only a couple of those, and I think you want to sample other distributions right? How do they look?

@cdeil

This comment has been minimized.

Contributor

cdeil commented Sep 2, 2016

Just a note in case I ever find time to work on this.
Looks very good and it's always easier to translate working code / tests to another language that start from the paper:
https://github.com/ApproxFun/ApproxFun.jl#sampling
https://arxiv.org/abs/1307.1223

@peterewills

This comment has been minimized.

peterewills commented Jun 9, 2018

Thinking of getting started on an implementation of the Olver & Townsend paper. Has anything changed here that I should know about before jumping in?

@peterewills

This comment has been minimized.

peterewills commented Jun 10, 2018

Tried a naive implementation of ITS using Chebyshev approximations. It's available in a repo here. Note that this library also implements ITS using a CDF built from numerical quadrature.

The results are disappointing - it's not very fast. Can't find the code in the Matlab library chebfun to compare against - maybe should start learning Julia so I can dig through that code. More comments on the speed can be found in the notebook on comparing the two approaches. Would certainly appreciate feedback and insight, if you guys have any to spare.

@josef-pkt

This comment has been minimized.

Member

josef-pkt commented Jun 10, 2018

it looks like I didn't pay attention here

as an approximation if evaluating a large number of pdf values is cheap is to just use linear interpolation
http://jpktd.blogspot.com/2012/12/visual-inspection-of-random-numbers.html
I had tried also chip interpolation but it was much slower.

I interpolated the cdf directly, IIRC, just to see how accurate linear interpolation is.
If it is not available then there still needs to be a numerical integration to get the cdf from the pdf at the grid points. (IIRC I used trapz for fast integration on a grid)

that type of approach, linear interpolation of ppf, is in https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_histogram.rvs.html

@mdhaber

This comment has been minimized.

Contributor

mdhaber commented Jun 10, 2018

@andyfaff

This comment has been minimized.

Contributor

andyfaff commented Jun 10, 2018

@peterewills, @cdeil I didn't know about the paper that was referenced here, it's good to learn about.

rv_histogram was added to stats a while back, which deals with the histogram version of experimentally acquired data.I have two stalled PR's which are relevant to this, #6466 and #7257.

#7257 is an attempt to add a distribution for x/y sampled data (rv_scatter), almost identical to rv_histogram, but for x/y data. It builds a linear piecewise interpolation spline from the supplied data to act as the PDF. It's straightforward to calculate CDF from that because the splines have an integral method. The PPF is straightforward as well, you don't want to use root finding because it's a lot slower. The PPF function is analytically calculable for a linear interpolator, the integrated function between the two points is a quadratic.

#6466 is an attempt to add a distribution for arbitrary functions. rv_arbitrary would've accepted callables for PDF/CDF/PPF. The two PR's are not unrelated because if rv_scatter didn't exist then rv_arbitrary could be used by constructing the piecewise linear interpolator yourself, and supplying that as the PDF to rv_arbitrary.

My ultimate aim was to randomly sample from the spectrum of a reactor moderator for monte carlo purposes. Perhaps my PR's can be resurrected somehow?

@peterewills

This comment has been minimized.

peterewills commented Jun 11, 2018

@mdhaber - code is profiled in the notebook, but since the functions I run are just wrappers on other root-finding code, it's not informative

@josef-pkt @andyfaff - rv_histogram looks like a similar problem. I think the carryover is that we could form some kind of spline/linear interpolation estimate of the CDF and use this in the root-finding. I have an idea that using Chebyshev approximation will give more accuracy than linear approximation on an equispaced grid, but perhaps I'm mistaken about this. I suppose what we'd really want is an adaptive grid that includes more points in places where the CDF has a high derivative, but I haven't looked into that problem yet. Anyways, I agree that function calls to a linear interpolant will be much faster, so maybe this is the way to go, although it might sacrifice some accuracy.

EDIT: Changed to make it clear that I don't actually know if linear interpolation is less accurate than Chebyshev, I just have a guess that it will be.

@josef-pkt

This comment has been minimized.

Member

josef-pkt commented Jun 11, 2018

@peterewills Do you know if the Chebychev approximation to the pdf is non-negative, i.e. the cdf monotonic?

AFAIK, that is not the case with polynomial approximations. I always found cases with negative pdf.

I think the main speed advantage of linear interpolation compared to Olver Townsend is that we have an explicit ppf approximation. I think for generating a large number of random variables using rootfinding to invert the cdf is not efficient.

One reason that I liked the histogram distribution is that all methods are consistent with the linear cdf/ppf interpolation, i.e. the pdf is piecewise constant if the cdf is piecewise linear.

I didn't find any polynomial basis besides linear interpolation that where we can approximate both the cdf and the ppf at the same time.

Another possible problem with polynomial basis function like Chebyshev is the possibility of overshooting. If the approximation uses a high order polynomial, then there might be excess fluctuations, non-monotonicity and the approximation between grid or sample points is worse than linear. (I don't remember what I tried specifically for chebyshev.)

Nevertheless, I think polynomial approximation to distributions can be useful in a more general way, and not with the main focus on fast random number generation. That is it would provide a smooth version for nonparametric distribution approximation. And the Olver Townsend distribution has the pieces where all methods are consistent with each other.
(Personal: I like the part about having the integral also a polynomial that is derived from the underlying pdf. AFAIR, that was one piece missing in my experiments many years ago.)

@peterewills

This comment has been minimized.

peterewills commented Jun 11, 2018

@josef-pkt I don't imagine that a slightly negative PDF would generate big problems computationally, assuming we use a sufficiently high degree polynomial approximation so that it is not significantly negative. However I haven't done any experiments on this - you may be right that it could be a problem.

I like the idea of using piecewise constant PDF piecewise linear CDF for the approximation, as we can then analytically generate the (piecewise linear) PPF as well. The question here is how to sample, but we could just do an equispaced grid as a default, and allow the user to define a custom grid if they like.

I'm not sure how we'd apply this to non-finite support; however, I'm thinking we can just restrict ourselves to PDFs with finite support, because e.g. for a Gaussian you could just sample +/- 10 deviations from the mean and get a numerically indistinguishable result.

@andyfaff looking at your PRs, they look very close to what I'm thinking about, although I would go even simpler with a piecewise constant (rather than linear) PDF. Why did those stall out? It wasn't clear from reading them.

@andyfaff

This comment has been minimized.

Contributor

andyfaff commented Jun 11, 2018

A piecewise constant PDF between points is akin to stats.rv_histogram. You would just need to add an extra bin at the end.
I wrote the two PR's with a view to two main use cases:

  1. Sampling from an experimental measured distribution. (In my case an energy spectrum emanating from a reactor moderator).
  2. Creating arbitrary stats distributions from a function. For example, you could create a quadratic distribution from lambda x: x**2. This option could also be used for use case 1 if the function is, for example, a wrapped interpolating spline.

The polynomial approximation one uses is highly dependent on the system. Piecewise constant or piecewise linear are useful for use case 1. Higher order interpolation may not be suitable because they could have ringing effects, or add too much detail. Use of non-piecewise polynomial basis sets may also not be able to add enough detail, when there are sharp details in the PDF.

From my point of view case 2 (the first PR), is the most flexible option. Technically it could be used to generate any of the distributions in stats. It can also be used for experimentally derived distributions by using piecewise interpolators. It could also handle any of the use cases you describe. Besides supplying a PDF function the PR was designed so that you could supply a CDF and PPF function.

The PR's stalled for a mixture of reasons. I was keen to get something included in scipy, but they seemed to receive lukewarm support.

@peterewills

This comment has been minimized.

peterewills commented Jun 12, 2018

@andyfaff okay, sorry I was being thick before. so I guess what I'm really wanting here is something akin to #6466 but which had a .rvs() method once the PDF (or CDF) was supplied, which would require generating an approximate PPF. So we'd want to be able to do something like

pdf = lambda x: x
dist = rv_arbitrary(pdf=pdf,support=[0,1]) # can specify the support, or leave it empty for infinite support
data = dist.rvs(size=10000)

I suppose what I'm suggesting now is that we just sample the provided PDF to generate a histogram, then use the same technique for sampling a histogram that is used in rv_histogram.rvs. This approach is pretty crude, but it can generate a PPF that is very quick to evaluate, unlike numerical quadrature (or polynomial approximation) of the provided PDF.

Not sure if this is something that would be useful to have included in scipy, given that it's such a rough approximate approach. I'm new to the whole world of OSS, so curious to hear more about how experience folks think about this.

@josef-pkt

This comment has been minimized.

Member

josef-pkt commented Jun 12, 2018

(I wrote this this morning but wasn't sure if I should post it)

I like the idea of using piecewise constant PDF piecewise linear CDF for the approximation, as we can then analytically generate the (piecewise linear) PPF as well.

that's what the histogram distribution already does, IIRC

I mainly did density estimation with orthogonal polynomials, so I don't know a lot about getting cdf and similar in an efficient way.
For distributions with unbound support I used Hermite polynomials. But that's about 6 years ago.
I didn't even remember I did this (and I don't know where that code is):
http://jpktd.blogspot.com/2012/01/monotone-hermite-polynomials.html
e.g. still in statsmodels sandbox is a class that has mainly fit and evaluate pdf methods
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/nonparametric/densityorthopoly.py#L278


I like the idea of some nonparametric arbitrary distribution, but I'm not sure what the requirements are.
rvs_histogram is the simplest full distribution case that can be used as approximation to any distribution.

Some polynomial like Chebychev would be good to have as full distribution.
For the other cases it sounds more like we need helper functions to create histogram distributions, e.g. pick the grid to get a good linear approximation to a user provided function.

@andyfaff AFAIR, the main reason the arbitrary rv stalled because it didn't add much to what the generic distribution framework provides, i.e. users can create a new distribution just by specifying the pdf, but the rest will be slow generic integration and root finding.
If there is a clean, consistent low order >1 spline or polynomial version, then it sounds good to include, But it looks like we haven't found an alternative to linear interpolation and rv_histogram if the objective is to have fast random number generation.

@josef-pkt

This comment has been minimized.

Member

josef-pkt commented Jun 12, 2018

One thing I had wanted to add to the non-negativity constraints

in some nonparametric or non-semiparametric approaches in econometrics they use the exponential to impose nonnegativity, i.e.

pdf = exp(polynomial_series_expansion)
or log_pdf is the orthogonal polynomial

The disadvantage is that it looses the nice features of using orthogonal polynomials, but the main point for me was that it was more difficult to fit to data, and AFAIR I never checked what the implication for other methods of a distribution are.

@chrisb83

This comment has been minimized.

Member

chrisb83 commented Sep 2, 2018

Cross-ref #8293 which introduces a method for the 1d case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment