New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draw random samples from arbitrary distributions #3747
Comments
For (1) to work well, you need a standard way for the user to provide those n-dimensional distribution functions. In |
Just a note in case I ever find time to work on this. |
Thinking of getting started on an implementation of the Olver & Townsend paper. Has anything changed here that I should know about before jumping in? |
Tried a naive implementation of ITS using Chebyshev approximations. It's available in a repo here. Note that this library also implements ITS using a CDF built from numerical quadrature. The results are disappointing - it's not very fast. Can't find the code in the Matlab library |
it looks like I didn't pay attention here as an approximation if evaluating a large number of pdf values is cheap is to just use linear interpolation I interpolated the cdf directly, IIRC, just to see how accurate linear interpolation is. that type of approach, linear interpolation of ppf, is in https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_histogram.rvs.html |
Haven't been able to try it yet, but thought I'd mention #8357
<#8357> for the root finding. Does the
notebook profile the code and reveal what parts are taking so much time?
Matt
…On Sun, Jun 10, 2018 at 10:20 AM, Peter Wills ***@***.***> wrote:
Tried a naive implementation of ITS using Chebyshev approximations. It's
available in a repo here <https://github.com/peterewills/itsample>. Note
that this library also implements ITS using a CDF built from numerical
quadrature.
The results are disappointing - it's not very fast. Can't find the code in
the Matlab library chebfun to compare against - maybe should start
learning Julia so I can dig through that code. More comments on the speed
can be found in the notebook on comparing the two approaches
<https://github.com/peterewills/itsample/blob/master/testing-chebyshev.ipynb>.
Would certainly appreciate feedback and insight, if you guys have any to
spare.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3747 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGRCK7CHy1To7AlxGrMutkdjQqsCFtldks5t7VVNgaJpZM4CFtpL>
.
|
@peterewills, @cdeil I didn't know about the paper that was referenced here, it's good to learn about.
#7257 is an attempt to add a distribution for x/y sampled data (rv_scatter), almost identical to rv_histogram, but for x/y data. It builds a linear piecewise interpolation spline from the supplied data to act as the PDF. It's straightforward to calculate CDF from that because the splines have an integral method. The PPF is straightforward as well, you don't want to use root finding because it's a lot slower. The PPF function is analytically calculable for a linear interpolator, the integrated function between the two points is a quadratic. #6466 is an attempt to add a distribution for arbitrary functions. My ultimate aim was to randomly sample from the spectrum of a reactor moderator for monte carlo purposes. Perhaps my PR's can be resurrected somehow? |
@mdhaber - code is profiled in the notebook, but since the functions I run are just wrappers on other root-finding code, it's not informative @josef-pkt @andyfaff - EDIT: Changed to make it clear that I don't actually know if linear interpolation is less accurate than Chebyshev, I just have a guess that it will be. |
@peterewills Do you know if the Chebychev approximation to the pdf is non-negative, i.e. the cdf monotonic? AFAIK, that is not the case with polynomial approximations. I always found cases with negative pdf. I think the main speed advantage of linear interpolation compared to Olver Townsend is that we have an explicit ppf approximation. I think for generating a large number of random variables using rootfinding to invert the cdf is not efficient. One reason that I liked the histogram distribution is that all methods are consistent with the linear cdf/ppf interpolation, i.e. the pdf is piecewise constant if the cdf is piecewise linear. I didn't find any polynomial basis besides linear interpolation that where we can approximate both the cdf and the ppf at the same time. Another possible problem with polynomial basis function like Chebyshev is the possibility of overshooting. If the approximation uses a high order polynomial, then there might be excess fluctuations, non-monotonicity and the approximation between grid or sample points is worse than linear. (I don't remember what I tried specifically for chebyshev.) Nevertheless, I think polynomial approximation to distributions can be useful in a more general way, and not with the main focus on fast random number generation. That is it would provide a smooth version for nonparametric distribution approximation. And the Olver Townsend distribution has the pieces where all methods are consistent with each other. |
@josef-pkt I don't imagine that a slightly negative PDF would generate big problems computationally, assuming we use a sufficiently high degree polynomial approximation so that it is not significantly negative. However I haven't done any experiments on this - you may be right that it could be a problem. I like the idea of using piecewise constant PDF piecewise linear CDF for the approximation, as we can then analytically generate the (piecewise linear) PPF as well. The question here is how to sample, but we could just do an equispaced grid as a default, and allow the user to define a custom grid if they like. I'm not sure how we'd apply this to non-finite support; however, I'm thinking we can just restrict ourselves to PDFs with finite support, because e.g. for a Gaussian you could just sample +/- 10 deviations from the mean and get a numerically indistinguishable result. @andyfaff looking at your PRs, they look very close to what I'm thinking about, although I would go even simpler with a piecewise constant (rather than linear) PDF. Why did those stall out? It wasn't clear from reading them. |
A piecewise constant PDF between points is akin to
The polynomial approximation one uses is highly dependent on the system. Piecewise constant or piecewise linear are useful for use case 1. Higher order interpolation may not be suitable because they could have ringing effects, or add too much detail. Use of non-piecewise polynomial basis sets may also not be able to add enough detail, when there are sharp details in the PDF. From my point of view case 2 (the first PR), is the most flexible option. Technically it could be used to generate any of the distributions in stats. It can also be used for experimentally derived distributions by using piecewise interpolators. It could also handle any of the use cases you describe. Besides supplying a PDF function the PR was designed so that you could supply a CDF and PPF function. The PR's stalled for a mixture of reasons. I was keen to get something included in scipy, but they seemed to receive lukewarm support. |
@andyfaff okay, sorry I was being thick before. so I guess what I'm really wanting here is something akin to #6466 but which had a
I suppose what I'm suggesting now is that we just sample the provided PDF to generate a histogram, then use the same technique for sampling a histogram that is used in Not sure if this is something that would be useful to have included in scipy, given that it's such a rough approximate approach. I'm new to the whole world of OSS, so curious to hear more about how experience folks think about this. |
(I wrote this this morning but wasn't sure if I should post it)
that's what the histogram distribution already does, IIRC I mainly did density estimation with orthogonal polynomials, so I don't know a lot about getting cdf and similar in an efficient way. I like the idea of some nonparametric arbitrary distribution, but I'm not sure what the requirements are. Some polynomial like Chebychev would be good to have as full distribution. @andyfaff AFAIR, the main reason the arbitrary rv stalled because it didn't add much to what the generic distribution framework provides, i.e. users can create a new distribution just by specifying the pdf, but the rest will be slow generic integration and root finding. |
One thing I had wanted to add to the non-negativity constraints in some nonparametric or non-semiparametric approaches in econometrics they use the exponential to impose nonnegativity, i.e. pdf = exp(polynomial_series_expansion) The disadvantage is that it looses the nice features of using orthogonal polynomials, but the main point for me was that it was more difficult to |
Cross-ref #8293 which introduces a method for the 1d case |
Linking to #13343. It was decided that this should go in |
This is a feature request asking for a simple-to-use method to draw random samples from arbitrary n-dimensional distribution functions (could be given by an analytical function or some density estimation).
One possibility would be to implement something like what is available in ROOT (see Python wrapper function or implementation), which if I understand basically does this:
Would something like this be welcome as an addition in
scipy
?(I need this in a Python package where I don't want ROOT as a dependency.)
Or maybe someone has the skills / time to implement better methods such as the ones in UNURAN?
cc @josef-pkt @ndawe
The text was updated successfully, but these errors were encountered: