-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Desire for qMC library in scipy? #9695
Comments
Yes! |
Plus there is already Sobol' QMC in |
There definitely is enough interest, a PR would be very welcome @Balandat.
A separate top-level module does not seem warranted. This seems to fit well in |
Sorry I have been swamped with stuff, hopefully I can get to this soon. FWIW, we upstreamed a modified implementation into pytorch: https://pytorch.org/docs/stable/torch.html#quasi-random-sampling |
@rgommers Is there any interest for other QMC like Halton (which is slightly better in dim <~ 10)? I put a version in statsmodels. I could also propose some LHS. |
In principle yes I think, but we try hard to not duplicate things between If we're going to have multiple, either in SciPy or in Statsmodels, it would be good if they have a uniform API as much as possible. The I think a self-contained module would be better rather than putting them into a "design of experiments" (sub)module. Other than that, no strong preference between SciPy and Statsmodels. Cc @josef-pkt, @bashtage |
IMO both of these are better fits for scipy than statsmodels. I think those functions can be treated as private, and so could be quietly and quickly deprecated (with a pointer to scipy.MOD) if SciPy goes forward with these. |
FWIW there is a sobol squence generator wrapped in Python with support for 21201 dimensions in pyscenarios. Uses numba and so is fast. |
As for a design, it seems like it would make sense to split the transformations from the generation. Transformation is a generic problem that can take any form fo quasi-random-like generates. |
@bashtage I agree with you that these methods are more general than "just" statistically relevant. For lots of optimization/integration problems we want to have better number generators. But I am not sure about keeping these private. I would want to have an alternative to |
@tupui The private was only relevant to what is already in statsmodels, not what may go into SciPy. By private I mean it doesn't go into the docs, and then if this mod goes into SciPy, the version in statsmodels can be removed and users redirected to scipy (which sm knows that user already uses). |
I agree with this. What do you think @rgommers about moving the PR and the existing halton and lhs here? |
I think that makes sense. Putting it all together in one PR so it can be reviewed together may make sense.
Agreed. Inspired by / borrowed from the new |
@rgommers super. As for the infrastructure, seems |
Sounds good.! It's fine to keep things in Python for now. We've got to get the API right first, improving the performance with Cython can come later (even keep as a future todo after merging). |
@rgommers Hum, not sure about the naming. Maybe Anyway, I created a first PR and propose to move the discussion there 😃 |
IIRC, only transformation methods work reasonably for deriving non-uniform variates from QMC uniforms. Most of the distribution algorithms in |
I was thinking that it could be useful for a transformation to accept a BitGenerator. For example, if the transformations included a multivariate Student's t copula, then one could generator MV student's RVs using either a SobolGenerator or a BitGenerator. |
@tupui I prefer to keep the discussion here (on the high-level goals and design at least) to not split it over two places.
Design of experiments as a topic we'd definitely want to leave with Statsmodels. Here we want the qMC samplers, not the DOE-specific parts. Latin hypercube sampling could be included perhaps, it's "near-random" rather than quasi-random but also widely used in Monte Carlo sampling and not specific to DOE. Also note that |
No problem.
I guess it's a matter of terminology. I am coming from the Uncertainty Quantification field and all these methods are under: DoE. I will change it to
Fully agree on that. |
gh-10844 proposes five new functions:
I'm not sure about the Thoughts on whether the above functions are the right scope anyone? |
The starting point of this was the cython sobol implementation I have - maybe we should add that to |
yes, definitely! |
Thanks @tupui, that's good to know. Would you in principle be happy to help out maintaining it after you're back? (I know plans can change, that's why "in principle") The reason I'm asking is that it's a lot easier to add a significant new functionality, a whole submodule in this case, if the original authors or other domain experts can be asked in case of bugs or design decisions. For people who already understand the algorithms, that can often be a small effort, while if one of the SciPy maintainers needs to do it they may first have to study the algorithmic details. Same question to @Balandat. I'm happy to help review and make the code fit nicely into the SciPy API and code base, but I likely won't become a content expert here .... |
@rgommers Yes of course I would be happy to help 😄 I am working as a python dev now and for good hopefully ;) |
I'm happy to be PoC for the Sobol generator. I can add this as a commit to #10844 soonish - I might need some help with the cython setup though, not sure if scipy does something particular about that. |
May be useful to put it all together indeed. @tupui perhaps you can add @Balandat as a collaborator on your fork of SciPy, so he can push commits to gh-10844?
The build system should autodetect
|
Great, I hope to get to this later this week. |
@Balandat hey! Did you get some time to work on this? |
Unfortunately not. I am pretty swamped for the next couple of weeks, but after that I should be able to find some time. |
No worries, just wanted to have a heads up 😃 |
Actually I banged this out just now: 65b8552 Still a little rough around the edges and needs some cleaning up, but most functionality is there and tests fine. Note that the setup uses a |
Nice, you rock! 👍
I will have a look. |
Now that we have all the methods in here 🎊 , I guess we should fix a sort of common API. Currently, you either call a function like Personally, I tend more to the first option as it's more immediate from a user perspective. But I understand the control that ones could want with the class way. I propose we have then both a function and class for this. The class would be more private. Then, the question of the bounds. Shall all method just work in the [0, 1[ space like with Sobol'? Also, what about a helper function like we have in |
So conceptually there is an important distinction with qMC methods between the functional API and the object-based API: For iid random draws you can just continue to call the functional API and append the samples and you get a set of iid samples. For the space filling sequences, this is not the case. So only having a functional api wouldn't allow you to start from mid-sequence, you'd always have to draw from the start. we can provide that API and it woudl be useful, but we should think about the "resumability" of the sequence for the qmc case, and whether we want to expose that to users (we probably should in some way). |
@Balandat OK then I will use as well an object-based API. This way we would have:
I would remove the engine in the name if you don't mind. Then we could have a base class which would require to implement:
I would also move NormalQMCEngine and MultivariateNormalQMCEngine to the same qmc file. And I would add an option to be able to select another base engine. For the bounds, what is you opinion? Shall we just leave it to the user? I am neutral here. Last but not least, I would add a last class which would serve as a higher level interface:
Let me know what you think about these and I can work on the modifications 😃. |
@Balandat If the following is fine for you, I will do this for all methods. For now, I left the scale in the base class. Not sure about this. Maybe outside like discrepancy would be better. Or we put also discrepancy inside? class QMCEngine(ABC):
"""Quasi-Monte Carlo engine sampler.
Samples are distributed over the half-open interval [0, 1).
Examples
--------
Generate samples from a low discrepancy sequence of Halton.
>>> from scipy.stats import qmc
>>> sampler = qmc.Halton(k_vars=2)
>>> sample = sampler.random(n_samples=5)
Compute the quality of the sample using the discrepancy criterion.
>>> uniformity = qmc.discrepancy(sample)
If some wants to continue an existing design, extra points can be obtained.
>>> sampler.fast_forward(5)
>>> sample_continued = sampler.random(n_samples=5)
Finally, samples can be scaled to bounds.
>>> bounds = np.array([[0, 2], [10, 5]])
>>> sample_continued_scaled = sampler.scale(sample_continued, bounds)
"""
@abstractmethod
def __init__(self, k_vars, seed=None):
"""
Parameters
----------
k_vars : int
Dimension of the parameter space.
seed : int or `np.random.RandomState`, optional
If `seed` is not specified the `np.RandomState` singleton is used.
If `seed` is an int, a new `np.random.RandomState` instance is used,
seeded with seed.
If `seed` is already a `np.random.RandomState instance`, then that
`np.random.RandomState` instance is used.
Specify `seed` for repeatable sampling.
"""
self.k_vars = k_vars
self.rng = check_random_state(seed)
@abstractmethod
def random(self, n_samples):
"""Draw n_samples in the half-open interval [0, 1).
Parameters
----------
n_samples : int
Number of samples to generate in the parameter space.
Returns
-------
sample : array_like (n_samples, k_vars)
QMC sample.
"""
pass
@abstractmethod
def fast_forward(self, n):
"""Fast-forward the sequence by n positions.
Parameters
----------
n: int
Number of points to skip in the sequence.
"""
pass
@classmethod
def scale(cls, sample, bounds):
"""Sample scaling from unit hypercube to bounds range.
Parameters
----------
sample : array_like (n_samples, k_vars)
QMC sample.
bounds : tuple or array_like ([min, k_vars], [max, k_vars])
Desired range of transformed data. The transformation apply the bounds
on the sample and not the theoretical space, unit cube. Thus min and
max values of the sample will coincide with the bounds.
Returns
-------
sample : array_like (n_samples, k_vars)
QMC scaled-sample.
"""
bounds = np.asarray(bounds)
min_ = np.min(bounds, axis=0)
max_ = np.max(bounds, axis=0)
return sample * (max_ - min_) + min_
class Halton(QMCEngine):
def __init__(self, k_vars):
super().__init__(k_vars=k_vars)
self.base = n_primes(k_vars)
self.n_fast_forward = 0
def random(self, n_samples):
sample = [van_der_corput(n_samples + 1, bdim, self.n_fast_forward)
for bdim in self.base]
return np.array(sample).T[1:]
def fast_forward(self, n):
self.n_fast_forward = n |
Sorry about the delay here. I like this general interface, this is quite nice.
Sure, makes sense.
I think we can have this be case-by-case. Note that e.g. the MultivariateNormalQMC doesn't sample from a compact space but generates samples in R^n. So bounds don't really make sense there. Typically the convention is for most of these methods to generate samples in the unit cube, I would just use that convention. Scaling can then happen outside (i.e. I think you can move the In general I think it makes sense to start off as minimal as possible, we can always add more features / wrappers around the basics, but the other direction is much harder.... |
No problem! Perfect I will finish the conversion of everything then.
OK for me. |
It seems like the main follow-up here is performance-related? Maybe worth opening a new issue for moving some of the critical parts to either Cython or Pythran? |
Ok I will create an issue for that. I have also 2 following PR. One to use this in optimize and the other to add |
I have a fast Cython implementation of a Sobol low-discrepancy quasi-random number generator using Owen scrambling. On top of that I have some transformations (inverse transform, Box-Muller) that make it easy to draw quasi-MC samples from, say, a multivariate normal distribution. This code has been used successfully in an industry setting.
Before I go through the effort of creating a PR, I wanted to see whether this is something that people would like to see in scipy.
The api could look something like the following:
The text was updated successfully, but these errors were encountered: