[Question] Desire for qMC library in scipy? #9695

Balandat · 2019-01-17T22:44:35Z

I have a fast Cython implementation of a Sobol low-discrepancy quasi-random number generator using Owen scrambling. On top of that I have some transformations (inverse transform, Box-Muller) that make it easy to draw quasi-MC samples from, say, a multivariate normal distribution. This code has been used successfully in an industry setting.

Before I go through the effort of creating a PR, I wanted to see whether this is something that people would like to see in scipy.

The api could look something like the following:

from scipy.qmc import Sobol

sobol = Sobol(dimen=d, seed=1234)
samples = sobol.rvs(n)

from scipy.qmc import MultivariateNormalQMC

mvn = MultivariateNormalQMC(mean=np.zeros(2), cov=np.array([[1, 0.1], [0.1, 1]]))
samples = mvn.rvs(n)

The text was updated successfully, but these errors were encountered:

rkern · 2019-01-18T05:24:18Z

Yes!

tupui · 2019-02-27T18:06:37Z

Plus there is already Sobol' QMC in from scipy.optimize._shgo_lib import sobol_seq. There is some current work on statsmodels to implement some sampling methods: statsmodels/statsmodels#4104 and WIP: statsmodels/statsmodels#4113

rgommers · 2019-05-01T06:50:23Z

There definitely is enough interest, a PR would be very welcome @Balandat.

from scipy.qmc import MultivariateNormalQMC

A separate top-level module does not seem warranted. This seems to fit well in scipy.stats. Depending on how many functions you want to add (sounds like quite a few?), we could create scipy.stats.qmc.

Balandat · 2019-05-01T17:22:43Z

Sorry I have been swamped with stuff, hopefully I can get to this soon.

FWIW, we upstreamed a modified implementation into pytorch: https://pytorch.org/docs/stable/torch.html#quasi-random-sampling

tupui · 2019-09-17T06:12:08Z

@rgommers Is there any interest for other QMC like Halton (which is slightly better in dim <~ 10)? I put a version in statsmodels. I could also propose some LHS.

rgommers · 2019-09-18T03:47:26Z

@rgommers Is there any interest for other QMC like Halton (which is slightly better in dim <~ 10)? I put a version in statsmodels. I could also propose some LHS.

In principle yes I think, but we try hard to not duplicate things between statsmodels and scipy.stats. The Halton function and statsmodels.tools.sequences seem to not have made it into the Statsmodels docs though (and your WIP PRs actually moves it to doe.sequences), so it's unclear to me if that's public or not - my impression is it's private still.

If we're going to have multiple, either in SciPy or in Statsmodels, it would be good if they have a uniform API as much as possible. The halton, van_der_corput and latin_hypercube signatures are all slightly different and that doesn't look necessary.

I think a self-contained module would be better rather than putting them into a "design of experiments" (sub)module. Other than that, no strong preference between SciPy and Statsmodels.

Cc @josef-pkt, @bashtage

bashtage · 2019-09-18T07:09:47Z

IMO both of these are better fits for scipy than statsmodels. I think those functions can be treated as private, and so could be quietly and quickly deprecated (with a pointer to scipy.MOD) if SciPy goes forward with these.

bashtage · 2019-09-18T07:12:11Z

FWIW there is a sobol squence generator wrapped in Python with support for 21201 dimensions in pyscenarios. Uses numba and so is fast.

bashtage · 2019-09-18T07:14:30Z

As for a design, it seems like it would make sense to split the transformations from the generation. Transformation is a generic problem that can take any form fo quasi-random-like generates.

tupui · 2019-09-18T07:30:46Z

@bashtage I agree with you that these methods are more general than "just" statistically relevant. For lots of optimization/integration problems we want to have better number generators. But I am not sure about keeping these private. I would want to have an alternative to np.random.random() which is accessible.

bashtage · 2019-09-18T07:46:56Z

@tupui The private was only relevant to what is already in statsmodels, not what may go into SciPy. By private I mean it doesn't go into the docs, and then if this mod goes into SciPy, the version in statsmodels can be removed and users redirected to scipy (which sm knows that user already uses).

tupui · 2019-09-18T08:36:11Z

@tupui The private was only relevant to what is already in statsmodels, not what may go into SciPy. By private I mean it doesn't go into the docs, and then if this mod goes into SciPy, the version in statsmodels can be removed and users redirected to scipy (which sm knows that user already uses).

I agree with this. What do you think @rgommers about moving the PR and the existing halton and lhs here?

rgommers · 2019-09-19T04:43:02Z

I agree with this. What do you think @rgommers about moving the PR and the existing halton and lhs here?

I think that makes sense. Putting it all together in one PR so it can be reviewed together may make sense.

As for a design, it seems like it would make sense to split the transformations from the generation. Transformation is a generic problem that can take any form fo quasi-random-like generates.

Agreed. Inspired by / borrowed from the new numpy.random generator infrastructure I guess:) Are there any parts of that you would recommend building on, or is QMC too different?

tupui · 2019-09-19T06:32:09Z

@rgommers super. As for the infrastructure, seems numpy.random but I never did any Cython so I am not sure how to proceed here. I propose to first put the code in scipy.stats.qmc as you suggested.

rgommers · 2019-09-19T06:49:08Z

Sounds good.! It's fine to keep things in Python for now. We've got to get the API right first, improving the performance with Cython can come later (even keep as a future todo after merging).

tupui · 2019-09-19T11:46:32Z

@rgommers Hum, not sure about the naming. Maybe scipy.stats.doe or scipy.stats.sampling would be better because of LHS which is not a QMC.

Anyway, I created a first PR and propose to move the discussion there 😃

rkern · 2019-09-19T15:00:28Z

IIRC, only transformation methods work reasonably for deriving non-uniform variates from QMC uniforms. Most of the distribution algorithms in Generator would not work with QMC uniforms.

bashtage · 2019-09-19T15:44:08Z

IIRC, only transformation methods work reasonably for deriving non-uniform variates from QMC uniforms. Most of the distribution algorithms in Generator would not work with QMC uniforms.

I was thinking that it could be useful for a transformation to accept a BitGenerator. For example, if the transformations included a multivariate Student's t copula, then one could generator MV student's RVs using either a SobolGenerator or a BitGenerator.

rgommers · 2019-09-20T02:21:03Z

@tupui I prefer to keep the discussion here (on the high-level goals and design at least) to not split it over two places.

Maybe scipy.stats.doe or scipy.stats.sampling would be better because of LHS which is not a QMC.

Design of experiments as a topic we'd definitely want to leave with Statsmodels. Here we want the qMC samplers, not the DOE-specific parts. Latin hypercube sampling could be included perhaps, it's "near-random" rather than quasi-random but also widely used in Monte Carlo sampling and not specific to DOE. Also note that optimize.shgo and optimize.differential_evolution both contain latin hypercube sampling. Having a single efficient method for it standalone that those functions could rely on could make sense.

tupui · 2019-09-20T09:34:43Z

@tupui I prefer to keep the discussion here (on the high-level goals and design at least) to not split it over two places.

No problem.

Design of experiments as a topic we'd definitely want to leave with Statsmodels. Here we want the qMC samplers, not the DOE-specific parts. Latin hypercube sampling could be included perhaps, it's "near-random" rather than quasi-random but also widely used in Monte Carlo sampling and not specific to DOE.

I guess it's a matter of terminology. I am coming from the Uncertainty Quantification field and all these methods are under: DoE. I will change it to scipy.stats.qmc if you prefer.

Also note that optimize.shgo and optimize.differential_evolution both contain latin hypercube sampling. Having a single efficient method for it standalone that those functions could rely on could make sense.

Fully agree on that.

rgommers · 2019-10-04T19:14:00Z

gh-10844 proposes five new functions:

qmc.discrepancy
qmc.halton
qmc.orthogonal_latin_hypercube
qmc.latin_hypercube
qmc.optimal_design

I'm not sure about the optimal_design one, that may be Statsmodels territory.

Thoughts on whether the above functions are the right scope anyone?

Balandat · 2019-10-04T23:05:21Z

The starting point of this was the cython sobol implementation I have - maybe we should add that to qmc.sobol ?

rgommers · 2019-10-05T10:08:54Z

The starting point of this was the cython sobol implementation I have - maybe we should add that to qmc.sobol ?

yes, definitely!

tupui · 2019-10-05T10:17:56Z

@Balandat @rgommers for info, I just have 2 more weeks to work on this and then I will be AFK for 10 months (world travel without a computer). So feel free afterwards to continue this without me if we do not manage this before 😄

rgommers · 2019-10-05T12:54:23Z

Thanks @tupui, that's good to know. Would you in principle be happy to help out maintaining it after you're back? (I know plans can change, that's why "in principle")

The reason I'm asking is that it's a lot easier to add a significant new functionality, a whole submodule in this case, if the original authors or other domain experts can be asked in case of bugs or design decisions. For people who already understand the algorithms, that can often be a small effort, while if one of the SciPy maintainers needs to do it they may first have to study the algorithmic details.

Same question to @Balandat.

I'm happy to help review and make the code fit nicely into the SciPy API and code base, but I likely won't become a content expert here ....

tupui · 2019-10-05T13:59:41Z

@rgommers Yes of course I would be happy to help 😄 I am working as a python dev now and for good hopefully ;)

Balandat · 2019-10-07T15:44:08Z

I'm happy to be PoC for the Sobol generator. I can add this as a commit to #10844 soonish - I might need some help with the cython setup though, not sure if scipy does something particular about that.

rgommers · 2019-10-08T13:33:47Z

Thanks @tupui and @Balandat

I can add this as a commit to #10844 soonish

May be useful to put it all together indeed. @tupui perhaps you can add @Balandat as a collaborator on your fork of SciPy, so he can push commits to gh-10844?

I might need some help with the cython setup though, not sure if scipy does something particular about that.

The build system should autodetect .pyx files and run Cython over them. The only things that need doing are:

for each filename.pyx, and filename.c to `.gitignore
edit stats/setup.py to build the extension. should be straightforward, you can follow what is done for _stats.pyx

tupui · 2019-10-08T14:18:06Z

May be useful to put it all together indeed. @tupui perhaps you can add @Balandat as a collaborator on your fork of SciPy, so he can push commits to gh-10844?

I have done it already 😉

Balandat · 2019-10-08T14:59:38Z

Great, I hope to get to this later this week.

tupui · 2020-05-28T00:49:28Z

@Balandat hey! Did you get some time to work on this?

Balandat · 2020-05-28T01:04:32Z

Unfortunately not. I am pretty swamped for the next couple of weeks, but after that I should be able to find some time.

tupui · 2020-05-28T01:06:17Z

Unfortunately not. I am pretty swamped for the next couple of weeks, but after that I should be able to find some time.

No worries, just wanted to have a heads up 😃

Balandat · 2020-05-28T03:19:41Z

Actually I banged this out just now: 65b8552

Still a little rough around the edges and needs some cleaning up, but most functionality is there and tests fine.

Note that the setup uses a SobolEngine as the sequence has state. This is somewhat different than the functional interface of the other RNGs. We'll have to discuss how this could fit in.

tupui · 2020-05-28T03:31:34Z

Actually I banged this out just now: 65b8552

Nice, you rock! 👍

Note that the setup uses a SobolEngine as the sequence has state. This is somewhat different than the functional interface of the other RNGs. We'll have to discuss how this could fit in.

I will have a look.

tupui · 2020-05-28T19:19:37Z

Now that we have all the methods in here 🎊 , I guess we should fix a sort of common API. Currently, you either call a function like qmc.[method](dim, n_samples) or use a class engine = qmc.[method](dim) ; engine.draw(n_sample).

Personally, I tend more to the first option as it's more immediate from a user perspective. But I understand the control that ones could want with the class way. I propose we have then both a function and class for this. The class would be more private.

Then, the question of the bounds. Shall all method just work in the [0, 1[ space like with Sobol'?
Or shall we deal with bounds like in the other methods?

Also, what about a helper function like we have in optimize? This could be handy.

Balandat · 2020-05-28T19:32:09Z

So conceptually there is an important distinction with qMC methods between the functional API and the object-based API: For iid random draws you can just continue to call the functional API and append the samples and you get a set of iid samples. For the space filling sequences, this is not the case. So only having a functional api wouldn't allow you to start from mid-sequence, you'd always have to draw from the start. we can provide that API and it woudl be useful, but we should think about the "resumability" of the sequence for the qmc case, and whether we want to expose that to users (we probably should in some way).

tupui · 2020-06-05T01:23:34Z

@Balandat OK then I will use as well an object-based API. This way we would have:

qmc.NormalQMCEngine
qmc.MultivariateNormalQMCEngine
qmc.SobolEngine
qmc.HaltonEngine
qmc.OrthogonalLatinHypercubeEngine
qmc.LatinHypercubeEngine
qmc.OptimalDesignEngine
qmc.discrepancy

I would remove the engine in the name if you don't mind.

Then we could have a base class which would require to implement:

qmc.[].random(args) (to follow numpy.random)
qmc.[].fast_forward(args) (or we do this in the init)

I would also move NormalQMCEngine and MultivariateNormalQMCEngine to the same qmc file. And I would add an option to be able to select another base engine.

For the bounds, what is you opinion? Shall we just leave it to the user? I am neutral here.

Last but not least, I would add a last class which would serve as a higher level interface:

qmc.QMC

Let me know what you think about these and I can work on the modifications 😃.

cc @rgommers @bashtage

tupui · 2020-06-18T04:24:33Z

@Balandat If the following is fine for you, I will do this for all methods. For now, I left the scale in the base class. Not sure about this. Maybe outside like discrepancy would be better. Or we put also discrepancy inside?

cc @rgommers @bashtage

class QMCEngine(ABC):
    """Quasi-Monte Carlo engine sampler.

    Samples are distributed over the half-open interval [0, 1).

    Examples
    --------
    Generate samples from a low discrepancy sequence of Halton.

    >>> from scipy.stats import qmc
    >>> sampler = qmc.Halton(k_vars=2)
    >>> sample = sampler.random(n_samples=5)

    Compute the quality of the sample using the discrepancy criterion.

    >>> uniformity = qmc.discrepancy(sample)

    If some wants to continue an existing design, extra points can be obtained.

    >>> sampler.fast_forward(5)
    >>> sample_continued = sampler.random(n_samples=5)

    Finally, samples can be scaled to bounds.

    >>> bounds = np.array([[0, 2], [10, 5]])
    >>> sample_continued_scaled = sampler.scale(sample_continued, bounds)

    """

    @abstractmethod
    def __init__(self, k_vars, seed=None):
        """
        Parameters
        ----------
        k_vars : int
            Dimension of the parameter space.
        seed : int or `np.random.RandomState`, optional
            If `seed` is not specified the `np.RandomState` singleton is used.
            If `seed` is an int, a new `np.random.RandomState` instance is used,
            seeded with seed.
            If `seed` is already a `np.random.RandomState instance`, then that
            `np.random.RandomState` instance is used.
            Specify `seed` for repeatable sampling.
        """
        self.k_vars = k_vars
        self.rng = check_random_state(seed)

    @abstractmethod
    def random(self, n_samples):
        """Draw n_samples in the half-open interval [0, 1).

        Parameters
        ----------
        n_samples : int
            Number of samples to generate in the parameter space.

        Returns
        -------
        sample : array_like (n_samples, k_vars)
            QMC sample.
        """
        pass

    @abstractmethod
    def fast_forward(self, n):
        """Fast-forward the sequence by n positions.

        Parameters
        ----------
        n: int
            Number of points to skip in the sequence.
        """
        pass

    @classmethod
    def scale(cls, sample, bounds):
        """Sample scaling from unit hypercube to bounds range.

        Parameters
        ----------
        sample : array_like (n_samples, k_vars)
            QMC sample.
        bounds : tuple or array_like ([min, k_vars], [max, k_vars])
            Desired range of transformed data. The transformation apply the bounds
            on the sample and not the theoretical space, unit cube. Thus min and
            max values of the sample will coincide with the bounds.

        Returns
        -------
        sample : array_like (n_samples, k_vars)
            QMC scaled-sample.

        """
        bounds = np.asarray(bounds)
        min_ = np.min(bounds, axis=0)
        max_ = np.max(bounds, axis=0)
        return sample * (max_ - min_) + min_


class Halton(QMCEngine):

    def __init__(self, k_vars):
        super().__init__(k_vars=k_vars)
        self.base = n_primes(k_vars)
        self.n_fast_forward = 0

    def random(self, n_samples):
        sample = [van_der_corput(n_samples + 1, bdim, self.n_fast_forward)
                  for bdim in self.base]
        return np.array(sample).T[1:]

    def fast_forward(self, n):
        self.n_fast_forward = n

Balandat · 2020-06-18T06:34:02Z

Sorry about the delay here. I like this general interface, this is quite nice.

I would remove the engine in the name if you don't mind.

Sure, makes sense.

For the bounds, what is you opinion? Shall we just leave it to the user? I am neutral here.

I think we can have this be case-by-case. Note that e.g. the MultivariateNormalQMC doesn't sample from a compact space but generates samples in R^n. So bounds don't really make sense there. Typically the convention is for most of these methods to generate samples in the unit cube, I would just use that convention. Scaling can then happen outside (i.e. I think you can move the scale method outside the class.

In general I think it makes sense to start off as minimal as possible, we can always add more features / wrappers around the basics, but the other direction is much harder....

tupui · 2020-06-18T09:25:53Z

Sorry about the delay here. I like this general interface, this is quite nice.

No problem! Perfect I will finish the conversion of everything then.

[...] Scaling can then happen outside (i.e. I think you can move the scale method outside the class.

OK for me.

rgommers · 2021-01-30T15:42:56Z

It seems like the main follow-up here is performance-related? Maybe worth opening a new issue for moving some of the critical parts to either Cython or Pythran?

tupui · 2021-01-30T15:50:47Z

It seems like the main follow-up here is performance-related? Maybe worth opening a new issue for moving some of the critical parts to either Cython or Pythran?

Ok I will create an issue for that. I have also 2 following PR. One to use this in optimize and the other to add OptimalDesign

Balandat closed this as completed Jan 17, 2019

Balandat changed the title ~~[Question] Desire for qMC library in scipy~~ [Question] Desire for qMC library in scipy? Jan 17, 2019

Balandat reopened this Jan 17, 2019

rgommers added the enhancement A new feature or improvement label May 1, 2019

rgommers added the scipy.stats label Aug 11, 2019

tupui mentioned this issue Sep 18, 2019

ENH: doe module and lhs designs statsmodels/statsmodels#4113

Closed

tupui mentioned this issue Sep 19, 2019

ENH: add stats.qmc module with quasi Monte Carlo functionality #10844

Merged

mfeurer mentioned this issue Mar 2, 2020

Use scipy-methods for initial design automl/SMAC3#600

Closed

rgommers closed this as completed in #10844 Jan 30, 2021

[Question] Desire for qMC library in scipy? #9695

[Question] Desire for qMC library in scipy? #9695

Comments

Balandat commented Jan 17, 2019 • edited

rkern commented Jan 18, 2019

tupui commented Feb 27, 2019

rgommers commented May 1, 2019

Balandat commented May 1, 2019

tupui commented Sep 17, 2019

rgommers commented Sep 18, 2019

bashtage commented Sep 18, 2019

bashtage commented Sep 18, 2019

bashtage commented Sep 18, 2019

tupui commented Sep 18, 2019

bashtage commented Sep 18, 2019

tupui commented Sep 18, 2019

rgommers commented Sep 19, 2019

tupui commented Sep 19, 2019

rgommers commented Sep 19, 2019

tupui commented Sep 19, 2019

rkern commented Sep 19, 2019

bashtage commented Sep 19, 2019

rgommers commented Sep 20, 2019

tupui commented Sep 20, 2019

rgommers commented Oct 4, 2019

Balandat commented Oct 4, 2019

rgommers commented Oct 5, 2019

tupui commented Oct 5, 2019

rgommers commented Oct 5, 2019

tupui commented Oct 5, 2019

Balandat commented Oct 7, 2019

rgommers commented Oct 8, 2019

tupui commented Oct 8, 2019

Balandat commented Oct 8, 2019

tupui commented May 28, 2020

Balandat commented May 28, 2020

tupui commented May 28, 2020

Balandat commented May 28, 2020

tupui commented May 28, 2020

tupui commented May 28, 2020 • edited

Balandat commented May 28, 2020

tupui commented Jun 5, 2020

tupui commented Jun 18, 2020

Balandat commented Jun 18, 2020

tupui commented Jun 18, 2020

rgommers commented Jan 30, 2021

tupui commented Jan 30, 2021

Balandat commented Jan 17, 2019 •

edited

tupui commented May 28, 2020 •

edited