Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Desire for qMC library in scipy? #9695

Closed
Balandat opened this issue Jan 17, 2019 · 43 comments · Fixed by #10844
Closed

[Question] Desire for qMC library in scipy? #9695

Balandat opened this issue Jan 17, 2019 · 43 comments · Fixed by #10844
Labels
enhancement A new feature or improvement scipy.stats

Comments

@Balandat
Copy link
Contributor

Balandat commented Jan 17, 2019

I have a fast Cython implementation of a Sobol low-discrepancy quasi-random number generator using Owen scrambling. On top of that I have some transformations (inverse transform, Box-Muller) that make it easy to draw quasi-MC samples from, say, a multivariate normal distribution. This code has been used successfully in an industry setting.

Before I go through the effort of creating a PR, I wanted to see whether this is something that people would like to see in scipy.

The api could look something like the following:

from scipy.qmc import Sobol

sobol = Sobol(dimen=d, seed=1234)
samples = sobol.rvs(n)
from scipy.qmc import MultivariateNormalQMC

mvn = MultivariateNormalQMC(mean=np.zeros(2), cov=np.array([[1, 0.1], [0.1, 1]]))
samples = mvn.rvs(n)
@Balandat Balandat changed the title [Question] Desire for qMC library in scipy [Question] Desire for qMC library in scipy? Jan 17, 2019
@Balandat Balandat reopened this Jan 17, 2019
@rkern
Copy link
Member

rkern commented Jan 18, 2019

Yes!

@tupui
Copy link
Member

tupui commented Feb 27, 2019

Plus there is already Sobol' QMC in from scipy.optimize._shgo_lib import sobol_seq. There is some current work on statsmodels to implement some sampling methods: statsmodels/statsmodels#4104 and WIP: statsmodels/statsmodels#4113

@rgommers rgommers added the enhancement A new feature or improvement label May 1, 2019
@rgommers
Copy link
Member

rgommers commented May 1, 2019

There definitely is enough interest, a PR would be very welcome @Balandat.

from scipy.qmc import MultivariateNormalQMC

A separate top-level module does not seem warranted. This seems to fit well in scipy.stats. Depending on how many functions you want to add (sounds like quite a few?), we could create scipy.stats.qmc.

@Balandat
Copy link
Contributor Author

Balandat commented May 1, 2019

Sorry I have been swamped with stuff, hopefully I can get to this soon.

FWIW, we upstreamed a modified implementation into pytorch: https://pytorch.org/docs/stable/torch.html#quasi-random-sampling

@tupui
Copy link
Member

tupui commented Sep 17, 2019

@rgommers Is there any interest for other QMC like Halton (which is slightly better in dim <~ 10)? I put a version in statsmodels. I could also propose some LHS.

@rgommers
Copy link
Member

@rgommers Is there any interest for other QMC like Halton (which is slightly better in dim <~ 10)? I put a version in statsmodels. I could also propose some LHS.

In principle yes I think, but we try hard to not duplicate things between statsmodels and scipy.stats. The Halton function and statsmodels.tools.sequences seem to not have made it into the Statsmodels docs though (and your WIP PRs actually moves it to doe.sequences), so it's unclear to me if that's public or not - my impression is it's private still.

If we're going to have multiple, either in SciPy or in Statsmodels, it would be good if they have a uniform API as much as possible. The halton, van_der_corput and latin_hypercube signatures are all slightly different and that doesn't look necessary.

I think a self-contained module would be better rather than putting them into a "design of experiments" (sub)module. Other than that, no strong preference between SciPy and Statsmodels.

Cc @josef-pkt, @bashtage

@bashtage
Copy link
Contributor

IMO both of these are better fits for scipy than statsmodels. I think those functions can be treated as private, and so could be quietly and quickly deprecated (with a pointer to scipy.MOD) if SciPy goes forward with these.

@bashtage
Copy link
Contributor

FWIW there is a sobol squence generator wrapped in Python with support for 21201 dimensions in pyscenarios. Uses numba and so is fast.

@bashtage
Copy link
Contributor

As for a design, it seems like it would make sense to split the transformations from the generation. Transformation is a generic problem that can take any form fo quasi-random-like generates.

@tupui
Copy link
Member

tupui commented Sep 18, 2019

@bashtage I agree with you that these methods are more general than "just" statistically relevant. For lots of optimization/integration problems we want to have better number generators. But I am not sure about keeping these private. I would want to have an alternative to np.random.random() which is accessible.

@bashtage
Copy link
Contributor

@tupui The private was only relevant to what is already in statsmodels, not what may go into SciPy. By private I mean it doesn't go into the docs, and then if this mod goes into SciPy, the version in statsmodels can be removed and users redirected to scipy (which sm knows that user already uses).

@tupui
Copy link
Member

tupui commented Sep 18, 2019

@tupui The private was only relevant to what is already in statsmodels, not what may go into SciPy. By private I mean it doesn't go into the docs, and then if this mod goes into SciPy, the version in statsmodels can be removed and users redirected to scipy (which sm knows that user already uses).

I agree with this. What do you think @rgommers about moving the PR and the existing halton and lhs here?

@rgommers
Copy link
Member

I agree with this. What do you think @rgommers about moving the PR and the existing halton and lhs here?

I think that makes sense. Putting it all together in one PR so it can be reviewed together may make sense.

As for a design, it seems like it would make sense to split the transformations from the generation. Transformation is a generic problem that can take any form fo quasi-random-like generates.

Agreed. Inspired by / borrowed from the new numpy.random generator infrastructure I guess:) Are there any parts of that you would recommend building on, or is QMC too different?

@tupui
Copy link
Member

tupui commented Sep 19, 2019

@rgommers super. As for the infrastructure, seems numpy.random but I never did any Cython so I am not sure how to proceed here. I propose to first put the code in scipy.stats.qmc as you suggested.

@rgommers
Copy link
Member

Sounds good.! It's fine to keep things in Python for now. We've got to get the API right first, improving the performance with Cython can come later (even keep as a future todo after merging).

@tupui
Copy link
Member

tupui commented Sep 19, 2019

@rgommers Hum, not sure about the naming. Maybe scipy.stats.doe or scipy.stats.sampling would be better because of LHS which is not a QMC.

Anyway, I created a first PR and propose to move the discussion there 😃

@rkern
Copy link
Member

rkern commented Sep 19, 2019

IIRC, only transformation methods work reasonably for deriving non-uniform variates from QMC uniforms. Most of the distribution algorithms in Generator would not work with QMC uniforms.

@bashtage
Copy link
Contributor

IIRC, only transformation methods work reasonably for deriving non-uniform variates from QMC uniforms. Most of the distribution algorithms in Generator would not work with QMC uniforms.

I was thinking that it could be useful for a transformation to accept a BitGenerator. For example, if the transformations included a multivariate Student's t copula, then one could generator MV student's RVs using either a SobolGenerator or a BitGenerator.

@rgommers
Copy link
Member

@tupui I prefer to keep the discussion here (on the high-level goals and design at least) to not split it over two places.

Maybe scipy.stats.doe or scipy.stats.sampling would be better because of LHS which is not a QMC.

Design of experiments as a topic we'd definitely want to leave with Statsmodels. Here we want the qMC samplers, not the DOE-specific parts. Latin hypercube sampling could be included perhaps, it's "near-random" rather than quasi-random but also widely used in Monte Carlo sampling and not specific to DOE. Also note that optimize.shgo and optimize.differential_evolution both contain latin hypercube sampling. Having a single efficient method for it standalone that those functions could rely on could make sense.

@tupui
Copy link
Member

tupui commented Sep 20, 2019

@tupui I prefer to keep the discussion here (on the high-level goals and design at least) to not split it over two places.

No problem.

Design of experiments as a topic we'd definitely want to leave with Statsmodels. Here we want the qMC samplers, not the DOE-specific parts. Latin hypercube sampling could be included perhaps, it's "near-random" rather than quasi-random but also widely used in Monte Carlo sampling and not specific to DOE.

I guess it's a matter of terminology. I am coming from the Uncertainty Quantification field and all these methods are under: DoE. I will change it to scipy.stats.qmc if you prefer.

Also note that optimize.shgo and optimize.differential_evolution both contain latin hypercube sampling. Having a single efficient method for it standalone that those functions could rely on could make sense.

Fully agree on that.

@rgommers
Copy link
Member

rgommers commented Oct 4, 2019

gh-10844 proposes five new functions:

  • qmc.discrepancy
  • qmc.halton
  • qmc.orthogonal_latin_hypercube
  • qmc.latin_hypercube
  • qmc.optimal_design

I'm not sure about the optimal_design one, that may be Statsmodels territory.

Thoughts on whether the above functions are the right scope anyone?

@Balandat
Copy link
Contributor Author

Balandat commented Oct 4, 2019

The starting point of this was the cython sobol implementation I have - maybe we should add that to qmc.sobol ?

@rgommers
Copy link
Member

rgommers commented Oct 5, 2019

The starting point of this was the cython sobol implementation I have - maybe we should add that to qmc.sobol ?

yes, definitely!

@tupui
Copy link
Member

tupui commented Oct 5, 2019

@Balandat @rgommers for info, I just have 2 more weeks to work on this and then I will be AFK for 10 months (world travel without a computer). So feel free afterwards to continue this without me if we do not manage this before 😄

@rgommers
Copy link
Member

rgommers commented Oct 5, 2019

Thanks @tupui, that's good to know. Would you in principle be happy to help out maintaining it after you're back? (I know plans can change, that's why "in principle")

The reason I'm asking is that it's a lot easier to add a significant new functionality, a whole submodule in this case, if the original authors or other domain experts can be asked in case of bugs or design decisions. For people who already understand the algorithms, that can often be a small effort, while if one of the SciPy maintainers needs to do it they may first have to study the algorithmic details.

Same question to @Balandat.

I'm happy to help review and make the code fit nicely into the SciPy API and code base, but I likely won't become a content expert here ....

@tupui
Copy link
Member

tupui commented Oct 5, 2019

@rgommers Yes of course I would be happy to help 😄 I am working as a python dev now and for good hopefully ;)

@Balandat
Copy link
Contributor Author

Balandat commented Oct 7, 2019

I'm happy to be PoC for the Sobol generator. I can add this as a commit to #10844 soonish - I might need some help with the cython setup though, not sure if scipy does something particular about that.

@rgommers
Copy link
Member

rgommers commented Oct 8, 2019

Thanks @tupui and @Balandat

I can add this as a commit to #10844 soonish

May be useful to put it all together indeed. @tupui perhaps you can add @Balandat as a collaborator on your fork of SciPy, so he can push commits to gh-10844?

I might need some help with the cython setup though, not sure if scipy does something particular about that.

The build system should autodetect .pyx files and run Cython over them. The only things that need doing are:

  • for each filename.pyx, and filename.c to `.gitignore
  • edit stats/setup.py to build the extension. should be straightforward, you can follow what is done for _stats.pyx

@tupui
Copy link
Member

tupui commented Oct 8, 2019

May be useful to put it all together indeed. @tupui perhaps you can add @Balandat as a collaborator on your fork of SciPy, so he can push commits to gh-10844?

I have done it already 😉

@Balandat
Copy link
Contributor Author

Balandat commented Oct 8, 2019

Great, I hope to get to this later this week.

@tupui
Copy link
Member

tupui commented May 28, 2020

@Balandat hey! Did you get some time to work on this?

@Balandat
Copy link
Contributor Author

Unfortunately not. I am pretty swamped for the next couple of weeks, but after that I should be able to find some time.

@tupui
Copy link
Member

tupui commented May 28, 2020

Unfortunately not. I am pretty swamped for the next couple of weeks, but after that I should be able to find some time.

No worries, just wanted to have a heads up 😃

@Balandat
Copy link
Contributor Author

Actually I banged this out just now: 65b8552

Still a little rough around the edges and needs some cleaning up, but most functionality is there and tests fine.

Note that the setup uses a SobolEngine as the sequence has state. This is somewhat different than the functional interface of the other RNGs. We'll have to discuss how this could fit in.

@tupui
Copy link
Member

tupui commented May 28, 2020

Actually I banged this out just now: 65b8552

Nice, you rock! 👍

Note that the setup uses a SobolEngine as the sequence has state. This is somewhat different than the functional interface of the other RNGs. We'll have to discuss how this could fit in.

I will have a look.

@tupui
Copy link
Member

tupui commented May 28, 2020

Now that we have all the methods in here 🎊 , I guess we should fix a sort of common API. Currently, you either call a function like qmc.[method](dim, n_samples) or use a class engine = qmc.[method](dim) ; engine.draw(n_sample).

Personally, I tend more to the first option as it's more immediate from a user perspective. But I understand the control that ones could want with the class way. I propose we have then both a function and class for this. The class would be more private.

Then, the question of the bounds. Shall all method just work in the [0, 1[ space like with Sobol'?
Or shall we deal with bounds like in the other methods?

Also, what about a helper function like we have in optimize? This could be handy.

@Balandat
Copy link
Contributor Author

So conceptually there is an important distinction with qMC methods between the functional API and the object-based API: For iid random draws you can just continue to call the functional API and append the samples and you get a set of iid samples. For the space filling sequences, this is not the case. So only having a functional api wouldn't allow you to start from mid-sequence, you'd always have to draw from the start. we can provide that API and it woudl be useful, but we should think about the "resumability" of the sequence for the qmc case, and whether we want to expose that to users (we probably should in some way).

@tupui
Copy link
Member

tupui commented Jun 5, 2020

@Balandat OK then I will use as well an object-based API. This way we would have:

  • qmc.NormalQMCEngine
  • qmc.MultivariateNormalQMCEngine
  • qmc.SobolEngine
  • qmc.HaltonEngine
  • qmc.OrthogonalLatinHypercubeEngine
  • qmc.LatinHypercubeEngine
  • qmc.OptimalDesignEngine
  • qmc.discrepancy

I would remove the engine in the name if you don't mind.

Then we could have a base class which would require to implement:

  • qmc.[].random(args) (to follow numpy.random)
  • qmc.[].fast_forward(args) (or we do this in the init)

I would also move NormalQMCEngine and MultivariateNormalQMCEngine to the same qmc file. And I would add an option to be able to select another base engine.

For the bounds, what is you opinion? Shall we just leave it to the user? I am neutral here.

Last but not least, I would add a last class which would serve as a higher level interface:

  • qmc.QMC

Let me know what you think about these and I can work on the modifications 😃.

cc @rgommers @bashtage

@tupui
Copy link
Member

tupui commented Jun 18, 2020

@Balandat If the following is fine for you, I will do this for all methods. For now, I left the scale in the base class. Not sure about this. Maybe outside like discrepancy would be better. Or we put also discrepancy inside?

cc @rgommers @bashtage

class QMCEngine(ABC):
    """Quasi-Monte Carlo engine sampler.

    Samples are distributed over the half-open interval [0, 1).

    Examples
    --------
    Generate samples from a low discrepancy sequence of Halton.

    >>> from scipy.stats import qmc
    >>> sampler = qmc.Halton(k_vars=2)
    >>> sample = sampler.random(n_samples=5)

    Compute the quality of the sample using the discrepancy criterion.

    >>> uniformity = qmc.discrepancy(sample)

    If some wants to continue an existing design, extra points can be obtained.

    >>> sampler.fast_forward(5)
    >>> sample_continued = sampler.random(n_samples=5)

    Finally, samples can be scaled to bounds.

    >>> bounds = np.array([[0, 2], [10, 5]])
    >>> sample_continued_scaled = sampler.scale(sample_continued, bounds)

    """

    @abstractmethod
    def __init__(self, k_vars, seed=None):
        """
        Parameters
        ----------
        k_vars : int
            Dimension of the parameter space.
        seed : int or `np.random.RandomState`, optional
            If `seed` is not specified the `np.RandomState` singleton is used.
            If `seed` is an int, a new `np.random.RandomState` instance is used,
            seeded with seed.
            If `seed` is already a `np.random.RandomState instance`, then that
            `np.random.RandomState` instance is used.
            Specify `seed` for repeatable sampling.
        """
        self.k_vars = k_vars
        self.rng = check_random_state(seed)

    @abstractmethod
    def random(self, n_samples):
        """Draw n_samples in the half-open interval [0, 1).

        Parameters
        ----------
        n_samples : int
            Number of samples to generate in the parameter space.

        Returns
        -------
        sample : array_like (n_samples, k_vars)
            QMC sample.
        """
        pass

    @abstractmethod
    def fast_forward(self, n):
        """Fast-forward the sequence by n positions.

        Parameters
        ----------
        n: int
            Number of points to skip in the sequence.
        """
        pass

    @classmethod
    def scale(cls, sample, bounds):
        """Sample scaling from unit hypercube to bounds range.

        Parameters
        ----------
        sample : array_like (n_samples, k_vars)
            QMC sample.
        bounds : tuple or array_like ([min, k_vars], [max, k_vars])
            Desired range of transformed data. The transformation apply the bounds
            on the sample and not the theoretical space, unit cube. Thus min and
            max values of the sample will coincide with the bounds.

        Returns
        -------
        sample : array_like (n_samples, k_vars)
            QMC scaled-sample.

        """
        bounds = np.asarray(bounds)
        min_ = np.min(bounds, axis=0)
        max_ = np.max(bounds, axis=0)
        return sample * (max_ - min_) + min_


class Halton(QMCEngine):

    def __init__(self, k_vars):
        super().__init__(k_vars=k_vars)
        self.base = n_primes(k_vars)
        self.n_fast_forward = 0

    def random(self, n_samples):
        sample = [van_der_corput(n_samples + 1, bdim, self.n_fast_forward)
                  for bdim in self.base]
        return np.array(sample).T[1:]

    def fast_forward(self, n):
        self.n_fast_forward = n

@Balandat
Copy link
Contributor Author

Sorry about the delay here. I like this general interface, this is quite nice.

I would remove the engine in the name if you don't mind.

Sure, makes sense.

For the bounds, what is you opinion? Shall we just leave it to the user? I am neutral here.

I think we can have this be case-by-case. Note that e.g. the MultivariateNormalQMC doesn't sample from a compact space but generates samples in R^n. So bounds don't really make sense there. Typically the convention is for most of these methods to generate samples in the unit cube, I would just use that convention. Scaling can then happen outside (i.e. I think you can move the scale method outside the class.

In general I think it makes sense to start off as minimal as possible, we can always add more features / wrappers around the basics, but the other direction is much harder....

@tupui
Copy link
Member

tupui commented Jun 18, 2020

Sorry about the delay here. I like this general interface, this is quite nice.

No problem! Perfect I will finish the conversion of everything then.

[...] Scaling can then happen outside (i.e. I think you can move the scale method outside the class.

OK for me.

@rgommers
Copy link
Member

It seems like the main follow-up here is performance-related? Maybe worth opening a new issue for moving some of the critical parts to either Cython or Pythran?

@tupui
Copy link
Member

tupui commented Jan 30, 2021

It seems like the main follow-up here is performance-related? Maybe worth opening a new issue for moving some of the critical parts to either Cython or Pythran?

Ok I will create an issue for that. I have also 2 following PR. One to use this in optimize and the other to add OptimalDesign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants