Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: random_state using scipy.stats.qmc engines #13368

Closed
tupui opened this issue Jan 8, 2021 · 24 comments
Closed

ENH: random_state using scipy.stats.qmc engines #13368

tupui opened this issue Jan 8, 2021 · 24 comments

Comments

@tupui
Copy link
Member

tupui commented Jan 8, 2021

This is a follow up of discussions in #10844.

Describe the solution you'd like
All sampling methods in scipy.stats are using random_state which is a np.random.Generator (for new code). But this numpy generator is not aware of dimensions. Also, the new scipy.stats.qmc allow to generate samples efficiently in n-dimensions. It would be nice to bridge the gap between the two.

Currently there is a qmc.MultivariateNormalQMC. Instead of duplicating this for other distributions, a solution could be to make the QMC engines inherit from np.random.BitGenerator. We could then use the new QMC engines with all the existing distributions.

Describe alternatives you've considered

  1. Solution using a BitGenerator: Seemingly, it must be done in Cython. Otherwise I saw that there was a wrapper around BitGenerator, but this is not available in numpy... https://bashtage.github.io/randomgen/bit_generators/userbitgenerator.html from @bashtage. This is working but the underlying numpy code is not aware of dimensions (https://github.com/numpy/numpy/blob/e4feb7027e397925d220a10dd58b581b87ca1fec/numpy/random/_generator.pyx#L3562-L3568).
from numpy.random import Generator
from randomgen import UserBitGenerator


class SobolNP:
    def __init__(self, state):
        self._next_64 = None
        self.engine = Sobol(d=1, scramble=False, seed=state)

    def random_raw(self):
        """Generate the next "raw" value, which is 64 bits"""
        return int(self.engine.random() * 2**64)  # although Sobol uses 30 bits...

    @property
    def next_64(self):
        def _next_64(void_p):
            return self.random_raw()

        self._next_64 = _next_64
        return _next_64


# examples
prng = SobolNP(1234)
sobol_bit_generator = UserBitGenerator(prng.next_64, 64)
gen = Generator(sobol_bit_generator)
gen.random(8)
  1. Another solution would be to mock np.random.Generator: I am using __getattr__ to mock calls to the distributions. So calls like random_state.uniform(...). Seems to be working and n-dimensions is ok too.
from numpy.random import Generator
import scipy.stats as stats
from functools import partial


class ScipyGenerator:

    @property
    def __class__(self):
        return Generator

    def __init__(self, d):
        self.qrng = stats.qmc.Sobol(d=d, scramble=False)

    def rvs(self, *args, dist, **kwargs):
        args = list(args)
        size = kwargs.pop('size', None)
        if size is None:
            size = args.pop(0)
        sample = self.qrng.random(size)
        return dist(*args, **kwargs).ppf(sample)

    def __getattr__(self, attr):
        try:
            dist = getattr(stats, attr)
            return partial(self.rvs, dist=dist)
        except AttributeError as err_np:
            raise err_np


# examples
random_state = ScipyGenerator(d=2)
isinstance(random_state, Generator)
random_state.uniform(8)
random_state.gamma(2, size=8)
@rkern
Copy link
Member

rkern commented Jan 8, 2021

I don't believe that implementing a QMC-based BitGenerator will work properly. Most of the non-uniform distribution algorithms in Generator really require a PRNG, not a QRNG, underneath. Inversion algorithms (e.g. exponential) work fine, but most are not of that form. Similar algorithms do exist for QRNGs, but they are different and would need to be derived separately.

Implementing a class that has (at least) a subset of Generator's methods with appropriate algorithms in each might be useful, but I wouldn't do the __class__ type-punning. I get that you want it to survive a check_random_state(), but it's usually not a good idea fake the type hierarchy like that. Note that the names of the Generator methods are often not the same as the scipy.stats distributions, so I recommend just writing explicit methods.

@tupui
Copy link
Member Author

tupui commented Jan 8, 2021

I don't believe that implementing a QMC-based BitGenerator will work properly. Most of the non-uniform distribution algorithms in Generator really require a PRNG, not a QRNG, underneath. Inversion algorithms (e.g. exponential) work fine, but most are not of that form. Similar algorithms do exist for QRNGs, but they are different and would need to be derived separately.

I am not sure I understand why we could not sample any distribution using QMC sampling.

Implementing a class that has (at least) a subset of Generator's methods with appropriate algorithms in each might be useful, but I wouldn't do the __class__ type-punning. I get that you want it to survive a check_random_state(), but it's usually not a good idea fake the type hierarchy like that. Note that the names of the Generator methods are often not the same as the scipy.stats distributions, so I recommend just writing explicit methods.

Indeed I did this for the check_random_state, but if I would propose a PR I would propose to add this class to the acceptable options of check_random_state. As for the names of the method, the rational behind was to avoid declaring all the distributions which would duplicate lot of code.

If you feel that what I propose here is ok, I could prepare a PR (well we still need to merge qmc... so right after).

@bashtage
Copy link
Contributor

bashtage commented Jan 8, 2021

I am not sure I understand why we could not sample any distribution using QMC sampling.

You can, but not using rejection samplers which are the dominant method used in Generator.

@tupui
Copy link
Member Author

tupui commented Jan 8, 2021

I am not sure I understand why we could not sample any distribution using QMC sampling.

You can, but not using rejection samplers which are the dominant method used in Generator.

Ah ok thanks. I thought only inverse CDF was used.

@rkern
Copy link
Member

rkern commented Jan 8, 2021

I am not sure I understand why we could not sample any distribution using QMC sampling.

There are algorithms that work with QMC sampling; they just aren't the algorithms implemented in Generator, for the most part. The paper I linked has the mathematical details about why QMC doesn't work for those algorithms out-of-box and how they can be modified to work.

I would not modify check_random_state() per se. There are two code contributions coming soon that really rely on a real Generator with a real BitGenerator underneath.

As for the names of the method, the rational behind was to avoid declaring all the distributions which would duplicate lot of code.

Sure, but that doesn't get you the Generator interface. Not only are there differences in names, the arguments often differ. If the idea is simply to provide easy access to all of the scipy.stats distributions with the QMC engines, then I'd just use the rvs() method as you've written it and pass in the distributions explicitly. You could even move this method to Sobol and friends.

@tupui
Copy link
Member Author

tupui commented Jan 8, 2021

As for the names of the method, the rational behind was to avoid declaring all the distributions which would duplicate lot of code.

Sure, but that doesn't get you the Generator interface. Not only are there differences in names, the arguments often differ. If the idea is simply to provide easy access to all of the scipy.stats distributions with the QMC engines, then I'd just use the rvs() method as you've written it and pass in the distributions explicitly. You could even move this method to Sobol and friends.

True for the arguments and that's why I added the small logic with the args,kwargs.

The idea was to give a user a simple way to use the QMC engines all over the place by just substituting the seed in their code.

I agree that, regardless of this, rvs could be in present in QMCEngine. But then this function would be very general as converting a uniform sample to another distribution. This actually relate to what I propose in #13343

@rkern
Copy link
Member

rkern commented Jan 8, 2021

True for the arguments and that's why I added the small logic with the args,kwargs.

That's not the issue I'm pointing out. Compare the arguments to Generator.triangular and scipy.stats.triang, for instance.

The idea was to give a user a simple way to use the QMC engines all over the place by just substituting the seed in their code.

It's a nice thought, but I don't think it ultimately works often enough to allow check_random_state() to pass through one of these. For instance scipy.stats.geninvgauss.rvs() implements a rejection algorithm. This is not one of the Generator methods, so people writing QMC-agnostic code would be calling geninvgauss.rvs(..., random_state=something) not something.geninvgauss(...).

@tupui
Copy link
Member Author

tupui commented Apr 23, 2021

@rkern do you, somehow, see a future for this or shall I close it?

@rkern
Copy link
Member

rkern commented Apr 23, 2021

I don't think this strategy is profitable to explore. Code written for PRNGs are assuming things about the BitGenerator that don't hold for QRNGs.

I think there's significant room to write some tools that help people sample from scipy.stats distributions using the QMC algorithms, but I don't think it would take the form of threading the QMC algorithms through as BitGenerators.

@tupui
Copy link
Member Author

tupui commented Apr 24, 2021

Thanks for the discussions here. I will close it then.

I think there's significant room to write some tools that help people sample from scipy.stats distributions using the QMC algorithms, but I don't think it would take the form of threading the QMC algorithms through as BitGenerators.

Do you have other ideas I should explore instead?

@tupui tupui closed this as completed Apr 24, 2021
@rkern
Copy link
Member

rkern commented Apr 24, 2021

The first most obvious thing would be to provide a function that takes a QMCEngine and a scipy.stats distribution object and apply the inversion method to sample from that distribution.

There also are variants of the acceptance-rejection method for QMC that you can implement with the same kind of interface. Some of the algorithms in UNU.RAN for automatically creating efficient acceptance-rejection algorithms for PRNGs will be useful for QMC, too (such as automatically deriving good proposal distributions, which are used in both the PRNG and QMC forms of AR). As the effort ramps up to integrate the UNU.RAN functionality, you might want to take a look and make sure that the algorithms are exposed in such a way that they will be useful both for PRNG sampling and QMC sampling.

@tupui
Copy link
Member Author

tupui commented Apr 25, 2021

The first most obvious thing would be to provide a function that takes a QMCEngine and a scipy.stats distribution object and apply the inversion method to sample from that distribution.

OK I was afraid this was too simple and would not be accepted. I could make a PR for that. Not sure about the interface and names here. But is the following what you mean?

class SampleDist:

    def __init__(self, qrng, dist):
        self.qrng = qrng
        self.dist = dist

    def rvs(self):
        sample = self.qrng.random(size)
        return self.dist.ppf(sample)

There also are variants of the acceptance-rejection method for QMC that you can implement with the same kind of interface. Some of the algorithms in UNU.RAN for automatically creating efficient acceptance-rejection algorithms for PRNGs will be useful for QMC, too (such as automatically deriving good proposal distributions, which are used in both the PRNG and QMC forms of AR). As the effort ramps up to integrate the UNU.RAN functionality, you might want to take a look and make sure that the algorithms are exposed in such a way that they will be useful both for PRNG sampling and QMC sampling.

I will keep an eye open during this integration work. Although MCMC and such are more the territory of PyMC for now.

@rkern
Copy link
Member

rkern commented Apr 25, 2021

I was thinking more of a function rather than an object, but yes. It's possible that it falls under the category of "not every one-liner needs to be a function", but then again, not everyone knows that trick for getting QMC variates from non-uniform distributions.

UNU.RAN is not an MCMC library. The stuff for MCMC in UNU.RAN is just applying the main algorithms of UNU.RAN to MCMC problems because automatically deriving fast samplers for new continuous distributions solves a particular problem with Gibbs sampling MCMC algorithms. That's not the main focus of the library or the planned integration.

@mdhaber
Copy link
Contributor

mdhaber commented Apr 25, 2021

I've read the discussion above. For univariate distributions (rv_continuous, rv_discrete), at least, could we not still pass a QMCEngine as the random_state of the rvs method? It does not need to inherit from BitGenerator for that. The rvs method could check the type of random_state and do inverse transform sampling if it gets a QMCEngine.

@rkern
Copy link
Member

rkern commented Apr 25, 2021

rvs() could grow a different keyword argument to allow you to pass in a QMCEngine explicitly, but I would not repurpose random_state like that.

@rkern
Copy link
Member

rkern commented Apr 25, 2021

But having explicit functions is more flexible, especially as the UNU.RAN algorithms become available. Inverting using the ppf() is probably the last thing you'll want to do if any other method is available (even just spline-interpolated approximation of inversion).

@mdhaber
Copy link
Contributor

mdhaber commented Apr 25, 2021

even just spline-interpolated approximation of inversion

Speaking of which, @rkern, should I keep gh-13319 open? Do you think the interface/testing might be of value even if all the internals get replaced?

@rkern
Copy link
Member

rkern commented Apr 25, 2021

I doubt we'll bother replacing the internals with the UNU.RAN implementation, unless if you think that's the best way to address the remaining todos.

Want to add a qmc_engine=None keyword argument to FastNumericalInverse.rvs() and see how that feels?

@mdhaber
Copy link
Contributor

mdhaber commented Apr 25, 2021

Done! Let's continue this discussion in gh-13319.

@tupui
Copy link
Member Author

tupui commented Aug 19, 2021

Now that we have UNU.RAN and NumericalInverseHermite, I think we are missing a nice wrapper or some integration with our distributions.

Right now, a user would need to be aware of these details in order to find/use these. IMHO this is not a trivial assumption to make. Following what you said @mdhaber, what about if we would add a method parameter in rv_continuous, rv_discrete? It could use the distribution's defined method by default and otherwise we could use UNU.RAN or NumericalInverseHermite.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 27, 2021

I'm not sure now. I see three possibilities for the design

  1. the API is simple, but the user has little control other than the name of the method used to generate the random variates (e.g. "NumericalInverseHermite")
  2. we give the user convenient control of not only the method but several of its important options through the rv_continuous/rv_discrete interface, but the interface of rv_continuous/rv_discrete becomes considerably more complex.
  3. we do a hybrid thing where we allow the user to pass in something like a partial class... everything needed to instantiate the class that generates random variates (e.g. NumericalInverseHermite) except for the distribution object. But then this is no more convenient, IMO, than just having the user pass the distribution object into the random variate class initializer.

Right now, a user would need to be aware of these details in order to find/use these.

As a start, how about we just make them aware? We could include a note in the rvs method about the availability of these other sampling methods, and perhaps we could add an example to the documentation (that would show for each distribution) of how to use one of these classes to get random variates? If users ask for it to be exposed through rvs directly, we could add that. Or we could ask the mailing list whether they think the extra documentation is enough.

@tupui
Copy link
Member Author

tupui commented Aug 27, 2021

It makes me think about the discussion we had on the mailing list andrv_continuous/discrete being not optimal. Maybe we can actually start to think about this topic. I am not sure where though. I tried the SciPEP flag once, but I am not sure this is working (meaning it should not hang there forever, either it's go or no go) 😅 But we should make this work since we don't have slack or discussions (or we activate this) and email is just no for these things.

In the meantime, I agree that a note in the doc seems sensible. We could write it once and make it available to all rvs as they share some common doc. With UNURAN there is a better tutorial for sampling, but as always, you have to find it.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 28, 2021

It seems that we have two big ideas for stats maintenance these days:

  1. Overhaul of the distribution superclasses
  2. Overhaul of (almost) everything else (other than the major new enhancements: QMC and fast RVS)

After a year of pretty intense work on scipy.stats, I've become interested in both, but I've prioritized #2. The resampling method PRs (bootstrap, permutation test, and Monte Carlo tests) and gh-13312 have actually all been part of that effort. Besides applying the _axis_nan_policy decorator to existing functions and providing the resampling methods to expand the functionality of the statistics and tests we already have, my plan has been to create a class that makes writing a typical hypothesis test as simple as implementing a 1D statistic (and, if the null hypothesis is not that of independence, possibly defining the distribution of the statistic under the null hypothesis). All this other functionality (nan_policy, vectorization, one- and two-sided p-values, confidence intervals, etc.) can be inherited - or overridden if desired. The goal is for the interfaces and capabilities of most tests to be pretty consistent without requiring contributors to re-invent the wheel in every PR. rv_continuous and rv_discrete are not perfect, but think what it would be like if everyone had to implement distributions from scratch! A lot of distributions would be missing a lot of methods, there would be different names for the same methods and parameters, there would be varying behaviors in case of invalid inputs (maybe output NaN, maybe raise this or that error), and who knows what support for vectorization would look like. But this is exactly what we have in the case of the hypothesis tests and correlation functions now! I think that's why I have prioritized it.

There has also been some effort toward #1. @tirthasheshpatel and I have been talking in recent PRs about overhauling the test suite of distributions so that we can find all the obvious bugs where distribution methods are not living up to their public signatures.

I don't know that we should work on all aspects of these projects simultaneously. So, in the case of working on overhaul of rv_continuous/discrete, I think it would be better if this were to wait a bit. I really would like to be a part of it, but I don't think I can do it in parallel with all the rest. Also, I think it would help to have a more thorough test suite to help us characterize the shortcomings of rv_continuous and rv_discrete before we rewrite the distribution classes (or factory functions). And personally, I'd prefer to get a little further along toward the other stuff - item 2 - before digging deeply into those test suite improvements.

So maybe I'd suggest this order of high-level maintenance operations:
1a. Get gh-14651 rolling. (It will take a long time to complete but once it gets rolling, each new PR will be pretty quick and easy, as has been the case with the alternative effort in gh-12506. I think we'll know when we get there, and at that point, we can move on.)
1b. A base class (or factory function) for hypothesis tests, with the first example being the z-test (gh-13662).
2a. Overhaul the test suite of the distributions
2b. Look into what comes after rv_continuous and rv_discrete

I like some variety, but I'm not really efficient when I'm bouncing between dozen of PRs, waiting a few months at a time between updates and having to re-learn everything when I come back to it. I imagine that if a few of us were able to tag-team as authors and reviewers toward a common goal, we could get a lot done pretty quickly. What do you think?

@tupui
Copy link
Member Author

tupui commented Aug 28, 2021

This action plan sounds reasonable 👍 The overhaul of the distribution is a massive undertaking for sure and we would need to be sure to have a few maintainers on board to first avoid late discussions and second to do the hard work. This should not be a 1-2 man only project.

As usual, feel free to ping me if you feel I could help 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants