Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats.unuran: add a higher level API #8

Closed
wants to merge 30 commits into from

Conversation

tirthasheshpatel
Copy link
Owner

@tirthasheshpatel tirthasheshpatel commented Jun 11, 2021

This adds a higher-level class called GenericGenerator which can be used to
create generators using UNU.RAN's String API. Currently, only strings are allowed
but it can be extended to accept a object containing a UNU.RAN distribution.

One can simply pass a string with the distribution to sample from and create a
generator:

>>> rng = GenericGenerator("normal")
>>> rvs = rng.rvs(100_000)

It is also possible to pass the method and its parameters as keyword arguments.

>>> rng = GenericGenerator("normal", method="tdr", domain=(-5, 5), c=0.,
...                        cpoints=(-4.5, -3.5, -2.5, 0.0, 2.5, 3.5, 4.5))
>>> rvs = rng.rvs(100_000)

It is also possible to change the parameters of the distribution. The below example
creates a samplers for the normal distribution with mean 6 and standard
deviation 4.25:

>>> rng = GenericGenerator("normal(6, 4.25)")

Advanced use case includes passing a custom distribution with methods coded as strings.

>>> rng = GenericGenerator("distr = cont; pdf = 'exp(-0.5 * x^2)'")
>>> rvs = rng.rvs(100_000)

@chrisb83 As you suggested, I have attempted to write a high-level API for UNU.RAN. Does this align with your ideas? In not, feel free to add suggestions or ask for changes. Better name to be decided :)
Also pinging @mckib2 in case you are interested in reviewing this.

This commit adds UNU.RAN source tree to scipy.stats with Makefiles and configuration
files removed. The "uniform" directory has also been removed as it contains third-party
uniform random number generator which we aren't allowed to use. Other files have been
kept intact.
This commit removes all the deprecated files from UNU.RAN source tree.
Declarations of the deprecated functions have been removed from `unuran.h`.
This commit replaces all the `.ch` files with `.h` files as SciPy's build system
doesn't understand `.ch` files. The C files including these `.ch` have been edited
to now include the coressponding `.h` file.
…as default

This commit refactors functions `unur_get_default_urng` and `unur_get_default_urng_aux` to
allow NumPy BitGenerator as the default URNG. Previously, UNU.RAN defined a macro to get its
own URNG from the `uniform` directory but as it is now removed, that macro will fail with an
error. Hence, the macro (present in `unuran_config.h`) has been changed to successfully obtain
a default URNG (which is set during `scipy.stats` import).
This commit adds a Cython wrapper wrapping methods Transformed Density Rejection
and Discrete Alias-Urn from UNU.RAN. A few tests and docs have been added. I have
written a small C API for aquiring and releasing Python callbacks during calls to
UNU.RAN C functions.
@chrisb83
Copy link

Hi Tirth,

as we discussed in our meeting yesterday, we do not need to make the UNU.RAN distributions (the ones listed here: http://statmath.wu.ac.at/unuran/doc/unuran.html#Stddist_005fCONT) available in SciPy.

The suggestion would be to allow to access the different generation methods easily, e.g. GenericGenerator(dist, method='tdr', ...). dist is a distribution object that needs to have the required methods like pdf, dpdf etc. ... are all other keyword arguments, maybe there are a few common ones like domain, otherwise we can just put **kwargs.
We can then internally refer to the implemented approaches like TransformedDensityRejection. If we change the API of the sampling methods like TransformedDensityRejection to also accept dist as above, this should be pretty easy (I hope :)

method : str, optional
Method to use. See [3]_ for available methods. Default is 'auto'.
domain : str, optional
Domain o the distribution. Can be used to truncate the distributions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: o -> of

@chrisb83
Copy link

maybe there are a few common ones like domain, otherwise we can just put **kwargs.

I just saw that it exactly what you did :)

@chrisb83
Copy link

The more I think about it, maybe we could even keep the access to the UNU.RAN distributions since you already wrote the code. In the end, it is a feature that is easy to use and maybe some users would like to rely on a different implementation, e.g. of the Gamma distribution. I could imagine that the UNU.RAN methods are quite a bit faster than the SciPy ones for large samples. Maybe this is a question we can ask on the mailing list?

…thod parameters.

This refactors the public API of the methods TransformedDensityRejection and DiscreteAliasUrn
method to accept a single `dist` object that contains all the required methods. This closely
follows the design of scipygh-13319. So, for example, previously one had to write required functions
seperately and pass them to the method for setup:

    >>> from scipy.stats import TransformedDensityRejection
    >>>
    >>> pdf = lambda x: 1-x*x
    >>> dpdf = lambda x: -2*x
    >>>
    >>> rng = TransformedDensityRejection(pdf, dpdf, domain=(-1,1), seed=123)

But, with the new API, one can just pass a `dist` object with all those methods bundled in it:

    >>> class MyDist:
    ...     def pdf(self, x):
    ...         return 1-x*x
    ...     def dpdf(self, x):
    ...         return -2*x
    ...
    >>> dist = MyDist()
    >>> rng = TransformedDensityRejection(dist, domain=(-1,1), seed=123)

Tests have been alterned to use the new API. New tests for the API itself need to be added.
The tests suite has been refactored to be more readable. Some common tests and data
has been centralized. Justifications for a few tests in the form of comments have
been added. Some of the validations in the API have also been centralized.
This implements some suggestions from the code review:
  * use UNU.RAN manual references in our doc
  * update documentation for RVS method
  * document default value of center as 0 and c as -0.5
  * remove underscores from rvs numbers
  * mention that the variants have been explained in notes
  * "w.r.t the variate" --> "w.r.t x (i.e. the variate)"
* split arguments in separate lines for readability
* remove the file * lines from ukraine.pxd
* remove unused/unvalidated arguments from _validate_args for readability
Relax the chi-squared tests to pass if p-value is > 0.1 instead of trying
to assert if the p-value is 0.999 as the latter case occurs only 0.1% of the
times which is very rare.
UNU.RAN was unable to recognize inf, nan and all zeros in the PV and threw unhelpful
"unknown error". Hence, the DAU method has been refactored to evaluate the PMF during
argument validation and check the values in the PV to report helpful errors. This is
possible to do as DAU only works for distribution with finite domain.

The tests have also been changed to incorporate this refactor.

Docs of the DAU method have also been updated.
Two tests failed:
  1. TestDiscreteAliasUrn::test_basic
  2. Both `test_seed` tests

The former was because a non-string distribution name was present in `distdiscrete` which led
to problems with getattr. The straight-forward fix for this was to skip for non-string `distname`s

The latter was due to a more subtle reason. I used global variable to store the NumPy RNG for sampling
when `np.__version__ < 1.19`. Because of this, when a new NumPy RNG was set, all the methods would start
using that RNG and seeding would immediately break. This has been fixed by using a seperate function in the
base-class `Method` and a global function for default RNG.
UNU.RAN seg faults when nans are present in the data. The behaviour for quantiles
less than 0 and greater than 1 is also different than that of SciPy's. So, code
and tests have been added to match the SciPy's behaviour and handle nans properly.
…g.py

Stubs were not getting exported as no __init__.pyi file was found by mypy.
This file has been added and other errors (false positives) have been ignored.
This commit adds a tutorial on Universal Non-Uniform Random Number Generators
in SciPy.
'randint' was failing with an "unknown error" on 32-bit Ubuntu with NumPy 1.16.5.
The error is occurring on line 723 in file src/methods/dau.c. Looks like it might
be due to floating-point errors. As this is not inside SciPy (and also only exists
on a very specific platform and numpy version), I have skipped this test case.

Also, previously, a test was skipped if the distribution name wasn't string. This
has been corrected to run the test on that test case.
UNU.RAN tests TDR with some special custom distributions. Those tests have
been added to SciPy. UNU.RAN's tests DAU with a few random and geometric
PVs. Our test suite tests for all the discrete distributions and is stronger
than UNU.RAN's. So, no tests for DAU have been added
This adds a higher-level class called `GenericGenerator` which can be used to
create generators using UNU.RAN's String API. Currently, only strings are allowed
but it can be extended to accept a object containing a UNU.RAN distribution.

One can simply pass a string with the distribution to sample from and create a
generator:

    >>> rng = GenericGenerator("normal")
    >>> rvs = rng.rvs(100_000)

It is also possible to pass the method and its parameters as keyword arguments.

    >>> rng = GenericGenerator("normal", method="tdr", domain=(-5, 5), c=0.,
    ...                        cpoints=(-4.5, -3.5, -2.5, 0.0, 2.5, 3.5, 4.5))
    >>> rvs = rng.rvs(100_000)

It is also possible to change the parameters of the distribution. The below example
creates a samplers for the normal distribution with mean 6 and standard
deviation 4.25:

    >>> rng = GenericGenerator("normal(6, 4.25)")

Advanced use case includes passing a custom distribution with methods coded as strings.

    >>> rng = GenericGenerator("distr = cont; pdf = 'exp(-0.5 * x^2)'")
    >>> rvs = rng.rvs(100_000)
@chrisb83
Copy link

Hi, this is definitely a nice simple interface that allows users to easily switch between the sampling methods. Let's briefly discuss on Saturday once more if we do not want to keep the code that allows to access the unuran distributions.

@tirthasheshpatel
Copy link
Owner Author

That sounds nice! I have been experimenting with a couple of things. So, it would be nice to discuss this once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants