Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add DiscreteGuideTable wrapper #14828

Merged
merged 47 commits into from
Nov 8, 2021
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
386f8d3
ENH: Add minimal DGT wrapper
Kai-Striega Oct 9, 2021
c67bbc3
TST: Import DGT into test suite
Kai-Striega Oct 9, 2021
577af42
DOC: Add basic docstring for DGT method
Kai-Striega Oct 10, 2021
7505f1b
DOC: Document basic DGT properties
Kai-Striega Oct 10, 2021
96b932e
ENH: Add warnings for invalid DGT guide_factor values
Kai-Striega Oct 13, 2021
891bfba
TST: Test basic use of DGT with doman + PMF
Kai-Striega Oct 13, 2021
a8bffb0
TST: Add test for negative `guide_factor`
Kai-Striega Oct 13, 2021
199881f
DOC: Document random_state in UNU.RAN DGT
Kai-Striega Oct 13, 2021
0190ad1
STY: Minor style fixes
Kai-Striega Oct 13, 2021
a2ae994
ENH: Add type stubs for UNU.RAN DGT
Kai-Striega Oct 13, 2021
9ec10f9
DOC: Describe UNU.RAN further in Notes section
Kai-Striega Oct 13, 2021
32f256f
DOC: Reference UNU.RAN user manual
Kai-Striega Oct 13, 2021
115383a
Merge branch 'master' of github.com:scipy/scipy into discrete_guide_t…
tirthasheshpatel Oct 14, 2021
ad71835
DOC: Fix indentation error
Kai-Striega Oct 16, 2021
d2a9f5b
DOC: Write `Examples` section
Kai-Striega Oct 18, 2021
730d137
STY: Apply suggestions from code review
Kai-Striega Oct 19, 2021
90b88bf
TST: Add tests for invalid and infinite domains
Kai-Striega Oct 19, 2021
a6a0427
TST: Add tests for infinite domains
Kai-Striega Oct 19, 2021
59007bf
STY: Fix whitespace for added tests
Kai-Striega Oct 19, 2021
4f4f2d4
ENH: Warn users if variant use is not optimal
Kai-Striega Oct 19, 2021
795a113
TST: Fix test errors from automerge
Kai-Striega Oct 19, 2021
cf872d3
DOC: Add tutorial for Discrete Guide Table
Kai-Striega Oct 19, 2021
c330ad7
BENCH: Copy benchmarks from DAU
Kai-Striega Oct 19, 2021
a045188
BENCH: Copy benchmarks from DAU
Kai-Striega Oct 19, 2021
2494aa7
ENH: Default to variant `None`in DGT
Kai-Striega Oct 19, 2021
c38af0a
STY: Remove duplicate code
Kai-Striega Oct 20, 2021
848a4d8
DOC: Various review fixes
Kai-Striega Oct 22, 2021
259a59b
ENH: Simplify DGT API
Kai-Striega Oct 23, 2021
c98f3cd
DOC: Add canonical DGT reference
Kai-Striega Oct 23, 2021
c06c525
ENH: Remove two missed references to `variant`
Kai-Striega Oct 28, 2021
3b198d9
DOC: Add missing colon in dgt guide
Kai-Striega Oct 29, 2021
d43e52d
ENH: Add ppf method for DGT
Kai-Striega Oct 31, 2021
8f6de46
TST: Remove custom binomial class
Kai-Striega Oct 31, 2021
d7e2e70
ENH: Fix difference between UNU.RAN and SciPy default
Kai-Striega Oct 31, 2021
aae4769
DOC: Minor code style fixes
Kai-Striega Nov 1, 2021
3e16412
TYPE: Add type hint for DGT.ppf
Kai-Striega Nov 1, 2021
aa0a009
TST: PPF returns an exact distribution so test if equal
Kai-Striega Nov 2, 2021
b8b20c5
DOC: Document PPF example for DGT wrapper
Kai-Striega Nov 7, 2021
532003e
TYPE: Add type overload for PPF method
Kai-Striega Nov 7, 2021
2613a00
ENH: Use domain for PPF adjustment
Kai-Striega Nov 7, 2021
07798da
Merge branch 'master' of github.com:scipy/scipy into discrete_guide_t…
tirthasheshpatel Nov 7, 2021
9b86020
Update scipy/stats/_unuran/unuran_wrapper.pyx.templ
Kai-Striega Nov 7, 2021
c76adf9
TYPE: Support typehints for older python versions
Kai-Striega Nov 7, 2021
d04cc2c
Update scipy/stats/_unuran/unuran_wrapper.pyx.templ
tirthasheshpatel Nov 7, 2021
9b84396
ENH: Use stats.binom in docstring examples
Kai-Striega Nov 8, 2021
10bef81
Update scipy/stats/_unuran/unuran_wrapper.pyi
tirthasheshpatel Nov 8, 2021
b7530e2
DOC: stats: remove statement about CDF and rephrase
tirthasheshpatel Nov 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions benchmarks/benchmarks/stats_sampling.py
Original file line number Diff line number Diff line change
Expand Up @@ -231,3 +231,31 @@ def time_dau_setup(self, distribution):

def time_dau_rvs(self, distribution):
self.rng.rvs(100000)


class DiscreteGuideTable(Benchmark):

param_names = ['distribution']

params = [
# a subset of discrete distributions with finite domain.
[['nhypergeom', (20, 7, 1)],
['hypergeom', (30, 12, 6)],
['nchypergeom_wallenius', (140, 80, 60, 0.5)],
['binom', (5, 0.4)]]
]

def setup(self, distribution):
distname, params = distribution
dist = getattr(stats, distname)
domain = dist.support(*params)
self.urng = np.random.default_rng(0x2fc9eb71cd5120352fa31b7a048aa867)
x = np.arange(domain[0], domain[1] + 1)
self.pv = dist.pmf(x, *params)
self.rng = stats.DiscreteGuideTable(self.pv, random_state=self.urng)

def time_dgt_setup(self, distribution):
stats.DiscreteGuideTable(self.pv, random_state=self.urng)

def time_dgt_rvs(self, distribution):
self.rng.rvs(100000)
2 changes: 2 additions & 0 deletions doc/source/tutorial/stats/sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ where
Methods for discrete distributions Required Inputs Optional Inputs Setup Speed Sampling Speed
===================================== =============== =============== =========== ==============
:class:`~DiscreteAliasUrn` pv pmf slow very fast
:class:`~DiscreteGuideTable` pv pmf slow very fast
chrisb83 marked this conversation as resolved.
Show resolved Hide resolved
===================================== =============== =============== =========== ==============

where
Expand Down Expand Up @@ -287,6 +288,7 @@ Generators in :mod:`scipy.stats`
sampling_tdr
sampling_dau
sampling_pinv
sampling_dgt
sampling_nrou


Expand Down
127 changes: 127 additions & 0 deletions doc/source/tutorial/stats/sampling_dgt.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@

.. _sampling-dgt:

Discrete Guide Table (DGT)

.. currentmodule:: scipy.stats

* Required: probability vector (PV) or the PMF along with a finite domain
* Speed:

* Set-up: slow (linear with the vector-length)
* Sampling: very fast


DGT samples from arbitrary but finite probability vectors. Random numbers
are generated by the inversion method, i.e.

1. Generate a random number U ~ U(0,1).
2. Find smallest integer I such that F(I) = P(X<=I) >= U.

Step (2) is the crucial step. Using sequential search requires O(E(X))
comparisons, where E(X) is the expectation of the distribution. Indexed
search, however, uses a guide table to jump to some I' <= I near I to find
X in constant time. Indeed the expected number of comparisons is reduced to
2, when the guide table has the same size as the probability vector
(this is the default). For larger guide tables this number becomes smaller
(but is always larger than 1), for smaller tables it becomes larger.
chrisb83 marked this conversation as resolved.
Show resolved Hide resolved

On the other hand the setup time for guide table is O(N), where N denotes
the length of the probability vector (for size 1 no preprocessing is
required). Moreover, for very large guide tables memory effects might even
reduce the speed of the algorithm. So we do not recommend to use guide
tables that are more than three times larger than the given probability
vector. If only a few random numbers have to be generated, (much) smaller
table sizes are better. The size of the guide table relative to the length
of the given probability vector can be set by the guide_factor parameter:

>>> import numpy as np
>>> from scipy.stats import DiscreteGuideTable
>>>
>>> pv = [0.18, 0.02, 0.8]
>>> urng = np.random.default_rng()
>>> rng = DiscreteGuideTable(pv, random_state=urng)
>>> rng.rvs()
2

By default, the probability vector is indexed starting at 0. However, this
can be changed by passing a ``domain`` parameter. When ``domain`` is given
in combination with the PV, it has the effect of relocating the
distribution from ``(0, len(pv))`` to ``(domain[0], domain[0] + len(pv))``.
``domain[1]`` is ignored in this case.

>>> rng = DiscreteGuideTable(pv, random_state=urng, domain=(10, 13))
>>> rng.rvs()
10

The method also works when no probability vector but a PMF is given.
In that case, a bounded (finite) domain must also be given either by
passing the ``domain`` parameter explicitly or by providing a ``support``
method in the distribution object:

>>> class Distribution:
... def __init__(self, c):
... self.c = c
... def pmf(self, x):
... return x ** self.c
... def support(self):
... return 0, 10
...
>>> dist = Distribution(2)
>>> rng = DiscreteGuideTable(dist, random_state=urng)
>>> rng.rvs()
9

.. note:: As :class:`~DiscreteGuideTable` expects PMF with signature
``def pmf(self, x: float) -> float``, it first vectorizes the
PMF using ``np.vectorize`` and then evaluates it over all the
points in the domain. But if the PMF is already vectorized,
it is much faster to just evaluate it at each point in the domain
and pass the obtained PV instead along with the domain.
For example, ``pmf`` methods of SciPy's discrete distributions
are vectorized and a PV can be obtained by doing:

>>> from scipy.stats import binom, DiscreteGuideTable
>>> dist = binom(10, 0.2) # distribution object
>>> domain = dist.support() # the domain of your distribution
>>> x = np.arange(domain[0], domain[1] + 1)
>>> pv = dist.pmf(x)
>>> rng = DiscreteGuideTable(pv, domain=domain)

Domain is required here to relocate the distribution

The size of the guide table relative to the probability vector may be set using
the ``guide_factor`` parameter. Larger guide tables result in faster generation
time but require a more expensive setup.

>>> guide_factor = 2
>>> rng = DiscreteGuideTable(pv, random_state=urng, guide_factor=guide_factor)
>>> rng.rvs()
2

Unfortunately, the PPF is rarely available in closed form or too slow when
available. For many distributions, the CDF is also not easy to obtain. This
method addresses both the shortcomings. The user only has to provide the
tirthasheshpatel marked this conversation as resolved.
Show resolved Hide resolved
probability vector and the PPF (inverse CDF) may be generated using ``ppf``
method. This method calculates the PPF exactly.
tirthasheshpatel marked this conversation as resolved.
Show resolved Hide resolved

For example to calculate the PPF of a binomial distribution with :math:`n=4` and
:math:`p=0.1`: we can set up a guide table as follows:

>>> n, p = 4, 0.1
>>> dist = stats.binom(n, p)
>>> rng = DiscreteGuideTable(dist, random_state=42)
>>> rng.ppf(0.5)
0.0

Please see [1]_ and [2]_ for more details on this method.

References
----------

.. [1] UNU.RAN reference manual, Section 5.8.4,
chrisb83 marked this conversation as resolved.
Show resolved Hide resolved
"DGT - (Discrete) Guide Table method (indexed search)"
https://statmath.wu.ac.at/unuran/doc/unuran.html#DGT

.. [2] H.C. Chen and Y. Asau (1974). On generating random variates from an
empirical distribution, AIIE Trans. 6, pp. 163-166.
1 change: 1 addition & 0 deletions scipy/stats/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -382,6 +382,7 @@
NumericalInversePolynomial
TransformedDensityRejection
DiscreteAliasUrn
DiscreteGuideTable

Circular statistical functions
------------------------------
Expand Down
1 change: 1 addition & 0 deletions scipy/stats/_unuran/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from .unuran_wrapper import ( # noqa: F401
TransformedDensityRejection,
DiscreteAliasUrn,
DiscreteGuideTable,
NumericalInversePolynomial,
NaiveRatioUniforms,
UNURANError
Expand Down
17 changes: 17 additions & 0 deletions scipy/stats/_unuran/unuran_wrapper.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -125,3 +125,20 @@ class DiscreteAliasUrn(Method):
domain: None | Tuple[float, float] = ...,
urn_factor: float = ...,
random_state: SeedType = ...) -> None: ...


class DGTDist(Protocol):
@property
def pmf(self) -> Callable[..., float]: ...
@property
def support(self) -> Tuple[float, float]: ...

class DiscreteGuideTable(Method):
tirthasheshpatel marked this conversation as resolved.
Show resolved Hide resolved
def __init__(self,
dist: npt.ArrayLike | DGTDist,
*,
domain: None | Tuple[float, float] = ...,
guide_factor: float = ...,
random_state: SeedType = ...) -> None: ...
@overload
def ppf(self, u: npt.ArrayLike) -> np.ndarray: ...
Kai-Striega marked this conversation as resolved.
Show resolved Hide resolved
Loading