From 2f3f22ca2c26ce99139100c65711d8b573a0d55b Mon Sep 17 00:00:00 2001 From: Ricardo Date: Fri, 18 Jun 2021 10:18:23 +0200 Subject: [PATCH 1/7] Add V4 distribution implementation developer guide --- docs/source/distribution_implementation.md | 452 +++++++++++++++++++++ 1 file changed, 452 insertions(+) create mode 100644 docs/source/distribution_implementation.md diff --git a/docs/source/distribution_implementation.md b/docs/source/distribution_implementation.md new file mode 100644 index 000000000..a2ef39ffc --- /dev/null +++ b/docs/source/distribution_implementation.md @@ -0,0 +1,452 @@ +# Implementing a Distribution (devs) + +This guide provides an overview on how to implement a distribution for version 4 +of PyMC3. It is designed for developers who wish to add a new distribution to +the library. Users will not be aware of all this complexity and should instead +make use of helper methods such as (TODO?) + +PyMC3 distributions build on top of Aesara's random variables (TODO: add link), +and implement `logp` and `logcdf` methods as well as other initialization and +validation helpers, most notably `shape/dims`, alternative parametrizations, and +default `transforms`. + +Here is a summary check-list of the steps needed to implement a new distribution. +Each section will be expanded below: + +1. Creating a new `RandomVariable` `op` +1. Implementing the corresponding `Distribution` class +1. Adding tests for the new `RandomVariable` +1. Adding tests for the `logp` / `logcdf` methods +1. Documenting the new `Distribution`. + +This guide does not attempt to explain the rationale behind the `Distributions` +current implementation, and details are provided only insofar as they help +implementing new "standard" distributions. + +## 1. Creating a new `RandomVariable` `op` + +`RandomVariables` are responsible for implementing the random sampling methods, +which in version 3 of PyMC3 used to be one of the standard `Distribution` +methods, alongside `logp` and `logcdf`. The `RandomVariable` is also responsible +for parameter broadcasting and shape inference. + +Before creating a new `RandomVariable` make sure that it is not offered in the +[Numpy library](https://numpy.org/doc/stable/reference/random/generator.html#distributions) +If it is, it should be added to the [Aesara library](https://github.com/aesara-devs/aesara) +first and then imported into the PyMC3 library. + +In addition, it might not always be necessary to implement a new `RandomVariable`. +For example if the new `Distribution` is just a special parametrization of an +existing `Distribution`. This is the case of the `OrderedLogistic` and +`OrderedProbit`, which are just special parametrizations of the `Categorial` +distribution. + +The following snippet illustrates how to create a new `RandomVariable`: + +```python + +from aesara.tensor.var import TensorVariable +from aesara.tensor.random.op import RandomVariable + +# Create your own `RandomVariable`... +class BlahRV(RandomVariable): + name: str = "blah" + + # Provide the number of (output) dimensions for this RV + # (e.g. `0` for a scalar, `1` for a vector, etc.) + ndim_supp: int = 0 + + # Provide the number of (input) dimensions for each parameter of the RV + # (e.g. if there's only one vector parameter, `[1]`; for two parameters, + # one a matrix and the other a scalar, `[2, 0]`; etc.) + ndims_params: List[int] = [0, 0] + + # The NumPy/Aesara dtype for this RV (e.g. `"int32"`, `"int64"`). + # The standard in the library is `"int64"` for discrete variables + # and `"floatX"` for continuous variables + dtype: str = "floatX" + + # A pretty text and LaTeX representation for the RV + _print_name: Tuple[str, str] = ("blah", "\\operatorname{blah}") + + # If you want to add a custom signature and default values for the + # parameters, do it like this. Otherwise this can be left out. + def __call__(self, loc=0.0, scale=1.0, size=None, **kwargs) -> TensorVariable: + return super().__call__(loc, scale, size=size, **kwargs) + + # This is the Python code that produces samples. Its signature will always + # start with a NumPy `RandomState` object, then the distribution + # parameters, and, finally, the size. + # + # This is effectively the v4 replacement for `Distribution.random`. + @classmethod + def rng_fn( + cls, + rng: np.random.RandomState, + loc: np.ndarray, + scale: np.ndarray, + size: Tuple[int, ...], + ) -> np.ndarray: + return scipy.stats.blah.rvs(loc, scale, random_state=rng, size=size) + +# Create an instance of the `RandomVariable` `Op`... +blah = BlahRV() + +``` + +Some important things to keep in mind: + +1. Everything inside the `rng_fn` method is pure Python code (as are the inputs) + and should not make use of other `Aesara` symbolic ops. The random method + should make use of the `rng` which is a Numpy `Random State / Generator` ??, + so that samples are reproducible. +1. The `size` argument (together with the inputs shape) is the only way for the + user to specify the (shape?, number of draws?) of the `RandomVariable`. + The `rng_fn` will have to take this into consideration for correct output. + `size` is the specification used by `Numpy` and `Scipy` and works like + PyMC3 `shape` for univariate distributions, but is different for multivariate + distributions. Unfortunately there is no general reference documenting how + `size` ought to work for multivariate distributions. + This [discussion](https://github.com/numpy/numpy/issues/17669) may be helpful + to get more context. +1. `Aesara` tries to infer the output shape of the `RandomVariable` + (given a user-specified size) by introspection of the `ndim_supp` and + `ndim_params` attributes. However, the default method may not work for more + complex distributions. In that case, custom `_shape_from_params` (and less + probably (?) `_infer_shape`) should also be implemented in the new + `RandomVariable` class. One simple example is seen in the [MultinomialRV](https://github.com/aesara-devs/aesara/blob/3a3adaee3b8ceccff6d0208996f117fcd5e5eb6f/aesara/tensor/random/basic.py#L518-L521) + where it was necessary to specify the `rep_param_idx` so that the + `default_shape_from_params` helper method could do its job. In more complex + cases, it may not be possible to make use of the default helper, + but those have not been found yet! +1. It's okay to use the `rng_fn` `classmethods` of other Aesara and PyMC3 + `RandomVariables` inside the new `rng_fn`. For example if you are + implementing a negative HalfNormal `RandomVariable`, your `rng_fn` can simply + return `- halfnormal.rng_fn(rng, scale, size)`. + + +For a quick test that your new `RandomVariable` `Op` is working, you can just +call `eval` on a new instance: + +```python + +# blah = aesara.tensor.random.uniform in this example +blah([0, 0], [1, 2], size=(10, 2)).eval() + +# array([[0.83674527, 0.76593773], +# [0.00958496, 1.85742402], +# [0.74001876, 0.6515534 ], +# [0.95134629, 1.23564938], +# [0.41460156, 0.33241175], +# [0.66707807, 1.62134924], +# [0.20748312, 0.45307477], +# [0.65506507, 0.47713784], +# [0.61284429, 0.49720329], +# [0.69325978, 0.96272673]]) + +``` + +## 2. Inheriting from a PyMC3 base `Distribution` class + +After implementing the new `RandomVariable` `op`, it's time to make use of it +in a new PyMC3 `distribution`. PyMC3 works in a very functional way, and the +`distribution` classes are there mostly to facilitate porting the `v3` code to +the new `v4` version, and to keep related methods organized together. In +practice, they take care of: + +1. Linking (dispatching) a rv_op class with the corresponding logp and logcdf + methods. +1. Defining a standard transformation (for continuous distributions) that + converts a bounded variable domain (e.g., positive line) to an unbounded + domain (i.e., the real line), which many samplers prefer. +1. Validating the parametrization of a distribution and converting non-symbolic + inputs (i.e., numeric literals or numpy arrays) to symbolic variables. +1. Converting multiple alternative parametrizations to the standard + parametrization that the `RandomVariable` is defined in terms of. + +Here is how the example continues: + +```python + +from pymc3.aesaraf import floatX, intX +from pymc3.distributions.continuous import PositiveContinuous +from pymc3.distributions.dist_math import bound + +# Subclassing `PositiveContinuous` will dispatch a default `log` transformation +class Blah(PositiveContinuous): + + # This will be used by the metaclass `DistributionMeta` to dispatch the + # class `logp` and `logcdf` methods to the `blah` `op` + rv_op = blah + + # dist() is responsible for returning an instance of the rv_op. We pass + # the standard parametrizations to super().dist + @classmethod + def dist(cls, param1, param2=None, alt_param2=None, **kwargs): + param1 = at.as_tensor_variable(intX(param1)) + if param2 is not None and alt_param2 is not None: + raise ValueError('Only one of param2 and alt_param2 is allowed') + if alt_param2 is not None: + param2 = 1 / alt_param2 + param2 = at.as_tensor_variable(floatX(param2)) + + # The first value-only argument should be a list of the parameters that + # the rv_op needs in order to be instantiated + return super().dist([param1, param2], **kwargs) + + # Logp returns a symbolic expression for the logp evaluation of the variable + # given the `value` of the variable and the parameters `param1` ... `paramN` + def logp(value, param1, param2): + logp_expression = value * (param1 + at.log(param2)) + + # We use `bound` for parameter validation. After the default expression, + # multiple comma-separated symbolic conditions can be added. Whenever + # a bound is invalidated, the returned expression evaluates to `-np.inf` + return bound( + logp_expression, + value >= 0, + param2 >= 0, + # There is one sneaky optional keyowrd argument, that converts an + # otherwise elemwise `bound` to a reduced scalar bound. This is usually + # needed for multivariate distributions where the dimensionality + # of the bound conditions does not match that of the "value" / "logp" + # By default it is set to `True`. + broadcast_conditions=True, + ) + + # logcdf works the same way as logp. For bounded variables, it is expected + # to return `-inf` for values below the domain start and `0` for values + # above the domain end, but only when the parameters are valid. + def logcdf(value, param1, param2): + ... + +``` + +Some notes: + +1. A distribution should at the very least inherit from `Discrete` or + `Continuous`. For the latter, more specific subclasses exist: + `PositiveContinuous`, `UnitContinuous`, `BoundedContinuous`, + `CircularContinuous`, which specify default transformations for the variables. + If you need to specify a one-time custom transform you can also override the + `__new__` method, as is done for the [`Dirichlet`](https://github.com/pymc-devs/pymc3/blob/6eaf9a0d7bc5b9168bedc9337808391ccb86a5ea/pymc3/distributions/multivariate.py#L399-L401). +1. If a distribution does not have a corresponding `random` implementation, a + `RandomVariable` should still be created that raises a `NotImplementedError`. + This is the case for the [`Flat` distribution](https://github.com/pymc-devs/pymc3/blob/6eaf9a0d7bc5b9168bedc9337808391ccb86a5ea/pymc3/distributions/continuous.py#L312-L321). + In this case it will be necessary to provide a standard `initval` by + overriding `__new__`. +1. As mentioned above, `v4` works in a very functional way, and all the + information that is needed in the `logp` and `logcdf` methods is expected to + be "carried" via the `RandomVariable` inputs. You may pass numerical + arguments that are not strictly needed for the `random` method but are used + in the `logp` and `logcdf` methods. Just keep in mind whether this affects + the correct shape inference behavior of the `RandomVariable`. If specialized + non-numeric information is needed you might need to define your custom + `_logp` and `_logcdf` dispatch functions, but this should be done as a last + resort (or never?) +1. The `logcdf` method is not a requirement, but it's a nice plus! + +For a quick check that things are working you can try to create the new +distribution in a model context: + +```python + +import pymc3 as pm +from pymc3.distributions.logp import logpt + +with pm.Model() as model: + # pm.blah = pm.Uniform in this example + blah = pm.Blah('blah', [0, 0], [1, 2]) + +# Test that the returned blah_op is still working fine +blah.eval() +# array([0.62778803, 1.95165513]) + +# logpt will replace the blah RandomVariable with the corresponding logp +# expression, which takes as input the `blah_value` `TensorVariable` and +# `blah.owner.inputs`. We pass `transformed=False` to return the original +# `logp` expression just for simplicity of the example. +blah_value = model.rvs_to_values[blah] +blah_logp = logpt(blah, {blah: blah_value}, transformed=False) +blah_logp.eval({blah_value: [1.5, 1.5]}) +# array([ -inf, -0.69314718]) + +# It will instead introduce the `logcdf` expression if we pass `cdf=True` +blah_logcdf = logpt(blah, {blah: blah_value}, cdf=True) +blah_logcdf.eval({blah_value: [1.5, 1.5]}) +# array([ 0. , -0.28768207]) +``` + +## 3. Adding tests for the new `RandomVariable` + +Tests for new `RandomVariables` are mostly located in +`pymc3/tests/test_distributions_random.py`. Most tests can be accomodated by the +default `BaseTestDistribution` class, which provides default tests for checking: +1. Expected inputs are passed to the `rv_op` by the `dist` `classmethod`, via + `check_pymc_params_match_rv_op` +1. Expected (exact) draws are being returned, via + `check_pymc_draws_match_reference` +1. Shape variable inference is correct, via `check_rv_size` + +```python + +class TestBlah(BaseTestDistribution): + + pymc_dist = pm.Blah + # Parameters with which to test the blah pymc3 Distribution + pymc_dist_params = {"param1": 0.25, "param2": 2.0} + # Parameters that are expected to have passed as inputs to the RandomVariable op + expected_rv_op_params = {"param1": 0.25, "param2": 2.0} + # If the new `RandomVariable` is simply calling a `numpy`/`scipy` method, + # we can make use of `seeded_[scipy|numpy]_distribution_builder` which + # will prepare a seeded reference distribution for us. + reference_dist_params = {"mu": 0.25, "loc": 2.0} + reference_dist = seeded_scipy_distribution_builder("blah") + tests_to_run = [ + "check_pymc_params_match_rv_op", + "check_pymc_draws_match_reference", + "check_rv_size", + ] +``` + +Additional tests should be added for each optional parametrization of the +distribution. In this case it's enough to include the test +`check_pymc_params_match_rv_op` since only this differs. + +Make sure the tested alternative parameter value would lead to a different value +for the associated default parameter. For instance if it's just the inverse, +testing with `1.0` is not very informative, since the conversion would return +`1.0` as well, and we can't be (as) sure that is working correctly. + +```python + +class TestBlahAltParam2(BaseTestDistribution): + + pymc_dist = pm.Blah + # param2 is equivalent to 1 / alt_param2 + pymc_dist_params = {"param1": 0.25, "alt_param2": 4.0} + expected_rv_op_params = {"param1": 0.25, "param2": 2.0} + tests_to_run = ["check_pymc_params_match_rv_op"] + +``` + +Custom tests can also be added to the class as is done for the [`FlatRV`](https://github.com/pymc-devs/pymc3/blob/6eaf9a0d7bc5b9168bedc9337808391ccb86a5ea/pymc3/tests/test_distributions_random.py#L378-L390) + +### Notes on `check_pymcs_draws_match_reference` test + +The `check_pymcs_draws_match_reference` is a very simple test for the equality +of draws from the `RandomVariable` and the exact same python function, given the +same inputs and random seed. A small number (`size=15`) is checked. This is +not supposed to be a test for the correctness of the random generator. The +latter kind of test (if warranted) can be performed with the aid of +`pymc3_random` and `pymc3_random_discrete` methods in the same test file, which +will perform an expensive statistical comparison between the RandomVariable +`rng_fn` and a reference Python function. This kind of test only makes sense if +there is a good independent generator reference (i.e., not just the same +composition of numpy / scipy python calls that is done inside `rng_fn`). + +Finally, when your `rng_fn` is doing something more than just calling a `numpy` +or `scipy` method, you will need to setup an equivalent seeded function with +which to compare for the exact draws (instead of relying on +`seeded_[scipy|numpy]_distribution_builder`). You can find an example in the +[`TestWeibull`](https://github.com/ricardoV94/pymc3/blob/d52ae504c19fdb31e2103481b47fcc957830eb58/pymc3/tests/test_distributions_random.py#L867-L886), +whose `rng_fn` returns `beta * np.random.weibull(alpha, size=size)`. + + +## 4. Adding tests for the `logp` / `logcdf` methods + +Tests for the `logp` and `logcdf` methods are contained in +`pymc3/tests/test_distributions.py`, and most make use of the `TestMatchesScipy` +class, which provides `check_logp`, `check_logcdf`, and +`check_selfconsistency_discrete_logcdf` standard methods. These will suffice for +most distributions. + +```python + +from pymc3.tests.helpers import select_by_precision + +R = Domain([-np.inf, -2.1, -1, -0.01, 0.0, 0.01, 1, 2.1, np.inf]) +Rplus = Domain([0, 0.01, 0.1, 0.9, 0.99, 1, 1.5, 2, 100, np.inf]) + +... + +def test_blah(self): + + self.check_logp( + pymc3_dist=pm.Blah, + # Domain of the distribution values + domain=R, + # Domains of the distribution parameters + paramdomains={"mu": R, "sigma": Rplus}, + # Reference scipy (or other) logp function + scipy_logp = lambda value, mu, sigma: sp.norm.logpdf(value, mu, sigma), + # Number of decimal points expected to match between the pymc3 and reference functions + decimal=select_by_precision(float64=6, float32=3), + # Maximum number of combinations of domain * paramdomains to test + n_samples=100, + ) + + self.check_logcdf( + pymc3_dist=pm.Blah, + domain=R, + paramdomains={"mu": R, "sigma": Rplus}, + scipy_logcdf=lambda value, mu, sigma: sp.norm.logcdf(value, mu, sigma), + decimal=select_by_precision(float64=6, float32=1), + n_samples=-1, + ) + +``` + +These methods will perform a grid evaluation on the combinations of domain and +paramdomains values, and check that the pymc3 methods and the reference +functions match. There are a couple of details worth keeping in mind: + +1. By default the first and last values (edges) of the `Domain` are not compared + (they are used for other things). If it is important to test the edge of the + `Domain`, the edge values can be repeated. This is done by the `Bool`: + `Bool = Domain([0, 0, 1, 1], "int64")` +1. There are some default domains (such as `R` and `Rplus`) that you can use for + testing your new distribution, but it's also perfectly fine to create your + own domains inside the test function if there is a good reason for it (e.g., + when the default values lead too many extreme unlikely combinations that are + not very informative about the correctness of the implementation). +1. By default, a random subset of 100 `param` * `paramdomain` combinations is + tested, in order to keep the test runtime under control. When testing your + shiny new distribution, you can temporarily set `n_samples=-1` to force all + combinations to be tested. This is important to avoid the your `PR` leading + to surprising failures in future runs whenever some bad combinations of + parameters are randomly tested. +1. On Github all tests are run twice, under the `aesara.config.floatX` flags of + `"float64"` and `"float32"`. However, the reference Python functions will + run in a pure "float64" environment, which means the reference and the `Pymc3` + results can diverge quite a lot (e.g., underflowing to `-np.inf` for extreme + parameters). You should therefore make sure you test locally in both regimes. + A quick and dirty way of doing this is to temporariliy add + `aesara.config.floatX = "float32"` at the very top of file, immediately after + `import aesara`. Remember to set `n_samples=-1` as well to test all + combinations. The test output will show what exact parameter values lead to + a failure. If you are confident that your implementation is correct, you may + opt to tweak the decimal precision with `select_by_precision`, or adjust the + tested `Domain` values. In extreme cases, you can mark the test with a + conditional `xfail` (if only one of the sub-methods is failing, they should + be separated, so that the `xfail` is as narrow as possible): + +```python + +test_blah_logp(self): + ... + +@pytest.mark.xfail( + condition=(aesara.config.floatX == "float32"), + reason="Fails on float32 due to numerical issues", +) +test_blah_logcdf(self): + ... + + +``` + + +## 5. Documenting the new `Distribution` + +TODO: What needs to be done? Adding a new entry to the .rst? From 04ab7d45989c4569892e8629b0738cdba3e1a244 Mon Sep 17 00:00:00 2001 From: Ricardo Date: Mon, 21 Jun 2021 10:28:07 +0200 Subject: [PATCH 2/7] Add ref links and documentation section --- docs/source/distribution_implementation.md | 167 ++++++++++++++++----- 1 file changed, 129 insertions(+), 38 deletions(-) diff --git a/docs/source/distribution_implementation.md b/docs/source/distribution_implementation.md index a2ef39ffc..0e88679d4 100644 --- a/docs/source/distribution_implementation.md +++ b/docs/source/distribution_implementation.md @@ -1,12 +1,13 @@ -# Implementing a Distribution (devs) +# Implementing a Distribution (developer guide) This guide provides an overview on how to implement a distribution for version 4 of PyMC3. It is designed for developers who wish to add a new distribution to the library. Users will not be aware of all this complexity and should instead -make use of helper methods such as (TODO?) +make use of helper methods such as (TODO). -PyMC3 distributions build on top of Aesara's random variables (TODO: add link), -and implement `logp` and `logcdf` methods as well as other initialization and +PyMC3 {`Distributions`}`pymc3.distributions.Distribution` build on top of +Aesara's {`RandomVariables`}`aesara.tensor.random.op.RandomVariable`, and +implement `logp` and `logcdf` methods as well as other initialization and validation helpers, most notably `shape/dims`, alternative parametrizations, and default `transforms`. @@ -21,14 +22,15 @@ Each section will be expanded below: This guide does not attempt to explain the rationale behind the `Distributions` current implementation, and details are provided only insofar as they help -implementing new "standard" distributions. +to implement new "standard" distributions. ## 1. Creating a new `RandomVariable` `op` -`RandomVariables` are responsible for implementing the random sampling methods, -which in version 3 of PyMC3 used to be one of the standard `Distribution` -methods, alongside `logp` and `logcdf`. The `RandomVariable` is also responsible -for parameter broadcasting and shape inference. +{`RandomVariables`}`aesara.tensor.random.op.RandomVariable` are responsible for +implementing the random sampling methods, which in version 3 of PyMC3 used to be +one of the standard `Distribution` methods, alongside `logp` and `logcdf`. The +`RandomVariable` is also responsible for parameter broadcasting and shape +inference. Before creating a new `RandomVariable` make sure that it is not offered in the [Numpy library](https://numpy.org/doc/stable/reference/random/generator.html#distributions) @@ -98,10 +100,11 @@ Some important things to keep in mind: 1. Everything inside the `rng_fn` method is pure Python code (as are the inputs) and should not make use of other `Aesara` symbolic ops. The random method - should make use of the `rng` which is a Numpy `Random State / Generator` ??, - so that samples are reproducible. + should make use of the `rng` which is a Numpy + {`Random State`}`numpy.random.RandomState`, so that samples are reproducible. 1. The `size` argument (together with the inputs shape) is the only way for the - user to specify the (shape?, number of draws?) of the `RandomVariable`. + user to specify the `batch_shape` (~number of repetitions) of the + `RandomVariable`. The `rng_fn` will have to take this into consideration for correct output. `size` is the specification used by `Numpy` and `Scipy` and works like PyMC3 `shape` for univariate distributions, but is different for multivariate @@ -113,8 +116,9 @@ Some important things to keep in mind: (given a user-specified size) by introspection of the `ndim_supp` and `ndim_params` attributes. However, the default method may not work for more complex distributions. In that case, custom `_shape_from_params` (and less - probably (?) `_infer_shape`) should also be implemented in the new - `RandomVariable` class. One simple example is seen in the [MultinomialRV](https://github.com/aesara-devs/aesara/blob/3a3adaee3b8ceccff6d0208996f117fcd5e5eb6f/aesara/tensor/random/basic.py#L518-L521) + probably, `_infer_shape`) should also be implemented in the new + `RandomVariable` class. One simple example is seen in the + {MultinomialRV}`pymc3.distributions.multivariate.DirichletMultinomialRV` where it was necessary to specify the `rep_param_idx` so that the `default_shape_from_params` helper method could do its job. In more complex cases, it may not be possible to make use of the default helper, @@ -149,13 +153,14 @@ blah([0, 0], [1, 2], size=(10, 2)).eval() ## 2. Inheriting from a PyMC3 base `Distribution` class After implementing the new `RandomVariable` `op`, it's time to make use of it -in a new PyMC3 `distribution`. PyMC3 works in a very functional way, and the -`distribution` classes are there mostly to facilitate porting the `v3` code to -the new `v4` version, and to keep related methods organized together. In -practice, they take care of: - -1. Linking (dispatching) a rv_op class with the corresponding logp and logcdf - methods. +in a new PyMC3 {`Distribution`}`pymc3.distributions.Distribution`. PyMC3 works +in a very {functional}`Functional_Programming` way, and the `distribution` +classes are there mostly to facilitate porting the `v3` code to the new `v4` +version, and to keep related methods organized together. In practice, they take +care of: + +1. Linking ({dispatching}`Dispatching`) a rv_op class with the corresponding + logp and logcdf methods. 1. Defining a standard transformation (for continuous distributions) that converts a bounded variable domain (e.g., positive line) to an unbounded domain (i.e., the real line), which many samplers prefer. @@ -229,21 +234,22 @@ Some notes: `PositiveContinuous`, `UnitContinuous`, `BoundedContinuous`, `CircularContinuous`, which specify default transformations for the variables. If you need to specify a one-time custom transform you can also override the - `__new__` method, as is done for the [`Dirichlet`](https://github.com/pymc-devs/pymc3/blob/6eaf9a0d7bc5b9168bedc9337808391ccb86a5ea/pymc3/distributions/multivariate.py#L399-L401). + `__new__` method, as is done for the + {`Dirichlet`}`pymc3.distributions.multivariate.Dirichlet`. 1. If a distribution does not have a corresponding `random` implementation, a `RandomVariable` should still be created that raises a `NotImplementedError`. - This is the case for the [`Flat` distribution](https://github.com/pymc-devs/pymc3/blob/6eaf9a0d7bc5b9168bedc9337808391ccb86a5ea/pymc3/distributions/continuous.py#L312-L321). + This is the case for the {`Flat`}`pymc3.distributions.continuous.FlatRV`. In this case it will be necessary to provide a standard `initval` by overriding `__new__`. -1. As mentioned above, `v4` works in a very functional way, and all the - information that is needed in the `logp` and `logcdf` methods is expected to - be "carried" via the `RandomVariable` inputs. You may pass numerical - arguments that are not strictly needed for the `random` method but are used - in the `logp` and `logcdf` methods. Just keep in mind whether this affects - the correct shape inference behavior of the `RandomVariable`. If specialized - non-numeric information is needed you might need to define your custom - `_logp` and `_logcdf` dispatch functions, but this should be done as a last - resort (or never?) +1. As mentioned above, `v4` works in a very {functional}`Functional_Programming` + way, and all the information that is needed in the `logp` and `logcdf` + methods is expected to be "carried" via the `RandomVariable` inputs. You may + pass numerical arguments that are not strictly needed for the `random` method + but are used in the `logp` and `logcdf` methods. Just keep in mind whether + this affects the correct shape inference behavior of the `RandomVariable`. + If specialized non-numeric information is needed you might need to define + your custom`_logp` and `_logcdf` {dispatc}`Dispatching` functions, but this + should be done as a last resort. 1. The `logcdf` method is not a requirement, but it's a nice plus! For a quick check that things are working you can try to create the new @@ -330,7 +336,23 @@ class TestBlahAltParam2(BaseTestDistribution): ``` -Custom tests can also be added to the class as is done for the [`FlatRV`](https://github.com/pymc-devs/pymc3/blob/6eaf9a0d7bc5b9168bedc9337808391ccb86a5ea/pymc3/tests/test_distributions_random.py#L378-L390) +Custom tests can also be added to the class as is done for the +{`FlatRV`}`pymc3.tests.test_random.TestFlat`. + +### Note on `check_rv_size` test: + +Custom input sizes (and expected output shapes) can be defined for the +`check_rv_size` test, by adding the optional class attributes `sizes_to_check` +and `sizes_expected`: + +```python +sizes_to_check = [None, (1), (2, 3)] +sizes_expected = [(3,), (1, 3), (2, 3, 3)] +tests_to_run = ["check_rv_size"] +``` + +This is usually needed for Multivariate distributions. You can see an example in +the {`Dirichlet` test}`pymc3.test.test_random.TestDirichlet`. ### Notes on `check_pymcs_draws_match_reference` test @@ -349,8 +371,8 @@ Finally, when your `rng_fn` is doing something more than just calling a `numpy` or `scipy` method, you will need to setup an equivalent seeded function with which to compare for the exact draws (instead of relying on `seeded_[scipy|numpy]_distribution_builder`). You can find an example in the -[`TestWeibull`](https://github.com/ricardoV94/pymc3/blob/d52ae504c19fdb31e2103481b47fcc957830eb58/pymc3/tests/test_distributions_random.py#L867-L886), -whose `rng_fn` returns `beta * np.random.weibull(alpha, size=size)`. +{`TestWeibull`}`pymc3.tests.test_distributions_random.TestWeibull`, whose +`rng_fn` returns `beta * np.random.weibull(alpha, size=size)`. ## 4. Adding tests for the `logp` / `logcdf` methods @@ -433,14 +455,15 @@ functions match. There are a couple of details worth keeping in mind: ```python -test_blah_logp(self): +def test_blah_logp(self): ... + @pytest.mark.xfail( condition=(aesara.config.floatX == "float32"), reason="Fails on float32 due to numerical issues", ) -test_blah_logcdf(self): +def test_blah_logcdf(self): ... @@ -449,4 +472,72 @@ test_blah_logcdf(self): ## 5. Documenting the new `Distribution` -TODO: What needs to be done? Adding a new entry to the .rst? +New distributions should have a rich docstring, following the same format as +that of previously implemented distributions. It generally looks something like +this: + +```python + r""" + Univariate blah distribution. + + The pdf of this distribution is + + .. math:: + + f(x \mid \param1, \param2) = \exp{x * (param1 + \log{param2})} + + .. plot:: + + import matplotlib.pyplot as plt + import numpy as np + import scipy.stats as st + import arviz as az + plt.style.use('arviz-darkgrid') + x = np.linspace(-5, 5, 1000) + params1 = [0., 0., 0., -2.] + params2 = [0.4, 1., 2., 0.4] + for param1, param2 in zip(params1, params2): + pdf = st.blah.pdf(x, param1, param2) + plt.plot(x, pdf, label=r'$\param1$ = {}, $\param2$ = {}'.format(param1, param2)) + plt.xlabel('x', fontsize=12) + plt.ylabel('f(x)', fontsize=12) + plt.legend(loc=1) + plt.show() + + ======== ========================================== + Support :math:`x \in [0, \infty)` + ======== ========================================== + + Blah distribution can be parameterized either in terms of param2 or + alt_param2. The link between the two parametrizations is + given by + + .. math:: + + \param2 = \dfrac{1}{\alt_param2} + + + Parameters + ---------- + param1: float + Interpretation of param1. + param2: float + Interpretation of param2 (param2 > 0). + alt_param2: float + Interpretation of alt_param2 (alt_param2 > 0) (alternative to param2). + + Examples + -------- + .. code-block:: python + + with pm.Model(): + x = pm.Blah('x', param1=0, param2=10) + """ +``` + +The new distribution should be referenced in the respective API page in the docs +module (e.g., `pymc3/docs/api/distributions.continuous.rst`). If appropriate, +a new notebook example should be added to +[pymc-examples](https://github.com/pymc-devs/pymc-examples/) illustrating how +this distribution can be used and how it relates (and/or differs) from other +distributions that users are more likely to be familiar with. From 29ffa545661f6483228cbe585ce44c2a50cdaa10 Mon Sep 17 00:00:00 2001 From: Ricardo Vieira <28983449+ricardoV94@users.noreply.github.com> Date: Wed, 23 Jun 2021 16:24:42 +0200 Subject: [PATCH 3/7] Apply suggestions from code review Co-authored-by: Michael Osthege --- docs/source/distribution_implementation.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/source/distribution_implementation.md b/docs/source/distribution_implementation.md index 0e88679d4..b60251266 100644 --- a/docs/source/distribution_implementation.md +++ b/docs/source/distribution_implementation.md @@ -14,7 +14,7 @@ default `transforms`. Here is a summary check-list of the steps needed to implement a new distribution. Each section will be expanded below: -1. Creating a new `RandomVariable` `op` +1. Creating a new `RandomVariable` `Op` 1. Implementing the corresponding `Distribution` class 1. Adding tests for the new `RandomVariable` 1. Adding tests for the `logp` / `logcdf` methods @@ -24,7 +24,7 @@ This guide does not attempt to explain the rationale behind the `Distributions` current implementation, and details are provided only insofar as they help to implement new "standard" distributions. -## 1. Creating a new `RandomVariable` `op` +## 1. Creating a new `RandomVariable` `Op` {`RandomVariables`}`aesara.tensor.random.op.RandomVariable` are responsible for implementing the random sampling methods, which in version 3 of PyMC3 used to be @@ -33,7 +33,7 @@ one of the standard `Distribution` methods, alongside `logp` and `logcdf`. The inference. Before creating a new `RandomVariable` make sure that it is not offered in the -[Numpy library](https://numpy.org/doc/stable/reference/random/generator.html#distributions) +[Numpy library](https://numpy.org/doc/stable/reference/random/generator.html#distributions). If it is, it should be added to the [Aesara library](https://github.com/aesara-devs/aesara) first and then imported into the PyMC3 library. @@ -54,7 +54,7 @@ from aesara.tensor.random.op import RandomVariable class BlahRV(RandomVariable): name: str = "blah" - # Provide the number of (output) dimensions for this RV + # Provide the minimum number of (output) dimensions for this RV # (e.g. `0` for a scalar, `1` for a vector, etc.) ndim_supp: int = 0 @@ -152,11 +152,12 @@ blah([0, 0], [1, 2], size=(10, 2)).eval() ## 2. Inheriting from a PyMC3 base `Distribution` class -After implementing the new `RandomVariable` `op`, it's time to make use of it +After implementing the new `RandomVariable` `Op`, it's time to make use of it in a new PyMC3 {`Distribution`}`pymc3.distributions.Distribution`. PyMC3 works in a very {functional}`Functional_Programming` way, and the `distribution` classes are there mostly to facilitate porting the `v3` code to the new `v4` -version, and to keep related methods organized together. In practice, they take +version, add PyMC3 API features and keep related methods organized together. +In practice, they take care of: 1. Linking ({dispatching}`Dispatching`) a rv_op class with the corresponding @@ -242,9 +243,9 @@ Some notes: In this case it will be necessary to provide a standard `initval` by overriding `__new__`. 1. As mentioned above, `v4` works in a very {functional}`Functional_Programming` - way, and all the information that is needed in the `logp` and `logcdf` + way, and all the information that is needed in the `logp` and `logcdf` methods is expected to be "carried" via the `RandomVariable` inputs. You may - pass numerical arguments that are not strictly needed for the `random` method + pass numerical arguments that are not strictly needed for the `rng_fn` method but are used in the `logp` and `logcdf` methods. Just keep in mind whether this affects the correct shape inference behavior of the `RandomVariable`. If specialized non-numeric information is needed you might need to define @@ -440,7 +441,7 @@ functions match. There are a couple of details worth keeping in mind: parameters are randomly tested. 1. On Github all tests are run twice, under the `aesara.config.floatX` flags of `"float64"` and `"float32"`. However, the reference Python functions will - run in a pure "float64" environment, which means the reference and the `Pymc3` + run in a pure "float64" environment, which means the reference and the `PyMC3` results can diverge quite a lot (e.g., underflowing to `-np.inf` for extreme parameters). You should therefore make sure you test locally in both regimes. A quick and dirty way of doing this is to temporariliy add @@ -492,7 +493,6 @@ this: import numpy as np import scipy.stats as st import arviz as az - plt.style.use('arviz-darkgrid') x = np.linspace(-5, 5, 1000) params1 = [0., 0., 0., -2.] params2 = [0.4, 1., 2., 0.4] From 585bc98c3a20857a9f0dfd81f0c7b5cd10241bb6 Mon Sep 17 00:00:00 2001 From: Ricardo Date: Wed, 23 Jun 2021 16:35:28 +0200 Subject: [PATCH 4/7] Remove line breaks --- docs/source/distribution_implementation.md | 245 ++++++--------------- 1 file changed, 63 insertions(+), 182 deletions(-) diff --git a/docs/source/distribution_implementation.md b/docs/source/distribution_implementation.md index b60251266..4bf1d6467 100644 --- a/docs/source/distribution_implementation.md +++ b/docs/source/distribution_implementation.md @@ -1,15 +1,10 @@ # Implementing a Distribution (developer guide) -This guide provides an overview on how to implement a distribution for version 4 -of PyMC3. It is designed for developers who wish to add a new distribution to -the library. Users will not be aware of all this complexity and should instead -make use of helper methods such as (TODO). +This guide provides an overview on how to implement a distribution for version 4 of PyMC3. +It is designed for developers who wish to add a new distribution to the library. +Users will not be aware of all this complexity and should instead make use of helper methods such as (TODO). -PyMC3 {`Distributions`}`pymc3.distributions.Distribution` build on top of -Aesara's {`RandomVariables`}`aesara.tensor.random.op.RandomVariable`, and -implement `logp` and `logcdf` methods as well as other initialization and -validation helpers, most notably `shape/dims`, alternative parametrizations, and -default `transforms`. +PyMC3 {`Distributions`}`pymc3.distributions.Distribution` build on top of Aesara's {`RandomVariables`}`aesara.tensor.random.op.RandomVariable`, and implement `logp` and `logcdf` methods as well as other initialization and validation helpers, most notably `shape/dims`, alternative parametrizations, and default `transforms`. Here is a summary check-list of the steps needed to implement a new distribution. Each section will be expanded below: @@ -20,28 +15,19 @@ Each section will be expanded below: 1. Adding tests for the `logp` / `logcdf` methods 1. Documenting the new `Distribution`. -This guide does not attempt to explain the rationale behind the `Distributions` -current implementation, and details are provided only insofar as they help -to implement new "standard" distributions. +This guide does not attempt to explain the rationale behind the `Distributions` current implementation, and details are provided only insofar as they help to implement new "standard" distributions. ## 1. Creating a new `RandomVariable` `Op` -{`RandomVariables`}`aesara.tensor.random.op.RandomVariable` are responsible for -implementing the random sampling methods, which in version 3 of PyMC3 used to be -one of the standard `Distribution` methods, alongside `logp` and `logcdf`. The -`RandomVariable` is also responsible for parameter broadcasting and shape -inference. +{`RandomVariables`}`aesara.tensor.random.op.RandomVariable` are responsible for implementing the random sampling methods, which in version 3 of PyMC3 used to be one of the standard `Distribution` methods, alongside `logp` and `logcdf`. +The `RandomVariable` is also responsible for parameter broadcasting and shape inference. -Before creating a new `RandomVariable` make sure that it is not offered in the -[Numpy library](https://numpy.org/doc/stable/reference/random/generator.html#distributions). -If it is, it should be added to the [Aesara library](https://github.com/aesara-devs/aesara) -first and then imported into the PyMC3 library. +Before creating a new `RandomVariable` make sure that it is not offered in the [Numpy library](https://numpy.org/doc/stable/reference/random/generator.html#distributions). +If it is, it should be added to the [Aesara library](https://github.com/aesara-devs/aesara) first and then imported into the PyMC3 library. In addition, it might not always be necessary to implement a new `RandomVariable`. -For example if the new `Distribution` is just a special parametrization of an -existing `Distribution`. This is the case of the `OrderedLogistic` and -`OrderedProbit`, which are just special parametrizations of the `Categorial` -distribution. +For example if the new `Distribution` is just a special parametrization of an existing `Distribution`. +This is the case of the `OrderedLogistic` and `OrderedProbit`, which are just special parametrizations of the `Categorial` distribution. The following snippet illustrates how to create a new `RandomVariable`: @@ -98,39 +84,13 @@ blah = BlahRV() Some important things to keep in mind: -1. Everything inside the `rng_fn` method is pure Python code (as are the inputs) - and should not make use of other `Aesara` symbolic ops. The random method - should make use of the `rng` which is a Numpy - {`Random State`}`numpy.random.RandomState`, so that samples are reproducible. -1. The `size` argument (together with the inputs shape) is the only way for the - user to specify the `batch_shape` (~number of repetitions) of the - `RandomVariable`. - The `rng_fn` will have to take this into consideration for correct output. - `size` is the specification used by `Numpy` and `Scipy` and works like - PyMC3 `shape` for univariate distributions, but is different for multivariate - distributions. Unfortunately there is no general reference documenting how - `size` ought to work for multivariate distributions. - This [discussion](https://github.com/numpy/numpy/issues/17669) may be helpful - to get more context. -1. `Aesara` tries to infer the output shape of the `RandomVariable` - (given a user-specified size) by introspection of the `ndim_supp` and - `ndim_params` attributes. However, the default method may not work for more - complex distributions. In that case, custom `_shape_from_params` (and less - probably, `_infer_shape`) should also be implemented in the new - `RandomVariable` class. One simple example is seen in the - {MultinomialRV}`pymc3.distributions.multivariate.DirichletMultinomialRV` - where it was necessary to specify the `rep_param_idx` so that the - `default_shape_from_params` helper method could do its job. In more complex - cases, it may not be possible to make use of the default helper, - but those have not been found yet! -1. It's okay to use the `rng_fn` `classmethods` of other Aesara and PyMC3 - `RandomVariables` inside the new `rng_fn`. For example if you are - implementing a negative HalfNormal `RandomVariable`, your `rng_fn` can simply - return `- halfnormal.rng_fn(rng, scale, size)`. - - -For a quick test that your new `RandomVariable` `Op` is working, you can just -call `eval` on a new instance: +1. Everything inside the `rng_fn` method is pure Python code (as are the inputs) and should not make use of other `Aesara` symbolic ops. The random method should make use of the `rng` which is a Numpy {`Random State`}`numpy.random.RandomState`, so that samples are reproducible. +1. The `size` argument (together with the inputs shape) is the only way for the user to specify the `batch_shape` (~number of repetitions) of the `RandomVariable`. The `rng_fn` will have to take this into consideration for correct output. `size` is the specification used by `Numpy` and `Scipy` and works like PyMC3 `shape` for univariate distributions, but is different for multivariate distributions. Unfortunately there is no general reference documenting how `size` ought to work for multivariate distributions. This [discussion](https://github.com/numpy/numpy/issues/17669) may be helpful to get more context. +1. `Aesara` tries to infer the output shape of the `RandomVariable` (given a user-specified size) by introspection of the `ndim_supp` and `ndim_params` attributes. However, the default method may not work for more complex distributions. In that case, custom `_shape_from_params` (and less probably, `_infer_shape`) should also be implemented in the new `RandomVariable` class. One simple example is seen in the {MultinomialRV}`pymc3.distributions.multivariate.DirichletMultinomialRV` where it was necessary to specify the `rep_param_idx` so that the `default_shape_from_params` helper method could do its job. In more complex cases, it may not be possible to make use of the default helper, but those have not been found yet! +1. It's okay to use the `rng_fn` `classmethods` of other Aesara and PyMC3 `RandomVariables` inside the new `rng_fn`. For example if you are implementing a negative HalfNormal `RandomVariable`, your `rng_fn` can simply return `- halfnormal.rng_fn(rng, scale, size)`. + + +For a quick test that your new `RandomVariable` `Op` is working, you can just call `eval` on a new instance: ```python @@ -152,23 +112,14 @@ blah([0, 0], [1, 2], size=(10, 2)).eval() ## 2. Inheriting from a PyMC3 base `Distribution` class -After implementing the new `RandomVariable` `Op`, it's time to make use of it -in a new PyMC3 {`Distribution`}`pymc3.distributions.Distribution`. PyMC3 works -in a very {functional}`Functional_Programming` way, and the `distribution` -classes are there mostly to facilitate porting the `v3` code to the new `v4` -version, add PyMC3 API features and keep related methods organized together. -In practice, they take -care of: - -1. Linking ({dispatching}`Dispatching`) a rv_op class with the corresponding - logp and logcdf methods. -1. Defining a standard transformation (for continuous distributions) that - converts a bounded variable domain (e.g., positive line) to an unbounded - domain (i.e., the real line), which many samplers prefer. -1. Validating the parametrization of a distribution and converting non-symbolic - inputs (i.e., numeric literals or numpy arrays) to symbolic variables. -1. Converting multiple alternative parametrizations to the standard - parametrization that the `RandomVariable` is defined in terms of. +After implementing the new `RandomVariable` `Op`, it's time to make use of it in a new PyMC3 {`Distribution`}`pymc3.distributions.Distribution`. +PyMC3 works in a very {functional}`Functional_Programming` way, and the `distribution` classes are there mostly to facilitate porting the `v3` code to the new `v4` version, add PyMC3 API features and keep related methods organized together. +In practice, they take care of: + +1. Linking ({dispatching}`Dispatching`) a rv_op class with the corresponding logp and logcdf methods. +1. Defining a standard transformation (for continuous distributions) that converts a bounded variable domain (e.g., positive line) to an unbounded domain (i.e., the real line), which many samplers prefer. +1. Validating the parametrization of a distribution and converting non-symbolic inputs (i.e., numeric literals or numpy arrays) to symbolic variables. +1. Converting multiple alternative parametrizations to the standard parametrization that the `RandomVariable` is defined in terms of. Here is how the example continues: @@ -230,31 +181,13 @@ class Blah(PositiveContinuous): Some notes: -1. A distribution should at the very least inherit from `Discrete` or - `Continuous`. For the latter, more specific subclasses exist: - `PositiveContinuous`, `UnitContinuous`, `BoundedContinuous`, - `CircularContinuous`, which specify default transformations for the variables. - If you need to specify a one-time custom transform you can also override the - `__new__` method, as is done for the - {`Dirichlet`}`pymc3.distributions.multivariate.Dirichlet`. -1. If a distribution does not have a corresponding `random` implementation, a - `RandomVariable` should still be created that raises a `NotImplementedError`. - This is the case for the {`Flat`}`pymc3.distributions.continuous.FlatRV`. - In this case it will be necessary to provide a standard `initval` by +1. A distribution should at the very least inherit from `Discrete` or `Continuous`. For the latter, more specific subclasses exist:`PositiveContinuous`, `UnitContinuous`, `BoundedContinuous`, `CircularContinuous`, which specify default transformations for the variables. If you need to specify a one-time custom transform you can also override the `__new__` method, as is done for the {`Dirichlet`}`pymc3.distributions.multivariate.Dirichlet`. +1. If a distribution does not have a corresponding `random` implementation, a `RandomVariable` should still be created that raises a `NotImplementedError`. This is the case for the {`Flat`}`pymc3.distributions.continuous.FlatRV`. In this case it will be necessary to provide a standard `initval` by overriding `__new__`. -1. As mentioned above, `v4` works in a very {functional}`Functional_Programming` - way, and all the information that is needed in the `logp` and `logcdf` - methods is expected to be "carried" via the `RandomVariable` inputs. You may - pass numerical arguments that are not strictly needed for the `rng_fn` method - but are used in the `logp` and `logcdf` methods. Just keep in mind whether - this affects the correct shape inference behavior of the `RandomVariable`. - If specialized non-numeric information is needed you might need to define - your custom`_logp` and `_logcdf` {dispatc}`Dispatching` functions, but this - should be done as a last resort. +1. As mentioned above, `v4` works in a very {functional}`Functional_Programming` way, and all the information that is needed in the `logp` and `logcdf` methods is expected to be "carried" via the `RandomVariable` inputs. You may pass numerical arguments that are not strictly needed for the `rng_fn` method but are used in the `logp` and `logcdf` methods. Just keep in mind whether this affects the correct shape inference behavior of the `RandomVariable`. If specialized non-numeric information is needed you might need to define your custom`_logp` and `_logcdf` {dispatc}`Dispatching` functions, but this should be done as a last resort. 1. The `logcdf` method is not a requirement, but it's a nice plus! -For a quick check that things are working you can try to create the new -distribution in a model context: +For a quick check that things are working you can try to create the new distribution in a model context: ```python @@ -286,13 +219,10 @@ blah_logcdf.eval({blah_value: [1.5, 1.5]}) ## 3. Adding tests for the new `RandomVariable` -Tests for new `RandomVariables` are mostly located in -`pymc3/tests/test_distributions_random.py`. Most tests can be accomodated by the -default `BaseTestDistribution` class, which provides default tests for checking: -1. Expected inputs are passed to the `rv_op` by the `dist` `classmethod`, via - `check_pymc_params_match_rv_op` -1. Expected (exact) draws are being returned, via - `check_pymc_draws_match_reference` +Tests for new `RandomVariables` are mostly located in `pymc3/tests/test_distributions_random.py`. +Most tests can be accomodated by the default `BaseTestDistribution` class, which provides default tests for checking: +1. Expected inputs are passed to the `rv_op` by the `dist` `classmethod`, via `check_pymc_params_match_rv_op` +1. Expected (exact) draws are being returned, via `check_pymc_draws_match_reference` 1. Shape variable inference is correct, via `check_rv_size` ```python @@ -316,14 +246,11 @@ class TestBlah(BaseTestDistribution): ] ``` -Additional tests should be added for each optional parametrization of the -distribution. In this case it's enough to include the test -`check_pymc_params_match_rv_op` since only this differs. +Additional tests should be added for each optional parametrization of the distribution. +In this case it's enough to include the test `check_pymc_params_match_rv_op` since only this differs. -Make sure the tested alternative parameter value would lead to a different value -for the associated default parameter. For instance if it's just the inverse, -testing with `1.0` is not very informative, since the conversion would return -`1.0` as well, and we can't be (as) sure that is working correctly. +Make sure the tested alternative parameter value would lead to a different value for the associated default parameter. +For instance, if it's just the inverse, testing with `1.0` is not very informative, since the conversion would return `1.0` as well, and we can't be (as) sure that is working correctly. ```python @@ -337,14 +264,11 @@ class TestBlahAltParam2(BaseTestDistribution): ``` -Custom tests can also be added to the class as is done for the -{`FlatRV`}`pymc3.tests.test_random.TestFlat`. +Custom tests can also be added to the class as is done for the {`FlatRV`}`pymc3.tests.test_random.TestFlat`. ### Note on `check_rv_size` test: -Custom input sizes (and expected output shapes) can be defined for the -`check_rv_size` test, by adding the optional class attributes `sizes_to_check` -and `sizes_expected`: +Custom input sizes (and expected output shapes) can be defined for the `check_rv_size` test, by adding the optional class attributes `sizes_to_check` and `sizes_expected`: ```python sizes_to_check = [None, (1), (2, 3)] @@ -352,37 +276,25 @@ sizes_expected = [(3,), (1, 3), (2, 3, 3)] tests_to_run = ["check_rv_size"] ``` -This is usually needed for Multivariate distributions. You can see an example in -the {`Dirichlet` test}`pymc3.test.test_random.TestDirichlet`. +This is usually needed for Multivariate distributions. +You can see an example in the {`Dirichlet` test}`pymc3.test.test_random.TestDirichlet`. ### Notes on `check_pymcs_draws_match_reference` test -The `check_pymcs_draws_match_reference` is a very simple test for the equality -of draws from the `RandomVariable` and the exact same python function, given the -same inputs and random seed. A small number (`size=15`) is checked. This is -not supposed to be a test for the correctness of the random generator. The -latter kind of test (if warranted) can be performed with the aid of -`pymc3_random` and `pymc3_random_discrete` methods in the same test file, which -will perform an expensive statistical comparison between the RandomVariable -`rng_fn` and a reference Python function. This kind of test only makes sense if -there is a good independent generator reference (i.e., not just the same -composition of numpy / scipy python calls that is done inside `rng_fn`). - -Finally, when your `rng_fn` is doing something more than just calling a `numpy` -or `scipy` method, you will need to setup an equivalent seeded function with -which to compare for the exact draws (instead of relying on -`seeded_[scipy|numpy]_distribution_builder`). You can find an example in the -{`TestWeibull`}`pymc3.tests.test_distributions_random.TestWeibull`, whose -`rng_fn` returns `beta * np.random.weibull(alpha, size=size)`. +The `check_pymcs_draws_match_reference` is a very simple test for the equality of draws from the `RandomVariable` and the exact same python function, given the same inputs and random seed. +A small number (`size=15`) is checked. This is not supposed to be a test for the correctness of the random generator. +The latter kind of test (if warranted) can be performed with the aid of `pymc3_random` and `pymc3_random_discrete` methods in the same test file, which will perform an expensive statistical comparison between the RandomVariable `rng_fn` and a reference Python function. +This kind of test only makes sense if there is a good independent generator reference (i.e., not just the same composition of numpy / scipy python calls that is done inside `rng_fn`). + +Finally, when your `rng_fn` is doing something more than just calling a `numpy` or `scipy` method, you will need to setup an equivalent seeded function with which to compare for the exact draws (instead of relying on `seeded_[scipy|numpy]_distribution_builder`). +You can find an example in the {`TestWeibull`}`pymc3.tests.test_distributions_random.TestWeibull`, whose `rng_fn` returns `beta * np.random.weibull(alpha, size=size)`. ## 4. Adding tests for the `logp` / `logcdf` methods -Tests for the `logp` and `logcdf` methods are contained in -`pymc3/tests/test_distributions.py`, and most make use of the `TestMatchesScipy` -class, which provides `check_logp`, `check_logcdf`, and -`check_selfconsistency_discrete_logcdf` standard methods. These will suffice for -most distributions. +Tests for the `logp` and `logcdf` methods are contained in `pymc3/tests/test_distributions.py`, and most make use of the `TestMatchesScipy` class, which provides `check_logp`, `check_logcdf`, and +`check_selfconsistency_discrete_logcdf` standard methods. +These will suffice for most distributions. ```python @@ -420,39 +332,13 @@ def test_blah(self): ``` -These methods will perform a grid evaluation on the combinations of domain and -paramdomains values, and check that the pymc3 methods and the reference -functions match. There are a couple of details worth keeping in mind: - -1. By default the first and last values (edges) of the `Domain` are not compared - (they are used for other things). If it is important to test the edge of the - `Domain`, the edge values can be repeated. This is done by the `Bool`: - `Bool = Domain([0, 0, 1, 1], "int64")` -1. There are some default domains (such as `R` and `Rplus`) that you can use for - testing your new distribution, but it's also perfectly fine to create your - own domains inside the test function if there is a good reason for it (e.g., - when the default values lead too many extreme unlikely combinations that are - not very informative about the correctness of the implementation). -1. By default, a random subset of 100 `param` * `paramdomain` combinations is - tested, in order to keep the test runtime under control. When testing your - shiny new distribution, you can temporarily set `n_samples=-1` to force all - combinations to be tested. This is important to avoid the your `PR` leading - to surprising failures in future runs whenever some bad combinations of - parameters are randomly tested. -1. On Github all tests are run twice, under the `aesara.config.floatX` flags of - `"float64"` and `"float32"`. However, the reference Python functions will - run in a pure "float64" environment, which means the reference and the `PyMC3` - results can diverge quite a lot (e.g., underflowing to `-np.inf` for extreme - parameters). You should therefore make sure you test locally in both regimes. - A quick and dirty way of doing this is to temporariliy add - `aesara.config.floatX = "float32"` at the very top of file, immediately after - `import aesara`. Remember to set `n_samples=-1` as well to test all - combinations. The test output will show what exact parameter values lead to - a failure. If you are confident that your implementation is correct, you may - opt to tweak the decimal precision with `select_by_precision`, or adjust the - tested `Domain` values. In extreme cases, you can mark the test with a - conditional `xfail` (if only one of the sub-methods is failing, they should - be separated, so that the `xfail` is as narrow as possible): +These methods will perform a grid evaluation on the combinations of domain and paramdomains values, and check that the pymc3 methods and the reference functions match. +There are a couple of details worth keeping in mind: + +1. By default the first and last values (edges) of the `Domain` are not compared (they are used for other things). If it is important to test the edge of the `Domain`, the edge values can be repeated. This is done by the `Bool`: `Bool = Domain([0, 0, 1, 1], "int64")` +1. There are some default domains (such as `R` and `Rplus`) that you can use for testing your new distribution, but it's also perfectly fine to create your own domains inside the test function if there is a good reason for it (e.g., when the default values lead too many extreme unlikely combinations that are not very informative about the correctness of the implementation). +1. By default, a random subset of 100 `param` * `paramdomain` combinations is tested, in order to keep the test runtime under control. When testing your shiny new distribution, you can temporarily set `n_samples=-1` to force all combinations to be tested. This is important to avoid the your `PR` leading to surprising failures in future runs whenever some bad combinations of parameters are randomly tested. +1. On Github all tests are run twice, under the `aesara.config.floatX` flags of `"float64"` and `"float32"`. However, the reference Python functions will run in a pure "float64" environment, which means the reference and the `PyMC3` results can diverge quite a lot (e.g., underflowing to `-np.inf` for extreme parameters). You should therefore make sure you test locally in both regimes. A quick and dirty way of doing this is to temporariliy add `aesara.config.floatX = "float32"` at the very top of file, immediately after `import aesara`. Remember to set `n_samples=-1` as well to test all combinations. The test output will show what exact parameter values lead to a failure. If you are confident that your implementation is correct, you may opt to tweak the decimal precision with `select_by_precision`, or adjust the tested `Domain` values. In extreme cases, you can mark the test with a conditional `xfail` (if only one of the sub-methods is failing, they should be separated, so that the `xfail` is as narrow as possible): ```python @@ -473,9 +359,8 @@ def test_blah_logcdf(self): ## 5. Documenting the new `Distribution` -New distributions should have a rich docstring, following the same format as -that of previously implemented distributions. It generally looks something like -this: +New distributions should have a rich docstring, following the same format as that of previously implemented distributions. +It generally looks something like this: ```python r""" @@ -535,9 +420,5 @@ this: """ ``` -The new distribution should be referenced in the respective API page in the docs -module (e.g., `pymc3/docs/api/distributions.continuous.rst`). If appropriate, -a new notebook example should be added to -[pymc-examples](https://github.com/pymc-devs/pymc-examples/) illustrating how -this distribution can be used and how it relates (and/or differs) from other -distributions that users are more likely to be familiar with. +The new distribution should be referenced in the respective API page in the `docs` module (e.g., `pymc3/docs/api/distributions.continuous.rst`). +If appropriate, a new notebook example should be added to [pymc-examples](https://github.com/pymc-devs/pymc-examples/) illustrating how this distribution can be used and how it relates (and/or differs) from other distributions that users are more likely to be familiar with. From 1a74edde45accb84906f99030a14ef0ef3ab29c1 Mon Sep 17 00:00:00 2001 From: Ricardo Date: Wed, 23 Jun 2021 16:50:56 +0200 Subject: [PATCH 5/7] Fix cross-refs and integrate more suggestions --- docs/source/distribution_implementation.md | 23 +++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/docs/source/distribution_implementation.md b/docs/source/distribution_implementation.md index 4bf1d6467..c0e31b8ae 100644 --- a/docs/source/distribution_implementation.md +++ b/docs/source/distribution_implementation.md @@ -4,7 +4,7 @@ This guide provides an overview on how to implement a distribution for version 4 It is designed for developers who wish to add a new distribution to the library. Users will not be aware of all this complexity and should instead make use of helper methods such as (TODO). -PyMC3 {`Distributions`}`pymc3.distributions.Distribution` build on top of Aesara's {`RandomVariables`}`aesara.tensor.random.op.RandomVariable`, and implement `logp` and `logcdf` methods as well as other initialization and validation helpers, most notably `shape/dims`, alternative parametrizations, and default `transforms`. +PyMC3 {`class`}`~pymc3.distributions.Distribution` build on top of Aesara's {`class`}`~aesara.tensor.random.op.RandomVariable`, and implement `logp` and `logcdf` methods as well as other initialization and validation helpers, most notably `shape/dims`, alternative parametrizations, and default `transforms`. Here is a summary check-list of the steps needed to implement a new distribution. Each section will be expanded below: @@ -19,7 +19,7 @@ This guide does not attempt to explain the rationale behind the `Distributions` ## 1. Creating a new `RandomVariable` `Op` -{`RandomVariables`}`aesara.tensor.random.op.RandomVariable` are responsible for implementing the random sampling methods, which in version 3 of PyMC3 used to be one of the standard `Distribution` methods, alongside `logp` and `logcdf`. +{`class`}`~aesara.tensor.random.op.RandomVariable` are responsible for implementing the random sampling methods, which in version 3 of PyMC3 used to be one of the standard `Distribution` methods, alongside `logp` and `logcdf`. The `RandomVariable` is also responsible for parameter broadcasting and shape inference. Before creating a new `RandomVariable` make sure that it is not offered in the [Numpy library](https://numpy.org/doc/stable/reference/random/generator.html#distributions). @@ -77,20 +77,21 @@ class BlahRV(RandomVariable): ) -> np.ndarray: return scipy.stats.blah.rvs(loc, scale, random_state=rng, size=size) -# Create an instance of the `RandomVariable` `Op`... +# Create the actual `RandomVariable` `Op`... blah = BlahRV() ``` Some important things to keep in mind: -1. Everything inside the `rng_fn` method is pure Python code (as are the inputs) and should not make use of other `Aesara` symbolic ops. The random method should make use of the `rng` which is a Numpy {`Random State`}`numpy.random.RandomState`, so that samples are reproducible. -1. The `size` argument (together with the inputs shape) is the only way for the user to specify the `batch_shape` (~number of repetitions) of the `RandomVariable`. The `rng_fn` will have to take this into consideration for correct output. `size` is the specification used by `Numpy` and `Scipy` and works like PyMC3 `shape` for univariate distributions, but is different for multivariate distributions. Unfortunately there is no general reference documenting how `size` ought to work for multivariate distributions. This [discussion](https://github.com/numpy/numpy/issues/17669) may be helpful to get more context. +1. Everything inside the `rng_fn` method is pure Python code (as are the inputs) and should not make use of other `Aesara` symbolic ops. The random method should make use of the `rng` which is a Numpy {`class`}`~numpy.random.RandomState`, so that samples are reproducible. +1. The `size` argument (together with the inputs shape) are the only way for the user to specify non-default `RandomVariable` dimensions. The `rng_fn` will have to take this into consideration for correct output. `size` is the specification used by `Numpy` and `Scipy` and works like PyMC3 `shape` for univariate distributions, but is different for multivariate distributions. Unfortunately there is no general reference documenting how `size` ought to work for multivariate distributions. This [discussion](https://github.com/numpy/numpy/issues/17669) may be helpful to get more context. 1. `Aesara` tries to infer the output shape of the `RandomVariable` (given a user-specified size) by introspection of the `ndim_supp` and `ndim_params` attributes. However, the default method may not work for more complex distributions. In that case, custom `_shape_from_params` (and less probably, `_infer_shape`) should also be implemented in the new `RandomVariable` class. One simple example is seen in the {MultinomialRV}`pymc3.distributions.multivariate.DirichletMultinomialRV` where it was necessary to specify the `rep_param_idx` so that the `default_shape_from_params` helper method could do its job. In more complex cases, it may not be possible to make use of the default helper, but those have not been found yet! 1. It's okay to use the `rng_fn` `classmethods` of other Aesara and PyMC3 `RandomVariables` inside the new `rng_fn`. For example if you are implementing a negative HalfNormal `RandomVariable`, your `rng_fn` can simply return `- halfnormal.rng_fn(rng, scale, size)`. +*Note: In addition to `size`, the `PyMC3` API also provides `shape` and `dims` as alternatives to define a distribution dimensionality, but this is taken care of by {`class`}`~pymc3.distributions.Distribution`, and should not require any extra changes.* -For a quick test that your new `RandomVariable` `Op` is working, you can just call `eval` on a new instance: +For a quick test that your new `RandomVariable` `Op` is working, you can call the `Op` with the necessary parameters and then call `eval()` on the returned object: ```python @@ -181,8 +182,8 @@ class Blah(PositiveContinuous): Some notes: -1. A distribution should at the very least inherit from `Discrete` or `Continuous`. For the latter, more specific subclasses exist:`PositiveContinuous`, `UnitContinuous`, `BoundedContinuous`, `CircularContinuous`, which specify default transformations for the variables. If you need to specify a one-time custom transform you can also override the `__new__` method, as is done for the {`Dirichlet`}`pymc3.distributions.multivariate.Dirichlet`. -1. If a distribution does not have a corresponding `random` implementation, a `RandomVariable` should still be created that raises a `NotImplementedError`. This is the case for the {`Flat`}`pymc3.distributions.continuous.FlatRV`. In this case it will be necessary to provide a standard `initval` by +1. A distribution should at the very least inherit from `Discrete` or `Continuous`. For the latter, more specific subclasses exist:`PositiveContinuous`, `UnitContinuous`, `BoundedContinuous`, `CircularContinuous`, which specify default transformations for the variables. If you need to specify a one-time custom transform you can also override the `__new__` method, as is done for the {`class`}`~pymc3.distributions.multivariate.Dirichlet`. +1. If a distribution does not have a corresponding `random` implementation, a `RandomVariable` should still be created that raises a `NotImplementedError`. This is the case for the {`class`}`~pymc3.distributions.continuous.FlatRV`. In this case it will be necessary to provide a standard `initval` by overriding `__new__`. 1. As mentioned above, `v4` works in a very {functional}`Functional_Programming` way, and all the information that is needed in the `logp` and `logcdf` methods is expected to be "carried" via the `RandomVariable` inputs. You may pass numerical arguments that are not strictly needed for the `rng_fn` method but are used in the `logp` and `logcdf` methods. Just keep in mind whether this affects the correct shape inference behavior of the `RandomVariable`. If specialized non-numeric information is needed you might need to define your custom`_logp` and `_logcdf` {dispatc}`Dispatching` functions, but this should be done as a last resort. 1. The `logcdf` method is not a requirement, but it's a nice plus! @@ -264,7 +265,7 @@ class TestBlahAltParam2(BaseTestDistribution): ``` -Custom tests can also be added to the class as is done for the {`FlatRV`}`pymc3.tests.test_random.TestFlat`. +Custom tests can also be added to the class as is done for the {`class`}`~pymc3.tests.test_random.TestFlat`. ### Note on `check_rv_size` test: @@ -277,7 +278,7 @@ tests_to_run = ["check_rv_size"] ``` This is usually needed for Multivariate distributions. -You can see an example in the {`Dirichlet` test}`pymc3.test.test_random.TestDirichlet`. +You can see an example in {`class`}`~pymc3.test.test_random.TestDirichlet`. ### Notes on `check_pymcs_draws_match_reference` test @@ -287,7 +288,7 @@ The latter kind of test (if warranted) can be performed with the aid of `pymc3_r This kind of test only makes sense if there is a good independent generator reference (i.e., not just the same composition of numpy / scipy python calls that is done inside `rng_fn`). Finally, when your `rng_fn` is doing something more than just calling a `numpy` or `scipy` method, you will need to setup an equivalent seeded function with which to compare for the exact draws (instead of relying on `seeded_[scipy|numpy]_distribution_builder`). -You can find an example in the {`TestWeibull`}`pymc3.tests.test_distributions_random.TestWeibull`, whose `rng_fn` returns `beta * np.random.weibull(alpha, size=size)`. +You can find an example in {`class`}`~pymc3.tests.test_distributions_random.TestWeibull`, whose `rng_fn` returns `beta * np.random.weibull(alpha, size=size)`. ## 4. Adding tests for the `logp` / `logcdf` methods From 033f7b647ccfc947714ad21304c99722f965c3a2 Mon Sep 17 00:00:00 2001 From: Ricardo Date: Wed, 23 Jun 2021 17:01:53 +0200 Subject: [PATCH 6/7] Add docstrings to `BaseTestDistribution` --- pymc3/tests/test_distributions_random.py | 58 ++++++++++++++++++++++++ 1 file changed, 58 insertions(+) diff --git a/pymc3/tests/test_distributions_random.py b/pymc3/tests/test_distributions_random.py index 0d4fb6248..304f04456 100644 --- a/pymc3/tests/test_distributions_random.py +++ b/pymc3/tests/test_distributions_random.py @@ -283,6 +283,64 @@ class TestZeroInflatedBinomial(BaseTestCases.BaseTestCase): class BaseTestDistribution(SeededTest): + """ + This class provides a base for tests that new RandomVariables are correctly + implemented, and that the mapping of parameters between the PyMC3 + Distribution and the respective RandomVariable is correct. + + Three default tests are provided which check: + 1. Expected inputs are passed to the `rv_op` by the `dist` `classmethod`, + via `check_pymc_params_match_rv_op` + 2. Expected (exact) draws are being returned, via + `check_pymc_draws_match_reference` + 3. Shape variable inference is correct, via `check_rv_size` + + Each desired test must be referenced by name in `tests_to_run`, when + subclassing this distribution. Custom tests can be added to each class as + well. See `TestFlat` for an example. + + Additional tests should be added for each optional parametrization of the + distribution. In this case it's enough to include the test + `check_pymc_params_match_rv_op` since only this differs. + + Note on `check_rv_size` test: + Custom input sizes (and expected output shapes) can be defined for the + `check_rv_size` test, by adding the optional class attributes + `sizes_to_check` and `sizes_expected`: + + ```python + sizes_to_check = [None, (1), (2, 3)] + sizes_expected = [(3,), (1, 3), (2, 3, 3)] + tests_to_run = ["check_rv_size"] + ``` + + This is usually needed for Multivariate distributions. You can see an + example in `TestDirichlet` + + + Notes on `check_pymcs_draws_match_reference` test: + + The `check_pymcs_draws_match_reference` is a very simple test for the + equality of draws from the `RandomVariable` and the exact same python + function, given the same inputs and random seed. A small number + (`size=15`) is checked. This is not supposed to be a test for the + correctness of the random generator. The latter kind of test + (if warranted) can be performed with the aid of `pymc3_random` and + `pymc3_random_discrete` methods in this file, which will perform an + expensive statistical comparison between the RandomVariable `rng_fn` + and a reference Python function. This kind of test only makes sense if + there is a good independent generator reference (i.e., not just the same + composition of numpy / scipy python calls that is done inside `rng_fn`). + + Finally, when your `rng_fn` is doing something more than just calling a + `numpy` or `scipy` method, you will need to setup an equivalent seeded + function with which to compare for the exact draws (instead of relying on + `seeded_[scipy|numpy]_distribution_builder`). You can find an example + in the `TestWeibull`, whose `rng_fn` returns + `beta * np.random.weibull(alpha, size=size)`. + + """ + pymc_dist: Optional[Callable] = None pymc_dist_params = dict() reference_dist: Optional[Callable] = None From d7478027b520508209d959c54c005b235fa1175e Mon Sep 17 00:00:00 2001 From: Ricardo Date: Wed, 30 Jun 2021 09:08:35 +0200 Subject: [PATCH 7/7] Small tweaks --- ...on.md => developer_guide_implementing_distribution.md} | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) rename docs/source/{distribution_implementation.md => developer_guide_implementing_distribution.md} (97%) diff --git a/docs/source/distribution_implementation.md b/docs/source/developer_guide_implementing_distribution.md similarity index 97% rename from docs/source/distribution_implementation.md rename to docs/source/developer_guide_implementing_distribution.md index c0e31b8ae..4d5e2abd4 100644 --- a/docs/source/distribution_implementation.md +++ b/docs/source/developer_guide_implementing_distribution.md @@ -86,7 +86,7 @@ Some important things to keep in mind: 1. Everything inside the `rng_fn` method is pure Python code (as are the inputs) and should not make use of other `Aesara` symbolic ops. The random method should make use of the `rng` which is a Numpy {`class`}`~numpy.random.RandomState`, so that samples are reproducible. 1. The `size` argument (together with the inputs shape) are the only way for the user to specify non-default `RandomVariable` dimensions. The `rng_fn` will have to take this into consideration for correct output. `size` is the specification used by `Numpy` and `Scipy` and works like PyMC3 `shape` for univariate distributions, but is different for multivariate distributions. Unfortunately there is no general reference documenting how `size` ought to work for multivariate distributions. This [discussion](https://github.com/numpy/numpy/issues/17669) may be helpful to get more context. -1. `Aesara` tries to infer the output shape of the `RandomVariable` (given a user-specified size) by introspection of the `ndim_supp` and `ndim_params` attributes. However, the default method may not work for more complex distributions. In that case, custom `_shape_from_params` (and less probably, `_infer_shape`) should also be implemented in the new `RandomVariable` class. One simple example is seen in the {MultinomialRV}`pymc3.distributions.multivariate.DirichletMultinomialRV` where it was necessary to specify the `rep_param_idx` so that the `default_shape_from_params` helper method could do its job. In more complex cases, it may not be possible to make use of the default helper, but those have not been found yet! +1. `Aesara` tries to infer the output shape of the `RandomVariable` (given a user-specified size) by introspection of the `ndim_supp` and `ndim_params` attributes. However, the default method may not work for more complex distributions. In that case, custom `_shape_from_params` (and less probably, `_infer_shape`) should also be implemented in the new `RandomVariable` class. One simple example is seen in the {class}`~pymc3.distributions.multivariate.DirichletMultinomialRV` where it was necessary to specify the `rep_param_idx` so that the `default_shape_from_params` helper method could do its job. In more complex cases, it may not be possible to make use of the default helper, but those have not been found yet! 1. It's okay to use the `rng_fn` `classmethods` of other Aesara and PyMC3 `RandomVariables` inside the new `rng_fn`. For example if you are implementing a negative HalfNormal `RandomVariable`, your `rng_fn` can simply return `- halfnormal.rng_fn(rng, scale, size)`. *Note: In addition to `size`, the `PyMC3` API also provides `shape` and `dims` as alternatives to define a distribution dimensionality, but this is taken care of by {`class`}`~pymc3.distributions.Distribution`, and should not require any extra changes.* @@ -117,7 +117,7 @@ After implementing the new `RandomVariable` `Op`, it's time to make use of it in PyMC3 works in a very {functional}`Functional_Programming` way, and the `distribution` classes are there mostly to facilitate porting the `v3` code to the new `v4` version, add PyMC3 API features and keep related methods organized together. In practice, they take care of: -1. Linking ({dispatching}`Dispatching`) a rv_op class with the corresponding logp and logcdf methods. +1. Linking ({dispatch}`Dispatching`) a rv_op class with the corresponding logp and logcdf methods. 1. Defining a standard transformation (for continuous distributions) that converts a bounded variable domain (e.g., positive line) to an unbounded domain (i.e., the real line), which many samplers prefer. 1. Validating the parametrization of a distribution and converting non-symbolic inputs (i.e., numeric literals or numpy arrays) to symbolic variables. 1. Converting multiple alternative parametrizations to the standard parametrization that the `RandomVariable` is defined in terms of. @@ -185,7 +185,7 @@ Some notes: 1. A distribution should at the very least inherit from `Discrete` or `Continuous`. For the latter, more specific subclasses exist:`PositiveContinuous`, `UnitContinuous`, `BoundedContinuous`, `CircularContinuous`, which specify default transformations for the variables. If you need to specify a one-time custom transform you can also override the `__new__` method, as is done for the {`class`}`~pymc3.distributions.multivariate.Dirichlet`. 1. If a distribution does not have a corresponding `random` implementation, a `RandomVariable` should still be created that raises a `NotImplementedError`. This is the case for the {`class`}`~pymc3.distributions.continuous.FlatRV`. In this case it will be necessary to provide a standard `initval` by overriding `__new__`. -1. As mentioned above, `v4` works in a very {functional}`Functional_Programming` way, and all the information that is needed in the `logp` and `logcdf` methods is expected to be "carried" via the `RandomVariable` inputs. You may pass numerical arguments that are not strictly needed for the `rng_fn` method but are used in the `logp` and `logcdf` methods. Just keep in mind whether this affects the correct shape inference behavior of the `RandomVariable`. If specialized non-numeric information is needed you might need to define your custom`_logp` and `_logcdf` {dispatc}`Dispatching` functions, but this should be done as a last resort. +1. As mentioned above, `v4` works in a very {functional}`Functional_Programming` way, and all the information that is needed in the `logp` and `logcdf` methods is expected to be "carried" via the `RandomVariable` inputs. You may pass numerical arguments that are not strictly needed for the `rng_fn` method but are used in the `logp` and `logcdf` methods. Just keep in mind whether this affects the correct shape inference behavior of the `RandomVariable`. If specialized non-numeric information is needed you might need to define your custom`_logp` and `_logcdf` {dispatch}`Dispatching` functions, but this should be done as a last resort. 1. The `logcdf` method is not a requirement, but it's a nice plus! For a quick check that things are working you can try to create the new distribution in a model context: @@ -338,7 +338,7 @@ There are a couple of details worth keeping in mind: 1. By default the first and last values (edges) of the `Domain` are not compared (they are used for other things). If it is important to test the edge of the `Domain`, the edge values can be repeated. This is done by the `Bool`: `Bool = Domain([0, 0, 1, 1], "int64")` 1. There are some default domains (such as `R` and `Rplus`) that you can use for testing your new distribution, but it's also perfectly fine to create your own domains inside the test function if there is a good reason for it (e.g., when the default values lead too many extreme unlikely combinations that are not very informative about the correctness of the implementation). -1. By default, a random subset of 100 `param` * `paramdomain` combinations is tested, in order to keep the test runtime under control. When testing your shiny new distribution, you can temporarily set `n_samples=-1` to force all combinations to be tested. This is important to avoid the your `PR` leading to surprising failures in future runs whenever some bad combinations of parameters are randomly tested. +1. By default, a random subset of 100 `param` x `paramdomain` combinations is tested, in order to keep the test runtime under control. When testing your shiny new distribution, you can temporarily set `n_samples=-1` to force all combinations to be tested. This is important to avoid the your `PR` leading to surprising failures in future runs whenever some bad combinations of parameters are randomly tested. 1. On Github all tests are run twice, under the `aesara.config.floatX` flags of `"float64"` and `"float32"`. However, the reference Python functions will run in a pure "float64" environment, which means the reference and the `PyMC3` results can diverge quite a lot (e.g., underflowing to `-np.inf` for extreme parameters). You should therefore make sure you test locally in both regimes. A quick and dirty way of doing this is to temporariliy add `aesara.config.floatX = "float32"` at the very top of file, immediately after `import aesara`. Remember to set `n_samples=-1` as well to test all combinations. The test output will show what exact parameter values lead to a failure. If you are confident that your implementation is correct, you may opt to tweak the decimal precision with `select_by_precision`, or adjust the tested `Domain` values. In extreme cases, you can mark the test with a conditional `xfail` (if only one of the sub-methods is failing, they should be separated, so that the `xfail` is as narrow as possible): ```python