Poisson Binomial Distribution Support #6000

hkothari · 2016-03-25T19:32:12Z

I'd like to add support for the Poisson Binomial Distribution: https://en.wikipedia.org/wiki/Poisson_binomial_distribution

into the scipy.stats module. It could be placed under "scipy.stats.pbinom". Is there a reason why support for this doesn't already exist and if not, would it be a welcome addition to the scipy repo?

If so, I'll go ahead and implement it and submit a PR.

josef-pkt · 2016-03-25T19:45:31Z

In many cases if a distribution is missing, then it's because nobody implemented it.

Based on a brief look at the Wikipedia page, I think the main question is whether this will be fast enough in general to get a usable implementation. The sums and products in the pmf and cdf look computation intensive.
(Even if it's slow, it wouldn't be the only slow distribution in scipy.stats.)

hkothari · 2016-03-25T19:56:35Z

Yeah, I assume it will be fairly computationally intensive but I guess as you said it wouldn't be the only thing. Are there any standards we try to adhere to for performance? I'm going to be implementing this anyways so I'm happy to test out performance if there are any standards that need to be met.

If not, we can at least document it to be clear that it's expected to be computationally intensive.

josef-pkt · 2016-03-25T20:31:36Z

I haven't seen any performance standards for distributions. Some of them are marked "slow" in the unit test because some methods that use generic calculations can be very slow (rvs needs ppf needs cdf needs pmf for each point).
In some cases caching intermediate results would help but that is not possible in the main class, and I'm not sure how well it is supported in frozen distributions.

hkothari · 2016-03-26T02:41:33Z

Using fourier transforms this is actually pretty fast. One thing I am seeing in my pmf implementation using fourier transforms is that the sum of my probabilities is off from 1 by around 1e-14 so there's some precision loss.

I'm currently looking through my implementation to see if I can avoid precision loss but are there any standards regarding precision that I should be meeting?

ev-br · 2016-03-28T15:55:08Z

Re accuracy: hard to tell in general, depends on the quantity.Offhand, 1e-14 sounds ok.

hkothari · 2016-04-01T17:13:14Z

I've got a fairly working implementation of this outside of scipy, but when trying to get it working within scipy I'm running into some issues regarding the way shape parameters are handled. Because Poisson Binomial requires a bunch of individual probabilities I'm trying to take in an array of probabilities as the shape argument of my distribution. As far as I can tell, the rv_generic/rv_discrete infrastructure that exists currently doesn't really work with array parameters for shape variables.

Is there any way to take in an array of probabilities as a shape parameter with the current infrastructure? I could probably also do this with varargs but at a cursory glance it looks like that's prohibited and if it's allowed it's not really clear how.

josef-pkt · 2016-04-01T17:17:13Z

No, the setting/framework for distributions doesn't allow for multivariate shape parameters, they have to be individually provided as arguments.
The main reason, besides that the current distributions have 0 to 4 (?) shape parameters, is that an array for a parameter is interpreted for vectorized calculation and internally broadcasted to calculate all cases at the same time.

hkothari · 2016-04-01T17:22:30Z

Gotcha. So if that is the case do you see any way to be able to specify the
parameters of a Poisson Binomial with the current infrastructure? In order
to specify one of these distributions you must specify variable number of
individual probabilities. Without array parameters or varargs I'm not
really sure it's possible but there might be something I'm missing in the
code.

If there's documentation about what's supported in happy to look there.
Feel free to point me in any direction.
Em sex, 1 de abr de 2016 às 10:17, Josef Perktold notifications@github.com
escreveu:

No, the setting/framework for distributions doesn't allow for multivariate
shape parameters, they have to be individually provided as arguments.
The main reason, besides that the current distributions have 0 to 4 (?)
shape parameters, is that a array for a parameter is interpreted for
vectorized calculation and internally broadcasted to calculate all cases at
the same time.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#6000 (comment)

josef-pkt · 2016-04-01T17:22:31Z

BTW: In statsmodels we are just struggling with the Tweedie distribution which is a compound Poisson-Gamma distribution with an infinite sum in the pdf. I hope to avoid having to calculate it for the fit.
But it would be good to have a proper implementation of the distribution (as R does).

josef-pkt · 2016-04-01T17:38:05Z

There is also the discussion about the histogram distribution on the mailing list, which would have the same problem.
Actually (I just remembered from the analog), we have one case in discrete distributions with vector argument, but it's provided in the __init__ as values and not as shape parameter in the methods.

see https://github.com/scipy/scipy/pull/5672/files
which I have neglected already for some time to review.

josef-pkt · 2016-04-01T17:41:25Z

@ev-br using __new__ is a good trick to create a new instance for a distribution with new "frozen" parameters.

ev-br · 2016-04-01T18:11:20Z

ISTM (superficially, without looking at any code) that this pbinom would better fit among multivariate distributions. They are in _multivariate.py, a simple example is e.g. in #5410

josef-pkt · 2016-04-01T18:14:45Z

The distribution itself is univariate.
The implementation of multivariate distributions is less restrictive (but also provides less)

ev-br · 2016-04-01T18:33:13Z

@josef-pkt Indeed. Then it might be easier to register it as having no shape parameters and set the probabilities in the constructor.

ev-br · 2016-04-06T12:11:39Z

Something like this should work:

In [26]: class pbinom(rv_discrete):
    def __init__(self, prob, seed=None):
        self.prob = np.asarray(prob)    # validate etc
        super(pbinom, self).__init__(seed=seed, a=0, b=len(prob))

    def _pmf(self, x):
        ... can use self.prob here ...

Then a user would do p = pbinom(prob=[0.1, 0.2, 0.7]) to get a usable instance.

hkothari · 2018-01-19T18:33:04Z

I recognize this is a pretty old issue but I was never able to get it cleanly integrated into scipy. Someone just asked me about the code so I'm including it below and if anyone wants to use it, or carry it over the finish line that would be great:

diff --git a/scipy/stats/_discrete_distns.py b/scipy/stats/_discrete_distns.py
index c39486783..fbca4ac4d 100644
--- a/scipy/stats/_discrete_distns.py
+++ b/scipy/stats/_discrete_distns.py
@@ -798,9 +798,61 @@ class skellam_gen(rv_discrete):
         return mean, var, g1, g2
 skellam = skellam_gen(a=-np.inf, name="skellam", longname='A Skellam')
 
+class pbinom(rv_discrete):
+    """A  Poisson binomial discrete random variable.
+
+    %(before_notes)s
+
+    Notes
+    -----
+    Probability distribution of a series of independent Bernoulli
+    random variables that are not necessarily identically distributed.
+
+    For details see: http://en.wikipedia.org/wiki/Poisson_binomial_distribution
+
+    `pbinom` takes ``probabilities``, an array of probabilities for the Bernoulli
+    variables as parameters.
+
+    %(after_notes)s
+
+    %(example)s
+
+    """
+    def __init__(self, probabilities):
+        self.probabilities = np.asarray(probabilities)
+        if not ((self.probabilities <= 1).all() and (self.probabilities >= 0).all()):
+            raise ValueError("All probabilities must be between 0 and 1")
+        super(pbinom, self).__init__(seed=seed, a=0, b=len(prob))
+
+    def _rvs(self):
+        n = self._size
+        return sum((self._random_state.binomial(1, p) for p in self.probabilities))
+
+    def _pmf(self, x):
+        n = len(self.probabilities)
+        C = exp(2j*np.pi/(n+1))
+        s = 0
+        for l in range(0, n+1):
+            product = 1
+            for p in self.probabilities:
+                product *= 1 + (C**l - 1) * p
+            s += C**(-l*x) * product
+        return 1/(n+1) * s.real
+
+    def _cdf(self, x):
+        x = floor(x)
+        return sum((self._pmf(i) for i in range(0, x+1)))
+
+    def _stats(self):
+        mean = sum(self.probabilities)
+        var = sum((p * (1-p) for p in self.probabilities))
+        sigma = sqrt(var)
+        g1 = sum(( (1-2*p)*(1-p)*p for p in self.probabilities))/(sigma**3)
+        g2 = sum(( (1-6*(1-p)*p)*(1-p)*p ))/(sigma**4)
+        return mean, var, g1, g2
 
 # Collect names of classes and objects in this module.
 pairs = list(globals().items())
 _distn_names, _distn_gen_names = get_distribution_names(pairs, rv_discrete)
 
-__all__ = _distn_names + _distn_gen_names
+__all__ = _distn_names + _distn_gen_names + ["pbinom"]
diff --git a/scipy/stats/tests/test_distributions.py b/scipy/stats/tests/test_distributions.py
index a38bbcb9a..0f5f33edd 100644
--- a/scipy/stats/tests/test_distributions.py
+++ b/scipy/stats/tests/test_distributions.py
@@ -979,6 +979,50 @@ class TestSkellam(TestCase):
 
         assert_almost_equal(stats.skellam.cdf(k, mu1, mu2), skcdfR, decimal=5)
 
+class TestPBinom(TestCase):
+    def test_rvs(self):
+        vals = stats.pbinom.rvs(probabilities=[0.5, 0.25, 0.3, 0.2], size=(2, 50))
+        assert_(numpy.all(vals >= 0) & numpy.all(vals <= 4))
+        assert_(numpy.shape(vals) == (2, 50))
+        assert_(vals.dtype.char in typecodes['AllInteger'])
+        val = stats.binom.rvs(probabilities=[0.5, 0.25])
+        assert_(isinstance(val, int))
+        val = stats.binom([0.5, 0.25]).rvs(3)
+        assert_(isinstance(val, numpy.ndarray))
+        assert_(val.dtype.char in typecodes['AllInteger'])
+
+    def test_pmf(self):
+        vals1 = stats.pbinom.pmf(1, probabilities=[0.5])
+        assert_allclose(vals1, 0.5, rtol=1e-15, atol=0)
+
+        probs = [0.5, 0.5]
+        vals2 = stats.pbinom.pmf(2, probabilities=probs)
+        vals3 = stats.pbinom.pmf(1, probabilities=probs)
+        vals4 = stats.pbinom.pmf(0, probabilities=probs)
+        assert_allclose(vals2, 0.25, rtol=1e-15, atol=0)
+        assert_allclose(vals3, 0.5, rtol=1e-15, atol=0)
+        assert_allclose(vals4, 0.25, rtol=1e-15, atol=0)
+
+        probs = [0.3, 0.5]
+        vals5 = stats.pbinom.pmf(2, probabilities=probs)
+        vals6 = stats.pbinom.pmf(1, probabilities=probs)
+        vals7 = stats.pbinom.pmf(0, probabilities=probs)
+        assert_allclose(vals5, 0.25, rtol=1e-15, atol=0)
+        assert_allclose(vals6, 0.5, rtol=1e-15, atol=0)
+        assert_allclose(vals7, 0.25, rtol=1e-15, atol=0)
+
+    def test_binomial(self):
+        """
+        The case when probabilities are IID should yield the
+        exact same distribution as a binomial distribution.
+        """
+        binom_dist = stats.binom(10, 0.4)
+        pbinom_dist = stats.pbinom(probabilities=([0.4]*10))
+
+        assert_allclose(binom_dist.pmf(2), pbinom_dist.pmf(2), rtol=1e-15)
+        assert_allclose(binom_dist.pmf(7), pbinom_dist.pmf(7), rtol=1e-15)
+
+        assert_allclose(binom_dist.cdf(8), pbinom_dist.cdf(8), rtol=1e-15)
 
 class TestLognorm(TestCase):
     def test_pdf(self):

reynoldscem · 2018-11-22T14:52:10Z

Can you comment as to why it was never integrated?

bbudescu · 2019-03-06T13:49:07Z

There is another implementation using FFT here. The author says that his implementation is based on this paper, which is newer (2013) than the paper cited by wikipedia(2010). The newer paper is available as full text here. Note that the author of the paper also published an R package implementing his computation methods (source), that may be used for verifying the results.

DavidWalz · 2022-09-19T07:09:53Z

@hkothari thanks for your implementation!

@ev-br as far as I understand the remaining issue is the lack of broadcasting support for evaluating pmf and cdf on tensor-shaped x, right?

mdhaber · 2022-09-19T13:52:39Z

The problem is that this distribution doesn't fit well into the existing distribution infrastructure because it has a many-valued shape parameter (or a variable number of scalar shape parameters). We'll make sure that there is support for this sort of thing in the new infrastructure (gh-15928).

ev-br added scipy.stats enhancement A new feature or improvement labels Mar 26, 2016

josef-pkt mentioned this issue Apr 7, 2016

Consider adding natural and expectation parameterizations to all exponential families #6037

Closed

rlucas7 mentioned this issue Mar 20, 2019

ENH: Add the Irwinhall distribution #9963

Closed

rlucas7 mentioned this issue Nov 7, 2019

ENH: Implement the Polya-Gamma distribution #11009

Closed

maxi-marufo mentioned this issue May 2, 2020

ENH (WIP): Adds Poisson binomial distribution support (See #6000) #12000

Closed

mdhaber mentioned this issue Mar 15, 2022

A Solid Foundation for Statistics in Python with SciPy mdhaber/scipy#26

Closed

mdhaber mentioned this issue Apr 3, 2022

RFC: stats: univariate distribution infrastructure #15928

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poisson Binomial Distribution Support #6000

Poisson Binomial Distribution Support #6000

hkothari commented Mar 25, 2016

josef-pkt commented Mar 25, 2016

hkothari commented Mar 25, 2016

josef-pkt commented Mar 25, 2016

hkothari commented Mar 26, 2016

ev-br commented Mar 28, 2016

hkothari commented Apr 1, 2016

josef-pkt commented Apr 1, 2016

hkothari commented Apr 1, 2016

josef-pkt commented Apr 1, 2016

josef-pkt commented Apr 1, 2016

josef-pkt commented Apr 1, 2016

ev-br commented Apr 1, 2016

josef-pkt commented Apr 1, 2016

ev-br commented Apr 1, 2016

ev-br commented Apr 6, 2016

hkothari commented Jan 19, 2018

reynoldscem commented Nov 22, 2018

bbudescu commented Mar 6, 2019 •

edited

DavidWalz commented Sep 19, 2022

mdhaber commented Sep 19, 2022 •

edited

Poisson Binomial Distribution Support #6000

Poisson Binomial Distribution Support #6000

Comments

hkothari commented Mar 25, 2016

josef-pkt commented Mar 25, 2016

hkothari commented Mar 25, 2016

josef-pkt commented Mar 25, 2016

hkothari commented Mar 26, 2016

ev-br commented Mar 28, 2016

hkothari commented Apr 1, 2016

josef-pkt commented Apr 1, 2016

hkothari commented Apr 1, 2016

josef-pkt commented Apr 1, 2016

josef-pkt commented Apr 1, 2016

josef-pkt commented Apr 1, 2016

ev-br commented Apr 1, 2016

josef-pkt commented Apr 1, 2016

ev-br commented Apr 1, 2016

ev-br commented Apr 6, 2016

hkothari commented Jan 19, 2018

reynoldscem commented Nov 22, 2018

bbudescu commented Mar 6, 2019 • edited

DavidWalz commented Sep 19, 2022

mdhaber commented Sep 19, 2022 • edited

bbudescu commented Mar 6, 2019 •

edited

mdhaber commented Sep 19, 2022 •

edited