Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poisson Binomial Distribution Support #6000

Open
hkothari opened this issue Mar 25, 2016 · 20 comments
Open

Poisson Binomial Distribution Support #6000

hkothari opened this issue Mar 25, 2016 · 20 comments
Labels
enhancement A new feature or improvement scipy.stats

Comments

@hkothari
Copy link

I'd like to add support for the Poisson Binomial Distribution: https://en.wikipedia.org/wiki/Poisson_binomial_distribution

into the scipy.stats module. It could be placed under "scipy.stats.pbinom". Is there a reason why support for this doesn't already exist and if not, would it be a welcome addition to the scipy repo?

If so, I'll go ahead and implement it and submit a PR.

@josef-pkt
Copy link
Member

In many cases if a distribution is missing, then it's because nobody implemented it.

Based on a brief look at the Wikipedia page, I think the main question is whether this will be fast enough in general to get a usable implementation. The sums and products in the pmf and cdf look computation intensive.
(Even if it's slow, it wouldn't be the only slow distribution in scipy.stats.)

@hkothari
Copy link
Author

Yeah, I assume it will be fairly computationally intensive but I guess as you said it wouldn't be the only thing. Are there any standards we try to adhere to for performance? I'm going to be implementing this anyways so I'm happy to test out performance if there are any standards that need to be met.

If not, we can at least document it to be clear that it's expected to be computationally intensive.

@josef-pkt
Copy link
Member

I haven't seen any performance standards for distributions. Some of them are marked "slow" in the unit test because some methods that use generic calculations can be very slow (rvs needs ppf needs cdf needs pmf for each point).
In some cases caching intermediate results would help but that is not possible in the main class, and I'm not sure how well it is supported in frozen distributions.

@hkothari
Copy link
Author

Using fourier transforms this is actually pretty fast. One thing I am seeing in my pmf implementation using fourier transforms is that the sum of my probabilities is off from 1 by around 1e-14 so there's some precision loss.

I'm currently looking through my implementation to see if I can avoid precision loss but are there any standards regarding precision that I should be meeting?

@ev-br ev-br added scipy.stats enhancement A new feature or improvement labels Mar 26, 2016
@ev-br
Copy link
Member

ev-br commented Mar 28, 2016

Re accuracy: hard to tell in general, depends on the quantity.Offhand, 1e-14 sounds ok.

@hkothari
Copy link
Author

hkothari commented Apr 1, 2016

I've got a fairly working implementation of this outside of scipy, but when trying to get it working within scipy I'm running into some issues regarding the way shape parameters are handled. Because Poisson Binomial requires a bunch of individual probabilities I'm trying to take in an array of probabilities as the shape argument of my distribution. As far as I can tell, the rv_generic/rv_discrete infrastructure that exists currently doesn't really work with array parameters for shape variables.

Is there any way to take in an array of probabilities as a shape parameter with the current infrastructure? I could probably also do this with varargs but at a cursory glance it looks like that's prohibited and if it's allowed it's not really clear how.

@josef-pkt
Copy link
Member

No, the setting/framework for distributions doesn't allow for multivariate shape parameters, they have to be individually provided as arguments.
The main reason, besides that the current distributions have 0 to 4 (?) shape parameters, is that an array for a parameter is interpreted for vectorized calculation and internally broadcasted to calculate all cases at the same time.

@hkothari
Copy link
Author

hkothari commented Apr 1, 2016

Gotcha. So if that is the case do you see any way to be able to specify the
parameters of a Poisson Binomial with the current infrastructure? In order
to specify one of these distributions you must specify variable number of
individual probabilities. Without array parameters or varargs I'm not
really sure it's possible but there might be something I'm missing in the
code.

If there's documentation about what's supported in happy to look there.
Feel free to point me in any direction.
Em sex, 1 de abr de 2016 às 10:17, Josef Perktold notifications@github.com
escreveu:

No, the setting/framework for distributions doesn't allow for multivariate
shape parameters, they have to be individually provided as arguments.
The main reason, besides that the current distributions have 0 to 4 (?)
shape parameters, is that a array for a parameter is interpreted for
vectorized calculation and internally broadcasted to calculate all cases at
the same time.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#6000 (comment)

@josef-pkt
Copy link
Member

BTW: In statsmodels we are just struggling with the Tweedie distribution which is a compound Poisson-Gamma distribution with an infinite sum in the pdf. I hope to avoid having to calculate it for the fit.
But it would be good to have a proper implementation of the distribution (as R does).

@josef-pkt
Copy link
Member

There is also the discussion about the histogram distribution on the mailing list, which would have the same problem.
Actually (I just remembered from the analog), we have one case in discrete distributions with vector argument, but it's provided in the __init__ as values and not as shape parameter in the methods.

see https://github.com/scipy/scipy/pull/5672/files
which I have neglected already for some time to review.

@josef-pkt
Copy link
Member

@ev-br using __new__ is a good trick to create a new instance for a distribution with new "frozen" parameters.

@ev-br
Copy link
Member

ev-br commented Apr 1, 2016

ISTM (superficially, without looking at any code) that this pbinom would better fit among multivariate distributions. They are in _multivariate.py, a simple example is e.g. in #5410

@josef-pkt
Copy link
Member

The distribution itself is univariate.
The implementation of multivariate distributions is less restrictive (but also provides less)

@ev-br
Copy link
Member

ev-br commented Apr 1, 2016

@josef-pkt Indeed. Then it might be easier to register it as having no shape parameters and set the probabilities in the constructor.

@ev-br
Copy link
Member

ev-br commented Apr 6, 2016

Something like this should work:

In [26]: class pbinom(rv_discrete):
    def __init__(self, prob, seed=None):
        self.prob = np.asarray(prob)    # validate etc
        super(pbinom, self).__init__(seed=seed, a=0, b=len(prob))

    def _pmf(self, x):
        ... can use self.prob here ...

Then a user would do p = pbinom(prob=[0.1, 0.2, 0.7]) to get a usable instance.

@hkothari
Copy link
Author

I recognize this is a pretty old issue but I was never able to get it cleanly integrated into scipy. Someone just asked me about the code so I'm including it below and if anyone wants to use it, or carry it over the finish line that would be great:

diff --git a/scipy/stats/_discrete_distns.py b/scipy/stats/_discrete_distns.py
index c39486783..fbca4ac4d 100644
--- a/scipy/stats/_discrete_distns.py
+++ b/scipy/stats/_discrete_distns.py
@@ -798,9 +798,61 @@ class skellam_gen(rv_discrete):
         return mean, var, g1, g2
 skellam = skellam_gen(a=-np.inf, name="skellam", longname='A Skellam')
 
+class pbinom(rv_discrete):
+    """A  Poisson binomial discrete random variable.
+
+    %(before_notes)s
+
+    Notes
+    -----
+    Probability distribution of a series of independent Bernoulli
+    random variables that are not necessarily identically distributed.
+
+    For details see: http://en.wikipedia.org/wiki/Poisson_binomial_distribution
+
+    `pbinom` takes ``probabilities``, an array of probabilities for the Bernoulli
+    variables as parameters.
+
+    %(after_notes)s
+
+    %(example)s
+
+    """
+    def __init__(self, probabilities):
+        self.probabilities = np.asarray(probabilities)
+        if not ((self.probabilities <= 1).all() and (self.probabilities >= 0).all()):
+            raise ValueError("All probabilities must be between 0 and 1")
+        super(pbinom, self).__init__(seed=seed, a=0, b=len(prob))
+
+    def _rvs(self):
+        n = self._size
+        return sum((self._random_state.binomial(1, p) for p in self.probabilities))
+
+    def _pmf(self, x):
+        n = len(self.probabilities)
+        C = exp(2j*np.pi/(n+1))
+        s = 0
+        for l in range(0, n+1):
+            product = 1
+            for p in self.probabilities:
+                product *= 1 + (C**l - 1) * p
+            s += C**(-l*x) * product
+        return 1/(n+1) * s.real
+
+    def _cdf(self, x):
+        x = floor(x)
+        return sum((self._pmf(i) for i in range(0, x+1)))
+
+    def _stats(self):
+        mean = sum(self.probabilities)
+        var = sum((p * (1-p) for p in self.probabilities))
+        sigma = sqrt(var)
+        g1 = sum(( (1-2*p)*(1-p)*p for p in self.probabilities))/(sigma**3)
+        g2 = sum(( (1-6*(1-p)*p)*(1-p)*p ))/(sigma**4)
+        return mean, var, g1, g2
 
 # Collect names of classes and objects in this module.
 pairs = list(globals().items())
 _distn_names, _distn_gen_names = get_distribution_names(pairs, rv_discrete)
 
-__all__ = _distn_names + _distn_gen_names
+__all__ = _distn_names + _distn_gen_names + ["pbinom"]
diff --git a/scipy/stats/tests/test_distributions.py b/scipy/stats/tests/test_distributions.py
index a38bbcb9a..0f5f33edd 100644
--- a/scipy/stats/tests/test_distributions.py
+++ b/scipy/stats/tests/test_distributions.py
@@ -979,6 +979,50 @@ class TestSkellam(TestCase):
 
         assert_almost_equal(stats.skellam.cdf(k, mu1, mu2), skcdfR, decimal=5)
 
+class TestPBinom(TestCase):
+    def test_rvs(self):
+        vals = stats.pbinom.rvs(probabilities=[0.5, 0.25, 0.3, 0.2], size=(2, 50))
+        assert_(numpy.all(vals >= 0) & numpy.all(vals <= 4))
+        assert_(numpy.shape(vals) == (2, 50))
+        assert_(vals.dtype.char in typecodes['AllInteger'])
+        val = stats.binom.rvs(probabilities=[0.5, 0.25])
+        assert_(isinstance(val, int))
+        val = stats.binom([0.5, 0.25]).rvs(3)
+        assert_(isinstance(val, numpy.ndarray))
+        assert_(val.dtype.char in typecodes['AllInteger'])
+
+    def test_pmf(self):
+        vals1 = stats.pbinom.pmf(1, probabilities=[0.5])
+        assert_allclose(vals1, 0.5, rtol=1e-15, atol=0)
+
+        probs = [0.5, 0.5]
+        vals2 = stats.pbinom.pmf(2, probabilities=probs)
+        vals3 = stats.pbinom.pmf(1, probabilities=probs)
+        vals4 = stats.pbinom.pmf(0, probabilities=probs)
+        assert_allclose(vals2, 0.25, rtol=1e-15, atol=0)
+        assert_allclose(vals3, 0.5, rtol=1e-15, atol=0)
+        assert_allclose(vals4, 0.25, rtol=1e-15, atol=0)
+
+        probs = [0.3, 0.5]
+        vals5 = stats.pbinom.pmf(2, probabilities=probs)
+        vals6 = stats.pbinom.pmf(1, probabilities=probs)
+        vals7 = stats.pbinom.pmf(0, probabilities=probs)
+        assert_allclose(vals5, 0.25, rtol=1e-15, atol=0)
+        assert_allclose(vals6, 0.5, rtol=1e-15, atol=0)
+        assert_allclose(vals7, 0.25, rtol=1e-15, atol=0)
+
+    def test_binomial(self):
+        """
+        The case when probabilities are IID should yield the
+        exact same distribution as a binomial distribution.
+        """
+        binom_dist = stats.binom(10, 0.4)
+        pbinom_dist = stats.pbinom(probabilities=([0.4]*10))
+
+        assert_allclose(binom_dist.pmf(2), pbinom_dist.pmf(2), rtol=1e-15)
+        assert_allclose(binom_dist.pmf(7), pbinom_dist.pmf(7), rtol=1e-15)
+
+        assert_allclose(binom_dist.cdf(8), pbinom_dist.cdf(8), rtol=1e-15)
 
 class TestLognorm(TestCase):
     def test_pdf(self):

@reynoldscem
Copy link

Can you comment as to why it was never integrated?

@bbudescu
Copy link

bbudescu commented Mar 6, 2019

There is another implementation using FFT here. The author says that his implementation is based on this paper, which is newer (2013) than the paper cited by wikipedia(2010). The newer paper is available as full text here. Note that the author of the paper also published an R package implementing his computation methods (source), that may be used for verifying the results.

@DavidWalz
Copy link

@hkothari thanks for your implementation!

@ev-br as far as I understand the remaining issue is the lack of broadcasting support for evaluating pmf and cdf on tensor-shaped x, right?

@mdhaber
Copy link
Contributor

mdhaber commented Sep 19, 2022

The problem is that this distribution doesn't fit well into the existing distribution infrastructure because it has a many-valued shape parameter (or a variable number of scalar shape parameters). We'll make sure that there is support for this sort of thing in the new infrastructure (gh-15928).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants