Possible bug: rv_discret.ppf for percentiles 0 and 100 and loc > 1 #11132

reidarkind · 2019-11-27T09:18:59Z

The method rv_discrete.ppf returns "strange" values for percentile 0 and 1 (where percentile is ranges from 0 to 1).

Given a custom discrete method created with rv_discrete, with values = (xk, xp),

should the ppf return quantile min(xk)-1 for percentile 0? Right should be some number <(min(xk), expected nan or min(xk)
should the ppf retrun quantile max(xk) for percentile 1, even though loc > 0? Expected max(xk) + loc

Seems like rv_discrete._ppf returns more correct results.

See code and comments in code:

Reproducing code example:

import numpy as np
from scipy.stats import rv_discrete

# Custom discrete distribution with values:
xk = [1, 2, 3, 4]
xp = [0.25] * 4
custom_dist = rv_discrete(name="custom", values=(xk, xp))

# Percentiles evenly distributed from 0 to 1:
p = np.linspace(0, 1, 11)

# Percent point function; what is the quantile corresponding to each percentile
print(f"Percentiles: {p}")
# >> Percentiles: [0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]
print(f"PPF with loc 0: {custom_dist.ppf(p, loc=0)}")
# >> PPF with loc 0: [0. 1. 1. 2. 2. 2. 3. 3. 4. 4. 4.]
# A bit puzzeled about percentiles 0 and 1;
# Percentile 0 is min(xk) - 1, why -1?, shouldn't it be < min(xk)
# Percentile 1 is max(xk), shouldn't it be > max(xk)?

# Changing loc of distribution
print(f"PPF with loc 1: {custom_dist.ppf(p, loc=1)}")
# >> PPF with loc 1: [0. 2. 2. 3. 3. 3. 4. 4. 5. 5. 4.]
# Even more puzzeld of percentiles 0 and 1;
# percentile 1 corresponding to quantile 4, but 0.9 percentile corresponding to quantile 5!
# percentile 0 corresponding to 0 (same as when loc=0), should have been < min(xk)?

# Try with _ppf in stead:
print(f"Using _ppf, no loc: {custom_dist._ppf(p)}")
# >> Using _ppf, no loc: [1 1 1 2 2 2 3 3 4 4 4]
# Seems more correct.

# Shanging minimum value of xk to see that 0 percentile acctually corresponds to min(xk) - 1
xk_alt = [0.1, 2, 3, 4]
alt_custom_dist = rv_discrete(name="alt_custom", values=(xk_alt, xp))
print(f"PPF with loc 1: {alt_custom_dist.ppf(p, loc=0)}")
# >> PPF with loc 1: [-0.9  0.1  0.1  2.   2.   2.   3.   3.   4.   4.   4. ]
# Still 0 percentile is min(xk) - 1

# Seems to me that the _ppf method gives a better result
print(f"Using _ppf, no loc: {alt_custom_dist._ppf(p)}")
# >> Using _ppf, no loc: [0.1 0.1 0.1 2.  2.  2.  3.  3.  4.  4.  4. ]

Error message:

No error messages.

Scipy/Numpy/Python version information:

import sys, scipy, numpy; print(scipy.__version__, numpy.__version__, sys.version_info)
1.3.2 1.17.4 sys.version_info(major=3, minor=7, micro=0, releaselevel='final', serial=0)

The text was updated successfully, but these errors were encountered:

reidarkind · 2019-11-29T09:15:05Z

An update:
I've done a small change in one line in the rv_discrete.ppf method. I think the method now gives more correct results for percentile 0 and 1 (100):

Line 3384 in _distn_infrastructure.py could be changed to:
cond1 = (q >= 0) & (q <= 1)
instead of
cond1 = (q > 0) & (q < 1)

See full working example below:

import numpy as np
from numpy import asarray, shape, place
from scipy.stats import rv_discrete
from scipy.stats._distn_infrastructure import argsreduce
from scipy._lib._util import _valarray as valarray


def ppf(self, q, *args, **kwds):
    """
    Percent point function (inverse of `cdf`) at q of the given RV.
    Parameters
    ----------
    q : array_like
        Lower tail probability.
    arg1, arg2, arg3,... : array_like
        The shape parameter(s) for the distribution (see docstring of the
        instance object for more information).
    loc : array_like, optional
        Location parameter (default=0).
    Returns
    -------
    k : array_like
        Quantile corresponding to the lower tail probability, q.
    """
    args, loc, _ = self._parse_args(*args, **kwds)
    _a, _b = self._get_support(*args)
    q, loc = map(asarray, (q, loc))
    args = tuple(map(asarray, args))
    cond0 = self._argcheck(*args) & (loc == loc)
    cond1 = (q >= 0) & (q <= 1)  # this line is change; original was: cond1 = (q > 0) & (q < 1)
    cond2 = (q == 1) & cond0
    cond = cond0 & cond1
    output = valarray(shape(cond), value=self.badvalue, typecode="d")
    # output type 'd' to handle nin and inf
    place(output, (q == 0) * (cond == cond), _a - 1)
    place(output, cond2, _b)
    if np.any(cond):
        goodargs = argsreduce(cond, *((q,) + args + (loc,)))
        loc, goodargs = goodargs[-1], goodargs[:-1]
        place(output, cond, self._ppf(*goodargs) + loc)

    if output.ndim == 0:
        return output[()]
    return output


xk = [0, 1, 2, 3]
xp = [1 / 4] * 4
p = np.linspace(0, 1, 21)


custom_as_is = rv_discrete(name="custom_as_is", values=(xk, xp))
print(f"Original ppf, loc = 0: \n{custom_as_is.ppf(p, loc=0)}")
# >> Original ppf, loc = 0:
# [-1.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  2.  2.  2.  2.  2.  3.  3.
#   3.  3.  3.]
print(f"Original ppf, loc = 1: \n{custom_as_is.ppf(p, loc=1)}")
# >> Original ppf, loc = 1:
# [-1.  1.  1.  1.  1.  1.  2.  2.  2.  2.  2.  3.  3.  3.  3.  3.  4.  4.
#   4.  4.  3.]

# Overwrite existing ppf method with changed ppf method
rv_discrete.ppf = ppf
custom_changed = rv_discrete(name="custom_changed", values=(xk, xp))
print(f"Proposed new ppf, loc = 0: \n{custom_changed.ppf(p, loc=0)}")
# >> Proposed new ppf, loc = 0:
# [0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 3. 3. 3. 3. 3.]
print(f"Proposed new ppf, loc = 1: \n{custom_changed.ppf(p, loc=1)}")
# >> Proposed new ppf, loc = 1:
# [1. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 3. 3. 3. 3. 3. 4. 4. 4. 4. 4.]

reidarkind · 2019-11-29T10:53:03Z

Another option could be to ad loc to percentile 0 and 1. See changes in lines:

place(output, (q == 0) * (cond == cond), _a - 1 + loc)  # added loc
place(output, cond2, _b + loc)  # added loc```

Though, still a bit puzzled about what value is the correct one for percentile 0. How could ppf(0) be _a -1 if the distribution only is defined from _a to _b?

Complete code example below:

import numpy as np
from numpy import asarray, shape, place
from scipy.stats import rv_discrete
from scipy.stats._distn_infrastructure import argsreduce
from scipy._lib._util import _valarray as valarray


def ppf(self, q, *args, **kwds):
    """
    Percent point function (inverse of `cdf`) at q of the given RV.
    Parameters
    ----------
    q : array_like
        Lower tail probability.
    arg1, arg2, arg3,... : array_like
        The shape parameter(s) for the distribution (see docstring of the
        instance object for more information).
    loc : array_like, optional
        Location parameter (default=0).
    Returns
    -------
    k : array_like
        Quantile corresponding to the lower tail probability, q.
    """
    args, loc, _ = self._parse_args(*args, **kwds)
    _a, _b = self._get_support(*args)
    q, loc = map(asarray, (q, loc))
    args = tuple(map(asarray, args))
    cond0 = self._argcheck(*args) & (loc == loc)
    cond1 = (q > 0) & (q < 1)
    cond2 = (q == 1) & cond0
    cond = cond0 & cond1
    output = valarray(shape(cond), value=self.badvalue, typecode="d")
    # output type 'd' to handle nin and inf
    place(output, (q == 0) * (cond == cond), _a - 1 + loc)  # added loc
    place(output, cond2, _b + loc)  # added loc
    if np.any(cond):
        goodargs = argsreduce(cond, *((q,) + args + (loc,)))
        loc, goodargs = goodargs[-1], goodargs[:-1]
        place(output, cond, self._ppf(*goodargs) + loc)

    if output.ndim == 0:
        return output[()]
    return output


xk = [0, 1, 2, 3]
xp = [1 / 4] * 4
p = np.linspace(0, 1, 21)


custom_as_is = rv_discrete(name="custom_as_is", values=(xk, xp))
print(f"Original ppf, loc = 0: \n{custom_as_is.ppf(p, loc=0)}")
# >> Original ppf, loc = 0:
# [-1.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  2.  2.  2.  2.  2.  3.  3.
#   3.  3.  3.]
print(f"Original ppf, loc = 1: \n{custom_as_is.ppf(p, loc=1)}")
# >> Original ppf, loc = 1:
# [-1.  1.  1.  1.  1.  1.  2.  2.  2.  2.  2.  3.  3.  3.  3.  3.  4.  4.
#   4.  4.  3.]

# Overwrite existing ppf method with changed ppf method
rv_discrete.ppf = ppf
custom_changed = rv_discrete(name="custom_changed", values=(xk, xp))
print(f"Proposed new ppf, loc = 0: \n{custom_changed.ppf(p, loc=0)}")
# >> Proposed new ppf, loc = 0:
# [-1.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  2.  2.  2.  2.  2.  3.  3.
#   3.  3.  3.]
print(f"Proposed new ppf, loc = 1: \n{custom_changed.ppf(p, loc=1)}")
# >> Proposed new ppf, loc = 1:
# [0. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 3. 3. 3. 3. 3. 4. 4. 4. 4. 4.]

josef-pkt · 2019-11-29T16:57:31Z

ppf(0) = min(xk) - 1
is by design. It needs to be strictly smaller than the lower support point, and because discrete distributions are defined on integers reducing it by 1 looks appropriate.

I'm not sure about loc with custom distribution, those defined by values=(xk, xp). I don't remember ever seeing a case for it nor checking it.
If it doesn't have unit tests, then loc might not be correctly supported in that case.
There are not many use cases for loc shifted discrete distribution and this might not be verified enough for edge cases.

my guess is that the boundary points are not handled correctly.

AFAICS
on line 3239f, _a and _b are defined without loc shift
i.e. those need to be replaced by _a + loc and _b + loc

https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py#L3239
place(output, (q == 0)*(cond == cond), _a-1)
place(output, cond2, _b)

compare with continuous which adjust for loc and scale
https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py#L2003

reidarkind · 2019-12-02T07:14:13Z

Thanks for your reply, @josef-pkt !

I agree, ppf(0) = min(xk) - 1 is by design. And when xk is integers it is fine. Looking at the tutorial it looks like everything around discrete distributions is about integers. That said, I find the result a bit strange if p(xk[0]) = 0, as shown in code below. But I guess this is a special case.

from scipy.stats import rv_discrete
import numpy as np

xk = np.array([0, 1, 2, 3, 4])
xp = np.array([0] + [0.25] * 4)
custom = rv_discrete(name="custom", values=(xk, xp))
p = np.linspace(0, 1, 21)
print(f"percentiles: \n{p}")
print(f"ppf at each percentile:\n{custom.ppf(p)}")
# percentiles:
# [0.   0.05 0.1  0.15 0.2  0.25 0.3  0.35 0.4  0.45
#  0.7  0.75 0.8  0.85 0.9  0.95 1.  ]
# ppf at each percentile:
# [-1.  1.  1.  1.  1.  1.  2.  2.  2.  2.  2.  3.  3
#   4.  4.  4.]

When it comes to loc it must be a bug. Looks like you have come to the same conclusion that I did in my comment I'll create a PR for that one.

Updated rv_discrete to support shifting loc of distribution, ref Issue scipy#11132

Updated rv_discrete to support shifting loc of distribution, ref Issue gh-11132

rgommers · 2020-01-02T19:26:21Z

fixed in gh-11248, closing. thanks @reidarkind and @josef-pkt!

reidarkind mentioned this issue Nov 29, 2019

Wrong confidence interval for binomial distribution with p=0 #11134

Closed

reidarkind added a commit to reidarkind/scipy-1 that referenced this issue Dec 2, 2019

rv_discrete.ppf() to handle loc

1c5c896

Updated rv_discrete to support shifting loc of distribution, ref Issue scipy#11132

reidarkind mentioned this issue Dec 2, 2019

rv_discrete.ppf() to handle loc reidarkind/scipy-1#1

Merged

reidarkind mentioned this issue Dec 19, 2019

BUG: rv_discrete.ppf() to handle loc #11248

Merged

tylerjereddy added scipy.stats defect A clear bug or issue that prevents SciPy from being installed or used as expected labels Dec 19, 2019

rgommers pushed a commit that referenced this issue Jan 2, 2020

BUG: rv_discrete.ppf() to handle loc (#11248)

d6483ae

Updated rv_discrete to support shifting loc of distribution, ref Issue gh-11132

rgommers added this to the 1.5.0 milestone Jan 2, 2020

rgommers closed this as completed Jan 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bug: rv_discret.ppf for percentiles 0 and 100 and loc > 1 #11132

Possible bug: rv_discret.ppf for percentiles 0 and 100 and loc > 1 #11132

reidarkind commented Nov 27, 2019

reidarkind commented Nov 29, 2019

reidarkind commented Nov 29, 2019

josef-pkt commented Nov 29, 2019

reidarkind commented Dec 2, 2019

rgommers commented Jan 2, 2020

Possible bug: rv_discret.ppf for percentiles 0 and 100 and loc > 1 #11132

Possible bug: rv_discret.ppf for percentiles 0 and 100 and loc > 1 #11132

Comments

reidarkind commented Nov 27, 2019

Reproducing code example:

Error message:

Scipy/Numpy/Python version information:

reidarkind commented Nov 29, 2019

reidarkind commented Nov 29, 2019

josef-pkt commented Nov 29, 2019

reidarkind commented Dec 2, 2019

rgommers commented Jan 2, 2020