Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug: rv_discret.ppf for percentiles 0 and 100 and loc > 1 #11132

Closed
reidarkind opened this issue Nov 27, 2019 · 5 comments
Closed

Possible bug: rv_discret.ppf for percentiles 0 and 100 and loc > 1 #11132

reidarkind opened this issue Nov 27, 2019 · 5 comments
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Milestone

Comments

@reidarkind
Copy link
Contributor

The method rv_discrete.ppf returns "strange" values for percentile 0 and 1 (where percentile is ranges from 0 to 1).

Given a custom discrete method created with rv_discrete, with values = (xk, xp),

  • should the ppf return quantile min(xk)-1 for percentile 0? Right should be some number <(min(xk), expected nan or min(xk)
  • should the ppf retrun quantile max(xk) for percentile 1, even though loc > 0? Expected max(xk) + loc

Seems like rv_discrete._ppf returns more correct results.

See code and comments in code:

Reproducing code example:

import numpy as np
from scipy.stats import rv_discrete

# Custom discrete distribution with values:
xk = [1, 2, 3, 4]
xp = [0.25] * 4
custom_dist = rv_discrete(name="custom", values=(xk, xp))

# Percentiles evenly distributed from 0 to 1:
p = np.linspace(0, 1, 11)

# Percent point function; what is the quantile corresponding to each percentile
print(f"Percentiles: {p}")
# >> Percentiles: [0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]
print(f"PPF with loc 0: {custom_dist.ppf(p, loc=0)}")
# >> PPF with loc 0: [0. 1. 1. 2. 2. 2. 3. 3. 4. 4. 4.]
# A bit puzzeled about percentiles 0 and 1;
# Percentile 0 is min(xk) - 1, why -1?, shouldn't it be < min(xk)
# Percentile 1 is max(xk), shouldn't it be > max(xk)?

# Changing loc of distribution
print(f"PPF with loc 1: {custom_dist.ppf(p, loc=1)}")
# >> PPF with loc 1: [0. 2. 2. 3. 3. 3. 4. 4. 5. 5. 4.]
# Even more puzzeld of percentiles 0 and 1;
# percentile 1 corresponding to quantile 4, but 0.9 percentile corresponding to quantile 5!
# percentile 0 corresponding to 0 (same as when loc=0), should have been < min(xk)?

# Try with _ppf in stead:
print(f"Using _ppf, no loc: {custom_dist._ppf(p)}")
# >> Using _ppf, no loc: [1 1 1 2 2 2 3 3 4 4 4]
# Seems more correct.

# Shanging minimum value of xk to see that 0 percentile acctually corresponds to min(xk) - 1
xk_alt = [0.1, 2, 3, 4]
alt_custom_dist = rv_discrete(name="alt_custom", values=(xk_alt, xp))
print(f"PPF with loc 1: {alt_custom_dist.ppf(p, loc=0)}")
# >> PPF with loc 1: [-0.9  0.1  0.1  2.   2.   2.   3.   3.   4.   4.   4. ]
# Still 0 percentile is min(xk) - 1

# Seems to me that the _ppf method gives a better result
print(f"Using _ppf, no loc: {alt_custom_dist._ppf(p)}")
# >> Using _ppf, no loc: [0.1 0.1 0.1 2.  2.  2.  3.  3.  4.  4.  4. ]

Error message:

No error messages.

Scipy/Numpy/Python version information:

import sys, scipy, numpy; print(scipy.__version__, numpy.__version__, sys.version_info)
1.3.2 1.17.4 sys.version_info(major=3, minor=7, micro=0, releaselevel='final', serial=0)
@reidarkind
Copy link
Contributor Author

An update:
I've done a small change in one line in the rv_discrete.ppf method. I think the method now gives more correct results for percentile 0 and 1 (100):

Line 3384 in _distn_infrastructure.py could be changed to:
cond1 = (q >= 0) & (q <= 1)
instead of
cond1 = (q > 0) & (q < 1)

See full working example below:

import numpy as np
from numpy import asarray, shape, place
from scipy.stats import rv_discrete
from scipy.stats._distn_infrastructure import argsreduce
from scipy._lib._util import _valarray as valarray


def ppf(self, q, *args, **kwds):
    """
    Percent point function (inverse of `cdf`) at q of the given RV.
    Parameters
    ----------
    q : array_like
        Lower tail probability.
    arg1, arg2, arg3,... : array_like
        The shape parameter(s) for the distribution (see docstring of the
        instance object for more information).
    loc : array_like, optional
        Location parameter (default=0).
    Returns
    -------
    k : array_like
        Quantile corresponding to the lower tail probability, q.
    """
    args, loc, _ = self._parse_args(*args, **kwds)
    _a, _b = self._get_support(*args)
    q, loc = map(asarray, (q, loc))
    args = tuple(map(asarray, args))
    cond0 = self._argcheck(*args) & (loc == loc)
    cond1 = (q >= 0) & (q <= 1)  # this line is change; original was: cond1 = (q > 0) & (q < 1)
    cond2 = (q == 1) & cond0
    cond = cond0 & cond1
    output = valarray(shape(cond), value=self.badvalue, typecode="d")
    # output type 'd' to handle nin and inf
    place(output, (q == 0) * (cond == cond), _a - 1)
    place(output, cond2, _b)
    if np.any(cond):
        goodargs = argsreduce(cond, *((q,) + args + (loc,)))
        loc, goodargs = goodargs[-1], goodargs[:-1]
        place(output, cond, self._ppf(*goodargs) + loc)

    if output.ndim == 0:
        return output[()]
    return output


xk = [0, 1, 2, 3]
xp = [1 / 4] * 4
p = np.linspace(0, 1, 21)


custom_as_is = rv_discrete(name="custom_as_is", values=(xk, xp))
print(f"Original ppf, loc = 0: \n{custom_as_is.ppf(p, loc=0)}")
# >> Original ppf, loc = 0:
# [-1.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  2.  2.  2.  2.  2.  3.  3.
#   3.  3.  3.]
print(f"Original ppf, loc = 1: \n{custom_as_is.ppf(p, loc=1)}")
# >> Original ppf, loc = 1:
# [-1.  1.  1.  1.  1.  1.  2.  2.  2.  2.  2.  3.  3.  3.  3.  3.  4.  4.
#   4.  4.  3.]

# Overwrite existing ppf method with changed ppf method
rv_discrete.ppf = ppf
custom_changed = rv_discrete(name="custom_changed", values=(xk, xp))
print(f"Proposed new ppf, loc = 0: \n{custom_changed.ppf(p, loc=0)}")
# >> Proposed new ppf, loc = 0:
# [0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 3. 3. 3. 3. 3.]
print(f"Proposed new ppf, loc = 1: \n{custom_changed.ppf(p, loc=1)}")
# >> Proposed new ppf, loc = 1:
# [1. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 3. 3. 3. 3. 3. 4. 4. 4. 4. 4.]

@reidarkind
Copy link
Contributor Author

Another option could be to ad loc to percentile 0 and 1. See changes in lines:

place(output, (q == 0) * (cond == cond), _a - 1 + loc)  # added loc
place(output, cond2, _b + loc)  # added loc```

Though, still a bit puzzled about what value is the correct one for percentile 0. How could ppf(0) be _a -1 if the distribution only is defined from _a to _b?

Complete code example below:

import numpy as np
from numpy import asarray, shape, place
from scipy.stats import rv_discrete
from scipy.stats._distn_infrastructure import argsreduce
from scipy._lib._util import _valarray as valarray


def ppf(self, q, *args, **kwds):
    """
    Percent point function (inverse of `cdf`) at q of the given RV.
    Parameters
    ----------
    q : array_like
        Lower tail probability.
    arg1, arg2, arg3,... : array_like
        The shape parameter(s) for the distribution (see docstring of the
        instance object for more information).
    loc : array_like, optional
        Location parameter (default=0).
    Returns
    -------
    k : array_like
        Quantile corresponding to the lower tail probability, q.
    """
    args, loc, _ = self._parse_args(*args, **kwds)
    _a, _b = self._get_support(*args)
    q, loc = map(asarray, (q, loc))
    args = tuple(map(asarray, args))
    cond0 = self._argcheck(*args) & (loc == loc)
    cond1 = (q > 0) & (q < 1)
    cond2 = (q == 1) & cond0
    cond = cond0 & cond1
    output = valarray(shape(cond), value=self.badvalue, typecode="d")
    # output type 'd' to handle nin and inf
    place(output, (q == 0) * (cond == cond), _a - 1 + loc)  # added loc
    place(output, cond2, _b + loc)  # added loc
    if np.any(cond):
        goodargs = argsreduce(cond, *((q,) + args + (loc,)))
        loc, goodargs = goodargs[-1], goodargs[:-1]
        place(output, cond, self._ppf(*goodargs) + loc)

    if output.ndim == 0:
        return output[()]
    return output


xk = [0, 1, 2, 3]
xp = [1 / 4] * 4
p = np.linspace(0, 1, 21)


custom_as_is = rv_discrete(name="custom_as_is", values=(xk, xp))
print(f"Original ppf, loc = 0: \n{custom_as_is.ppf(p, loc=0)}")
# >> Original ppf, loc = 0:
# [-1.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  2.  2.  2.  2.  2.  3.  3.
#   3.  3.  3.]
print(f"Original ppf, loc = 1: \n{custom_as_is.ppf(p, loc=1)}")
# >> Original ppf, loc = 1:
# [-1.  1.  1.  1.  1.  1.  2.  2.  2.  2.  2.  3.  3.  3.  3.  3.  4.  4.
#   4.  4.  3.]

# Overwrite existing ppf method with changed ppf method
rv_discrete.ppf = ppf
custom_changed = rv_discrete(name="custom_changed", values=(xk, xp))
print(f"Proposed new ppf, loc = 0: \n{custom_changed.ppf(p, loc=0)}")
# >> Proposed new ppf, loc = 0:
# [-1.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  2.  2.  2.  2.  2.  3.  3.
#   3.  3.  3.]
print(f"Proposed new ppf, loc = 1: \n{custom_changed.ppf(p, loc=1)}")
# >> Proposed new ppf, loc = 1:
# [0. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 3. 3. 3. 3. 3. 4. 4. 4. 4. 4.]

@josef-pkt
Copy link
Member

ppf(0) = min(xk) - 1
is by design. It needs to be strictly smaller than the lower support point, and because discrete distributions are defined on integers reducing it by 1 looks appropriate.

I'm not sure about loc with custom distribution, those defined by values=(xk, xp). I don't remember ever seeing a case for it nor checking it.
If it doesn't have unit tests, then loc might not be correctly supported in that case.
There are not many use cases for loc shifted discrete distribution and this might not be verified enough for edge cases.

my guess is that the boundary points are not handled correctly.

AFAICS
on line 3239f, _a and _b are defined without loc shift
i.e. those need to be replaced by _a + loc and _b + loc

https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py#L3239
place(output, (q == 0)*(cond == cond), _a-1)
place(output, cond2, _b)

compare with continuous which adjust for loc and scale
https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py#L2003

@reidarkind
Copy link
Contributor Author

Thanks for your reply, @josef-pkt !

I agree, ppf(0) = min(xk) - 1 is by design. And when xk is integers it is fine. Looking at the tutorial it looks like everything around discrete distributions is about integers. That said, I find the result a bit strange if p(xk[0]) = 0, as shown in code below. But I guess this is a special case.

from scipy.stats import rv_discrete
import numpy as np

xk = np.array([0, 1, 2, 3, 4])
xp = np.array([0] + [0.25] * 4)
custom = rv_discrete(name="custom", values=(xk, xp))
p = np.linspace(0, 1, 21)
print(f"percentiles: \n{p}")
print(f"ppf at each percentile:\n{custom.ppf(p)}")
# percentiles:
# [0.   0.05 0.1  0.15 0.2  0.25 0.3  0.35 0.4  0.45
#  0.7  0.75 0.8  0.85 0.9  0.95 1.  ]
# ppf at each percentile:
# [-1.  1.  1.  1.  1.  1.  2.  2.  2.  2.  2.  3.  3
#   4.  4.  4.]

When it comes to loc it must be a bug. Looks like you have come to the same conclusion that I did in my comment I'll create a PR for that one.

reidarkind added a commit to reidarkind/scipy-1 that referenced this issue Dec 2, 2019
Updated  rv_discrete to support shifting loc of distribution, ref Issue scipy#11132
@tylerjereddy tylerjereddy added scipy.stats defect A clear bug or issue that prevents SciPy from being installed or used as expected labels Dec 19, 2019
rgommers pushed a commit that referenced this issue Jan 2, 2020
Updated  rv_discrete to support shifting loc of distribution, ref Issue gh-11132
@rgommers rgommers added this to the 1.5.0 milestone Jan 2, 2020
@rgommers
Copy link
Member

rgommers commented Jan 2, 2020

fixed in gh-11248, closing. thanks @reidarkind and @josef-pkt!

@rgommers rgommers closed this as completed Jan 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Projects
None yet
Development

No branches or pull requests

4 participants