Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify handling of infinity and nan in random.choices #85939

Closed
cool-RR mannequin opened this issue Sep 12, 2020 · 13 comments
Closed

Clarify handling of infinity and nan in random.choices #85939

cool-RR mannequin opened this issue Sep 12, 2020 · 13 comments
Labels
3.10 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@cool-RR
Copy link
Mannequin

cool-RR mannequin commented Sep 12, 2020

BPO 41773
Nosy @rhettinger, @cool-RR
PRs
  • bpo-41773: random.choices doesn't support non-finite weights #22441
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-09-29.01:33:05.699>
    created_at = <Date 2020-09-12.15:31:17.870>
    labels = ['type-feature', 'library', '3.10']
    title = 'Clarify handling of infinity and nan in random.choices'
    updated_at = <Date 2020-09-29.01:33:05.698>
    user = 'https://github.com/cool-RR'

    bugs.python.org fields:

    activity = <Date 2020-09-29.01:33:05.698>
    actor = 'rhettinger'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-09-29.01:33:05.699>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2020-09-12.15:31:17.870>
    creator = 'cool-RR'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 41773
    keywords = ['patch']
    message_count = 13.0
    messages = ['376802', '376812', '376816', '376821', '376827', '376828', '376829', '376831', '376832', '377618', '377619', '377650', '377651']
    nosy_count = 2.0
    nosy_names = ['rhettinger', 'cool-RR']
    pr_nums = ['22441']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue41773'
    versions = ['Python 3.10']

    @cool-RR
    Copy link
    Mannequin Author

    cool-RR mannequin commented Sep 12, 2020

    I was recently tripped up by a bug caused by passing infinite weights to random.choices. I toyed around with that function, and it seemed that when it's given weights that include infinity or NaN, it selects a specific element, always without being random. That's regardless of whether multiple items have infinity weights. It chooses a different specific item if the infinity is negative. The specific item isn't always the one that has the infinite weight.

    I don't know whether that's the intended behavior for some reason, or whether it's even possible to define a logical behavior for infinite weights. If it's not possible, then maybe this function should raise an errors when given weights that include infinity or nan.

    I see that the documentation states that the weights must be non-negative; maybe we should have a check for that too.

    @cool-RR cool-RR mannequin added type-bug An unexpected behavior, bug, or error 3.10 only security fixes stdlib Python modules in the Lib dir labels Sep 12, 2020
    @rhettinger
    Copy link
    Contributor

    Sorry, I don't see the point in cluttering the documentation or over-specifying the function for these non-sensensical toy experiments. AFAICT, no actual problem exists in practice.

    Also, we don't want to slow down the function by running scans for infinities or NaNs. That would make the function worse for all users and better for almost no one.

    @rhettinger
    Copy link
    Contributor

    In general NaNs wreak havoc on any function that uses comparisons:

        >>> max(10, nan)
        10
        >>> max(nan, 10)
        nan

    It isn't the responsibility of the functions to test for NaNs. Likewise, it would not make sense to document the NaN behavior in every function that uses a comparison. That would clutter the docs and it puts the responsibility in the wrong place.

    It is the NaN itself that is responsible for the behavior you're seeing. It is up to the NaN documentation to discuss its effects on downstream code. For example, sorting isn't consistent when NaNs are present:

        >>> sorted([10, nan, 5])
        [10, nan, 5]
        >>> sorted([5, nan, 10])
    [   5, nan, 10]

    Also, a NaN is just one of many possible objects that create weird downstream effects. For examples, sets have a partial ordering and would also create odd results with sorted(), bisect(), max(), min(), etc.

    The situation with Infinity is similar. It is a special object that has the unusual property that inf+1==inf, so bisection of cumulative sums will give the rightmost infinity.

    Consider a population 'ABC' and weights of [5, 7, 2]. We have

    Element   Weight   Cumulative Weight  Range (half-open interval)
    -------   ------   -----------------  \--------------------------
      A         5      5                  0.0  <= X < 5.0
      B         7      12                 5.0  <= X < 12.0
      C         2      14                 12.0 <= X < 14.0
              \------
               14
    

    The selector X comes from: random() * 14.0 which gives a range of: 0.0 <= X < 14.0.

    Now consider a population 'ABC' and weights of [5, 0, 2]. We have

    Element   Weight   Cumulative Weight  Range (half-open interval)
    -------   ------   -----------------  \--------------------------
      A         5      5                  0.0  <= X < 5.0
      B         0      5                  5.0  <= X < 5.0
      C         2      7                  5.0  <= X < 7.0
              \------
                7
    

    If X is 5.0, we have to choose C because B has a zero selection probabliity. So have to pick the rightmost (bottommost) range that has 5.0, giving the C.

    Now, replace B's weight with float('Inf'):

    Element   Weight   Cumulative Weight  Range (half-open interval)
    -------   ------   -----------------  \--------------------------
      A         5      5                  0.0  <= X < 5.0
      B        Inf     Inf                5.0  <= X < Inf
      C         2      Inf                Inf  <= X < Inf
              \------
               Inf
    

    Since Inf+2 is Inf and Inf==Inf, the latter two ranges are undifferentiated. The selector itself in always Inf because "X = random() * inf" always gives inf. Using the previous rule, we have to choose the rightmost Inf which is C. This is in fact what choices() does:

        >>> choices('ABC', [5, float('Inf'), 2])
        ['C']

    It isn't an error. That is the same behavior that bisect() has when searching for a infinite value (or any other object that universally compares larger than anything except itself):

        >>> bisect([10, 20, inf], inf)
        3
        >>> bisect([10, 20, 30], inf)
        3

    When searching for infinity, you always get the rightmost insertion point even if the cuts points are finite. Accordingly, it makes sense that if one of the weights is infinite, then the total infinite, and the selector is infinite, so you always get the rightmost value.

    @cool-RR
    Copy link
    Mannequin Author

    cool-RR mannequin commented Sep 13, 2020

    I disagree, especially the bit about slowing down the function, since you're checking the total, not every element. Like the check for negative numbers that you do on line 493. But it's your call.

    @cool-RR
    Copy link
    Mannequin Author

    cool-RR mannequin commented Sep 13, 2020

    Speaking of that check, the error message is 'Total of weights must be greater than zero', which is a bit misleading. 'All weights must be positive or zero' would be more accurate. Would you like a PR for that?

    @rhettinger
    Copy link
    Contributor

    'All weights must be positive or zero' would be more accurate.

    That's not what the test checks. It literally checks for "total <= 0.0".

    @cool-RR
    Copy link
    Mannequin Author

    cool-RR mannequin commented Sep 13, 2020

    Yes, but the docs say that the weights are assumed to be nonnegative. I'm assuming the check is done on total because it'd be expensive to do it on each item. A user reading that error message could understand that it's okay for weights to be negative as long as the total isn't, which as far as I understand isn't true.

    @rhettinger
    Copy link
    Contributor

    A user reading that error message could understand that it's
    okay for weights to be negative as long as the total isn't,
    which as far as I understand isn't true.

    This seems like another made up issue. One could also argue the opposite case that if the error message were changed, a user could reasonably assume that the function was in fact checking all the weights individually (which it isn't).

    The docs already state that the function assumes non-negative inputs, which is in fact what it does. The function already has reasonable error reporting for common missteps. Also, it behaves gracefully if someone is nudging weights up and down in a way that goes slightly negative due to rounding. Beyond that, we're hitting the limits of what it can or should do when fed garbage inputs for ill-posed problems.

    It's time for this one die now. It's already eaten an hour of my development time explaining how infinities and nans flow through functions and explaining what the Python norms are for letting them flow through versus treating them as errors.

    @cool-RR
    Copy link
    Mannequin Author

    cool-RR mannequin commented Sep 13, 2020

    Thanks for your time, and I accept your decision here. I'd appreciate it though if you wouldn't use the term "made up issue" on something that happened to me a couple of days ago. Our opposing positions each have their pros and cons, and it makes sense to me that your approach has more pros than mine. Ease up on the "It's time for this one die now" though.

    @rhettinger
    Copy link
    Contributor

    I've discussed this with other developers and now agree that a test should be added. While the current code's handing of Inf or NaN is correct in a technical sense, it is neither intuitive or useful.

    Even though it will have a small time cost for the common case of two weights, we should add a test just after the check for the total being greater than zero:

        if not _isfinite(total):
            raise ValueError('Total of weights must be finite')

    Also edit the docs to say:

    Weights are assumed to be non-negative and finite.

    Ram, since this was your finding, do you want to make a PR for it (with a test, news entry, and doc edit)?

    @rhettinger rhettinger reopened this Sep 28, 2020
    @rhettinger rhettinger added type-feature A feature request or enhancement and removed invalid type-bug An unexpected behavior, bug, or error labels Sep 28, 2020
    @rhettinger rhettinger reopened this Sep 28, 2020
    @rhettinger rhettinger added type-feature A feature request or enhancement and removed invalid type-bug An unexpected behavior, bug, or error labels Sep 28, 2020
    @cool-RR
    Copy link
    Mannequin Author

    cool-RR mannequin commented Sep 28, 2020

    Sounds good, I'm on it.

    @rhettinger
    Copy link
    Contributor

    New changeset b0dfc75 by Ram Rachum in branch 'master':
    bpo-41773: Raise exception for non-finite weights in random.choices(). (GH-22441)
    b0dfc75

    @rhettinger
    Copy link
    Contributor

    Thanks for the PR and for filing the issue.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant