Clarify handling of infinity and nan in random.choices #85939
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = None closed_at = <Date 2020-09-29.01:33:05.699> created_at = <Date 2020-09-12.15:31:17.870> labels = ['type-feature', 'library', '3.10'] title = 'Clarify handling of infinity and nan in random.choices' updated_at = <Date 2020-09-29.01:33:05.698> user = 'https://github.com/cool-RR'
activity = <Date 2020-09-29.01:33:05.698> actor = 'rhettinger' assignee = 'none' closed = True closed_date = <Date 2020-09-29.01:33:05.699> closer = 'rhettinger' components = ['Library (Lib)'] creation = <Date 2020-09-12.15:31:17.870> creator = 'cool-RR' dependencies =  files =  hgrepos =  issue_num = 41773 keywords = ['patch'] message_count = 13.0 messages = ['376802', '376812', '376816', '376821', '376827', '376828', '376829', '376831', '376832', '377618', '377619', '377650', '377651'] nosy_count = 2.0 nosy_names = ['rhettinger', 'cool-RR'] pr_nums = ['22441'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue41773' versions = ['Python 3.10']
The text was updated successfully, but these errors were encountered:
I was recently tripped up by a bug caused by passing infinite weights to random.choices. I toyed around with that function, and it seemed that when it's given weights that include infinity or NaN, it selects a specific element, always without being random. That's regardless of whether multiple items have infinity weights. It chooses a different specific item if the infinity is negative. The specific item isn't always the one that has the infinite weight.
I don't know whether that's the intended behavior for some reason, or whether it's even possible to define a logical behavior for infinite weights. If it's not possible, then maybe this function should raise an errors when given weights that include infinity or nan.
I see that the documentation states that the weights must be non-negative; maybe we should have a check for that too.
Sorry, I don't see the point in cluttering the documentation or over-specifying the function for these non-sensensical toy experiments. AFAICT, no actual problem exists in practice.
Also, we don't want to slow down the function by running scans for infinities or NaNs. That would make the function worse for all users and better for almost no one.
In general NaNs wreak havoc on any function that uses comparisons:
>>> max(10, nan) 10 >>> max(nan, 10) nan
It isn't the responsibility of the functions to test for NaNs. Likewise, it would not make sense to document the NaN behavior in every function that uses a comparison. That would clutter the docs and it puts the responsibility in the wrong place.
It is the NaN itself that is responsible for the behavior you're seeing. It is up to the NaN documentation to discuss its effects on downstream code. For example, sorting isn't consistent when NaNs are present:
>>> sorted([10, nan, 5]) [10, nan, 5] >>> sorted([5, nan, 10]) [ 5, nan, 10]
Also, a NaN is just one of many possible objects that create weird downstream effects. For examples, sets have a partial ordering and would also create odd results with sorted(), bisect(), max(), min(), etc.
The situation with Infinity is similar. It is a special object that has the unusual property that inf+1==inf, so bisection of cumulative sums will give the rightmost infinity.
Consider a population 'ABC' and weights of [5, 7, 2]. We have
The selector X comes from: random() * 14.0 which gives a range of: 0.0 <= X < 14.0.
Now consider a population 'ABC' and weights of [5, 0, 2]. We have
If X is 5.0, we have to choose C because B has a zero selection probabliity. So have to pick the rightmost (bottommost) range that has 5.0, giving the C.
Now, replace B's weight with float('Inf'):
Since Inf+2 is Inf and Inf==Inf, the latter two ranges are undifferentiated. The selector itself in always Inf because "X = random() * inf" always gives inf. Using the previous rule, we have to choose the rightmost Inf which is C. This is in fact what choices() does:
>>> choices('ABC', [5, float('Inf'), 2]) ['C']
It isn't an error. That is the same behavior that bisect() has when searching for a infinite value (or any other object that universally compares larger than anything except itself):
>>> bisect([10, 20, inf], inf) 3 >>> bisect([10, 20, 30], inf) 3
When searching for infinity, you always get the rightmost insertion point even if the cuts points are finite. Accordingly, it makes sense that if one of the weights is infinite, then the total infinite, and the selector is infinite, so you always get the rightmost value.
Yes, but the docs say that the weights are assumed to be nonnegative. I'm assuming the check is done on total because it'd be expensive to do it on each item. A user reading that error message could understand that it's okay for weights to be negative as long as the total isn't, which as far as I understand isn't true.
This seems like another made up issue. One could also argue the opposite case that if the error message were changed, a user could reasonably assume that the function was in fact checking all the weights individually (which it isn't).
The docs already state that the function assumes non-negative inputs, which is in fact what it does. The function already has reasonable error reporting for common missteps. Also, it behaves gracefully if someone is nudging weights up and down in a way that goes slightly negative due to rounding. Beyond that, we're hitting the limits of what it can or should do when fed garbage inputs for ill-posed problems.
It's time for this one die now. It's already eaten an hour of my development time explaining how infinities and nans flow through functions and explaining what the Python norms are for letting them flow through versus treating them as errors.
Thanks for your time, and I accept your decision here. I'd appreciate it though if you wouldn't use the term "made up issue" on something that happened to me a couple of days ago. Our opposing positions each have their pros and cons, and it makes sense to me that your approach has more pros than mine. Ease up on the "It's time for this one die now" though.
I've discussed this with other developers and now agree that a test should be added. While the current code's handing of Inf or NaN is correct in a technical sense, it is neither intuitive or useful.
Even though it will have a small time cost for the common case of two weights, we should add a test just after the check for the total being greater than zero:
if not _isfinite(total): raise ValueError('Total of weights must be finite')
Also edit the docs to say:
Weights are assumed to be non-negative and finite.
Ram, since this was your finding, do you want to make a PR for it (with a test, news entry, and doc edit)?