Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random_combination_with_replacement recipe has misleading docstring #102653

Closed
pochmann opened this issue Mar 13, 2023 · 7 comments
Closed

random_combination_with_replacement recipe has misleading docstring #102653

pochmann opened this issue Mar 13, 2023 · 7 comments
Assignees
Labels
3.11 only security fixes 3.12 bugs and security fixes docs Documentation in the Doc dir

Comments

@pochmann
Copy link
Contributor

pochmann commented Mar 13, 2023

Documentation

The random module has four recipes that are supposed to "efficiently make random selections from the combinatoric iterators in the itertools module". And their docstrings all say "Random selection from [iterator]". Both suggest they're equivalent to random.choice(list(iterator)), just efficiently.

For example, itertools.combinations_with_replacement([0, 1], r=4) produces these five combinations:

(0, 0, 0, 0)
(0, 0, 0, 1)
(0, 0, 1, 1)
(0, 1, 1, 1)
(1, 1, 1, 1)

So random.choice(list(iterator)) would return one of those five with 20% probability each.

But the random_combination_with_replacement recipe instead produces these probabilities:

(0, 0, 0, 0)  6.25%
(0, 0, 0, 1) 25.00%
(0, 0, 1, 1) 37.50%
(0, 1, 1, 1) 25.00%
(1, 1, 1, 1)  6.25%

Here's an implementation that is equivalent to random.choice(list(iterator)):

def random_combination_with_replacement(iterable, r):
    "Random selection from itertools.combinations_with_replacement(iterable, r)"
    pool = tuple(iterable)
    n = len(pool)
    indices = sorted(random.sample(range(n+r-1), k=r))
    return tuple(pool[i-j] for j, i in enumerate(indices))

One can view the combinations as the result of actually simulating r random draws with replacement, where the multiset {0,0,1,1} indeed occurs more often, namely as 0011, 0101, 0110, etc. But that is not the only valid view and isn't the view suggested by the documentation (as my first paragraph argued). Though if that view and the bias is the intention, then I suggest its documentation should mention the bias.

Test code

Attempt This Online!

import random
import itertools
from collections import Counter

iterable = [0, 1]
r = 4


#-- itertools ----------------------

print('itertools')
for comb in itertools.combinations_with_replacement(iterable, r):
    print(comb)


#-- from iterator ------------------

def random_combination_with_replacement_from_iterator(iterable, r):
    "Random selection from itertools.combinations_with_replacement(iterable, r)"
    iterator = itertools.combinations_with_replacement(iterable, r)
    return random.choice(list(iterator))


#-- current random recipe ----------

def random_combination_with_replacement(iterable, r):
    "Random selection from itertools.combinations_with_replacement(iterable, r)"
    pool = tuple(iterable)
    n = len(pool)
    indices = sorted(random.choices(range(n), k=r))
    return tuple(pool[i] for i in indices)


#-- proposed random recipe ---------

def random_combination_with_replacement_proposal(iterable, r):
    "Random selection from itertools.combinations_with_replacement(iterable, r)"
    pool = tuple(iterable)
    n = len(pool)
    indices = sorted(random.sample(range(n+r-1), k=r))
    return tuple(pool[i-j] for j, i in enumerate(indices))


#-- Comparison

for func in random_combination_with_replacement_from_iterator, random_combination_with_replacement, random_combination_with_replacement_proposal:
    print()
    print(func.__name__)
    N = 100000
    ctr = Counter(func(iterable, r) for _ in range(N))
    for comb, freq in sorted(ctr.items()):
        print(comb, f'{freq/N:6.2%}')
Test results
itertools
(0, 0, 0, 0)
(0, 0, 0, 1)
(0, 0, 1, 1)
(0, 1, 1, 1)
(1, 1, 1, 1)

random_combination_with_replacement_from_iterator
(0, 0, 0, 0) 19.89%
(0, 0, 0, 1) 20.08%
(0, 0, 1, 1) 20.01%
(0, 1, 1, 1) 19.88%
(1, 1, 1, 1) 20.14%

random_combination_with_replacement
(0, 0, 0, 0)  6.14%
(0, 0, 0, 1) 24.98%
(0, 0, 1, 1) 37.71%
(0, 1, 1, 1) 25.04%
(1, 1, 1, 1)  6.13%

random_combination_with_replacement_proposal
(0, 0, 0, 0) 20.17%
(0, 0, 0, 1) 19.82%
(0, 0, 1, 1) 20.18%
(0, 1, 1, 1) 19.88%
(1, 1, 1, 1) 19.95%

Linked PRs

@pochmann pochmann added the docs Documentation in the Doc dir label Mar 13, 2023
@terryjreedy
Copy link
Member

@rhettinger

@stevendaprano
Copy link
Member

Both suggest they're equivalent to random.choice(list(iterator)), just efficiently.

I cannot see anything in the documentation that says or implies that. The text immediately under the heading "Recipes" says:

"These recipes show how to efficiently make random selections from the combinatoric iterators in the itertools module"

But it doesn't say anything about being equivalent to random.choice(list(iterator)). Where did you see that?

@rhettinger rhettinger self-assigned this Mar 13, 2023
@rhettinger
Copy link
Contributor

@pochmann Please assign these to me as you produce one issue after another on the various recipes. In every case, the author of the code should be looped in on the conversation.

@pochmann
Copy link
Contributor Author

@stevendaprano I didn't see that code, I'm saying that's what "random selection from [an iterator]" sounds like. How else do you interpret that?

@rhettinger Ok, next time.

@rhettinger
Copy link
Contributor

I agree with @pochmann that there is an issue here. The docs do imply (falsely) that the output of the itertool is what is being sampled.

The actual intent of the recipe was to model selection with replacement and then subsequently disregarding order. That happens to not be the same as making equiprobable selections from a deduped result space.

I'll spend some more time thinking about this. For the moment, I'm inclined to just update the docstring to make clear what random process is being modeled.

@rhettinger
Copy link
Contributor

Here's a possible new docstring:

def random_combination_with_replacement(iterable, r):  # baseline
    """Choose r elements with replacement.  Order the result to match the iterable.

    When the input iterable is already sorted, this is equivalent to:

        sorted(random.choice(list(itertools.product(iterable, repeat=r))))

    And because the result is sorted, it would be contained in:
    
        set(itertools.combinations_with_replacement(iterable, r))
    
    """
    pool = tuple(iterable)
    n = len(pool)
    indices = sorted(random.choices(range(n), k=r))
    return tuple(pool[i] for i in indices)

This is likely overkill and perhaps only the first line is needed. This recipe has been around for a while there haven't previously been any misunderstandings. This is likely because it matches what people usually want and because the four line recipe is clear about what it does.

@rhettinger rhettinger changed the title random_combination_with_replacement recipe misbehaves random_combination_with_replacement recipe has misleading docstring Mar 14, 2023
@rhettinger rhettinger added 3.11 only security fixes 3.12 bugs and security fixes labels Mar 14, 2023
@pochmann
Copy link
Contributor Author

pochmann commented Mar 14, 2023

Sounds alright. I'd probably remove the middle paragraph, I don't think it really helps and might even be distracting. Then the third paragraph could shrink to just extend the first:

def random_combination_with_replacement(iterable, r):  # baseline
    """Choose r elements with replacement.  Order the result to match the iterable,
    so it would be contained in:
    
        set(itertools.combinations_with_replacement(iterable, r))
    
    """

Or maybe even without the set:

def random_combination_with_replacement(iterable, r):  # baseline
    """Choose r elements with replacement.  Order the result to match the iterable,
    so it would occur in:
    
        itertools.combinations_with_replacement(iterable, r)
    
    """

This recipe has been around for a while there haven't previously been any misunderstandings. This is likely because it matches what people usually want and because the four line recipe is clear about what it does.

Might also be because it's rarely used. A GitHub search found 434 occurrences, most of which just defining the function, I had to go to page 5 of the results to find someone using it, and that was for test data in a benchmark about container types and their membership test speeds, where I doubt they cared about the distribution.

rhettinger added a commit to rhettinger/cpython that referenced this issue Mar 15, 2023
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Mar 16, 2023
…ythonGH-102742)

(cherry picked from commit b0ec625)

Co-authored-by: Raymond Hettinger <rhettinger@users.noreply.github.com>
miss-islington added a commit that referenced this issue Mar 16, 2023
…2742)

(cherry picked from commit b0ec625)

Co-authored-by: Raymond Hettinger <rhettinger@users.noreply.github.com>
carljm added a commit to carljm/cpython that referenced this issue Mar 17, 2023
* main: (34 commits)
  pythongh-102701: Fix overflow in dictobject.c (pythonGH-102750)
  pythonGH-78530: add support for generators in `asyncio.wait` (python#102761)
  Increase stack reserve size for Windows debug builds to avoid test crashes (pythonGH-102764)
  pythongh-102755: Add PyErr_DisplayException(exc) (python#102756)
  Fix outdated note about 'int' rounding or truncating (python#102736)
  pythongh-102192: Replace PyErr_Fetch/Restore etc by more efficient alternatives (python#102760)
  pythongh-99726: Improves correctness of stat results for Windows, and uses faster API when available (pythonGH-102149)
  pythongh-102192: remove redundant exception fields from ssl module socket (python#102466)
  pythongh-102192: Replace PyErr_Fetch/Restore etc by more efficient alternatives (python#102743)
  pythongh-102737: Un-ignore ceval.c in the CI globals check (pythongh-102745)
  pythonGH-102748: remove legacy support for generator based coroutines from `asyncio.iscoroutine` (python#102749)
  pythongh-102721: Improve coverage of `_collections_abc._CallableGenericAlias` (python#102722)
  pythonGH-102653: Make recipe docstring show the correct distribution (python#102742)
  Add comments to `{typing,_collections_abc}._type_repr` about each other (python#102752)
  pythongh-102594: PyErr_SetObject adds note to exception raised on normalization error (python#102675)
  pythongh-94440: Fix issue of ProcessPoolExecutor shutdown hanging (python#94468)
  pythonGH-100112:  avoid using iterable coroutines in asyncio internally (python#100128)
  pythongh-102690: Use Edge as fallback in webbrowser instead of IE (python#102691)
  pythongh-102660: Fix Refleaks in import.c (python#102744)
  pythongh-102738: remove from cases generator the code related to register instructions (python#102739)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes 3.12 bugs and security fixes docs Documentation in the Doc dir
Projects
None yet
Development

No branches or pull requests

4 participants