Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a sample selection method to random.py #37377

Closed
rhettinger opened this issue Oct 28, 2002 · 33 comments
Closed

Add a sample selection method to random.py #37377

rhettinger opened this issue Oct 28, 2002 · 33 comments
Assignees
Labels
stdlib Python modules in the Lib dir

Comments

@rhettinger
Copy link
Contributor

BPO 629637
Nosy @gvanrossum, @tim-one, @loewis, @rhettinger
Files
  • rand5.diff: sample(n,k) with tests and docs
  • rand6.diff: Cleaned-up sample(population, k)
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2002-11-12.17:52:46.000>
    created_at = <Date 2002-10-28.02:03:10.000>
    labels = ['library']
    title = 'Add a sample selection method to random.py'
    updated_at = <Date 2002-11-12.17:52:46.000>
    user = 'https://github.com/rhettinger'

    bugs.python.org fields:

    activity = <Date 2002-11-12.17:52:46.000>
    actor = 'rhettinger'
    assignee = 'rhettinger'
    closed = True
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2002-10-28.02:03:10.000>
    creator = 'rhettinger'
    dependencies = []
    files = ['4663', '4664']
    hgrepos = []
    issue_num = 629637
    keywords = ['patch']
    message_count = 33.0
    messages = ['41469', '41470', '41471', '41472', '41473', '41474', '41475', '41476', '41477', '41478', '41479', '41480', '41481', '41482', '41483', '41484', '41485', '41486', '41487', '41488', '41489', '41490', '41491', '41492', '41493', '41494', '41495', '41496', '41497', '41498', '41499', '41500', '41501']
    nosy_count = 4.0
    nosy_names = ['gvanrossum', 'tim.peters', 'loewis', 'rhettinger']
    pr_nums = []
    priority = 'normal'
    resolution = 'accepted'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue629637'
    versions = ['Python 2.3']

    @rhettinger
    Copy link
    Contributor Author

    random.randset(n, k) returns a k length list of unique
    integers in the range [0,n).

    Improves on a Cookbook submission by using the
    parameters to select between a shuffle algorithm and
    a dictionary algorithm.

    I want to add this to the library because it is a simple,
    robust solution to a general selection problem and
    because it isn't obvious that two different algorithms
    are needed to balance speed/space trade-offs.

    If approved, will add docs and a news item.

    @rhettinger rhettinger self-assigned this Oct 28, 2002
    @rhettinger rhettinger added the stdlib Python modules in the Lib dir label Oct 28, 2002
    @rhettinger rhettinger self-assigned this Oct 28, 2002
    @rhettinger rhettinger added the stdlib Python modules in the Lib dir label Oct 28, 2002
    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    Added full patch with news item and docs.

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    Renamed to random.sample(n,k) to show that it is used
    for sampling without replacement.

    1 similar comment
    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    Renamed to random.sample(n,k) to show that it is used
    for sampling without replacement.

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    Added new version with local variable optimization and
    with the dictionary results returned in selection order.

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    Martin, do you have time to give this patch a second
    review?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Nov 5, 2002

    Logged In: YES
    user_id=21627

    Can you explain why this needs to be in the standard
    library? I.e. what typical application would use it?

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    Like shuffle() and choose(), random sampling without
    replacement is one of the core principal use cases for
    random numbers.

    Acceptance testing often requires a fixed number of non-
    overlapping samples i.e. Selecting 60 transactions out of
    a 1000 and finding zero errors yields a 95% confidence
    that the population has less than a 5% error rate.

    Some simulations also need groups of non-overlapping
    samples i.e. a lottery result of six unique numbers
    selected from a range of 1 to 57. An electronic raffle picks
    consecutive winners without allowing previous winners to
    be reselected.

    While sampling with replacement is trivial to implement
    with a list comprehension, sampling without replacement
    has a number of implementation nuances that makes it
    worthwhile to have a robust solution already implemented
    in the random library.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Nov 5, 2002

    Logged In: YES
    user_id=21627

    Thanks for the explanation. On to the implementation: How
    did you arrive at the factor of 6 between a dictionary and a
    list?

    The documentation should mention the random optional argument.

    @tim-one
    Copy link
    Member

    tim-one commented Nov 5, 2002

    Logged In: YES
    user_id=31435

    I agree this is useful, but would rather see Python grow
    libraries for combinatorial objects. There are many things
    beyond this that are also useful, For example, the examples
    you gave here were of selections from collections that aren't
    range(n), and it would be more useful to more people to have
    a way to choose k elements from an arbitrary n-element
    collection directly (like a collection of transactions, or a set of
    cards, whatever).

    Note that I posted a module to Python-Dev not long ago that
    implements such stuff (CombGen.py), along with other useful
    functions on combinations.

    Note that when k > n/2, "the usual trick" isn't to shuffle a list,
    but to generate a complement selection. For example, if you
    want a random sample of 9999 out of 10000, it's a lot more
    efficient to pick the single element that's *not* in the result.
    See CombGen for code to do this.

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    Thanks for the quick follow-ups.

    The switchover ratio of six came from counting pointers
    and longs. Shuffling uses an n length list at one pointer
    for each element. The dictionary approach has k
    elements with a hash code, a key pointer, and a value
    pointer for a total of three multiplied by 1.5 and rounding
    up to five (because dict loading is kept under 2/3) and one
    pointer for the 'inorder' return list for a total of six. Also, I
    liked six to minimize resampling in the dictionary
    approach (keeping it under 20%).

    As requested, I'll add the random argument to the
    documentation.

    Originally, I was going to have sample() select from an
    arbitrary collection (like choose() does) but, in the end,
    preferred the current approach of choosing integers. This
    approach allows sample(1000000,60) without building a
    giant list first. Also, converting from indices to elements is
    trivial: [colorlist[i] for i in random.sample(len
    (colorlist),5)].

    I avoided the n/2 complement selection technique
    because of use case rarity and to allow the sample itself to
    be in random order (oxymoron?). If you guys think it's
    necessary, I'll add a complement selection branch
    followed by a call to random.shuffle(). Still, as it stands,
    the code is robust, uses space no larger than a k sized
    dictionary, and runs with no more than 1.2*k calls to
    random().

    I don't know why CombGen.py never made it to
    Tools/scripts. Even if it does, I think a random sampling
    function belongs in the random module where people can
    find it -- it is a very common use of random numbers.

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    Assigned to GvR for pronouncement on
    a) whether he agrees that a sampling function is useful.
    b) whether to implement it as is or with sequence
    arguments
    c) whether to leave it in random or put in another module.

    The current form returns a list of integers that can be used
    directly or as indices into a sequence. The advantages are
    flexibility in use and the ability to pick a hundred elements
    out of ten million without building a long list first. The
    approach is essentially a uniquified list of calls to
    randrange(). Tim prefers an approach that parallels
    random.choice() where the call looks like:
    random.sample([a,b,c,d,e], 2) # picks 2 of the 5 objects

    I think the function belongs in the random module since it
    is a primary use of random numbers (just like shuffle()
    and choose()). Tim prefers to have a separate library
    module that has a whole grab bag of combinatorics.

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    I'm not even looking at this, I'm delegating this to Tim. He
    knows infinitely more about random and permutations than I
    do, and he's actually used this stuff.

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    I'm re-learning to hate the patch process. This was a
    straight-forward, thoroughy tested, useful patch. Getting it
    accepted wasn't supposed to be hard.

    What is the next step -- Take it as is, convert the n
    argument to choice() style population list, or withdraw the
    patch?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Nov 8, 2002

    Logged In: YES
    user_id=21627

    Well, I agree that the patch is correct in the sense of
    doing what it says it does. What I cannot judge is whether
    the feature is useful; it looks like bloat to me.

    I could be convinced if you find a user of this function (or
    the Cookbook recipe) who says I use it for this and that,
    and I would prefer to see it in the library for that reason,
    instead of copying it from the Cookbook.

    I have the feeling that anybody who would use such a
    function would also use ten other "standard" functions which
    are not included in the library at the moment. So that
    person would not be helped with getting the single function;
    he would need an entirely new library of such things.

    So I would propose that you withdraw the patch.

    @tim-one
    Copy link
    Member

    tim-one commented Nov 8, 2002

    Logged In: YES
    user_id=31435

    Well, you're in murky waters because it's a "new feature"
    patch rather than a bugfix, and wasn't vetted on Python-
    Dev or c.l.py or via PEP first, nor is it a function in wide use
    already, neither one that people have asked for in
    the "small feature requests" PEP. It appeared out of the
    blue, and "unsolicited"/undiscussed new features are
    *usually* hard sells. The alternative is boundless bloat.

    Python went for years without random.shuffle(), and that
    got added because (a) at any given moment, you were
    likely to find a c.l.py discussion about someone's incorrect
    Python code for shuffling; and, (b) how to shuffle was a very
    popular FAQ on the Tutor list. So the demand, and the
    difficulty of rolling your own, were compellingly clear at the
    time.

    In contrast, people asking how to get a random k-
    combination are almost conspicuous by absence, which
    makes the "very common use" claim hard to buy when
    viewed against the Python community as a whole.. The
    handful (an exaggeration -- I only wish there were 5 <wink>)
    who egged me into writing CombGen.py at the time wanted
    much more than *just* that, and CombGen tried to meet all
    expressed desires at the time. I have to agree with Martin
    that people who would use this also want a lot of related
    stuff (I'm one of them).

    Some of the design decisions here remain unclear. Where
    CombGen went out of its way to guarantee that
    combinations are always delivered in "ascending" order, you
    seem to want to guarantee that they appear in a random
    order. Why? Especially since you view these as index
    vectors, ascending order gives the best shot at locality of
    reference when the user does the indirect indexing bit.

    People who intend to use the result as a random starting
    point into the lexicographic or Gray code ordering of k-
    combinations also need ascending order.

    CombGen never went into the std library because I never
    made an attempt to put it there: CombGen never
    attracted a signficant audience, and I'm not keen to push
    things into the library that, as far as I can tell, only a few
    people use.

    Since that's the std I hold myself to, it's also the std I'm
    inclined to hold others to.

    In the absence of being able to point to potential users from
    c.l.py threads, let me ask why *you* wrote it. Did you have
    an actual app that needed this function (and if so, what was
    it), or was it more of an interesting programming exercise?

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    I use the routine for transaction testing in audit work.

    The random order is useful so that subslices of the result
    are also valid random samples. I run a sample of 60, test
    the first 25, if an error is found, the sample expands to 60,
    and if more errors are found, the transaction set does not
    pass the audit.

    The cookbook poster also needed the routine in his work
    and wanted it badly enough to make an excrutiating
    tranlation from old Fortran code from a textbook. To save
    bungled re-inventions of the wheel, I crafted a cleaner
    solution than either my quick and dirty or his translated
    version.

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    Still, the question remains, why are all these functions so
    disconnected in their interface. Why does shuffle() take an
    optional random() function as argument? Why doesn't sample()
    take a list from which it returns a sample? Why isn't
    sample() a generator? Etc. These aren't necessarily good
    questions, but without trying to use these functions, I
    can't tell. The APIs look pretty random. Maybe the random()
    module is destined to be a random collection of useful
    statistical hacks? It already looks like that to me now. If
    that's the case, I'm not against adding some more, but I
    wish that Raymond would look at Tim's code and suggestions
    (e.g. complement selection for k > n/2). It does seem to me
    that a *random* sample falls in the same category as Tim's
    "generate all samples" code though, so arguably Raymond's
    sample() would belong in random.py even if CombGen.py were
    in the standard library. Also consider that many uses of
    random() are inspired by education -- for some reason,
    teachers like to teach programming using the random()
    function and its derivatives to write simple games (number
    guessing), visual effects (brownian motion) and more.
    random.sample() might well fit in that category. Another
    potential use category could be simple applied statistics,
    like Raymond's transaction testing. It seems that such
    things fill some kind of need (otherwise there wouldn't be
    two cookbook recipes for it).

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    FWIW, I did try out the complement selection method for
    k>n/2 but found that it improved performance in some
    cases and worsened it in others. More importantly, it
    interfered with the goal of returning the selections in
    random order. Select 10 raffle winners, give a grand prize,
    2 second prizes, 3 third prizes, and 4 fourth prizes -- the
    results must be in random order so that the grand prize is
    not biased by a non-random ordering.

    If everyone prefers sample(sequence, k) to sample(n,k), I
    will be happy to change it.

    If Tim wants to send me some code to study, that's cool. I
    always learn something from reading his code.

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    Tim's code is at

    http://mail.python.org/pipermail/python-dev/2002-August/028399.html

    If you really need the selection in random order, wouldn't
    it make more sense to apply shuffle() to the resulting list?
    (Applying sort() to the list if you don't want it randomized
    seems backwards.)

    I do find returing a list of indices less intuitive than a
    list of elements.

    @tim-one
    Copy link
    Member

    tim-one commented Nov 8, 2002

    Logged In: YES
    user_id=31435

    Guido, you may recall that you used combgen in the
    Mankato project (to generate random, non-overlapping 5(?)-
    word "fingerprints" from email msgs). There are certainly
    valid uses for this stuff, and good algorithms aren't easy.

    combgen resolved the range(n) vs sequence "dilemma" by
    providing both, where the former was primarily for speed
    freaks, and the latter was implemented via has-a of the
    former. Both are useful, and the former is *essential* in
    some cases (e.g., picking 3 out of a billion -- as Raymond
    says, you can't well materialize an explicit list of a billion
    elements first). So as a basic building block, range(n) is
    more useful. OTOH, users often don't see how to build
    what they want out of basic blocks.

    About random vs sorted, Raymond provided a plausible
    use case. Nobody brought that up when I was doing
    combgen, but it's another thing different apps may want
    done differently. Purely from an efficiency view, it's quicker
    not to guarantee ascending order (combgen sorts under
    the covers), so in that way Raymond's range(n) gimmick is
    even more of a speed-freak basic building block than
    combgen's CombGenBasic class.

    It's always a puzzle figuring out where things belong.
    combgen didn't start life doing random combinations -- it
    started because merely computing the number of k-
    combinations (of n things) *is* a frequent question (how
    many poker hands are there? bridge hands?), and an
    efficient algorithm for computing that isn't obvious either.
    Start from there, and it's soon apparent that there are
    many algorithms involving combinations, so much so that if
    you're working in this area, a class capturing the concept is
    very useful.

    Ideally, Python would have a package for combinatorial
    objects, and modules therein would tackle combinations,
    permutations, partitions, and possibly basic graph
    algorithms. combgen was meant to be a start at that, but
    it ended there too.

    So that's a mild dilemma: if we put one of these in, a small
    but probably growing user base will want "more of the
    same", and random.py isn't even arguably the right place to
    put any of the rest.

    As to how straightforward even this is, I expect this is the
    only patch in Python history to have 10 versions attached
    <wink>.

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    As requested, revised patch to accept a population
    sequence instead of an index range.

    Now that xrange() is fixed (a separate issue), this patch
    will also serve to choose from large integer sequences
    without building the whole sequence first: sample(xrange
    (10000000), 60).

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    P.S. The code continues to use the index list internally.
    This leaves the original pool unmolested and allows the
    use of xrange(n) as an argument.

    By not using the population elements as dictionary keys,
    no assumptions need to be made about the uniqueness of
    the population list. A weighted population is valid:
    sample('red red red blue blue'.split(), 3)

    @tim-one
    Copy link
    Member

    tim-one commented Nov 8, 2002

    Logged In: YES
    user_id=31435

    I'd rather you went back to the original scheme -- as
    a "speed-freak basic building block", sticking to implicit
    range(n) was clear, and nobody who wants that behavior is
    going to guess that passing xrange(n) might work in the
    new scheme.

    If random order is a promise of this method, than that must
    be documented. As is, the docs are silent about order, so
    any order meets the spec. If it's important that it be
    random, then the docs have to constrain implementations;
    if it's not important, you can't use it as an argument <wink>.

    The return type isn't documented and should be, esp. if you
    want to stick to the new scheme. That it always returns a
    list will be surprising (if I pass, e.g., a string, I *expect* a
    string of length k to come back; or if a tuple, a tuple of
    length k, etc. -- this became clear from combgen's users,
    and is another reason sticking to the basic building block
    function is better -- we put this in, and next thing is a
    feature request to return a sequence of the same type as
    the input).

    Comments about use case subtleties, and algorithm
    obscurities, belong in the docs and in code comments more
    than in patch comments.

    You surely don't want to hear this next one <wink>, but the
    patch appears to be missing test cases.

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    Done. Added revised patches for sample(population,k)
    and for sample(n,k). Take your pick.

    FYI, to interpret the generator test, the expected standard
    deviation for a uniform distribution is sqrt(((n**2)-1) / 12).

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    Neatened-up the patch for random.sample(population,k).
    Sped the tests, eliminated the final map, and clarified the
    docs. Using xrange(n) as an argument is shown in both
    the docs and docstring so that people won't have to be
    clever or original.

    I think this one is ready for prime time and would be a
    happy fellow if it got blessed.

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    But I thought Tim recommends sample(n, k)?

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    Tim, which do you prefer?

    Rand6.diff is on the lauch pad, ready to go.
    random.sample(population,k) is now as lean and mean as
    sample(n,k); the xrange() idiom is thoroughly documented
    and tested; and the sample(population,k) approach is now
    my favorite.

    Still, rand5.diff is also ready to go. It documents how to
    convert from indices to elements.

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    Can I commit rand6.diff and be done with this one?

    At one time, sample(n,k) looked better because the code
    was simpler, faster, and the use of xrange(n) in sample
    (population,k) wasn't obvious.

    As of rand6.diff, sample(population,k) is equally fast and
    simple. The use of xrange(n) is thoroughly documented
    and has no performance penalty.

    It's now faster and easier to express sample(n,k) in terms
    of sample(population,k) than vice-versa. Also, sample
    (population,k) has the friendlier interface. So, it is the one
    I recommend.

    @gvanrossum
    Copy link
    Member

    Logged In: YES
    user_id=6380

    Tim, are you still hesitant? I think this is fine.

    @tim-one
    Copy link
    Member

    tim-one commented Nov 11, 2002

    Logged In: YES
    user_id=31435

    Sorry, I have a lot on my plate, and this one overshot its
    budget by an hour already. I'll get back to it later today.

    @tim-one
    Copy link
    Member

    tim-one commented Nov 11, 2002

    Logged In: YES
    user_id=31435

    Accepted, and back to Raymond <whew!>. rand6,diff it is.

    Comments:

    + I doubt the xrange trick will work in Python 3, but time
    enough for that in coming years.

    + It's awfully obscure why "return pool[-k:]" can't do a
    wrong thing when k is 0.

    + Vertical space isn't at a premium here -- no need to
    squash if and if-controlled code onto the same line.

    + The test code will never be run (nobody runs
    random.py). That's not your fault. We should think about
    making a real test out of random.py's quarter-attempt at
    testing itself.

    + It looks to be a pleasant new facility. Good job!

    @rhettinger
    Copy link
    Contributor Author

    Logged In: YES
    user_id=80475

    Committed as random.py 1.36 and librandom.tex 1.31.

    Thanks Tim, Martin, and GvR for the reviews.

    Tim' comments:

    + If xrange disappears in Py3, objects defining __getitem__
    will live on.

    + When k is zero, the block with "return pool[:k]" is never
    executed; the dictionary block runs instead.

    + Expanded single line ifs into multi-line

    + The tests in random.py get run only when we are
    maintaining the module. When I put in the Mersenne
    Twister, will make separate unittests that get run all the
    time and that test the module thoroughly.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants