Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow weights in random.choice #63044

Closed
aisaac mannequin opened this issue Aug 26, 2013 · 69 comments
Closed

allow weights in random.choice #63044

aisaac mannequin opened this issue Aug 26, 2013 · 69 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@aisaac
Copy link
Mannequin

aisaac mannequin commented Aug 26, 2013

BPO 18844
Nosy @tim-one, @rhettinger, @mdickinson, @pitrou, @serhiy-storchaka, @NeilGirdhar, @applio
PRs
  • [Do Not Merge] Convert Misc/NEWS so that it is managed by towncrier #552
  • Files
  • weighted_choice.diff: Preliminary implementation of random.choice optional arg "weights"
  • weighted_choice_v2.diff: Move cumulative distribution calculation to separate function that returns an index generator
  • weighted_choice_generator.patch
  • wcg_bench.py: Benchmarking different methods
  • weighted_choice_generator_2.patch
  • weighted_choice_v3.diff: a new implementation of weighted choice
  • weighted_choice_v3.patch: weighted choice function
  • weighted_choice_v4.patch
  • weighted_choice_v5.patch: weighted choice function v5
  • weighted_choice.diff
  • weighted_choice2.diff: Add docs and test
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2016-09-07.00:17:01.460>
    created_at = <Date 2013-08-26.17:23:45.101>
    labels = ['type-feature', 'library']
    title = 'allow weights in random.choice'
    updated_at = <Date 2017-03-31.16:36:29.131>
    user = 'https://bugs.python.org/aisaac'

    bugs.python.org fields:

    activity = <Date 2017-03-31.16:36:29.131>
    actor = 'dstufft'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2016-09-07.00:17:01.460>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2013-08-26.17:23:45.101>
    creator = 'aisaac'
    dependencies = []
    files = ['31479', '31547', '31732', '31734', '36331', '42322', '42323', '42386', '42393', '44394', '44407']
    hgrepos = []
    issue_num = 18844
    keywords = ['patch', 'needs review']
    message_count = 69.0
    messages = ['196229', '196234', '196235', '196252', '196551', '196567', '196709', '196711', '196716', '196721', '196728', '196731', '196741', '196750', '196761', '196767', '197507', '197512', '197540', '197862', '197865', '197866', '198367', '198372', '223750', '224947', '224949', '224953', '224954', '224957', '225128', '225133', '225137', '225140', '225148', '226891', '262625', '262626', '262642', '262649', '262652', '262656', '262678', '262744', '262967', '262970', '262971', '262981', '262982', '262983', '262994', '262995', '267782', '272767', '272785', '274538', '274677', '274684', '274686', '274760', '274907', '274964', '277485', '277486', '277487', '278516', '278633', '279701', '279702']
    nosy_count = 14.0
    nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'pitrou', 'aisaac', 'westley.martinez', 'python-dev', 'serhiy.storchaka', 'NeilGirdhar', 'madison.may', 'dkorchem', 'Christian.Kleineidam', 'davin', 'xksteven']
    pr_nums = ['552']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue18844'
    versions = ['Python 3.6']

    @aisaac
    Copy link
    Mannequin Author

    aisaac mannequin commented Aug 26, 2013

    The need for weighted random choices is so common that it is addressed as a "common task" in the docs:
    http://docs.python.org/dev/library/random.html

    This enhancement request is to add an optional argument to random.choice, which must be a sequence of non-negative numbers (the weights) having the same length as the main argument.

    @aisaac aisaac mannequin added the type-feature A feature request or enhancement label Aug 26, 2013
    @serhiy-storchaka serhiy-storchaka added the stdlib Python modules in the Lib dir label Aug 26, 2013
    @MadisonMay
    Copy link
    Mannequin

    MadisonMay mannequin commented Aug 26, 2013

    +1. I've found myself in need of this feature often enough to wonder why it's not part of the stdlib.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 26, 2013

    Agreed with the feature request. The itertools dance won't be easy to understand, for many people.

    @MadisonMay
    Copy link
    Mannequin

    MadisonMay mannequin commented Aug 26, 2013

    I realize its probably quite early to begin putting a patch together, but here's some preliminary code for anyone interested. It builds off of the "common task" example in the docs and adds in validation for the weights list.

    There are a few design decisions I'd like to hash out.
    In particular:

    • Should negative weights cause a ValueError to be raised, or should they be converted to 0s?
    • Should passing a list full of zeros as the weights arg raise a ValueError or be treated as if no weights arg was passed?

    @mdickinson
    Copy link
    Member

    mdickinson commented Aug 30, 2013

    [Madison May]

    • Should negative weights cause a ValueError to be raised, or should they be converted to 0s?
    • Should passing a list full of zeros as the weights arg raise a ValueError or be treated as if no weights arg was passed?

    Both those seem like clear error conditions to me, though I think it would be fine if the second condition produced a ZeroDivisionError rather than a ValueError.

    I'm not 100% sold on the feature request. For one thing, the direct implementation is going to be inefficient for repeated sampling, building the table of cumulative sums each time random.choice is called. A more efficient approach for many use-cases would do the precomputation once, returning some kind of 'distribution' object from which samples can be generated. (Walker's aliasing method is one route for doing this efficiently, though there are others.) I agree that this is a commonly needed and commonly requested operation; I'm just not convinced either that an efficient implementation fits well into the random module, or that it makes sense to add an inefficient implementation.

    @MadisonMay
    Copy link
    Mannequin

    MadisonMay mannequin commented Aug 30, 2013

    [Mark Dickinson]

    Both those seem like clear error conditions to me, though I think it would be fine if the second condition produced a ZeroDivisionError rather than a ValueError.

    Yeah, in hindsight it makes sense that both of those conditions should raise errors. After all: "Explicit is better than implicit".

    As far as optimization goes, could we potentially use functools.lru_cache to cache the cumulative distribution produced by the weights argument and optimize repeated sampling?

    Without @lru_cache:
    >>> timeit.timeit("x = choice(list(range(100)), list(range(100)))", setup="from random import choice", number=100000)
    36.7109281539997
    
    With @lru_cache(max=128):
    >>> timeit.timeit("x = choice(list(range(100)), list(range(100)))", setup="from random import choice", number=100000)
    6.6788657720007905

    Of course it's a contrived example, but you get the idea.

    Walker's aliasing method looks intriguing. I'll have to give it a closer look.

    I agree that an efficient implementation would be preferable but would feel out of place in random because of the return type. I still believe a relatively inefficient addition to random.choice would be valuable, though.

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Sep 1, 2013

    +1 for the overall idea. I'll take a detailed look at the patch when I get a chance.

    @rhettinger rhettinger self-assigned this Sep 1, 2013
    @rhettinger
    Copy link
    Contributor

    rhettinger commented Sep 1, 2013

    The sticking point is going to be that we don't want to recompute the cumulative weights for every call to weighted_choice.

    So there should probably be two functions:

      cw = make_cumulate_weights(weight_list) 
      x = choice(choice_list, cw)

    This is similar to what was done with string.maketrans() and str.translate().

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 1, 2013

    A more efficient approach for many use-cases would do the precomputation once, returning some kind of 'distribution' object from which samples can be generated.

    I like the idea about adding a family of distribution generators. They should check input parameters and make a precomputation and then generate infinite sequence of specially distributed random numbers.

    @MadisonMay
    Copy link
    Mannequin

    MadisonMay mannequin commented Sep 1, 2013

    [Raymond Hettinger]

    The sticking point is going to be that we don't want to recompute the
    cumulative weights for every call to weighted_choice.

    So there should probably be two functions:

    cw = make_cumulate_weights(weight_list)
    x = choice(choice_list, cw)

    That's pretty much how I broke things up when I decided to test out optimization with lru_cache. That version of the patch is now attached.

    [Serhiy Storchaka]

    I like the idea about adding a family of distribution generators.
    They should check input parameters and make a precomputation and then > generate infinite sequence of specially distributed random numbers.

    Would these distribution generators be implemented internally (see attached patch) or publicly exposed?

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 1, 2013

    Would these distribution generators be implemented internally (see attached patch) or publicly exposed?

    See bpo-18900. Even if this proposition will be rejected I think we should publicly expose weighted choice_generator(). A generator or a builder which returns function are only ways how efficiently implement this feature. Use lru_cache isn't good because several choice generators can be used in a program and because it left large data in a cache long time after it was used.

    @MadisonMay
    Copy link
    Mannequin

    MadisonMay mannequin commented Sep 1, 2013

    Use lru_cache isn't good because several choice generators can be used in a program and because it left large data in a cache long time after it was used.

    Yeah, I just did a quick search of the stdlib and only found one instance of lru_cache in use -- another sign that lru_cache is a bad choice.

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Sep 1, 2013

    I like the idea about adding a family of distribution generators

    Let's stay focused on the OP's feature request for a weighted version of choice().

    For the most part, it's not a good idea to "just add" a family of anything to the standard library. We wait for user requests and use cases to guide the design and error on the side of less, rather than more. This helps avoid bloat. Also, it would be a good idea to start something like this as a third-party to module to let it iterate and mature before deciding whether there was sufficient user uptake to warrant inclusion in the standard library.

    For the current request, we should also do some research on existing solutions in other languages. This isn't new territory. What do R, SciPy, Fortran, Matlab or other statistical packages already do? Their experiences can be used to inform our design. Alan Kay's big criticism of Python developers is that they have a strong propensity invent from scratch rather than taking advantage of the mountain of work done by the developers who came before them.

    @MadisonMay
    Copy link
    Mannequin

    MadisonMay mannequin commented Sep 1, 2013

    What do R, SciPy, Fortran, Matlab or other statistical packages already do?

    Numpy avoids recalculating the cumulative distribution by introducing a 'size' argument to numpy.random.choice(). The cumulative distribution is calculated once, then 'size' random choices are generated and returned.

    Their overall implementation is quite similar to the method suggested in the python docs.

    >> choices, weights = zip(*weighted_choices)
    >> cumdist = list(itertools.accumulate(weights))
    >> x = random.random() * cumdist[-1]
    >> choices[bisect.bisect(cumdist, x)]

    The addition of a 'size' argument to random.choice() has already been discussed (and rejected) in bpo-18414, but this was on the grounds that the standard idiom for generating a list of random choices ([random.choice(seq) for i in range(k)]) is obvious and efficient.

    @westleymartinez
    Copy link
    Mannequin

    westleymartinez mannequin commented Sep 2, 2013

    Honestly, I think adding weights to any of the random functions are trivial enough to implement as is. Just because something becomes a common task does not mean it ought to be added to the stdlib.

    Anyway, from a user point of view, I think it'd be useful to be able to send a sequence to a function that'll weight the sequence for use by random.

    @MadisonMay
    Copy link
    Mannequin

    MadisonMay mannequin commented Sep 2, 2013

    Just ran across a great blog post on the topic of weighted random generation from Eli Bendersky for anyone interested:
    http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python/

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 11, 2013

    The proposed patch add two methods to the Random class and two module level functions: weighted_choice() and weighted_choice_generator().

    weighted_choice(data) accepts either mapping or sequence and returns a key or index x with probability which is proportional to data[x].

    If you need several elements with same distribution, use weighted_choice_generator(data) which returns an iterator which produces random keys or indices of the data. It is more faster than calling weighted_choice(data) repeatedly and is more flexible than generating a list of random values at specified size (as in NumPy).

    @NeilGirdhar
    Copy link
    Mannequin

    NeilGirdhar mannequin commented Sep 12, 2013

    Should this really be implemented using the cumulative distribution and binary search algorithm? Vose's Alias Method has the same initialization and memory usage cost (O(n)), but is constant time to generate each sample.

    An excellent tutorial is here: http://www.keithschwarz.com/darts-dice-coins/

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 12, 2013

    Thank you Neil. It is interesting.

    Vose's alias method has followed disadvantages (in comparison with the roulette wheel selection proposed above):

    1. It operates with probabilities and uses floats, therefore it can be a little less accurate.

    2. It consumes two random number (an integer and a float) for generating one sample. It can be fixed however (in the cost of additional precision lost).

    3. While it has same time and memory O(n) cost for initialization, it has larger multiplication, Vose's alias method requires several times larger time and memory for initialization.

    4. It requires more memory in process of generating samples.

    However it has an advantage. It really has constant time cost to generate each sample.

    Here are some benchmark results. "Roulette Wheel" is proposed above implementation. "Roulette Wheel 2" is its modification with normalized cumulative sums. It has twice more initialization time, but 1.5-2x faster generates each sample. "Vose's Alias" is an implementation of Vose's alias method directly translated from Java. "Vose's Alias 2" is optimized implementation which uses Python specific.

    Second column is a size of distribution, third column is initialization time (in milliseconds), fourth column is time to generate each sample (in microseconds), fifth column is a number of generated samples after which this method will overtake "Roulette Wheel" (including initialization time).

    Roulette Wheel 10 0.059 7.165 0
    Roulette Wheel 2 10 0.076 4.105 5
    Vose's Alias 10 0.129 13.206 -
    Vose's Alias 2 10 0.105 6.501 69
    Roulette Wheel 100 0.128 8.651 0
    Roulette Wheel 2 100 0.198 4.630 17
    Vose's Alias 100 0.691 12.839 -
    Vose's Alias 2 100 0.441 6.547 148
    Roulette Wheel 1000 0.719 10.949 0
    Roulette Wheel 2 1000 1.458 5.177 128
    Vose's Alias 1000 6.614 13.052 -
    Vose's Alias 2 1000 3.704 6.531 675
    Roulette Wheel 10000 7.495 13.249 0
    Roulette Wheel 2 10000 14.961 6.051 1037
    Vose's Alias 10000 69.937 13.830 -
    Vose's Alias 2 10000 37.017 6.746 4539
    Roulette Wheel 100000 73.988 16.180 0
    Roulette Wheel 2 100000 148.176 8.182 9275
    Vose's Alias 100000 690.099 13.808 259716
    Vose's Alias 2 100000 391.367 7.095 34932
    Roulette Wheel 1000000 743.415 19.493 0
    Roulette Wheel 2 1000000 1505.409 8.930 72138
    Vose's Alias 1000000 7017.669 13.798 1101673
    Vose's Alias 2 1000000 4044.746 7.152 267507

    As you can see Vose's alias method has very large initialization time. Non-optimized version will never overtake "Roulette Wheel" with small distributions (<100000), and even optimized version will never overtake "Roulette Wheel" with small distributions (<100000). Only with very large distributions Vose's alias method has an advantage (when you needs very larger number of samples).

    Because for generating only one sample we need a method with fastest initialization we need "Roulette Wheel" implementation. And because large distributions are rare, I think there is no need in alternative implementation. In worst case for generating 1000000 samples from 1000000-elements distribution the difference between "Roulette Wheel" and "Vose's Alias 2" is a difference between 20 and 11 seconds.

    @MadisonMay
    Copy link
    Mannequin

    MadisonMay mannequin commented Sep 16, 2013

    Serhiy, from a technical standpoint, your latest patch looks like a solid solution. From an module design standpoint we still have a few options to think through, though. What if random.weighted_choice_generator was moved to random.choice_generator and refactored to take an array of weights as an optional argument? Likewise, random.weighted_choice could still be implemented with an optional arg to random.choice. Here's the pros and cons of each implementation as I see them.

    Implementation: weighted_choice_generator + weighted_choice
    Pros:
    Distinct functions help indicate that weighted_choice should be used in a different manner than choice -- [weighted_choice(x) for _ in range(n)] isn't efficient.
    Can take Mapping or Sequence as argument.
    Has a single parameter
    Cons:
    Key, not value, is returned
    Requires two new functions
    Dissimilar to random.choice
    Long function name (weighted_choice_generator)

    Implementation: choice_generator + optional arg to choice
    Pros:
    Builds on existing code layout
    Value returned directly
    Only a single new function required
    More compact function name

    Cons:
    Difficult to support Mappings
    Two args required for choice_generator and random.choice
    Users may use [choice(x, weights) for _ in range(n)] expecting efficient results

    @westleymartinez
    Copy link
    Mannequin

    westleymartinez mannequin commented Sep 16, 2013

    I think Storchaka's solution is more transparent and I agree with him on the point that the choice generator should be exposed.

    @MadisonMay
    Copy link
    Mannequin

    MadisonMay mannequin commented Sep 16, 2013

    I think Storchaka's solution is more transparent and I agree with him on the point that the choice generator should be exposed.

    Valid point -- transparency should be priority #1

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 24, 2013

    Most existing implementation produce just index. That is why weighted_choice() accepts singular weights list and returns index. On the other hand, I think working with mapping will be wished feature too (especially because Counter is in stdlib). Indexable sequences and mappings are similar. In both cases weighted_choice() returns value which can be used as index/key of input argument.

    If you need choice an element from some sequence, just use seq[weighted_choice(weights)]. Actually weighted_choice() has no common code with choice() and has too different use cases. They should be dissimilar as far as possible. Perhaps we even should avoid the "choice" part in function names (are there any ideas?) to accent this.

    @MadisonMay
    Copy link
    Mannequin

    MadisonMay mannequin commented Sep 24, 2013

    You have me convinced, Serhiy. I see the value in making the two functions distinct.

    For naming purposes, perhaps weighted_index() would be more descriptive.

    @mdickinson
    Copy link
    Member

    mdickinson commented Jul 23, 2014

    Closed bpo-22048 as a duplicate of this one.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Aug 6, 2014

    Raymond, what is your opinion?

    @pitrou
    Copy link
    Member

    pitrou commented Aug 6, 2014

    I don't want to speak for Raymond, but the proposed API looks good, and it seems "Roulette Wheel 2" should be the implementation choice given its characteristics (simple, reasonably good and balanced performance).

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Mar 30, 2016

    I disagree.

    My patch adds two functions because they serve two different purposes. weighted_choice() returns one random value as other functions in the random module. weighted_choice_generator() provides more efficient way to generate random values, since startup cost is more significant than for other random value generators. Generators are widely used in Python, especially in Python 3. If they considered confusing, we should deprecate builtins map(), filter(), zip() and the itertools module at first place.

    Your function, Steven, returns a list containing one random value by default. It does not match the interface of other functions in the random module. It matches the interface of NumPy random module. In Python you need two separate functions, one that returns single value, and other that returns a list of values. But returning iterator and generating values by demand is more preferable in Python 3. Generatorsa are more flexible. With weighted_choice_generator() it is easy to get the result of your function: list(islice(weighted_choice_generator(data), amount)). But generating dynamic amount of values with your interface is impossible.

    Raymond, if you have now free time, could you please make a review of weighted_choice_generator_2.patch?

    @xksteven
    Copy link
    Mannequin

    xksteven mannequin commented Mar 30, 2016

    Hey serhiy.storchaka

    I can edit the code to output just one value if called with simply a list and then return a list of values if called with the optional amount parameter. My code also needs to check that amount >= 1.

    My code was mostly just to restart this discussion as I personally like the idea of the function for weighted choice and would like it to be standard in the random library.

    I have no qualms with adding both weighted_choice and weighted_choice_generator but my concern is mostly that you are asking too much and it won't go through by trying to add two functions at the same time. The other thing is that I believe that weighted_choice could suffice with just one function call.

    I just think my last concern is that generators are different from the other functions in random.py. Whereas they are more intuitive and accepted in the builtins like map and zip etc. There isn't any other functions in the random library that return that type of object when called. They instead return a numerical result.

    Those are my concerns and hence why I rewrote the code.

    @ChristianKleineidam
    Copy link
    Mannequin

    ChristianKleineidam mannequin commented Apr 1, 2016

    A user can use map(), filter(), zip() without knowing anything about generators. In most cases those function will do their magic and provide a finite number of outputs.

    The weighted_choice_generator on the other hand isn't as easy to use. If the user wants 5 values from it, they need to know about take() from itertools or call next().

    @westleymartinez
    Copy link
    Mannequin

    westleymartinez mannequin commented Apr 6, 2016

    I still like Serhiy's implementation more. A function that returns a list instead of the item is unnatural and doesn't fit with the rest of the module.

    I think there's need to be some discussion about use cases. What do users actually want? Maybe post this on the ideas list.

    @xksteven
    Copy link
    Mannequin

    xksteven mannequin commented Apr 6, 2016

    Okay so I added a few lines of code. One to make it return a single number if amount == 1 and the other to check that the amount > 1.

    The main difference I've noticed between this implementation and previous versions compared to say R is that in R they provide a boolean flag to ask if sampling with replacement.

    Here's there documentation and source code:
    https://github.com/wch/r-source/blob/e5b21d0397c607883ff25cca379687b86933d730/src/library/base/man/sample.Rd

    https://github.com/wch/r-source/blob/e5b21d0397c607883ff25cca379687b86933d730/src/library/base/R/sample.R

    Maybe someone else can comment more on the use cases. I can only say for myself that I've needed this function plenty of times when working with samples that have a non uniform distribution.

    @xksteven
    Copy link
    Mannequin

    xksteven mannequin commented Apr 6, 2016

    I reuploaded the file. The spacing on the if amount < 1 was off. Hopefully its fixed now.

    @mdickinson
    Copy link
    Member

    mdickinson commented Apr 7, 2016

    One to make it return a single number if amount == 1 and the other to check that the amount > 1.

    I think that's a dangerous API. Any code making a call to "weighted_choice(..., amount=n)" for variable n now has to be prepared to deal with two possible result types. It would be easy to introduce buggy code that fails in the corner case n = 1.

    @mdickinson
    Copy link
    Member

    mdickinson commented Apr 7, 2016

    One to make it return a single number if amount == 1 and the other to check that the amount > 1.

    Suggestion: if you want to go that way, return a single number if amount is not provided (so make the default value for amount None rather than 1). If amount=1 is explicitly given, a list containing one item should be returned.

    I also think there's no reason to raise an exception when amount = 0: just return an empty list.

    For comparison, here's NumPy's "uniform" generator, which generates a scalar if the "size" parameter is not given, and an array if "size" is given, even if it's 1.

    >>> np.random.uniform()
    0.4964992470265117
    >>> np.random.uniform(size=1)
    array([ 0.64817717])
    >>> np.random.uniform(size=0)
    array([], dtype=float64)

    @pitrou
    Copy link
    Member

    pitrou commented Apr 7, 2016

    Suggestion: if you want to go that way, return a single number if amount is not provided (so make the default value for amount None rather than 1). If amount=1 is explicitly given, a list containing one item should be returned.

    +1

    @xksteven
    Copy link
    Mannequin

    xksteven mannequin commented Apr 7, 2016

    Re-implemented with suggested improvements taken into account. Thanks @mark.dickinson and @pitrou for the suggestions.

    I also removed the redundant "fast path" portion for this code since it doesn't deal with generators anyways.

    Let me know additional thoughts about it.

    @xksteven
    Copy link
    Mannequin

    xksteven mannequin commented Apr 7, 2016

    Left in a line of code that was supposed to be removed. Fixed.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jun 8, 2016

    Raymond, do you have a time for this issue?

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Aug 15, 2016

    Raymond, any chance to get weighted random choices generator in 3.6? Less than month is left to feature code freeze.

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Aug 15, 2016

    FWIW, I have four full days set aside for the upcoming pre-feature release sprint which is dedicated to taking time to thoughtfully evaluate pending feature requests. In the meantime, I'm contacting Alan Downey for a consultation for the best API for this. As mentioned previously, the generator version isn't compatible with the design of the rest of the module that allows streams to have their state saved and restored at arbitrary points in the sequence. One API would be to create a list all at once (like random.sample does). Another would be to have two steps (like str.maketrans and str.translate). Ideally, the API should integrate neatly with collections.Counter as a possible input for the weighting. Hopefully, Alan can also comment on the relative frequency of small integer weightings versus the general case (the former benefits from a design using random.choice() applied to Counter.elements() and the latter benefits from a design with accumulate() and bisect()). Note, this is a low priority feature (no real demonstrated need, there is already a recipe for it in the docs, and once the best API have been determined, the code is so simple that any of us could implement it in only a few minutes).

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Sep 6, 2016

    Latest draft patch attached (w/o tests or docs).
    Incorporates consultation from Alan Downey and Jake Vanderplas.

    • Population and weights are separate arguments (like numpy.random.choice() and sample() in R). Matches the way data would arrive in Pandas. Easily extracted from a Counter or dict using keys() and values(). Suitable for applications that sample the population multiple times but using different weights. See https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html and http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html

    • Excludes a replacement=False option. That use case necessarily has integer weights and may be better suited to the existing random.sample() rather than trying to recompute a CDF on every iteration as we would have to in this function.

    • Allows cumulative_weights to be submitted instead of individual weights. This supports uses cases where the CDF already exists (as in the ThinkBayes examples) and where we want to periodically reuse the same CDF for repeated samples of the same population -- this occurs in resampling applications, Gibbs sampling, and Monte Carlo Markov Chain applications. Per Jake, "MCMC/Gibbs Sampling approaches generally boil down to a simple weighted coin toss at each step" and "It's definitely common to do aggregation of multiple samples, e.g. to compute sample statistics"

    • The API allows the weights to be integers, fractions, decimals, or floats. Likewise, the population and weights can be any Sequence. Population elements need not be hashable.

    • Returning a list means that the we don't have to save state in mid-stream (that is why we can't use a generator). A list feeds nicely into Counters, mean, median, stdev, etc for summary statistics. Returning a list parallels what random.sample() does, keeping the module internally consistent.

    • Default uniform weighting falls back to random.choice() which would be more efficient than bisecting.

    • Bisecting tends to beat other approaches in the general case. See http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python

    • Incorporates error checks for len(population)==len(cum_weights) and for conflicting specification of both weights and cumulative weights.

    There API is not perfect and there are some aspects that give me heartburn. 1) Not saving the computed CDF is waste and forces the user to pre-build the CDF if they want to save it for later use (the API could return both the selections and the CDF but that would be awkward and atypical). 2) For the common case of having small integer weights on a small population, the bisecting approach is slower than using random.choice on a population expanded to include the selections multiple times in proportion to their weights (that said, short of passing in a flag, there is no cheap easy way for this function to detect that case and give it a fast path). 3) Outputting a list is inefficient if all you're doing with result is summarizing it with a Counter, histogram tool, mean, median, or stdev. 4) There is no cheap way to check to see if the user supplied cum_weights is sorted or if the weights contain negative values.

    @applio
    Copy link
    Member

    applio commented Sep 7, 2016

    I've gone through the patch -- looks good to me.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 7, 2016

    New changeset a5856153d942 by Raymond Hettinger in branch 'default':
    Issue bpo-18844: Add random.weighted_choices()
    https://hg.python.org/cpython/rev/a5856153d942

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Sep 7, 2016

    Thanks Davin.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 7, 2016

    1. Returning a list instead of an iterator looks unpythonic to me. Values generated sequentially, there are no advantages of returning a list.

    2. An implementation lacks optimizations used in my patch.

    3. The documentation still contains a receipt for weighted choice. It is incompatible with new function.

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Sep 7, 2016

    There isn't really an option to return a generator because it conflicts the rest of the module that uses lists elsewhere and that allows state to be saved and restored before and after any function call. One of the design reviewers also said that the generator form would harder for students to use.

    I left the text in the examples section unchanged because it is still valid (showing how to make a cumulative distribution and how to build a fast alternative for the special case of small integer weights). Before the 3.6 release, I expect to expand this section to provide recipes for a MCMC application (using choices() with a passed-in CDF) and some other examples suggested by the design reviewers.

    The optimization hacks in the other patch don't seem worth it. The current code is readable and runs fast (the principal steps are all C functions).

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 8, 2016

    Using a generator doesn't prevents state to be saved and restored.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 27, 2016

    New changeset 39a4be5e003d by Raymond Hettinger in branch '3.6':
    Issue bpo-18844: Make the number of selections a keyword-only argument for random.choices().
    https://hg.python.org/cpython/rev/39a4be5e003d

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Sep 27, 2016

    Equidistributed examples:

    choices(c.execute('SELECT name FROM Employees').fetchall(), k=20)
    choices(['hearts', 'diamonds', 'spades', 'clubs'], k=5)
    choices(list(product(card_facevalues, suits)), k=5)
    

    Weighted selection examples:

      Counter(choices(['red', 'black', 'green'], [18, 18, 2], k=3800))   # american roulette
      Counter(choices(['hit', 'miss'], [5, 1], k=600))                   # russian roulette
      choices(fetch('employees'), fetch('years_of_service'), k=100)      # tenure weighted
      choices(cohort, map(cancer_risk, map(risk_factors, cohort)), k=50) # risk weighted

    Star unpacking example:

       transpose = lambda s: zip(*s)
       craps = [(2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (7, 6), (8, 5), (9, 4), (10, 3), (11, 2), (12, 1)]
       print(choices(*transpose(craps), k=10))

    Comparative APIs from other languages:

    http://www.mathworks.com/help/stats/randsample.html
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html
    https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html
    https://reference.wolfram.com/language/ref/RandomChoice.html
    

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Sep 27, 2016

    ###################################################################
    # Flipping a biased coin

    from collections import Counter
    from random import choices

    print(Counter(choices(range(2), [0.9, 0.1], k=1000)))

    ###################################################################
    # Bootstrapping

    'From a small statistical sample infer a 90% confidence interval for the mean'
    # http://statistics.about.com/od/Applications/a/Example-Of-Bootstrapping.htm

    from statistics import mean
    from random import choices
    
    data = 1, 2, 4, 4, 10
    means = sorted(mean(choices(data, k=5)) for i in range(20))
    print('The sample mean of {:.1f} has a 90% confidence interval from {:.1f} to {:.1f}'.format(
      mean(data), means[1], means[-2]))

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 12, 2016

    New changeset 433cff92d565 by Raymond Hettinger in branch '3.6':
    Issue bpo-18844: Fix-up examples for random.choices(). Remove over-specified test.
    https://hg.python.org/cpython/rev/433cff92d565

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 14, 2016

    New changeset d4e715e725ef by Raymond Hettinger in branch '3.6':
    Issue bpo-18844: Add more tests
    https://hg.python.org/cpython/rev/d4e715e725ef

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 29, 2016

    New changeset 32bfc81111b6 by Raymond Hettinger in branch '3.6':
    Issue bpo-18844: Make the various ways for specifing weights produce the same results.
    https://hg.python.org/cpython/rev/32bfc81111b6

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 30, 2016

    New changeset 09a87b16d5e5 by Raymond Hettinger in branch '3.6':
    Issue bpo-18844: Strengthen tests to include a case with unequal weighting
    https://hg.python.org/cpython/rev/09a87b16d5e5

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants