Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pick does not choose randomly when choosing all elements #355

Closed
ssanj opened this issue Aug 26, 2017 · 9 comments
Closed

Pick does not choose randomly when choosing all elements #355

ssanj opened this issue Aug 26, 2017 · 9 comments

Comments

@ssanj
Copy link
Contributor

ssanj commented Aug 26, 2017

Pick is defined as:

def pick[T](n: Int, l: Iterable[T]): Gen[Seq[T]]

I use pick to randomly choose 3 values from a range of 1 to 5 with the following:

scala> List.fill(10)(Gen.pick(3, (1 to 5)).sample.get).mkString("\n")
res142: String =
ArrayBuffer(1, 2, 5)
ArrayBuffer(1, 5, 3)
ArrayBuffer(4, 2, 5)
ArrayBuffer(1, 5, 4)
ArrayBuffer(1, 2, 5)
ArrayBuffer(5, 2, 4)
ArrayBuffer(4, 2, 5)
ArrayBuffer(1, 4, 5)
ArrayBuffer(5, 2, 3)
ArrayBuffer(1, 5, 4)

We see that the 3 values are chosen randomly.

If I ask for all 5 values to be chosen randomly:

scala> List.fill(10)(Gen.pick(5, (1 to 5)).sample.get).mkString("\n")
res143: String =
ArrayBuffer(1, 2, 3, 4, 5)
ArrayBuffer(1, 2, 3, 4, 5)
ArrayBuffer(1, 2, 3, 4, 5)
ArrayBuffer(1, 2, 3, 4, 5)
ArrayBuffer(1, 2, 3, 4, 5)
ArrayBuffer(1, 2, 3, 4, 5)
ArrayBuffer(1, 2, 3, 4, 5)
ArrayBuffer(1, 2, 3, 4, 5)
ArrayBuffer(1, 2, 3, 4, 5)
ArrayBuffer(1, 2, 3, 4, 5)

I keep getting the same values.

Is this the expected behaviour? It seems like unless the n supplied to pick is less than the size of the iterable i, it does not return random values.

@ashawley
Copy link
Contributor

It isn't released, yet, but it was fixed in #333.

@ssanj
Copy link
Contributor Author

ssanj commented Sep 16, 2017

I built this locally with fix #333 and it doesn't seem to have solved the issue. The problem is in the if condition, where it returns the exact content of the iterable l when count <= n where, count is the number of elements in l and n is the number of elements to pick:

      while (it.hasNext) {
        val t = it.next
        count += 1
        if (count <= n) {
          buf += t //adds all elements of l to buf while count <= n; which is what we requested - all elements)
        } else {
          val (x, s) = seed.long
          val i = (x & 0x7fffffff).toInt % count
          if (i < n) buf(i) = t
          seed = s
        }
      }
      r(Some(buf), seed) //buf (which is equal to l) is returned here

@ashawley
Copy link
Contributor

Sorry, I didn't read your first post closely. You're right. It doesn't shuffle when pick is called when n equals count. I don't know enough to really tell, but it looks like reservoir sampling was implemented without a shuffle.

https://en.wikipedia.org/wiki/Reservoir_sampling

I suppose it would be ideal to have it shuffle when n equals count, but there may be a tradeoff or some difficulty in doing so. Again, I can't tell.

@pford19
Copy link

pford19 commented Aug 10, 2018

As noted: Gen.pick(4, 1 to 4) is constant: (1, 2, 3, 4). Surprising.

Unfortunately pick does not do very well when picking most, but not all, elements, and even when picking just a few.

Gen.pick(3, 1 to 4) generates only 4 distinct values out of a possible 24 = 4!.

This code demonstrates

def stream[T](g: Gen[T]): Stream[T] = Stream.cons(g.sample.get, stream(g))
val g = Gen.pick(3, 1 to 4)
val picks = stream(g).take(1000).toSet
val npicks = picks.size
val pickpcnt = (npicks/24.0*100).toInt
println(s"$npicks distinct picks of 3 of 1 to 4, ${(pickpcnt}% of the 24 possible picks. Here they are:")
picks.toList.sortBy(_.toString).foreach(println)

with the following output

4 distinct picks of 3 of 1 to 4., 16% of the 24 possible picks. Here they are:
ArrayBuffer(1, 2, 3)
ArrayBuffer(1, 2, 4)
ArrayBuffer(1, 4, 3)
ArrayBuffer(4, 2, 3)

Not very random.

Gen.pick(2, 1 to 10) does a bit better, it generates 81% (73 of 90) of the possible 2-picks (over 1000 samples).

However the ones it omits are distinctly non-random. It omits exactly every pair of the form (_, 1) and every pair of the form (2, _) while generating all the rest.

(All above based on version 1.14.0.)

@ashawley
Copy link
Contributor

True, the lack of shuffling is generally a problem, regardless of length.

This function is called "pick", it isn't called "permute". Since modifying pick would have consequences on other methods, this doesn't seem like an easy fix.

For now, maybe it should be documented that pick doesn't in any way find the k permutations of n? Maybe the user guide could point this out, as well: Where it could then show users how to shuffle generated lists while it was at it.

@pford19
Copy link

pford19 commented Aug 10, 2018

Better docs of pick describing more precisely what it does and doesn't do would definitely help.

Fully understand the risks associated with "fixing" existing behavior that is arguably more "surprising" than "wrong".

But pick seems not quite fish and not quite fowl. It doesn't reliably permute its results, but it doesn't reliably preserve input order either. Seems that existing clients that depend on this kind of maybe-sort-of-sometimes behavior are doing so accidentally.

Aside: careful software engineers shouldn't infer semantics from a name alone: "pick" means whatever the implementor intended (documented or not) and what the implementation actually does (of course, we hope they are nearly the same). Good names help us remember the semantics, but precise documentation is always needed

As someone just digging into Scalacheck (I love it!) and trying to get a good understanding of generators, and trying to write some new ones (including a permuter and a partitioner) I've been suffering a bit. I've had to write more experimental tests than I would like in order to understand the real semantics of some of the built-in generators, such as our friend pick. The documentation is skimpy, and the source code is subtle (e.g. pick's implementation does not transparently expose its semantics). I ran into this surprising pick behavior while trying to implement a permutation generator.

The following would be useful information in the docs for each generator:

  • is the generated value stream uniformly distributed or not. If not, how so? (e.g. favors values like 0, -1, 1, MaxValue, MinValue, etc.)
  • for generators that select values from collections
    -- is the selection with or without replacement (for pick the answer is without)
    -- does the selection preserve order of shuffle (for pick the answer is yes, but ..)
    -- if it shuffles does it does so uniformly/randomly (for pick the answer is no)

As currently implemented pick's behavior seems a little tricky to specify: It may or may not shuffle its input, but not uniformly and not randomly and some possible results are never generated.

The current doc is pretty light:

 /** A generator that picks a given number of elements from a list, randomly */

That leaves open questions. Does it pick with or without replacement? What exactly does randomly mean? Does it preserve input order or not?

/** A generator that picks a given number of elements from a list. 
 * 
 * The elements are picked without replacement. 
 * 
 * pick is random in the sense that the result stream, mapped to sets, 
 * generates all possible subsets of length n of the input list and the distribution 
 * of the subsets is uniform. [Note: all possible subsets has been verified for 
 * various pickers from (1 to 10), the uniformity has not.]
 * 
 * But the result stream can not be relied on to generate all possible 
 * permutations of n elements.
 * 
 * For example, pick(2, (1 to 10)) will generate 45 = 10*9/2 distinct 2-element subsets 
 * of (1 to 10).
 * However, only 73 of the 90 possible 2-element permutations are generated. 

 * As another example, pick(n, 1 to n) generates only one value, (1 to n).
 * 
 * @throws IllegalArgumentException if n < 0 or n > l.size
 */

I could well imagine a different permutation generator with this spec.

/** A generator that selects a given number of elements from a list, ensuring
 * that the results are randomly shuffled relative to the input's order. 
 * 
 * Elements are selected without replacement.
 * 
 * The result stream is random in the sense  that it will generate all possible
 * permutations (shuffling) of n elements, and the distribution of the
 * permutations is uniform.
 * 
 * @throws IllegalArgumentException if n < 0 or n > input.size
 */
def pickAndShuffle(n: Int, input: Iterable[T]): Gen[Seq[T]] = { ... }

Or a picker that preserves order:

/** A generator that selects a given number of elements from a list 
 * preserving original order.
 * 
 * Elements are selected without replacement.
 * 
 * The result stream is random in the sense  that it will generate all 
 * possible input subsets of length n, while always preserving input order.
 * 
 * @throws IllegalArgumentException if n < 0 or n > input.size
 */
def pickInOrder(n: Int, input: Iterable[T]): Gen[Seq[T]] = { ... }

@pford19
Copy link

pford19 commented Aug 10, 2018

OK. I went back and read source code a little more carefully. Now I understand why pick is a little bit deterministic and a little bit random.

pick(m, collection-of-size-n starts by copying the first m elements of the input to the result. That's the deterministic part.

It then continues to iterate over the remaining n-m elements of the input, and randomly decides (1) whether to copy that value to the result, and (2) which result index to overwrite. That's the random part.

So if any of the first m elements of the input are in the result they are at a predetermined index -- their index in the input. If any of the remaining n-m elements are in the result they are at random indices and they overwrite the original value at that index.

This can be captured as a property

    def pickOrderProperty(m: Int, n: Int): Prop = {
      require(0 <= m && m < n)
      forAll(Gen.pick(m, 0 to n - 1)) { picks => 
        // above ordering semantics
        (0 to m - 1).forall(i => picks(i) == i || picks(i) >= m ) &&  
        // selection without replacement
        picks.size == picks.toSet.size  
      }
    }

The semantics are a little bit complicated, but reasonably concise once the smoke has been blown away.

@pford19
Copy link

pford19 commented Aug 13, 2018

I experimented with variant orderedPick which preserves input order. Performance degradation seems tolerable based on limited experiments. Performance hit increases with size of the input (see below). Statistical distribution of results is identical to current pick.

  /** A generator that picks a given number of elements from a list, randomly, 
   * preserving input order. 
   * 
   * See [[https://en.wikipedia.org/wiki/Reservoir_sampling]] for general
   *  explanation of the reservoir selection algorithm. This implementation 
   * ensures that the result preserve input ordering.
   */
  def orderedPick[T](n: Int, l: Iterable[T]): Gen[Seq[T]] = {
    if (n > l.size || n < 0) throw new IllegalArgumentException(s"invalid choice: $n")
    else if (n == 0) Gen.const(Nil)
    else gen { (p, seed0) =>
      val buf = ArrayBuffer.empty[T]
      val it = l.iterator
      var seed = seed0
      var count = 0
      while (it.hasNext) {
        val t = it.next
        count += 1
        if (count <= n) {
          buf += t
          //assert(buf.size <= n)
        } else {
          val (x, s) = seed.long
          val i = (x & 0x7fffffff).toInt % count
          if (i < n) {
            if (i == n - 1)
             // save one buff operation when slot to overwrite is also the last
              buf(i) = t 
            else {
              // remove random element and append new element, preserving input order
              buf.remove(i) // shrink buff to size n-1
              buf += t // return buff to size n
            }
            //assert(buf.size == n)
          }
          seed = s
        }
      }
      r(Some(buf), seed)
    }
  }

As noted in the source comments, the algorithmic change from standard pick is to remove a randomly selected element from the reservoir buff, and then to append its replacement at the end of buff, rather than simply overwriting the removed element. This change incurs the added cost of shifting the contents of buff for each added element. The average shift is n/2. The probability of adding the element at index j of the input (j > n) to the result is n/j.

Here is an outline of performance comparison tests:

val inputs = List(10, 20, 50, 100, 500, 1000, 2000, 5000, 10000)
for (inputSize <- inputs) {
      val timings = for (pickSize <- List(2, 5, 10, 20, 50, 100); 
        if pickSize < inputSize
      ) yield comparePickTiming(pickSize, inputSize, samplesize = 1000, print = false)
      report(timings)
    }
}

With these results

A run consists 1000 picks per pick size, for multiple pick sizes.
Ratio is in-order-pick-time/existing-pick-time for a run.
Average and stddev (standard deviation) are stats for run ratios
for a fixed input size and multiple picks sizes.

input size=10,    runs= 8, total picks= 8000, average ratio=0.9414, stddev=0.1324
input size=20,    runs=12, total picks=12000, average ratio=1.0000, stddev=0.0766
input size=50,    runs=16, total picks=16000, average ratio=1.0778, stddev=0.6826
input size=100,   runs=20, total picks=20000, average ratio=1.2142, stddev=0.9403
input size=500,   runs=24, total picks=24000, average ratio=1.1406, stddev=0.2415
input size=1000,  runs=24, total picks=24000, average ratio=1.7707, stddev=2.1308
input size=2000,  runs=24, total picks=24000, average ratio=1.4832, stddev=1.2891
input size=5000,  runs=24, total picks=24000, average ratio=1.7545, stddev=0.6217
input size=10000, runs=24, total picks=24000, average ratio=1.5178, stddev=1.0655

Does the concept of preserving input order have any merit? Either as

  1. A change to existing pick
  2. Or, a new generator with guaranteed order preserving semantics?

@ashawley
Copy link
Contributor

ashawley commented Apr 8, 2021

The documentation improvements to explain the order issues with Gen.pick have been out for a few years, now.

Closing this issue, but we can be reopen if it's deemed necessary.

@ashawley ashawley closed this as completed Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants