New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overhaul api for getting random integers #1578

Merged
merged 2 commits into from Sep 27, 2018

Conversation

Projects
None yet
4 participants
@thejoshwolfe
Member

thejoshwolfe commented Sep 22, 2018

What I like about this PR

  • The existing Random.scalar() method seemed silly, because it could either give an int or a boolean. So I made Random.int() and Random.boolean() instead. Seems more reasonable.
  • Random.range() has a worst case runtime of infinity for the sake of perfectly even distribution. This doesn't seem in line with the Zen of Zig, so I renamed that to include *RetryForever in the name to be clear that there's a loop in it. The default-named range functions like Random.intRangeLessThan() take a parameter which is the retry limit. For many applications (e.g. real time video games), retrying twice or so is good enough, and now there's an api for that.
  • Added support for weird integer types like u0 and u33.
  • Drastically simplified the signed integer range function with the power of two's complement signed arithmetic. Pretty much just enable wrap-around arithmetic and bitcast everything.
  • Added *AtMost* variants in addition to *LessThan* functions with a runtime upper bound. There are usecases for each:
    • LessThan: Selecting a random element from an array.
    • AtMost: When you want to be able to get the maximum integer value out of the range. Maybe you've partitioned the 32-bit number space into n groups, and you want to pick a random number from the top partition. The upper bound needs to be @maxValue(u32), so a LessThan function is not suitable. The AtMost variants have the unique ability to cover the entire number space of an integer type if the runtime-known bounds want it. e.g. random.intRangeAtMost(u64, 0, @maxValue(u64), 0).
  • Added DegenerateConstantPrng and DegenerateSequentialPrng to illustrate the unsafe nature of the RetryForever variants and to test actual values from the Random utility functions in a simple deterministic way.

What's still a little questionable about this PR:

  • Lots of copypasted documentation. I didn't thoroughly double check it all before committing.
  • The AtMost tests are mostly copypasted LessThan tests. Perhaps those could be unified?
  • The api names are awkward. When the name doesn't mention that one of the parameters is a retry limit, it looks like you're giving a lower and upper bound on the range. random.uintLessThan(u32, 100, 200)
  • The AtMost functions have two different runtime early returns for special upper bound values. Seems like we should be able to do that in just one check.
  • Haven't updated all the uses of scalar() and range() throughout the codebase yet. For now, I marked the functions as TODO: deprecated.
@andrewrk

This comment has been minimized.

Member

andrewrk commented Sep 22, 2018

I'll take a closer look in a bit, but at first glance I'm not convinced there is a real use case for this "retry" concept. First of all we're talking about pseudo-random number generators here, which do have a period, and are uniformly distributed within that period. So that limits the big O.

Second, the range of values for which the algorithm must "re-roll" to get new random bits is tiny, and therefore unlikely. But importantly, the unlikely value gets multiplied by the unlikely value at every re-roll. So we're doing something like pow(1 / 4294967296, N) where N is the number of re-rolls. This quickly approaches a scenario (at N = 7 or something like that) where it would be more likely for a cosmic ray to flip a random bit than it would be to have to re-roll that many times. So I don't think it's really O(infinity).

This was not a very mathematically accurate breakdown of the problem, this is just my first reaction. I'll take a closer look later.

@thejoshwolfe

This comment has been minimized.

Member

thejoshwolfe commented Sep 22, 2018

if you're range is 0 <= i <= 0x80000001, then you're rerolling about 50% of the time.

@andrewrk

This comment has been minimized.

Member

andrewrk commented Sep 23, 2018

Let's take 50% then. According to a study by IBM about cosmic rays the probability is 1.4E-15 per byte per second. Doing log base 2 on the inverse of that number gives us how many re-rolls have the same probability, and it's 50. So realistically this is O(1).

@mschwaig

This comment has been minimized.

mschwaig commented Sep 24, 2018

I think because of the low cost for a perfectly even distribution it is unfair to label that method with RetryForever since the probability of it actually taking forever is infinitely small.

Even distribution is something people should not give up lightheartedly.

@thejoshwolfe

This comment has been minimized.

Member

thejoshwolfe commented Sep 24, 2018

You're assuming that the backing prng provides evenly distributed values. I agree that this isn't a realistic problem for the prngs provided and recommend by the std lib.

My concern is that the Random class/struct/interface is providing general purpose utility functions. The correctness of the utilities depends on the behavior of the underlying prng, which is not surprising. But what is surprising is that the halting correctness depends on the values provided by the underlying prng. This seems especially hostile to anyone trying to implement or debug a backend when the observed behavior is that your program freezes while burning 100% cpu.

I think limiting the retries to 50 is a reasonable compromise.

@mschwaig

This comment has been minimized.

mschwaig commented Sep 24, 2018

Should it fail after 50 retries or give up?

EDIT: Probably if this happens in practice there is something wrong with the PRNG.

@andrewrk

This comment has been minimized.

Member

andrewrk commented Sep 24, 2018

You're assuming that the backing prng provides evenly distributed values.

I definitely think the Random interface should assume the backing prng provides evenly distributed values. I'd like to see a real life use case for supporting non-evenly distributed backing prngs before adding code to deal with them, and even if we were going to support this use case, it should probably be a different name other than std.rand.Random.

@mschwaig

This comment has been minimized.

mschwaig commented Sep 24, 2018

Josh is probably talking more about a PRNG with low quality output than a PRNG which deliberatly provides a different distribution. From what I've seen people usually build support for different distributions on top of a PRNG providing uniformly distributed values.

@andrewrk

I'm not convinced the use case for the retry concept is valid.

@thejoshwolfe

This comment has been minimized.

Member

thejoshwolfe commented Sep 24, 2018

I propose that after 50 retries it just biases the results by doing a value % range. If the backing prng was returning the same value every time to cause this behavior, then the consistency would be observable outside the Random interface as well.

It seems strange to me to have software rely on the assumption that random values will behave predictably. Granted that the constraints on the random values are very generous in this case, but it still seems far from Zig's "perfect software" philosophy to leave the correctness of an algorithm up to chance. Perhaps this is just a philosophical objection, but looping forever waiting for a random event to occur doesn't seem like correct software to me. Aside from the likelihoods of the different outcomes, this is analogous to bogosort.

You could argue that perfect correctness in this case actually means a perfectly even distribution, which does require looping forever. So then we have to compromise between imperfect runtime or imperfect distribution. The degree of imperfection on either side is very small.

I argue that a runtime guarantee is more important than a guarantee of perfectly even distribution. When it matters (like if your prng has a bug), looping forever is definitely worse behavior than returning a biased result. Keep in mind that whether it loops forever or not depends on several factors; it's not like it will reliably loop forever, enabling you to find the bug.

tldr: let me update to remove the retry_limit parameters and LoopForever variants and instead always have a retry_limit of 50. I like @andrewrk's analysis of cosmic interference, and that will look nice in the docs.

@andrewrk

This comment has been minimized.

Member

andrewrk commented Sep 24, 2018

So then we have to compromise between imperfect runtime or imperfect distribution.

How about allowing the user to specify in which direction they wish to compromise?

/// The time complexity of this function is O(1), but has a variable number of branches.
/// Returns a uniformly distributed random integer `start <= x < end`.
/// Assumes the underlying prng is evenly distributed. If this assumption is violated,
/// then time complexity of this function becomes undefined, and the randomness of
/// the return value will no longer be guaranteed to be uniform.
///
/// A more detailed account of the time complexity is as follows:
/// Worst case time complexity is O(P/2) where P is the period of the underlying prng. This is
/// a constant value, but a large constant value such as pow(2, 128) for xoroshiro128+.
/// However the chance of greater than O(50) runtime would be greater than the chance of a
/// cosmic ray flipping a random bit in memory within the next second.
/// 
/// API users who want to save a few CPU branches at the expense of perfectly uniform
/// random numbers can use the function `intRangeNonUniform`.
pub fn intRange(r: *Random, comptime T: type, start: T, end: T) T {
    // ...
}

/// The time complexity of this function is O(1), and has a constant number of branches.
/// Returns a random integer `start <= x < end`, not guaranteed to be evenly distributed.
/// For an evenly distributed range, see `intRange`.
pub fn intRangeNonUniform(r: *Random, comptime T: type, start: T, end: T) T {
    // ...
}

Note that neither of these has the ability to specify a "retry count", because we are in agreement that there is no use case for a retry count other than 0 or infinity.

I argue that a runtime guarantee is more important than a guarantee of perfectly even distribution. When it matters (like if your prng has a bug), looping forever is definitely worse behavior than returning a biased result. Keep in mind that whether it loops forever or not depends on several factors; it's not like it will reliably loop forever, enabling you to find the bug.

I disagree with this synopsis, but maybe we don't even have to discuss it, because I agree there is a use case for using non-uniformly distributed ranges (video games, art) and uniformly distributed ranges (cryptography, video games, art). Both of these use cases are covered by the above function prototype examples. What we don't have is a use case for the concept of a retry count (even if that retry count is an implementation detail not exposed to the API).

thejoshwolfe added some commits Sep 22, 2018

address review feedback
* removed retry limits. instead documented a recommendation
  to call int(T) % len directly.
* removed DegenerateConstantPrng because it violates the
  assumptions of the Random functions.
@thejoshwolfe

This comment has been minimized.

Member

thejoshwolfe commented Sep 27, 2018

I think I'm convinced that there's no usecase for a retry limit other than 0 or infinity. In very strict runtime situations like cryptography or safety critical realtime embedded systems, you should retry 0 times. In all other cases, an exponentially distributed blip in performance is acceptable.

What do people think about having separate functions for LessThan and AtMost? I'm not comfortable with either of them being the default, since there seems to be disagreement about what the default should be. In Python, the upper bound is inclusive; in .NET, the upper bound is exclusive. (Not even the parameter names in those function signatures (b and maxValue respectively) communicate whether the upper bound is inclusive or exclusive, except that maxValue misleadingly indicates that it is the maximum value returnable from the function. ugh.)

@andrewrk

This comment has been minimized.

Member

andrewrk commented Sep 27, 2018

What do people think about having separate functions for LessThan and AtMost?

Zig has an established convention in slices that it's start <= x < end, and I do personally think this convention results in code that is easier to reason about. So I wouldn't mind if the API encouraged this convention. On the other hand, I'm also fine with an API that was neutral and had unambiguous names for either convention, which is what you've done in this PR.

@andrewrk andrewrk merged commit e7d9d00 into master Sep 27, 2018

4 checks passed

continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@andrewrk andrewrk deleted the rand branch Sep 27, 2018

emekoi added a commit to emekoi/zig that referenced this pull request Sep 27, 2018

overhaul api for getting random integers (ziglang#1578)
* rand api overhaul
* no retry limits. instead documented a recommendation
  to call int(T) % len directly.

emekoi added a commit to emekoi/zig that referenced this pull request Sep 27, 2018

overhaul api for getting random integers (ziglang#1578)
* rand api overhaul
* no retry limits. instead documented a recommendation
  to call int(T) % len directly.
@tommyettinger

This comment has been minimized.

tommyettinger commented Nov 13, 2018

I got here too late, but it might be worthwhile to translate the code M.E. O'Neill gave here and figure out which of the techniques apply best to Zig. Unless modulus is handled better in Zig than in C, I'd expect the solutions that avoid modulus to perform much better (especially if they don't use rejection, but that makes them biased). The code for "Bitmask with Rejection (Unbiased)" is an unusual case and I expect it will optimize very differently in Zig than it did when I implemented it in Java. "Debiased Integer Multiplication (Unbiased)" generally seems to be the winner, and I think D. Lemire has produced more than one article/paper describing it. You're all doing good work!

@thejoshwolfe

This comment has been minimized.

Member

thejoshwolfe commented Nov 14, 2018

@tommyettinger thanks for the research! That's definitely something we want to take seriously. I've creeated #1721 for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment