Optimise `IteratorRandom::choose` for `size_hint` or `ExactSizeIterator` #511

dhardy · 2018-06-15T16:57:49Z

IteratorRandom::choose uses one random number per item since it has no way of knowing how many items will follow.

We do not use size_hint since it is not guaranteed that the supplied bounds are correct; though since incorrect implementations are considered buggy we could try.

Alternatively ExactSizeIterator could be used, but probably not without specialisation.

The text was updated successfully, but these errors were encountered:

sicking · 2018-07-23T07:26:40Z

Opinions on what we should do if size_hint is buggy? Should we reliably panic? Is return a "wrong" value ok?

So for example, a simple implementation strategy would be to check if size_hint returns (N, Some(N)), and if so simply return self.nth(rng.gen_range(0, N)). However that would return the "wrong" value rather than panic if the size_hint implementation was buggy.

dhardy · 2018-07-24T12:43:13Z

To quote the doc:

An incorrect implementation of size_hint() should not lead to memory safety violations.

Beyond that there's not much guidance; i.e. I think it should be acceptable to return "incorrect" results if size_hint is incorrect so long as the return value is either None or Some(x) where x is a result of the iterator.

nagisa · 2018-07-24T12:54:30Z

There’s an exception for ExactSizeIterator, but otherwise, yes, the size_hint may return whatever :)

dhardy · 2018-07-24T13:06:35Z

Yes, but does "may return whatever" mean:

size_hint may never affect the result in any way (including bias); i.e. can only be used to optimise memory allocations
If size_hint's maximum is too low, the result may ignore all values beyond the maximum, but must otherwise be correct and must be memory safe
Above except also allow bias or incorrect None if size_hint's maximum is too high
If size_hint is incorrect, users may do the wrong thing so long as they are memory-safe

nagisa · 2018-07-24T13:27:59Z

There are no restrictions to size_hint. Even if the size_hint is "incorrect", or rather… inaccurate, the algorithm should work correctly as observed at the call boundary. Consider for example collect() on iterators. Using size_hint, the collection may attempt to pre-allocate some amount of storage, but if the hint (and therefore allocation) turned out to be too small, it will have to re-allocate more memory later on.

So, while, there is a side-effect to using size_hint, but it is not reasonably visible to the end user.

For IteratorRandom::choose probably a sufficient rule would be this:

Given a deterministic backing random number generator (r) initialized with with some well known seed, and two different implementations of Iterator for I1 and I2 (that differ only in their implementation of size_hint), a call to <I1 as IteratorRandom>::choose(r) and <I2 as IteratorRandom>::choose(r) should:

Advance the I1 and I2 iterators the same amount of times (that is, call the next method the same number of times);
Advance the state of RNG to the same new state (?);
Return the same element.

nagisa · 2018-07-24T13:29:33Z

Ultimately, however, it is up to you to decide what the semantics are when size_hint returns different values. You could as well just specify (in the documentation), that size_hint may affect the outcome of the function, which is fine.

sicking · 2018-07-24T16:56:42Z

Advance the I1 and I2 iterators the same amount of times (that is, call the next method the same number of times);

Advance the state of RNG to the same new state (?);

Return the same element.

If we have the above requirement I don't see that we could take advantage of size_hint to improve performance. I'll defer to @dhardy and @pitdicker.

But we can always wait until specialization stabilizes and use ExactSizeIterator.

dhardy · 2018-07-24T17:24:50Z

I suppose those constraints are necessary to avoid a change in size_hint implementation from affecting results in a supposedly reproducible application. Currently, reproducibility requires pinning the version of the Rand crate; optimising via size_hint would essentially mean the version of other crates may need to be pinned too. I'm not really sure we can get away from that anyway though.

The alternative is just to point out that the implementation of size_hint may affect the effect of this function in documentation, as @nagisa suggests. To me this seems the more attractive option since the performance impact could be quite significant, reproducibility is less often required (especially when sampling from an iterator I think), and in any case maintaining reproducibility is a subtle thing.

sicking · 2018-07-25T05:23:30Z

I agree that pointing out that size_hint will affect reproducibility feels more attractive. Though we need to make it clear that that's the case even when size_hint changes from one valid implementation to another.

Likewise to point out that choose depends on size_hint, and so if if size_hint has a buggy implementation, then the return values from choose will not be dependable and could even panic. Though it will never result in UB.

If there's agreement that that's ok, then I'm happy to write up a PR?

dhardy · 2018-07-25T11:01:52Z

I don't think it needs to panic, but I think it's acceptable to introduce bias in the output (i.e. select some element as a fallback in case the iterator is not as long as size_hint says it is), or perhaps even return None in this case.

Perhaps though we should bring this up elsewhere to get more opinions. I don't really know where people might write incorrect size_hint implementations.

dhardy · 2018-07-25T12:17:30Z

See: https://internals.rust-lang.org/t/size-hint-correctness-reproducibility-and-documentation/8058

dhardy · 2018-07-31T09:36:09Z

@sicking the answer seems to be that we're allowed to break things if size_hint is incorrect: https://internals.rust-lang.org/t/size-hint-correctness-reproducibility-and-documentation/8058/6?u=dhardy

dhardy added T-sequences Topic: sequences C-optimisation E-easy Participation: easy job labels Jun 15, 2018

dhardy mentioned this issue Jun 27, 2018

Tracker: 0.6 release #520

Closed

28 tasks

sicking mentioned this issue Aug 21, 2018

Use Iterator::size_hint() to speed up IteratorRandom::choose #593

Merged

dhardy added this to the 0.6 release milestone Aug 23, 2018

dhardy closed this as completed in #593 Aug 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise `IteratorRandom::choose` for `size_hint` or `ExactSizeIterator` #511

Optimise `IteratorRandom::choose` for `size_hint` or `ExactSizeIterator` #511

dhardy commented Jun 15, 2018

sicking commented Jul 23, 2018 •

edited

dhardy commented Jul 24, 2018

nagisa commented Jul 24, 2018

dhardy commented Jul 24, 2018 •

edited

nagisa commented Jul 24, 2018

nagisa commented Jul 24, 2018 •

edited

sicking commented Jul 24, 2018

dhardy commented Jul 24, 2018

sicking commented Jul 25, 2018

dhardy commented Jul 25, 2018

dhardy commented Jul 25, 2018

dhardy commented Jul 31, 2018

Optimise IteratorRandom::choose for size_hint or ExactSizeIterator #511

Optimise IteratorRandom::choose for size_hint or ExactSizeIterator #511

Comments

dhardy commented Jun 15, 2018

sicking commented Jul 23, 2018 • edited

dhardy commented Jul 24, 2018

nagisa commented Jul 24, 2018

dhardy commented Jul 24, 2018 • edited

nagisa commented Jul 24, 2018

nagisa commented Jul 24, 2018 • edited

sicking commented Jul 24, 2018

dhardy commented Jul 24, 2018

sicking commented Jul 25, 2018

dhardy commented Jul 25, 2018

dhardy commented Jul 25, 2018

dhardy commented Jul 31, 2018

Optimise `IteratorRandom::choose` for `size_hint` or `ExactSizeIterator` #511

Optimise `IteratorRandom::choose` for `size_hint` or `ExactSizeIterator` #511

sicking commented Jul 23, 2018 •

edited

dhardy commented Jul 24, 2018 •

edited

nagisa commented Jul 24, 2018 •

edited