Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise IteratorRandom::choose for size_hint or ExactSizeIterator #511

Closed
dhardy opened this issue Jun 15, 2018 · 12 comments · Fixed by #593
Closed

Optimise IteratorRandom::choose for size_hint or ExactSizeIterator #511

dhardy opened this issue Jun 15, 2018 · 12 comments · Fixed by #593
Labels
C-optimisation E-easy Participation: easy job T-sequences Topic: sequences
Milestone

Comments

@dhardy
Copy link
Member

dhardy commented Jun 15, 2018

IteratorRandom::choose uses one random number per item since it has no way of knowing how many items will follow.

We do not use size_hint since it is not guaranteed that the supplied bounds are correct; though since incorrect implementations are considered buggy we could try.

Alternatively ExactSizeIterator could be used, but probably not without specialisation.

@dhardy dhardy added T-sequences Topic: sequences C-optimisation E-easy Participation: easy job labels Jun 15, 2018
@dhardy dhardy mentioned this issue Jun 27, 2018
28 tasks
@sicking
Copy link
Contributor

sicking commented Jul 23, 2018

Opinions on what we should do if size_hint is buggy? Should we reliably panic? Is return a "wrong" value ok?

So for example, a simple implementation strategy would be to check if size_hint returns (N, Some(N)), and if so simply return self.nth(rng.gen_range(0, N)). However that would return the "wrong" value rather than panic if the size_hint implementation was buggy.

@dhardy
Copy link
Member Author

dhardy commented Jul 24, 2018

To quote the doc:

An incorrect implementation of size_hint() should not lead to memory safety violations.

Beyond that there's not much guidance; i.e. I think it should be acceptable to return "incorrect" results if size_hint is incorrect so long as the return value is either None or Some(x) where x is a result of the iterator.

@nagisa
Copy link
Contributor

nagisa commented Jul 24, 2018

There’s an exception for ExactSizeIterator, but otherwise, yes, the size_hint may return whatever :)

@dhardy
Copy link
Member Author

dhardy commented Jul 24, 2018

Yes, but does "may return whatever" mean:

  1. size_hint may never affect the result in any way (including bias); i.e. can only be used to optimise memory allocations
  2. If size_hint's maximum is too low, the result may ignore all values beyond the maximum, but must otherwise be correct and must be memory safe
  3. Above except also allow bias or incorrect None if size_hint's maximum is too high
  4. If size_hint is incorrect, users may do the wrong thing so long as they are memory-safe

@nagisa
Copy link
Contributor

nagisa commented Jul 24, 2018

There are no restrictions to size_hint. Even if the size_hint is "incorrect", or rather… inaccurate, the algorithm should work correctly as observed at the call boundary. Consider for example collect() on iterators. Using size_hint, the collection may attempt to pre-allocate some amount of storage, but if the hint (and therefore allocation) turned out to be too small, it will have to re-allocate more memory later on.

So, while, there is a side-effect to using size_hint, but it is not reasonably visible to the end user.

For IteratorRandom::choose probably a sufficient rule would be this:

Given a deterministic backing random number generator (r) initialized with with some well known seed, and two different implementations of Iterator for I1 and I2 (that differ only in their implementation of size_hint), a call to <I1 as IteratorRandom>::choose(r) and <I2 as IteratorRandom>::choose(r) should:

  • Advance the I1 and I2 iterators the same amount of times (that is, call the next method the same number of times);
  • Advance the state of RNG to the same new state (?);
  • Return the same element.

@nagisa
Copy link
Contributor

nagisa commented Jul 24, 2018

Ultimately, however, it is up to you to decide what the semantics are when size_hint returns different values. You could as well just specify (in the documentation), that size_hint may affect the outcome of the function, which is fine.

@sicking
Copy link
Contributor

sicking commented Jul 24, 2018

  • Advance the I1 and I2 iterators the same amount of times (that is, call the next method the same number of times);
  • Advance the state of RNG to the same new state (?);
  • Return the same element.

If we have the above requirement I don't see that we could take advantage of size_hint to improve performance. I'll defer to @dhardy and @pitdicker.

But we can always wait until specialization stabilizes and use ExactSizeIterator.

@dhardy
Copy link
Member Author

dhardy commented Jul 24, 2018

I suppose those constraints are necessary to avoid a change in size_hint implementation from affecting results in a supposedly reproducible application. Currently, reproducibility requires pinning the version of the Rand crate; optimising via size_hint would essentially mean the version of other crates may need to be pinned too. I'm not really sure we can get away from that anyway though.

The alternative is just to point out that the implementation of size_hint may affect the effect of this function in documentation, as @nagisa suggests. To me this seems the more attractive option since the performance impact could be quite significant, reproducibility is less often required (especially when sampling from an iterator I think), and in any case maintaining reproducibility is a subtle thing.

@sicking
Copy link
Contributor

sicking commented Jul 25, 2018

I agree that pointing out that size_hint will affect reproducibility feels more attractive. Though we need to make it clear that that's the case even when size_hint changes from one valid implementation to another.

Likewise to point out that choose depends on size_hint, and so if if size_hint has a buggy implementation, then the return values from choose will not be dependable and could even panic. Though it will never result in UB.

If there's agreement that that's ok, then I'm happy to write up a PR?

@dhardy
Copy link
Member Author

dhardy commented Jul 25, 2018

I don't think it needs to panic, but I think it's acceptable to introduce bias in the output (i.e. select some element as a fallback in case the iterator is not as long as size_hint says it is), or perhaps even return None in this case.

Perhaps though we should bring this up elsewhere to get more opinions. I don't really know where people might write incorrect size_hint implementations.

@dhardy
Copy link
Member Author

dhardy commented Jul 25, 2018

@dhardy
Copy link
Member Author

dhardy commented Jul 31, 2018

@sicking the answer seems to be that we're allowed to break things if size_hint is incorrect: https://internals.rust-lang.org/t/size-hint-correctness-reproducibility-and-documentation/8058/6?u=dhardy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-optimisation E-easy Participation: easy job T-sequences Topic: sequences
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants