Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Powerset iterator adaptor #335

Merged
merged 4 commits into from
Dec 10, 2020
Merged

Powerset iterator adaptor #335

merged 4 commits into from
Dec 10, 2020

Conversation

willcrozi
Copy link
Contributor

Implements a powerset iterator adaptor that iterates over all subsets of the input iterator's elements. Returns vectors representing these subsets. Internally uses Combinations of increasing length.

I've taken the strategy of using a 'position' field that acts both as a means to to detect the special case of the first element and also allows optimal size_hint() implementation.

Additionally there is a commit to improve performance that alters Combinations implementation slightly. I've added Combinations benchmark as a stand-alone commit to allow checking for performance regressions. Powerset performance after this commit improves some cases (with small sizes of n) by 10-30%

This is my first attempt at a Rust contribution, happy to put in whatever discussion/work to get this merged. Cheers!

Copy link
Member

@jswrenn jswrenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to get this into the next Itertools release! Sorry for the long delay in review!

.travis.yml Outdated Show resolved Hide resolved
@willcrozi willcrozi force-pushed the powerset branch 3 times, most recently from fbd8b74 to d10c03e Compare September 23, 2020 09:32
@willcrozi
Copy link
Contributor Author

I want to get this into the next Itertools release! Sorry for the long delay in review!

Great to hear, thanks!

I've rewritten this feature since the original pull request. I think the code is ready and was intending on writing some explanatory notes, which I can post here later today.

It seems the CI build for the Powerset benchmark times out (it completes successfully on my fork). I'll take a look into this.

@willcrozi
Copy link
Contributor Author

Travis CI jobs 2 and 3 are both getting stuck and terminated after a 10 minute limit. I've reduced the length of the powerset benchmarks (needed doing anyway TBH) but this seems related to caching.

Setting up build cache

adding /home/travis/.cargo to cache
creating directory /home/travis/.cargo
adding /home/travis/build/rust-itertools/itertools/target to cache
creating directory /home/travis/build/rust-itertools/itertools/target
adding /home/travis/.rustup to cache
creating directory /home/travis/.rustup
adding /home/travis/.cache/sccache to cache
creating directory /home/travis/.cache/sccache

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received

The build has been terminated```

@willcrozi
Copy link
Contributor Author

My own use case for this adaptor relates to generating passwords to feed into tools such as hashcat. Since a powerset is essentially a chain of combinations of increasing length over the input set, I've made use of the Combinations iterator.

The additions to Combinations and LazyBuffer are to allow a Powerset to efficiently expand and reuse a Combinations rather than repeatedly creating new ones (with the associated cost of cloning of the source buffer). This makes a difference in performance for my use case (lots of small powersets within large iterator chains), but for most uses this is likely amortized for larger powersets.

Powerset's need to detect the special case of its first element, combined with the usefulness of a functional size hint for large powersets, seems to me to warrant Powerset's pos field.

size_hint::two_exp() was added for Powerset's size hint but is quite specific and could be folded into Powerset's size_hint method instead.

The benchmarks for Combinations was added to check performance regression due to the current modifications. The Powerset benchmark was added for completeness.

Let me know what you think.

Copy link
Member

@phimuemue phimuemue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! A powerset adaptor probably comes in very handy.

I sympathize with the idea of reusing Combinations, but I think it would be easier to avoid all the inner special cases.

src/powerset.rs Outdated Show resolved Hide resolved
src/powerset.rs Outdated Show resolved Hide resolved
src/size_hint.rs Outdated Show resolved Hide resolved
src/size_hint.rs Outdated Show resolved Hide resolved
Comment on lines 51 to 58
#[inline]
pub fn k(&self) -> usize { self.indices.len() }

/// Returns the (current) length of the pool from which combination elements are
/// selected. This value can change between invocations of `next()` and `init()`.
#[inline]
pub(crate) fn n(&self) -> usize { self.pool.len() }

/// Returns a reference to the source iterator.
#[inline]
pub(crate) fn src(&self) -> &I { &self.pool.it }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need (all of) these? E.g. at some point, we stored k explicitly, and we threw it out because we then always had multiple ways (namely k vs. indices.len to compute the very same thing), which just bloated the iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are mainly a consequence of handing most of our information over to Combinations. We could remove src() at the expense of losing a decent size_hint() but I think k() and n() have more justification:

  • k() and n() are used within Powerset to detect iterator completion. The alternatives I could see were a Powerset::done field (bloat) or just allowing a completed Powerset to keep increasing the size of k for its Combinations on every call to next() (wasteful, could allocate).
  • A combination's k and n are also part of the formal notation for combinations in set-theory, so I thought it might make sense to have them visible.
  • k() and n() are also necessary for the size_hint() impl.
  • Could potentially make code more readable e.g. self.indices.len() vs self.k()

I understand that we shouldn't jump through too many hoops for size_hint (and there are quite a few hoops here!), but since powersets can become large very quickly as input size increases, having one might be important for users.

src/combinations.rs Outdated Show resolved Hide resolved
src/combinations.rs Outdated Show resolved Hide resolved
@willcrozi
Copy link
Contributor Author

Hi, thanks for the detailed review!

I sympathize with the idea of reusing Combinations, but I think it would be easier to avoid all the inner special cases.

Agreed, please see my replies to your specific comments.

It seems there's a few issues that weave together here.

What I'm thinking:

  • Scrap the Inner enum and move to using increasing length Combinations from the outset, simplifying Powerset's next() impl as you suggest.
  • Decide between the following:
    • Keep size_hint impl (and therefore k(), n() and src() methods on Combinations
    • Scrap size_hint (removing Powerset::pos) and decide between:
      • Keeping k() and n() on Combinations
      • Add done field to Powerset
      • Some other way for Combinations to indicate a completed state?

Any idea on how to fix the the CI failures on jobs 2 and 3? (It gets stuck setting up sccache by the looks of it).

@willcrozi
Copy link
Contributor Author

I've simplified the Powerset implementation as per suggestions (withk() and n() still included for now).

The first additional commit excludes impl of size_hint and the second includes it for comparison.

I'll take some benchmarks when I get the chance to compare to the original approach.

Copy link
Member

@phimuemue phimuemue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there! Thanks for addressing the points - imho this improved the PR quite a bit!

Decide between the following:

* Keep `size_hint` impl (and therefore `k()`, `n()` and `src()` methods on `Combinations`

* Scrap `size_hint` (removing `Powerset::pos`) and decide between:
  
  * Keeping `k()` and `n()` on `Combinations`
  * Add `done` field to `Powerset`
  * Some other way for `Combinations` to indicate a completed state?

I think we could start out without size_hint - and if we really need it, possibly implement it for Combinations and compute Powerset's size_hint in terms of Combinations' size_hint. However, let's wait for @jswrenn's opinion.

As a side note: Could you rebase your commits, so that they are easier to review? It would make subsequent reviews much easier. Maybe separate your work along the lines of:

  • Pure formatting changes (if necessary)
  • Typos
  • Actual implementation
  • Benchmarks

@jswrenn We are increasingly confronted with huge PRs that could be easily split into smaller ones. Should we encourage this somewhere in our guidelines? (Do we have such guidelines?)

@@ -44,6 +43,25 @@ where
}
}
}

pub fn prefill(&mut self, len: usize) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the return value is not used anywhere, so we should possibly omit it for simplicity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly for another PR: Could we unify prefill and get_next?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the return value is not used anywhere, so we should possibly omit it for simplicity.

Yes, I've removed this.

Possibly for another PR: Could we unify prefill and get_next?

Yes would be good. Maybe implementing one in terms of the other is the way to go?

src/lazy_buffer.rs Outdated Show resolved Hide resolved
src/powerset.rs Outdated Show resolved Hide resolved
src/combinations.rs Outdated Show resolved Hide resolved
src/lazy_buffer.rs Outdated Show resolved Hide resolved
src/powerset.rs Outdated Show resolved Hide resolved
@willcrozi
Copy link
Contributor Author

willcrozi commented Sep 27, 2020

I think we could start out without size_hint - and if we really need it, possibly implement it for Combinations and compute Powerset's size_hint in terms of Combinations' size_hint. However, let's wait for @jswrenn's opinion.

Fair enough, though I'm not sure it's practical to compute a size_hint for Combinations since the formula involves three factorials: (n! / k!(n - k)!). A powerset's length is much easier to compute directly: 2^n.

As a side note: Could you rebase your commits, so that they are easier to review? It would make subsequent reviews much easier. Maybe separate your work along the lines of:

* Pure formatting changes (if necessary)
* Typos
* Actual implementation
* Benchmarks

No problem! I've split into two commits (impl and benchmarks). I've also made the doc comments more in line with the rest of the library.

src/combinations.rs Outdated Show resolved Hide resolved
src/combinations.rs Outdated Show resolved Hide resolved
src/combinations.rs Show resolved Hide resolved
Copy link
Member

@phimuemue phimuemue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spotted some minor things to bring powerset in line with other methods, etc.

src/powerset.rs Outdated Show resolved Hide resolved
src/lazy_buffer.rs Outdated Show resolved Hide resolved

/// Create a new `Powerset` from a clonable iterator.
pub fn powerset<I>(src: I) -> Powerset<I>
where I: Iterator,
Copy link
Member

@phimuemue phimuemue Sep 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be IntoIterator instead of Iterator? (Afaik the other methods use IntoIterator.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combinations and all the others I've looked at so far have I: Iterator in the where clause

src/lazy_buffer.rs Outdated Show resolved Hide resolved
src/lazy_buffer.rs Outdated Show resolved Hide resolved
@willcrozi
Copy link
Contributor Author

Just to check what's the preferred approach when I've got changes based on PR review feedback:

  • Add a new commit specifically addressing the feedback?
    or
  • Rebase then force-push, splitting review changes across commits according to category (typos, implementation, benchmarks etc)?

Renames tuple_combinations benchmark functions to tuple_comb_* for clarity in
test results.
An iterator to iterate through the powerset of the elements from an iterator.
@willcrozi
Copy link
Contributor Author

I've pushed changes addressing the resolved feedback so far.

I've split the benchmarks into two commits and ordered them so that it's easier to check any effect from the changes to introduced to Combinations. I've also tweaked the doc comment for Combinations::reset() to state "If k is larger than the current length of the data pool an attempt is made to prefill...", which is more correct.

As far as I can see the outstanding issues are:

  • Whether we perform the extra check for the special case of k == 0 (see discussion here)
  • Whether we want a size_hint() implementation (see discussion here and here)

Let me know what you think and if there are any more!

@jswrenn
Copy link
Member

jswrenn commented Oct 1, 2020

Fair enough, though I'm not sure it's practical to compute a size_hint for Combinations since the formula involves three factorials: (n! / k!(n - k)!). A powerset's length is much easier to compute directly: 2^n.

I'm basically convinced by this that we should have a size_hint method. I don't see any harm in it, either.

@jswrenn
Copy link
Member

jswrenn commented Dec 7, 2020

It's about time this gets merged. Worst case scenario, we can always fix issues in subsequent releases! Thanks for contributing this.

bors r+

bors bot added a commit that referenced this pull request Dec 7, 2020
335: Powerset iterator adaptor r=jswrenn a=willcrozi

Implements a [powerset](https://en.wikipedia.org/wiki/Power_set) iterator adaptor that iterates over all subsets of the input iterator's elements. Returns vectors representing these subsets. Internally uses `Combinations` of increasing length.

I've taken the strategy of using a 'position' field that acts both as a means to to detect the special case of the first element and also allows optimal `size_hint()` implementation.

Additionally there is a commit to improve performance that alters `Combinations` implementation slightly. I've added Combinations benchmark as a stand-alone commit to allow checking for performance regressions. `Powerset` performance after this commit improves some cases (with small sizes of `n`) by 10-30%

This is my first attempt at a Rust contribution, happy to put in whatever discussion/work to get this merged. Cheers!

Co-authored-by: Will Crozier <willcrozi@gmail.com>
@bors
Copy link
Contributor

bors bot commented Dec 7, 2020

Timed out.

@jswrenn
Copy link
Member

jswrenn commented Dec 8, 2020

bors retry

bors bot added a commit that referenced this pull request Dec 8, 2020
335: Powerset iterator adaptor r=jswrenn a=willcrozi

Implements a [powerset](https://en.wikipedia.org/wiki/Power_set) iterator adaptor that iterates over all subsets of the input iterator's elements. Returns vectors representing these subsets. Internally uses `Combinations` of increasing length.

I've taken the strategy of using a 'position' field that acts both as a means to to detect the special case of the first element and also allows optimal `size_hint()` implementation.

Additionally there is a commit to improve performance that alters `Combinations` implementation slightly. I've added Combinations benchmark as a stand-alone commit to allow checking for performance regressions. `Powerset` performance after this commit improves some cases (with small sizes of `n`) by 10-30%

This is my first attempt at a Rust contribution, happy to put in whatever discussion/work to get this merged. Cheers!

Co-authored-by: Will Crozier <willcrozi@gmail.com>
@jswrenn
Copy link
Member

jswrenn commented Dec 8, 2020

My attempt at making bors happy with the new CI isn't going well, so bear with me here...

@jswrenn
Copy link
Member

jswrenn commented Dec 8, 2020

bors r-

@bors
Copy link
Contributor

bors bot commented Dec 8, 2020

Canceled.

@jswrenn
Copy link
Member

jswrenn commented Dec 8, 2020

bors r+

bors bot added a commit that referenced this pull request Dec 8, 2020
335: Powerset iterator adaptor r=jswrenn a=willcrozi

Implements a [powerset](https://en.wikipedia.org/wiki/Power_set) iterator adaptor that iterates over all subsets of the input iterator's elements. Returns vectors representing these subsets. Internally uses `Combinations` of increasing length.

I've taken the strategy of using a 'position' field that acts both as a means to to detect the special case of the first element and also allows optimal `size_hint()` implementation.

Additionally there is a commit to improve performance that alters `Combinations` implementation slightly. I've added Combinations benchmark as a stand-alone commit to allow checking for performance regressions. `Powerset` performance after this commit improves some cases (with small sizes of `n`) by 10-30%

This is my first attempt at a Rust contribution, happy to put in whatever discussion/work to get this merged. Cheers!

Co-authored-by: Will Crozier <willcrozi@gmail.com>
@bors
Copy link
Contributor

bors bot commented Dec 8, 2020

Timed out.

@jswrenn
Copy link
Member

jswrenn commented Dec 10, 2020

bors r+

@bors
Copy link
Contributor

bors bot commented Dec 10, 2020

Build succeeded:

@bors bors bot merged commit 130ffd3 into rust-itertools:master Dec 10, 2020
@willcrozi
Copy link
Contributor Author

It's about time this gets merged. Worst case scenario, we can always fix issues in subsequent releases! Thanks for contributing this.

No problem, great to see it finally merged!

Thanks @jswrenn and @phimuemue for the all the feedback and bearing with me. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants