Added peekn #381

MitalAshok · 2017-08-31T22:31:01Z

This function is like toolz.itertoolz.peek, but for multiple items.

eriknw · 2019-06-22T17:21:46Z

Hi @MitalAshok, thanks for the PR, and I'm sorry for my excessive delay.

I like the name peekn and its functionality. However, it seems to me that this would be more natural if it returned a tuple instead of a lazy-ish iterator (I call it "lazy-ish" because it uses tee, which I generally prefer to avoid because it can copy and hold data behinds the scenes).

I'll give others a chance to chime in too. @groutr, what do you think?

MitalAshok · 2019-06-22T18:42:14Z

No worries @eriknw. That makes sense, so I added a lazy kwarg, so you can still use iterators if appropriate.

groutr · 2019-06-24T15:09:18Z

I agree with @eriknw, tee can be problematic sometimes. I think in this case, using tee is unnecessary.

Compare with the following implementation which is faster and uses noticeably less memory for large n. seq = range(1000000)

def peekn(n, seq):
    seq = iter(seq)
    peeked = tuple(take(n, seq))
    return peeked, itertools.chain(peeked, seq)

In [4]: %timeit peekn(3, seq)
812 ns ± 36.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit peekntee(3, seq)
1.17 µs ± 22.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

# peekn(500000, seq)
# peekntee(500000, seq)
(py37) ❯ python -m memory_profiler peekn.py
Filename: peekn.py

Line #    Mem usage    Increment   Line Contents
================================================
     8     48.3 MiB     48.3 MiB   @profile
     9                             def peekn(n, seq):
    10     48.3 MiB      0.0 MiB       seq = iter(seq)
    11     68.4 MiB     20.1 MiB       peeked = tuple(take(n, seq))
    12     68.4 MiB      0.0 MiB       return peeked, itertools.chain(peeked, seq)


Filename: peekn.py

Line #    Mem usage    Increment   Line Contents
================================================
    14     49.3 MiB     49.3 MiB   @profile
    15                             def peekntee(n, seq, lazy=False):
    16     49.3 MiB      0.0 MiB       iterator = iter(seq)
    17     49.3 MiB      0.0 MiB       peeked, orig = itertools.tee(take(n, iterator))
    18     49.3 MiB      0.0 MiB       if not lazy:
    19     75.0 MiB     25.7 MiB           peeked = tuple(peeked)
    20     75.0 MiB      0.0 MiB       return peeked, itertools.chain(orig, iterator)

I would prefer the lazy keyword be removed entirely. I don't think it is useful enough to earn adding to the complexity of the API. What would be the use case for having a lazy iterator for the peeked elements?

EDIT: Overall, I'm favor of having a generalized peek function. 👍 Thanks for the PR!

MitalAshok · 2019-06-24T16:40:46Z

@groutr The main problem I had with chaining a tuple was that, even if the original tuple that was chained to the iterator had fallen out of scope, the chain has a reference to an iterator, which has a reference to the tuple, so every element will be kept alive:

Line #    Mem usage    Increment   Line Contents
================================================
    29   11.766 MiB   11.766 MiB   @profile
    30                             def test_peekn():
    31   11.766 MiB    0.000 MiB       seq = range(1000000)
    32   31.168 MiB   19.402 MiB       peeked, it = peekn(500000, seq)
    33                                 # Used the peek, it's fallen out of scope and we continue
    34   31.168 MiB    0.000 MiB       peeked = None
    35   31.172 MiB    0.004 MiB       nth(300000, it)
    36                                 # Process the next few elements based on the peek
    37   31.172 MiB    0.000 MiB       nth(200000 - 2, it)
    38   31.172 MiB    0.000 MiB       next(it)
    39   31.172 MiB    0.000 MiB       next(it)
    40   12.066 MiB    0.000 MiB       it = None

Line #    Mem usage    Increment   Line Contents
================================================
    43   12.066 MiB   12.066 MiB   @profile
    44                             def test_peekntee():
    45   12.066 MiB    0.000 MiB       seq = range(1000000)
    46   36.133 MiB   24.066 MiB       peeked, it = peekntee(500000, seq)
    47   36.133 MiB    0.000 MiB       peeked = None
    48   27.133 MiB    0.000 MiB       nth(300000, it)
    49   20.883 MiB    0.000 MiB       nth(200000 - 2, it)
    50   20.883 MiB    0.000 MiB       next(it)
    51   20.883 MiB    0.000 MiB       next(it)
    52   12.078 MiB    0.000 MiB       it = None

Where the tee will deallocate elements as it is iterated over (Which is another reason why it is slower). However, the non-tee method doesn't free it at all. According to this, it can be fixed like so:

def peekn(n, seq):
    seq = iter(seq)
    peeked = tuple(take(n, seq))
    return peeked, itertools.chain(iter(peeked), seq)

Line #    Mem usage    Increment   Line Contents
================================================
    29   11.766 MiB   11.766 MiB   @profile
    30                             def test_peekn():
    31   11.766 MiB    0.000 MiB       seq = range(1000000)
    32   31.168 MiB   19.402 MiB       peeked, it = peekn(500000, seq)
    33   31.168 MiB    0.000 MiB       peeked = None
    34   31.172 MiB    0.004 MiB       nth(300000, it)
    35   31.172 MiB    0.000 MiB       nth(200000 - 2, it)
    36   12.066 MiB    0.000 MiB       next(it)
    37   12.066 MiB    0.000 MiB       next(it)
    38   12.066 MiB    0.000 MiB       it = None

(No changes have been made to peekntee)

You can see the problem where no memory is released until the peeked section has been completely iterated over.

Where does that ~8MiB go in the tee case? Apparantly the chain keeps itertools._tee object is still alive and is keeping a small cache (Though I can only find 499947 to 499999 being cached) even though both sides have been exhausted. So that's another solid reason to use to not use tee.

The reason the lazy keyword exists is because it was possible with the tee. Using the chained tuple approach, it doesn't make sense.

groutr · 2019-06-24T17:30:54Z

@MitalAshok great job on the memory analysis! I was unaware of this behavior of itertools.chain. I wonder if this behavior is popping up in other places where chain is used in toolz. The iter trick looks good to me.

@eriknw LGTM

eriknw · 2019-07-09T14:29:05Z

This is in! Everything about this PR is excellent. Thanks @MitalAshok!

Added peekn

7b6570a

MitalAshok force-pushed the peekn branch from 85a2f91 to 7b6570a Compare June 24, 2019 16:39

eriknw merged commit 020161b into pytoolz:master Jul 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added peekn #381

Added peekn #381

Uh oh!

MitalAshok commented Aug 31, 2017

Uh oh!

eriknw commented Jun 22, 2019

Uh oh!

MitalAshok commented Jun 22, 2019

Uh oh!

groutr commented Jun 24, 2019 •

edited

Loading

Uh oh!

MitalAshok commented Jun 24, 2019

Uh oh!

groutr commented Jun 24, 2019

Uh oh!

eriknw commented Jul 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Added peekn #381

Added peekn #381

Uh oh!

Conversation

MitalAshok commented Aug 31, 2017

Uh oh!

eriknw commented Jun 22, 2019

Uh oh!

MitalAshok commented Jun 22, 2019

Uh oh!

groutr commented Jun 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MitalAshok commented Jun 24, 2019

Uh oh!

groutr commented Jun 24, 2019

Uh oh!

eriknw commented Jul 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

groutr commented Jun 24, 2019 •

edited

Loading