Skip to content

Conversation

@MitalAshok
Copy link
Contributor

This function is like toolz.itertoolz.peek, but for multiple items.

@eriknw
Copy link
Member

eriknw commented Jun 22, 2019

Hi @MitalAshok, thanks for the PR, and I'm sorry for my excessive delay.

I like the name peekn and its functionality. However, it seems to me that this would be more natural if it returned a tuple instead of a lazy-ish iterator (I call it "lazy-ish" because it uses tee, which I generally prefer to avoid because it can copy and hold data behinds the scenes).

I'll give others a chance to chime in too. @groutr, what do you think?

@MitalAshok
Copy link
Contributor Author

No worries @eriknw. That makes sense, so I added a lazy kwarg, so you can still use iterators if appropriate.

@groutr
Copy link
Contributor

groutr commented Jun 24, 2019

I agree with @eriknw, tee can be problematic sometimes. I think in this case, using tee is unnecessary.

Compare with the following implementation which is faster and uses noticeably less memory for large n. seq = range(1000000)

def peekn(n, seq):
    seq = iter(seq)
    peeked = tuple(take(n, seq))
    return peeked, itertools.chain(peeked, seq)
In [4]: %timeit peekn(3, seq)
812 ns ± 36.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit peekntee(3, seq)
1.17 µs ± 22.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# peekn(500000, seq)
# peekntee(500000, seq)
(py37) ❯ python -m memory_profiler peekn.py
Filename: peekn.py

Line #    Mem usage    Increment   Line Contents
================================================
     8     48.3 MiB     48.3 MiB   @profile
     9                             def peekn(n, seq):
    10     48.3 MiB      0.0 MiB       seq = iter(seq)
    11     68.4 MiB     20.1 MiB       peeked = tuple(take(n, seq))
    12     68.4 MiB      0.0 MiB       return peeked, itertools.chain(peeked, seq)


Filename: peekn.py

Line #    Mem usage    Increment   Line Contents
================================================
    14     49.3 MiB     49.3 MiB   @profile
    15                             def peekntee(n, seq, lazy=False):
    16     49.3 MiB      0.0 MiB       iterator = iter(seq)
    17     49.3 MiB      0.0 MiB       peeked, orig = itertools.tee(take(n, iterator))
    18     49.3 MiB      0.0 MiB       if not lazy:
    19     75.0 MiB     25.7 MiB           peeked = tuple(peeked)
    20     75.0 MiB      0.0 MiB       return peeked, itertools.chain(orig, iterator)

I would prefer the lazy keyword be removed entirely. I don't think it is useful enough to earn adding to the complexity of the API. What would be the use case for having a lazy iterator for the peeked elements?

EDIT: Overall, I'm favor of having a generalized peek function. 👍 Thanks for the PR!

@MitalAshok
Copy link
Contributor Author

@groutr The main problem I had with chaining a tuple was that, even if the original tuple that was chained to the iterator had fallen out of scope, the chain has a reference to an iterator, which has a reference to the tuple, so every element will be kept alive:

Line #    Mem usage    Increment   Line Contents
================================================
    29   11.766 MiB   11.766 MiB   @profile
    30                             def test_peekn():
    31   11.766 MiB    0.000 MiB       seq = range(1000000)
    32   31.168 MiB   19.402 MiB       peeked, it = peekn(500000, seq)
    33                                 # Used the peek, it's fallen out of scope and we continue
    34   31.168 MiB    0.000 MiB       peeked = None
    35   31.172 MiB    0.004 MiB       nth(300000, it)
    36                                 # Process the next few elements based on the peek
    37   31.172 MiB    0.000 MiB       nth(200000 - 2, it)
    38   31.172 MiB    0.000 MiB       next(it)
    39   31.172 MiB    0.000 MiB       next(it)
    40   12.066 MiB    0.000 MiB       it = None

Line #    Mem usage    Increment   Line Contents
================================================
    43   12.066 MiB   12.066 MiB   @profile
    44                             def test_peekntee():
    45   12.066 MiB    0.000 MiB       seq = range(1000000)
    46   36.133 MiB   24.066 MiB       peeked, it = peekntee(500000, seq)
    47   36.133 MiB    0.000 MiB       peeked = None
    48   27.133 MiB    0.000 MiB       nth(300000, it)
    49   20.883 MiB    0.000 MiB       nth(200000 - 2, it)
    50   20.883 MiB    0.000 MiB       next(it)
    51   20.883 MiB    0.000 MiB       next(it)
    52   12.078 MiB    0.000 MiB       it = None

Where the tee will deallocate elements as it is iterated over (Which is another reason why it is slower). However, the non-tee method doesn't free it at all. According to this, it can be fixed like so:

def peekn(n, seq):
    seq = iter(seq)
    peeked = tuple(take(n, seq))
    return peeked, itertools.chain(iter(peeked), seq)
Line #    Mem usage    Increment   Line Contents
================================================
    29   11.766 MiB   11.766 MiB   @profile
    30                             def test_peekn():
    31   11.766 MiB    0.000 MiB       seq = range(1000000)
    32   31.168 MiB   19.402 MiB       peeked, it = peekn(500000, seq)
    33   31.168 MiB    0.000 MiB       peeked = None
    34   31.172 MiB    0.004 MiB       nth(300000, it)
    35   31.172 MiB    0.000 MiB       nth(200000 - 2, it)
    36   12.066 MiB    0.000 MiB       next(it)
    37   12.066 MiB    0.000 MiB       next(it)
    38   12.066 MiB    0.000 MiB       it = None

(No changes have been made to peekntee)

You can see the problem where no memory is released until the peeked section has been completely iterated over.

Where does that ~8MiB go in the tee case? Apparantly the chain keeps itertools._tee object is still alive and is keeping a small cache (Though I can only find 499947 to 499999 being cached) even though both sides have been exhausted. So that's another solid reason to use to not use tee.

The reason the lazy keyword exists is because it was possible with the tee. Using the chained tuple approach, it doesn't make sense.

@groutr
Copy link
Contributor

groutr commented Jun 24, 2019

@MitalAshok great job on the memory analysis! I was unaware of this behavior of itertools.chain. I wonder if this behavior is popping up in other places where chain is used in toolz. The iter trick looks good to me.

@eriknw LGTM

@eriknw eriknw merged commit 020161b into pytoolz:master Jul 9, 2019
@eriknw
Copy link
Member

eriknw commented Jul 9, 2019

This is in! Everything about this PR is excellent. Thanks @MitalAshok!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants