<a href="https://colab.research.google.com/github/rahiakela/fluent-python-book-practice/blob/master/part-v-control-flow/14_iterables_iterators_and_generators.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Iterables, iterators and generators

Iteration is fundamental to data processing. And when scanning datasets that don’t fit in memory, we need a way to fetch the items lazily, that is, one at a time and on demand. This is what the Iterator pattern is about.

Python does not have macros like Lisp, so abstracting away the Iterator pattern required changing the language: the yield keyword was added
in Python 2.2 (2001). The yield keyword allows the construction of generators, which work as iterators.

Python 3 uses generators in many places. Even the range() built-in now returns a generator-like object instead of full-blown lists like before. If you must build a list from range, you have to be explicit, e.g. list(range(100)).

Every collection in Python is iterable, and iterators are used internally to support:

- for loops;
- collection types construction and extension;
- looping over text files line by line;
- list, dict and set comprehensions;
- tuple unpacking;
- unpacking actual parameters with * in function calls.

## Sentence take #1: a sequence of words

We’ll start our exploration of iterables by implementing a Sentence class: you give its constructor a string with some text, and then you can iterate word by word.

In [1]:
import re
import reprlib

In [None]:
RE_WORD = re.compile("\w+")

class Sentence:

  def __init__(self, text):
    self.text = text
    # returns a list with all non-overlapping matches of the regular expression, as a list of strings.
    self.words = RE_WORD.findall(text)

  def __getitem__(self, index):
    # self.words holds the result of .findall, so we simply return the word at the given index.
    return self.words[index]

  # To complete the sequence protocol, we implement __len__ — but it is not needed to make an iterable object.
  def __len__(self):
    return len(self.words)

  def __repr__(self):
    # generate abbreviated string representations of data structures that can be very large
    return "Sentence(%s)" % reprlib.repr(self.text)

By default, reprlib.repr limits the generated string to 30 characters.

In [None]:
s = Sentence('"The time has come," the Walrus said,')
s

Sentence('"The time ha... Walrus said,')

In [None]:
# Sentence instances are iterable
for word in s:
  print(word)

The
time
has
come
the
Walrus
said


In [None]:
# Being iterable, Sentence objects can be used as input to build lists and other iterable types.
list(s)

['The', 'time', 'has', 'come', 'the', 'Walrus', 'said']

In [None]:
# because it’s also a sequence, so you can get words by index
s[0]

'The'

In [None]:
s[5]

'Walrus'

In [None]:
s[-1]

'said'

In [None]:
s[-2]

'Walrus'

### Why sequences are iterable: the iter function

Every Python programmer knows that sequences are iterable. Now we’ll see precisely why.

Whenever the interpreter needs to iterate over an object x, it automatically calls iter(x).

The iter built-in function:

- Checks whether the object implements, __iter__, and calls that to obtain an iterator;
- If __iter__ is not implemented, but __getitem__ is implemented, Python creates an iterator that attempts to fetch items in order, starting from index 0 (zero);
- If that fails, Python raises TypeError, usually saying "'C' object is not iterable", where C is the class of the target object.

That is why any Python sequence is iterable: they all implement `__getitem__`. In fact, the standard sequences also implement `__iter__`, and yours should too, because the special handling of `__getitem__` exists for backward compatibility reasons and may be gone in the future.

This is an extreme form of duck typing: an object is considered iterable not only when it implements the special method `__iter__`, but also when it implements `__getitem__`, as long as `__getitem__` accepts
int keys starting from 0.

In the goose-typing approach, the definition for an iterable is simpler but not as flexible: an object is considered iterable if it implements the `__iter__` method. No subclassing or registration is required, because abc.Iterable implements the `__subclasshook__`.

In [None]:
class Foo:
  def __iter__(self):
    pass

In [None]:
from collections import abc

In [None]:
issubclass(Foo, abc.Iterable)

True

In [None]:
f = Foo()
isinstance(f, abc.Iterable)

True

However, note that our initial Sentence class does not pass the issubclass(Sentence, abc.Iterable) test, even though it is iterable in practice.

### Iterables versus iterators

It’s important to be clear about the relationship between iterables and iterators: Python obtains iterators from iterables.

Here is a simple for loop iterating over a str. The str 'ABC' is the iterable here. You don’t see it, but there is an iterator behind the curtain:

In [None]:
s = 'ABC'
for char in s:
  print(char)

A
B
C


If there was no for statement and we had to emulate the for machinery by hand with a while loop, this is what we’d have to write:

In [None]:
s = 'ABC'
# Build an iterator it from the iterable.
it = iter(s)
while True:
  try:
    # Repeatedly call next on the iterator to obtain the next item.
    print(next(it))
  except StopIteration:  # The iterator raises StopIteration when there are no further items.
    # Release reference to it — the iterator object is discarded.
    del it
    break

A
B
C


StopIteration signals that the iterator is exhausted. This exception is handled internally in for loops and other iteration contexts like list comprehensions, tuple unpacking etc.

The standard interface for an iterator has two methods:

- `__next__`: Returns the next available item, raising StopIteration when there are no more items.
- `__iter__`: Returns self; this allows iterators to be used where an iterable is expected, for example, in a for loop.

In [None]:
s3 = Sentence("Pig and Pepper")

# Obtain an iterator from s3.
it = iter(s3)

In [None]:
# next(it) fetches the next word.
next(it)

'Pig'

In [None]:
next(it)

'and'

In [None]:
next(it)

'Pepper'

In [None]:
# There are no more words, so the iterator raises a StopIteration exception.
next(it)

StopIteration: ignored

In [None]:
# Once exhausted, an iterator becomes useless.
list(it)

[]

In [None]:
# To go over the sentence again, a new iterator must be built.
list(iter(s3))

['Pig', 'and', 'Pepper']

Since the only methods required of an iterator are `__next__` and `__iter__`, there is no way to check whether there are remaining items, other than call next() and catch StopInteration. 

Also, it’s not possible to “reset” an iterator. If you need to start over, you need to call iter(…) on the iterable that built the iterator in the first place. 

Calling `iter(…)` on the iterator itself won’t help, because — as mentioned — Iterator.`__iter__` is implemented by returning self, so this will not reset a depleted iterator.

## Sentence take #2: a classic iterator

The next an implementation of a Sentence that is iterable because it implements
the `__iter__` special method which builds and returns a SentenceIterator. This
is how the Iterator design pattern is described in the original Design Patterns book.

We are doing it this way here just to make clear the crucial distinction between an iterable
and an iterator and how they are connected.

In [3]:
RE_WORD = re.compile("\w+")

class Sentence:

  def __init__(self, text):
    self.text = text
    # returns a list with all non-overlapping matches of the regular expression, as a list of strings.
    self.words = RE_WORD.findall(text)

  def __repr__(self):
    # generate abbreviated string representations of data structures that can be very large
    return "Sentence(%s)" % reprlib.repr(self.text)

  """
  The __iter__ method is the only addition to the previous Sentence
  implementation. This version has no __getitem__, to make it clear that the class
  is iterable because it implements __iter__.
  """
  def __iter__(self):
    # __iter__ fulfills the iterable protocol by instantiating and returning an iterator.
    return SentenceIterator(self.words)


class SentenceIterator:

  def __init__(self, words):
    # SentenceIterator holds a reference to the list of words.
    self.words = words
    # self.index is used to determine the next word to fetch.
    self.index = 0

  def __next__(self):
    try:
      word = self.words[self.index]
    except IndexError:
      # If there is no word at self.index, raise StopIteration.
      raise StopIteration()
    self.index += 1  # Increment self.index.
    return word

  def __iter__(self):
    return self

Note that implementing `__iter__` in SentenceIterator is not actually needed for this example to work, but the it’s the right thing to do: iterators are supposed to implement both `__next__` and `__iter__`, and doing so makes our iterator pass the issubclass(SentenceInterator, abc.Iterator) test. If we had subclassed SentenceIterator from abc.Iterator we’d inherit the concrete abc.`Iterator.__iter__` method.

### Making Sentence an iterator: bad idea

A common cause of errors in building iterables and iterators is to confuse the two. To be clear: iterables have a `__iter__` method that instantiates a new iterator every time. Iterators implement a `__next__` method that returns individual items, and a `__iter__` method that returns self.

Therefore, iterators are also iterable, but iterables are not iterators.

To “support multiple traversals” it must be possible to obtain multiple independent iterators from the same iterable instance, and each iterator must keep its own internal state, so a proper implementation of the pattern requires each call to iter(my_iterable) to create a new, independent, iterator.

## Sentence take #3: a generator function

A Pythonic implementation of the same functionality uses a generator function to replace the SequenceIterator class.

In [4]:
RE_WORD = re.compile("\w+")

class Sentence:

  def __init__(self, text):
    self.text = text
    # returns a list with all non-overlapping matches of the regular expression, as a list of strings.
    self.words = RE_WORD.findall(text)

  def __repr__(self):
    # generate abbreviated string representations of data structures that can be very large
    return "Sentence(%s)" % reprlib.repr(self.text)

  """
  The __iter__ method is the only addition to the previous Sentence
  implementation. This version has no __getitem__, to make it clear that the class
  is iterable because it implements __iter__.
  """
  def __iter__(self):
    # Iterate over self.word.
    for word in self.words:
      yield word   # Yield the current word.
    return

Now the iterator is in fact a generator object, built automatically when the `__iter__` method is called, because `__iter__` here is a generator function.

### How a generator function works

Any Python function that has the yield keyword in its body is a generator function: a function which, when called, returns a generator object. In other words, a generator function is a generator factory.

Here is the simplest function useful to demonstrate the behavior of a generator:

In [5]:
def gen_123():
  yield 1
  yield 2
  yield 3

In [6]:
gen_123

<function __main__.gen_123>

In [7]:
gen_123()

<generator object gen_123 at 0x7f2e275997d8>

In [8]:
for i in gen_123():
  print(i)

1
2
3


In [9]:
g = gen_123()

In [10]:
next(g)

1

In [11]:
next(g)

2

In [12]:
next(g)

3

In [13]:
# When the body of the function completes, the generator object raises a StopIteration.
next(g)

StopIteration: ignored

A generator function builds a generator object which wraps the body of the function. When we invoke next(…) on the generator object, execution advances to the next yield in the function body, and the next(…) call evaluates to the value yielded when the function body is suspended. Finally, when the function body returns, the enclosing generator object raises StopIteration, in accordance with the Iterator protocol.

> Calling a generator function returns a generator.
A generator yields or produces values. A generator doesn’t
“return” values in the usual way: the return statement in the body
of a generator function causes StopIteration to be raised by the
generator object.

A generator function which prints messages when it runs.

In [14]:
def gen_AB():
  print("start")
  yield "A"
  print("continue")
  yield "B"
  print("end.")

In [15]:
for c in gen_AB():
  print("-->", c)

start
--> A
continue
--> B
end.


Now hopefully it’s clear how Sentence.`__iter__` works: `__iter__` is
generator function which, when called, builds a generator object which implements the iterator interface, so the SentenceIterator class is no longer needed.

This second version of Sentence is much shorter than the first, but it’s not as lazy as it could be. 

Nowadays, laziness is considered a good trait, at least in programming languages
and APIs. A lazy implementation postpones producing values to the last possible
moment. This saves memory and may avoid useless processing as well.

## Sentence take #4: a lazy implementation

The Iterator interface is designed to be lazy: next(my_iterator) produces one item at a time. The opposite of lazy is eager: lazy evaluation and eager evaluation are actual technical terms in programming language theory.

Our Sentence implementations so far have not been lazy because the `__init__` eagerly builds a list of all words in the text, binding it to the self.words attribute. This will entail processing the entire text, and the list may use as much memory as the text itself(probably more; it depends on how many non-word characters are in the text). Most of this work will be in vain if the user only iterates over the first couple of words.

The re.finditer function is a lazy version of re.findall which, instead of a list, returns a generator producing re.MatchObject instances on demand. If there are many matches, re.finditer saves a lot of memory. Using it, our third version of Sentence is now lazy: it only produces the next word when it is needed.

In [None]:
RE_WORD = re.compile("\w+")

class Sentence:

  def __init__(self, text):
    self.text = text

  def __repr__(self):
    return "Sentence(%s)" % reprlib.repr(self.text)

  """
  finditer builds an iterator over the matches of RE_WORD on self.text, yielding MatchObject instances.
  """
  def __iter__(self):
    # match.group() extracts the actual matched text from the MatchObject instance.
    for match in RE_WORD.finditer(self.text):
      yield match.group()   # Yield the current word.
    return

## Sentence take #5: a generator expression
