<img src="../../images/banners/python-advanced.png" width="600"/>

# <img src="../../images/logos/python.png" width="23"/> Generators 


## Table of Contents


* [Using Generators](#using_generators)
    * [Example 1: Reading Large Files](#example_1:_reading_large_files)
    * [Example 2: Generating an Infinite Sequence](#example_2:_generating_an_infinite_sequence)
* [Understanding Generators](#understanding_generators)
* [Building Generators With Generator Expressions](#building_generators_with_generator_expressions)
* [Profiling Generator Performance](#profiling_generator_performance)
* [Understanding the Python Yield Statement](#understanding_the_python_yield_statement)

---

Have you ever had to work with a dataset so large that it overwhelmed your machine’s memory? Or maybe you have a complex function that needs to maintain an internal state every time it’s called, but the function is too small to justify creating its own class. In these cases and more, generators and the Python yield statement are here to help.

If you’re a beginner or intermediate Pythonista and you’re interested in learning how to work with large datasets in a more Pythonic fashion, then this is the tutorial for you.

<a class="anchor" id="using_generators"></a>
## Using Generators

Introduced with [PEP 255](https://www.python.org/dev/peps/pep-0255), generator functions are a special kind of function that return a lazy iterator. These are objects that you can loop over like a list. However, unlike lists, lazy iterators do not store their contents in memory. For an overview of iterators in Python, take a look at Python “for” Loops (Definite Iteration).

Now that you have a rough idea of what a generator does, you might wonder what they look like in action. Let’s take a look at two examples. In the first, you’ll see how generators work from a bird’s eye view. Then, you’ll zoom in and examine each example more thoroughly.

<a class="anchor" id="example_1:_reading_large_files"></a>
### Example 1: Reading Large Files

A common use case of generators is to work with data streams or large files.

These text files separate data into columns by using commas. This format is a common way to share data. Now, what if you want to count the number of rows in a text file? The code block below shows one way of counting those rows:

In [1]:
from tqdm import tqdm

In [2]:
def text_file_reader(file_path):
    return open(file_path)

In [3]:
file_path = "../files/some_large_file.txt"
with open(file_path, "w") as f:
    for i in range(1000000):
        f.write(f"this is line {i+1}\n")

In [4]:
text_gen = text_file_reader(file_path)
row_count = 0

for row in tqdm(text_gen):
    row_count += 1

print(f"Row count is {row_count}")

100000000it [00:24, 4065950.25it/s]

Row count is 100000000





Looking at this example, you might expect `text_gen` to be a list. To populate this list, `open()` opens a file and loads its contents into `text_gen`. Then, the program iterates over the list and increments `row_count` for each row.

This is a reasonable explanation, but would this design still work if the file is very large? What if the file is larger than the memory you have available? To answer this question, let’s assume that `text_file_reader()` just opens the file and reads it into an array:

In [5]:
def text_file_reader(file_path):
    file = open(file_path)
    result = file.read().split("\n")
    return result

This function opens a given file and uses `file.read()` along with `.split()` to add each line as a separate element to a list. If you were to use this version of `csv_reader()` in the row counting code block you saw further up, then you’d get the following output:

In [6]:
lines = text_file_reader(file_path)

In this case, `open()` returns a generator object that you can lazily iterate through line by line. However, `file.read().split()` loads everything into memory at once, causing the `MemoryError`.

Before that happens, you’ll probably notice your computer slow to a crawl. You might even need to kill the program with a `KeyboardInterrupt`. So, how can you handle these huge data files? Take a look at a new definition of `text_file_reader()`:

In [39]:
def text_file_reader(file_path):
    for row in open(file_path, "r"):
        yield row

In this version, you `open` the file, iterate through it, and `yield` a row. This code should produce the following output, with no memory errors:

In [40]:
text_file_reader(file_path)

<generator object text_file_reader at 0x7f3d4ad86bd0>

What’s happening here? Well, you’ve essentially turned `text_file_reader()` into a generator function. This version opens a file, loops through each line, and yields each row, instead of returning it.

You can also define a generator expression (also called a generator comprehension), which has a very similar syntax to list comprehensions. In this way, you can use the generator without calling a function:

In [41]:
text_gen = (row for row in open(file_path))

In [42]:
text_gen

<generator object <genexpr> at 0x7f3d4ad86150>

This is a more succinct way to create the list `text_gen`. You’ll learn more about the Python `yield` statement soon. For now, just remember this key difference:

- Using `yield` will result in a generator object.
- Using `return` will result in the first line of the file only.

<a class="anchor" id="example_2:_generating_an_infinite_sequence"></a>
### Example 2: Generating an Infinite Sequence

Let’s switch gears and look at infinite sequence generation. In Python, to get a finite sequence, you call `range()` and evaluate it in a list context:

In [45]:
a = range(5)
list(a)

[0, 1, 2, 3, 4]

In [9]:
gen = zip([1, 2, 3], [4, 5, 6])

In [81]:
def my_gen(mylist):
    for item in mylist:
        yield item**2

In [82]:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
gen = my_gen(x)

In [83]:
next(gen)

1

Generating an infinite sequence, however, will require the use of a generator, since your computer memory is finite:

In [133]:
def infinite_sequence():
    num = 0
    while True:
        yield num
        num += 1

This code block is short and sweet. First, you initialize the variable `num` and start an infinite loop. Then, you immediately `yield num` so that you can capture the initial state. This mimics the action of `range()`.

After `yield`, you increment `num` by 1. If you try this with a for loop, then you’ll see that it really does seem infinite:

In [1]:
for i in infinite_sequence():
    print(i, end=" ")

The program will continue to execute until you stop it manually.

Instead of using a for loop, you can also call `next()` on the generator object directly. This is especially useful for testing a generator in the console:

In [48]:
gen = infinite_sequence()
next(gen)

0

In [49]:
next(gen)

1

In [50]:
next(gen)

2

Here, you have a generator called `gen`, which you manually iterate over by repeatedly calling `next()`. This works as a great sanity check to make sure your generators are producing the output you expect.

**Note:** When you use `next()`, Python calls `.__next__()` on the function you pass in as a parameter. There are some special effects that this parameterization allows, but it goes beyond the scope of this article. Experiment with changing the parameter you pass to `next()` and see what happens!

<a class="anchor" id="understanding_generators"></a>
## Understanding Generators

So far, you’ve learned about the two primary ways of creating generators: by using generator functions and generator expressions. You might even have an intuitive understanding of how generators work. Let’s take a moment to make that knowledge a little more explicit.

Generator functions look and act just like regular functions, but with one defining characteristic. Generator functions use the Python `yield` keyword instead of return. Recall the generator function you wrote earlier:

In [1]:
def infinite_sequence():
    num = 0
    while True:
        yield num
        num += 1

This looks like a typical function definition, except for the Python `yield` statement and the code that follows it. `yield` indicates where a value is sent back to the caller, but unlike `return`, you don’t exit the function afterward.

Instead, the **state** of the function is remembered. That way, when `next()` is called on a generator object (either explicitly or implicitly within a `for` loop), the previously yielded variable `num` is incremented, and then yielded again. Since generator functions look like other functions and act very similarly to them, you can assume that generator expressions are very similar to other comprehensions available in Python.

<a class="anchor" id="building_generators_with_generator_expressions"></a>
## Building Generators With Generator Expressions

Like list comprehensions, generator expressions allow you to quickly create a generator object in just a few lines of code. They’re also useful in the same cases where list comprehensions are used, with an added benefit: you can create them without building and holding the entire object in memory before iteration. In other words, you’ll have no memory penalty when you use generator expressions. Take this example of squaring some numbers:

In [3]:
nums_squared_lc = [num**2 for num in range(5)]
nums_squared_gc = (num**2 for num in range(5))

Both `nums_squared_lc` and `nums_squared_gc` look basically the same, but there’s one key difference. Can you spot it? Take a look at what happens when you inspect each of these objects:

In [4]:
nums_squared_lc

[0, 1, 4, 9, 16]

In [5]:
nums_squared_gc

<generator object <genexpr> at 0x7f9143561f50>

The first object used brackets to build a list, while the second created a generator expression by using parentheses. The output confirms that you’ve created a generator object and that it is distinct from a list.

<a class="anchor" id="profiling_generator_performance"></a>
## Profiling Generator Performance

You learned earlier that generators are a great way to optimize memory. While an infinite sequence generator is an extreme example of this optimization, let’s amp up the number squaring examples you just saw and inspect the size of the resulting objects. You can do this with a call to `sys.getsizeof()`:

In [6]:
import sys
nums_squared_lc = [i * 2 for i in range(10000)]
sys.getsizeof(nums_squared_lc)

87632

In [7]:
nums_squared_gc = (i ** 2 for i in range(10000))
sys.getsizeof(nums_squared_gc)

128

In this case, the list you get from the list comprehension is 87,624 bytes, while the generator object is only 120. This means that the list is over 700 times larger than the generator object!

There is one thing to keep in mind, though. If the list is smaller than the running machine’s available memory, then list comprehensions can be faster to evaluate than the equivalent generator expression because of the overhead of function calls. To explore this, let’s sum across the results from the two comprehensions above. You can generate a readout with `cProfile.run()`:

In [8]:
import cProfile
cProfile.run('sum([i * 2 for i in range(10000)])')

         5 function calls in 0.001 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.001    0.001 <string>:1(<listcomp>)
        1    0.000    0.000    0.001    0.001 <string>:1(<module>)
        1    0.000    0.000    0.001    0.001 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




In [9]:
cProfile.run('sum((i * 2 for i in range(10000)))')

         10005 function calls in 0.002 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10001    0.001    0.000    0.001    0.000 <string>:1(<genexpr>)
        1    0.000    0.000    0.002    0.002 <string>:1(<module>)
        1    0.000    0.000    0.002    0.002 {built-in method builtins.exec}
        1    0.001    0.001    0.002    0.002 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




Here, you can see that summing across all values in the list comprehension took about a third of the time as summing across the generator. If speed is an issue and memory isn’t, then a list comprehension is likely a better tool for the job.

Remember, list comprehensions return full lists, while generator expressions return generators. Generators work the same whether they’re built from a function or an expression. Using an expression just allows you to define simple generators in a single line, with an assumed ‍`yield` at the end of each inner iteration.

The Python `yield` statement is certainly the linchpin on which all of the functionality of generators rests, so let’s dive into how `yield` works in Python.

<a class="anchor" id="understanding_the_python_yield_statement"></a>
## Understanding the Python Yield Statement

On the whole, ‍`yield` is a fairly simple statement. Its primary job is to control the flow of a generator function in a way that’s similar to `return` statements. As briefly mentioned above, though, the Python yield statement has a few tricks up its sleeve.

When you call a generator function or use a generator expression, you return a special iterator called a generator. You can assign this generator to a variable in order to use it. When you call special methods on the generator, such as `next()`, the code within the function is executed up to yield.

When the Python `yield` statement is hit, the program suspends function execution and returns the yielded value to the caller. (In contrast, `return` stops function execution completely.) When a function is suspended, the state of that function is saved. This includes any variable bindings local to the generator, the instruction pointer, the internal stack, and any exception handling.

This allows you to resume function execution whenever you call one of the generator’s methods. In this way, all function evaluation picks back up right after `yield`. You can see this in action by using multiple Python yield statements:

In [14]:
def multi_yield():
    yield_str = "This will print the first string"
    yield yield_str
    yield_str = "This will print the second string"
    yield yield_str

In [15]:
multi_obj = multi_yield()
next(multi_obj)

'This will print the first string'

In [16]:
next(multi_obj)

'This will print the second string'

In [17]:
next(multi_obj)

StopIteration: 

Take a closer look at that last call to `next()`. You can see that execution has blown up with a traceback. This is because generators, like all iterators, can be exhausted. Unless your generator is infinite, you can iterate through it one time only. Once all values have been evaluated, iteration will stop and the for loop will exit. If you used `next()`, then instead you’ll get an explicit `StopIteration` exception.

`yield` can be used in many ways to control your generator’s execution flow. The use of multiple Python ‍`yield` statements can be leveraged as far as your creativity allows.