Item 31 Be defensive When Iterating Over Arguments

Things to Remember
- Beware of functions and methods that iterate over input arguments multiple times. If these arguments are iterators, you may see strange behavior and missing values.
- Python's iterator protocol defines how containers and iterators interact with the iter and next built-in functions, for loops, and related expressions.
- You can easily define your own iterable container type by implementing the __iter__ method as a generator.
- You can detect that a value is an iterator (instead of a container) if calling iter on it produces the same value as what you passed in. Alternatively, you can use the isinstance built-in function along with the collections.abc.Iterator class.  


In [None]:
# determine each city's contribution to the total visits
def normalize(numbers):
    total = sum(numbers) # this will exhaust an iterator
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)
    return result

visits = [15, 35, 80]
percentages = normalize(visits)
print(percentages)
assert sum(percentages) == 100.0

In [None]:
# scale this function up by defining a generator
def read_visits(data_path):
    with open(data_path) as f:
        for line in f:
            yield int(line)

it = read_visits('numbers.txt')
percentages = normalize(it)
print(percentages) # [] 

why it returns an empty list?
- an iterator produces its results only a single time
- if you iterate over an iterator or a generator that has already raised a StopIteration exception, you won't get any results the second time around
- you won't get errors when you iterate over an already exhausted iterator.
- for loops, the list constructor, and many other functions throughout the Python standard library expect the StopIteration exception to be raised during normal operation; they can't tell the difference between an iterator that has no output and an iterator that had output and is now exhausted.

In [None]:
it = read_visits('numbers.txt')
print(list(it))
print(list(it)) # already exhausted

In [None]:
# solution
# - exhaust an input iterator and keep a copy of its entire contents
#   in a list
def normalize_copy(numbers):
    numbers_copy = list(numbers) # copy the iterator
    total = sum(numbers_copy)
    result = []
    for value in numbers_copy:
        percent = 100 * value / total
        result.append(percent)
    return result

it = read_visits('numbers.txt')
percentages = normalize_copy(it)
assert sum(percentages) == 100.0


problem with the above approach
- the copy of the input iterator's contents could be extremely large. 
- hence copying the iterator could cause the program to run out memory and crash.
- it is kind of defeat the purpose of defining a generator in the first place.

In [None]:
# solution 
# - define a function that returns a new iterator each time it's called

def normalize_func(get_iter):
    total =  sum(get_iter()) # new iterator
    result = []
    for value in get_iter(): # new iterator
        percent = 100 * value / total
        result.append(percent)
    return result

path = 'numbers.txt'
percentages = normalize_func(lambda: read_visits(path)) # have to pass a lambda function
print(percentages)
assert sum(percentages) == 100.0


The iterator protocol
- the iterator protocol is how Python for loops and related expressions traverse the contents of a container type
- when Python sees a statement like for x in foo, it actually calls iter(foo), and iter in turn calls foo.__iter__ 
- the \__iter\__ method must return a iterator
- the for loop repeatedly calls the next built-in function on the iterator object until it's exhausted  

In [None]:
# define a container class that implements the iterator protocol
class ReadVisits:
    def __init__(self, data_path):
        self.data_path = data_path
    def __iter__(self):
        with open(self.data_path) as f:
            for line in f:
                yield int(line)

visits = ReadVisits(path)
percentages = normalize(visits)
print(percentages)
assert sum(percentages) == 100.0


How the above approach works
- the sum method and the for loop in the normalize function receive different iterator
- each iterator will be advanced and exhausted independently ensuring that each unique iteration sees all of the input data values 


More on the iterator protocol
- when an iterator is passed to the iter built-in function, iter returns the iterator itself
- when a container type is passed to iter, a new iterator object is returned each time    

In [None]:
# a defensive approach
# - reject arguments that can't be repeatedly iterated over

def normalize_defensive(numbers):
    # item 1 in "More on the iterator protocol"
    if iter(numbers) is numbers: # it's an iterator
        raise TypeError('Must supply a container')
    total = sum(numbers)
    result = []
    for value in numbers:
        percent = 100 * value /total
        result.append(percent)
    return result


In [None]:
# - alternatively you can use isinstance test and
#   collections.abc.Iterator class to reject arguments
#   that can't be repeatedly iterated over

from collections.abc import Iterator

def normalize_defensive(numbers):
    if isinstance(numbers, Iterator): # it's an iterator
        raise TypeError('Must supply a container')
    total = sum(numbers)
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)
    return result

visits = [15, 35, 80] # lists are iterable containers
percentages = normalize_defensive(visits)
assert sum(percentages) == 100.0

visits = ReadVisits(path)
percentages = normalize_defensive(visits)
assert sum(percentages) == 100.0

In [None]:
visits = [15, 35, 80]
it = iter(visits)
percentages = normalize_defensive(it) # error