Session 7
===

Today we'll dig deeper into Python's iteration protocol, which offers a very powerful mechanism for traversing collections of data. In general, there are lots and lots of cases where we need code that does calculations on collections of data, and this is especially true when we're developing code for data analysis.

In our analytical work, it's extremely common for use to encounter things like
 * Tabular data with many rows and columns
 * Numerous files, each one containing the results of a single analysis
 * Streams of data, either arriving as text or from instrumentation
In all these cases, we need to *iterate* over the data sets.


What we've seen so far...
---
Up to now, we've taken a very simplistic approach to iteration, using simple looping structures. Today, we'll review those and then talk about more-advanced mechanisms that Python uses for iteration. This yields some powerful constructs that can make your code faster, smaller, less error-prone, and -- most importantly -- more expressive and easier to read.

A recurring theme
---
A recurring theme we have seen is making a new collection based on data in an input collection.

In [None]:
# Here's some input data
input_list = [-9, -4, 8, 0, 22, -1, 7, -4, 9, 0,]
label_list = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
world_series_winners = {
              2014: 'Giants',
              2015: 'Royals',
              2002: 'Angels',
              2016: 'Cubs',
              2009: 'Yankees',
              2017: 'Astros',
              2011: 'Cardinals',
              2018: 'Red Sox',
              2019: 'Nationals',
              1988: 'Dodgers',
              2001: 'Diamondbacks',
              }

In [None]:
# Make a list of the square of the values in `input_list`
output_list = []
for i in input_list:
    output_list.append(i * i)

print('The squares are', output_list)

In [None]:
# We have also traversed dictionaries by their keys
for year in world_series_winners:
    print(year, world_series_winners[year])

In [None]:
# It's also possible to loop over dictionary values
for team in world_series_winners.values():
    print(team)

In [None]:
# We also have seen looping over key-value pairs using items()
for year, team in world_series_winners.items():
    print(year, team)

List Comprehensions
---
Let's look at these examples using a new type of expression, a *list comprehension*.

In [None]:
squares = 

In [None]:
sum_of_squares = 

The `zip` function
---

`zip` weaves two (or more) sequences into a sequence of tuples


In [None]:
for tu in zip(label_list, input_list):
   print(tu)

Dictionary comprehension
---
A dictionary comprehension builds a dictionary from keys and values

In [None]:
my_dict = { k: v for k, v in zip(label_list, input_list) }
my_dict
    

Another recurring theme
---
Another pattern we've seen a lot so far is accumulating values from an iterable sequence.

In [None]:
# Calculate the sum of squares
sum_of_squares = 0
for i in input_list:
    sum_of_squares += i * i
print('The sum of squares is', sum_of_squares)

Where we're heading: A generic approach to looping over data
---
Python's iterator protocol provides a very generic approach to looping over data. One interesting feature is that the approach lets us traverse large data sets efficiently, even occasionally allowing us to do so without ever loading the entire dataset into memory!

Iteration archetype: the `range` function
---
Up to now, all our looping structures have traversed collections that are already populated with data. What if we want to loop over a sequence of integers from 0 to 9? We do this with the `range` function. 

In [None]:
# Values from range() use the same values as for slicing text or lists
for i in range(10):
    print(i)

In [None]:
# We can also make a range from one value to another
for i in range(-5, 5):
    print(i)

In [None]:
# We can also skip entries...
for i in range(-5, 5, 2):
    print(i)

But let's take a closer look. Let's create a variable and assign a range of entries to it. Note: this code may not do what you expect!

In [None]:
# Make a range from 0 to 3
my_range = range(4)
print(my_range)

You might have expected to see a list there! It's not. It's a `range` object...

In [None]:
type(my_range)

A `range` object is an object called a *generator*. Generators are a special kind of object that supports iteration. Under the hood, this is implemented very simply. A generator yields up a value from the sequence each time we go through a `for` loop. 

In [None]:
for i in my_range:
    print(i)

How it works, and how we can take advantage
---
What's special is that the `range` object never actually creates a list of the values in the range. So a range over ten million integers takes exactly the same amount of memory used for a range of four integers. This is a powerful idea, and Python allows us to write our own generators by providing the `yield` keyword. A function that utilizes `yield` is automatically a generator function. When the function is accessed again while processing the `for` loop, execution proceeds from the statement after the `yield`.

Let's use the `while` loop to make our own `range` function:

In [None]:
def range_generator(end_value):
    '''
    Creates a simple range() function using our own generator
    '''
    current_value = 0
    while True:
        yield current_value
        current_value += 1
        if current_value == end_value:
            break

In [None]:
for i in range_generator(4):
    print(i)

Isn't that neat?
---
I'll accept that you may not find that very interesting... Let's create a more-useful generator and have a look at Moby Dick. We're now going to write a generator that automatically processes the file, yielding up each of the words in the book, in sequence.

In [None]:
# Process "Moby-Dick"... removing the Chapter and Epliogue headings along the way...
# also force lower case and remove punctuation

with open('moby-dick.txt', 'r', encoding='UTF-8') as input_file:
    clean_text = []
    for line in input_file:
        line = ( line.strip()
                     .lower()
                     .replace('.', '')
                     .replace('-', ' ')
                     .replace(',', '')
                     .replace("'", '')
                     .replace('"', '')
                     .replace('“', '')
                     .replace('_', '')
                     .replace('?', '')
                     .replace('!', '')
                     .replace(':', '')
                     .replace(';', '')
               )
        
        if line.startswith('chapter') or line.startswith('epilogue'):
            # 'continue' finishes this pass through the loop
            continue
            
        if len(line) == 0:
            continue
            
        clean_text.append(line)
        
clean_text

Let's convert it into a generator...

In [None]:
# Here's a generator for processing ANY book, one word at a time...

def words(book):
    '''
    Generator that yields each of the words in the book provided
    '''
    with open(book, 'r', encoding='UTF-8') as input_file:
        for line in input_file:
            line = ( line.strip()
                         .lower()
                         .replace('.', '')
                         .replace('-', ' ')
                         .replace(',', '')
                         .replace("'", '')
                         .replace('"', '')
                         .replace('“', '')
                         .replace('_', '')
                         .replace('?', '')
                         .replace('!', '')
                         .replace(':', '')
                         .replace(';', '')
                         .replace('&', '')
                   )

            if line.startswith('chapter') or line.startswith('epilogue'):
                # 'continue' finishes this pass through the loop
                continue

            if len(line) == 0:
                continue

            for word in line.split():
                yield word

In [None]:
for word in words('moby-dick.txt'):
    print(word)

You may not find that very interesting... But now let's look at how we can utilize it to do the analyses from Homework 2.

In [None]:
# Count all the words... we'll do this in a better way shortly
count = 0
for word in words('moby-dick.txt'):
    count += 1
print(count)

In [None]:
# Count the `the` entries
the = 0
for word in words('moby-dick.txt'):
    if word == 'the':
        the += 1
print(the)

In [None]:
# Number of unique entries
uniques = set(words('moby-dick.txt'))
len(uniques)

In [None]:
# The longest word
longest = ''
for word in words('moby-dick.txt'):
    if len(word) > len(longest):
        longest = word
print(longest)

Let's re-load the 2017 weather data set and have a look at it using *list comprehensions*.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import csv
from datetime import datetime

def scan_precip(s):
    '''
    Scans a textual precipitation entry, interpreting a trace ("T") of rain
    as 0.005 inches 
    '''
    if s == 'T':
        return 0.005
    return float(s)


# Initialize the output dictionaries

low_t = {}
avg_t = {}
high_t = {}
precip = {}

with open('2017-weather.csv', 'r', encoding='UTF-8') as input_file:
    for record in csv.DictReader(input_file):
        ts = datetime.strptime(record['Date'], '%m/%d/%y')
        key = datetime.strftime(ts, '%b')
        if key not in low_t:
            low_t[key] = []
            avg_t[key] = []
            high_t[key] = []
            precip[key] = []
        low_t[key].append(float(record['LowT']))
        avg_t[key].append(float(record['AvgT']))
        high_t[key].append(float(record['HighT']))
        precip[key].append(scan_precip(record['Precip']))
        

In [None]:
# Use a loop to make a list of month names
month_names = []
for month in avg_t:
    month_names.append(month)
print(month_names)

In [None]:
# Here's a list comprehension version
month_names = [ month for month in avg_t ]
print(month_names)

In [None]:
# Here's a list of monthly low temperatures
lows = [ min(low_t[month]) for month in low_t ]
print(lows)

In [None]:
# That's a little shorter using the values() method
lows = [ min(temps) for temps in low_t.values() ]
print(lows)

In [None]:
# We can also write a dictionary comprehension using the items() method
low_dict = { month: min(temps) for month, temps in low_t.items() }
print(low_dict)

In [None]:
# We can use a filtering expression... Which months never fell below freezing?
no_freeze = [ month for month in low_t if min(low_t[month]) > 32 ]
print(no_freeze)

Generator expressions
---
A generator expression does what a list comprehension does, only it never populates a list. It just yields up results.

Let's go back to Moby-Dick...

In [None]:
# Count the words (parentheses mark a generator expression)
sum( ( 1 for w in words('moby-dick.txt') ) )

In [None]:
# Shorthand way in a function call
sum(1 for w in words('moby-dick.txt'))

In [None]:
# Count the 'the's
sum(1 for w in words('moby-dick.txt') if w == 'the')

Sorting sequences
---

Let's look at some sorting applications and the `sorted` generator.