In [123]:
import numpy as np
import collections
import requests
import itertools
import random
import timeit

# Lab 8: Standard Library and Machine Learning

## Overview

**Wow**. Just. Wow. This is our last lab, so I can't help but be a little sentimental. Just think about how far we've come together, folks. You've learned so much about the Python programming language by this point. You've learned about its philosophy and decisions. Hopefully, you've learned why those decisions make it such a popular, amazing language. You've hopefully also learned about Python's drawbacks. 

You've learned how Python's core philosophy integrates into every feature of the language. We've talked about Python's data structures, functional programming in Python, object-oriented Python, error handling, decorators, iterators, generators, third-party and standard library tools. We've studied machine learning and learned about Python in data science.

And, you put all of that together to triangulate the location of a freaking unicorn.

Just think about that.

Like, that's really advanced stuff.

If you took the `triangulate.py` route, you eliminated noise from data to triangulate latitude/longitude coordinates of an actual stuffed unicorn, hidden on campus. And it's not like we controlled the noise at all: the data we generated was **really** noisy. Every data point was generated with a variance of several kilometers.

If you walked through the Row of Puzzles, you've used a truly dazzling array of Python language features. You wrote a decorator to inspect a strange function, you used `requests` and `numpy` to piece together *audio* files! That's freaking amazing.

Take a pause, breathe, and pat yourself on the back.

In Part 1 of this lab, we'll start with a variation on a problem that we promised you on the first day of class: the Netflix Recommendation Algorithm. Then, we hope you'll take some time to become more familiar with the tools provided by Python's standard library. We want you to gain practice with the most common utilities of the standard library and also to be aware of the rest of the tools in case you ever need them.

## Starter Code
Download the starter code for this lab [here](https://stanfordpython.com/res/starter-code/lab-8.zip). Unzip those files and place them into the same directory as this lab file.

## Netflix Recommendations
A big part of Netflix's business strategy is being able to predict what kind of movies a user will enjoy watching, based on how they rate previous movies they've seen. This is such a big deal to the company that they offered [one million dollars as a prize](https://en.wikipedia.org/wiki/Netflix_Prize) to anyone who could beat their algorithm.

Let's put this problem in formal terms. Suppose you have a matrix, where each row represents a user and each column represents a movie, like this one:
![A matrix with rows representing users, columns representing movies, and the entries representing a user's movie for a specific rating. This matrix contains ratings by Parth, Michael, Joy, and Unicornelius for the movies Harry Potter and the Goblet of Fire, Unicorn Killer, Inside Out, and Frozen 2](https://raw.githubusercontent.com/stanfordpython/python-labs/master/notebooks/lab-8/movie_matrix.png)

One version of the Netflix problem (which is the one that the prize money was for) is to *complete* the matrix. You'll notice that some of the boxes above are marked with `?`. That indicates that we don't have a rating from that user for that movie. Netflix offered a million dollars to any algorithm that could predict the values of those question markes better than their algorithm.

That's a super interesting problem, and if you have some linear algebra background, I'd highly recommend that you read about it. I really like [Carlos Fernandez-Granda's notes on low-rank matrix completion](https://cims.nyu.edu/~cfgranda/pages/OBDA_spring16/material/low_rank_models.pdf).

Today, we're going to implement a different, but related algorithm. We're going to write an algorithm that, **given a new user's preferences about movies, can suggest which movies that user is likely to enjoy**.

Let's start with an observation: In the above matrix, look at the ratings for Inside Out and Frozen 2:
```
              Inside Out   Frozen 2
Parth             5           4
Michael           ?           1
Joy               2           2
Unicornelius      5           5
```

Notice that every user that rated both movies rated them pretty similarly (i.e., the values in the two columns are very close to each other). Based on that, we can conclude that Inside Out is pretty similar to Frozen 2, and if you like one movie, you'll probably like the other. Similarly, if you hate one movie, you'll probably hate the other.

We'll formalize and compute the "closeness" of movies using **cosine similarity**. But first, let's load our data. <br />
*This data comes from CS 124: From Languages to Information*

### `load_data()`
We've stored the data for this problem in two files: `movies.txt`, which has information about each of the movies that we've recorded and `ratings.txt`, which has information about how users rated each of the movies.

In `load_data`, you should open the `ratings.txt` file, and extract the data into a matrix of the form that we depicted above, with users on the `0` axis and movies on the `1` axis. If a user hasn't rated a movie, leave that value as 0 in the matrix.

#### `movies.txt`
The `movies.txt` file looks like the following:
```
0%Toy Story (1995)%Adventure|Animation|Children|Comedy|Fantasy
1%Jumanji (1995)%Adventure|Children|Fantasy
2%Grumpier Old Men (1995)%Comedy|Romance
3%Waiting to Exhale (1995)%Comedy|Drama|Romance
4%Father of the Bride Part II (1995)%Comedy
...
```

Each line is formatted like
```
id%name%categories
```

We're only interested in the `id` and the `name`. You can ignore the third field. **There are a total of 9125 movies, numbered from 0 to 9124.** Note that you may not need to open this file to load the data.

#### `ratings.txt`
The `ratings.txt` file looks like this:
```
0%30%2.500000
...
0%1962%2.500000
0%2380%1.000000
0%2925%3.000000
1%9%4.000000
1%16%5.000000
1%37%5.000000
...
```

Each line has the format:
```
user_id%movie_id%rating
```

That is, user number `user_id` rated movie number `movie_id` as `rating` between 1 and 5. **There are a total of 671 users, numbered 0 to 670**.

In [9]:
def load_data():
    ratings_file = 'ratings.txt'
    with open(ratings_file, 'r') as f:
        ratings = np.array([line.split('%') for line in f])
    movies_file = 'movies.txt'
    with open(movies_file, 'r') as f:
        movies = np.array([line.split('%')[:2] for line in f])
    num_users = len(set(ratings[:, 0]))
    num_movies = len(set(movies[:, 0]))
    matrix = np.zeros((num_users, num_movies))
    matrix[ratings[:, 0].astype(int), ratings[:, 1].astype(int)] = ratings[:, 2]
    return matrix
    
ratings = load_data()

# As a sanity check, we've pre-computed the expected value of this next line:
np.mean(np.sum(ratings, axis=1)) # => 528.1296572280179

528.1296572280179

### `clean_data(ratings)`
Great! We've got our data loaded! Now, let's clean it. For cosine similarity, we need each column to have norm 1. That is, it's length, as a 9125-dimensional vector, should be 1. Recall that the length of a vector is the square root of the sum of its entries, squared (this is the Pythagorean Theorem, also called the Euclidean norm). For example, if $x = (x_1, x_2, \dots, x_n)$, then
$$\lVert x \rVert = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}$$

You can compute the norm of a vector using `np.linalg.norm`. That function also supports an `axis` keyword argument, which allows you to compute the norm "along a given axis," to use Michael's terminology. **Be careful:** some movies don't have ratings, so their norm will be 0. To avoid a divide-by-zero issue, leave those columns untouched. It might help to treat their norms as though they're 1, so you don't modify their values when renormalizing.

For example, if we normalize the Inside Out and Frozen 2 data from above, we'll get:
```
              Inside Out   Frozen 2
Parth           0.680       0.589  
Michael         0           0.147  
Joy             0.272       0.294  
Unicornelius    0.680       0.737  
```

Write a function, `clean_data(ratings)` that will take in the ratings matrix, and return a new matrix, where each column has norm 1.

*Challenge: Try to implement this function without using any loops, using `numpy` broadcasting.*

In [22]:
def clean_data(ratings):
    rows, cols = ratings.shape
    norm = np.linalg.norm(ratings, axis = 0)
    divide_mask = np.ones((rows, cols), dtype = bool)
    divide_mask[:, norm == 0] = False
    return np.divide(ratings, norm.reshape((1, cols)), where = divide_mask)
    
normalized_ratings = clean_data(ratings)

# As a sanity check, we've pre-computed the expected value of this next line:
np.mean(np.sum(normalized_ratings, axis=1)) # => 32.85379822810301

32.85379822810301

### `suggest_movies(user_ratings, normalized_ratings, n=5)`
Given a user's rating of all of the movies (i.e., a 9125-dimensional vector with entries between 0 and 5, where 0s represent un-rated movies), this function will return the indices of the **top `n` movies** that match with that user. We'll do this using cosine similarity.

First, we'll compute a `movie_profile` for each user, which will be a 671-dimensional vector that combines the ratings we received as input with the ratings from the rest of the users. We do this by scaling each column of the matrix by the user's rating of that movie and then adding together all of the columns. For example, if the user rated Inside Out as a 4, Frozen 2 as a 3, and didn't rate any other movies, their profile would be:

```
              Inside Out      Frozen 2
Parth          0.680 * 4  +  0.589 * 3 = 4.487
Michael        0     * 4  +  0.147 * 3 = 0.441
Joy            0.272 * 4  +  0.294 * 3 = 1.97
Unicornelius   0.680 * 4  +  0.737 * 3 = 4.931
```

Then, we'll normalize that vector by dividing it by its norm. In the above example, the norm of the vector is $6.971$, so the new normalized vector is:
```
              Inside Out      Frozen 2             movie_profile
Parth          0.680 * 4  +  0.589 * 3 = 4.487 ->      0.644 
Michael        0     * 4  +  0.147 * 3 = 0.441 ->      0.063 
Joy            0.272 * 4  +  0.294 * 3 = 1.97  ->      0.283
Unicornelius   0.680 * 4  +  0.737 * 3 = 4.931 ->      0.707 
```

Notice that this vector is the same size as each of the movie vectors (it'll have 671 entries)... That hints towards the significance of the vector: we can think of it as a vector which represents the *perfect movie* for this user.

The cosine similarity between two vectors $x = (x_1, x_2, \dots, x_n)$ and $y = (y_1, y_2, \dots, y_n)$ (which both have norm 1) is defined as their dot product, or the sum of element-wise products of their entries: $x_1 y_1 + x_2 y_2 + \cdots + x_n y_n$. This will be a number between 0 and 1 with higher values representing more similar vectors. You can think of the cosine similarity as an estimation of the "closeness" between the two vectors.

Find the movies that are closest to our `movie_profile`: compute the cosine similarity between the `movie_profile` and each of the columns in our matrix and return the indices of the top `n` movies, in order from most similar to least similar.

*Challenge: Try to implement this function without using any loops, using `numpy` broadcasting.*

In [23]:
ratings[:10, :10]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 4.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 4.],
       [0., 0., 4., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [3., 0., 0., 0., 0., 0., 0., 0., 0., 3.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [4., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [None]:
def suggest_movies(user_ratings, normalized_ratings, n=5):
    pass

parth_ratings = {
    8911: 5, # Inside Out
    8460: 4, # Frozen 1
    6294: 5  # Harry Potter and the Goblet of Fire
}

full_parth_ratings = np.array([parth_ratings.get(i, 0) for i in range(ratings.shape[1])])

# As a sanity check, we've computed the value we got for the next line:
suggest_movies(full_parth_ratings, normalized_ratings) # => array([8911, 6294, 8460, 5399, 8434])

For my ratings, that corresponds to:
```
Inside Out
Harry Potter and the Goblet of Fire
Frozen 1
Harry Potter and the Prisoner of Azkaban
Ender's Game
```

Not bad! I haven't seen Ender's Game, so I guess that's on my list.

Take a look at [movies.txt](movies.txt) and add in your own ratings!

### So how does this work?
Here's a fairly math-heavy explanation of how this is working. We're taking each movie and mapping it into 671-dimensional space, where each axis represents a different user's rating of that movie. We're assuming that each of those axes are orthoganal to one another (which, in reality, might not be a good assumption).

Then, based on the inputted preferences, we're creating a new vector as a linear combination of the movie vectors that the inputted preferences have ranked. Then, we find which movies (vectors) are closest to that vector.

This is called *cosine similarity* because of the standard formulation of the dot product. If $x = (x_1, x_2, \dots, x_n)$ and $y = (y_1, y_2, \dots, y_n)$, then:
$$x_1 y_1 + x_2 y_2 + \cdots + x_n y_n = x \cdot y = \lVert x \rVert \lVert y \rVert \cos(\theta)$$

Where $\theta$ is the angle between the two vectors. Since our vectors have norm 1, this simplifies:
$$x_1 y_1 + x_2 y_2 + \cdots + x_n y_n = \cos(\theta)$$

By the definition of $\cos$, this will give the length of the perpendicular distance between $x$ and $y$, which is why we use it as a measure of similarity between $x$ and $y$.

### OMG this is so cool... I want to do more!
Great! Here are some ideas for extensions and, potentially, a final project:
1. Implement the matrix completion algorithm using [Carlos Fernandez-Granda's notes on low-rank matrix completion](https://cims.nyu.edu/~cfgranda/pages/OBDA_spring16/material/low_rank_models.pdf).
2. Perform this analysis on a more complex data set from [Kaggle](https://www.kaggle.com/) or another dataset website.
3. Take into account the fact that different users have similar preferences, and don't treat the axes as orthogonal.
4. Perform unsupervised learning on this data set and cluster similar movies together...
5. ...using that data, develop a Buzzfeed-style quiz where each question is of the form "Which movie do you prefer more?" and, based on the results of that quiz, determine one cluster of movies that the quiz-taker prefers, and recommend all of the movies from that cluster to the user.

## Read the Standard Library

We get it. At first, reading documentation doesn't sound like a fun way to spend an afternoon. However, this is one of a rare few times when you will have dedicated class time to take a deep dive into a library tool. Python's standard library is huge, and although your interests may not span the whole library, we're willing to bet that you can find something you enjoy in the library.

Remember that you can follow along with the documentation's examples in the interactive interpreter - we recommend this approach, so that you're both reading about and practicing with the modules you like.

Several of the documentation pages have links to the module's source code - if you're interested in seeing examples of well-crafted Python modules, there's no better place to look than the standard library!

Above all, explore and ask questions!

If you don't know which modules to look at, we have a list of some of our favorite modules that *weren't* covered in lecture, based on common general interests. Ask us about what you'd like to learn more about, and we'll point you in the right general direction.

The top-level categories of tools in the standard library are:

- Built-in [Functions](https://docs.python.org/3/library/functions.html), [Constants](https://docs.python.org/3/library/constants.html), [Types](https://docs.python.org/3/library/stdtypes.html), and [Exceptions](https://docs.python.org/3/library/exceptions.html)
- [Text Processing Services](https://docs.python.org/3/library/text.html)
- [Binary Data Services](https://docs.python.org/3/library/binary.html)
- [Data Types](https://docs.python.org/3/library/datatypes.html)
- [Numeric and Mathematical Modules](https://docs.python.org/3/library/numeric.html)
- [Functional Programming Modules](https://docs.python.org/3/library/functional.html)
- [File and Directory Access](https://docs.python.org/3/library/filesys.html)
- [Data Persistence](https://docs.python.org/3/library/persistence.html)
- [Data Compression and Archiving](https://docs.python.org/3/library/archiving.html)
- [File Formats](https://docs.python.org/3/library/fileformats.html)
- [Cryptographic Services](https://docs.python.org/3/library/crypto.html)
- [Generic Operating System Services](https://docs.python.org/3/library/allos.html)
- [Concurrent Execution](https://docs.python.org/3/library/concurrency.html)
- [Context Variables](https://docs.python.org/3/library/contextvars.html)
- [Networking and Interprocess Communication](https://docs.python.org/3/library/ipc.html)
- [Internet Data Handling](https://docs.python.org/3/library/netdata.html)
- [Structured Markup Processing Tools](https://docs.python.org/3/library/markup.html)
- [Internet Protocols and Support](https://docs.python.org/3/library/internet.html)
- [Multimedia Services](https://docs.python.org/3/library/mm.html)
- [Internationalization](https://docs.python.org/3/library/i18n.html)
- [Program Frameworks](https://docs.python.org/3/library/frameworks.html)
- [Graphical User Interfaces with Tk](https://docs.python.org/3/library/tk.html)
- [Development Tools](https://docs.python.org/3/library/development.html)
- [Debugging and Profiling](https://docs.python.org/3/library/debug.html)
- [Software Packaging and Distribution](https://docs.python.org/3/library/distribution.html)
- [Python Runtime Services](https://docs.python.org/3/library/python.html)
- [Custom Python Interpreters](https://docs.python.org/3/library/custominterp.html)
- [Importing Modules](https://docs.python.org/3/library/modules.html)
- [Python Language Services](https://docs.python.org/3/library/language.html)

### [Take Me To The Standard Library (Click Me!)](https://docs.python.org/3/library/)

## Write

In this section, you'll gain practice with some of the common modules in the Python standard library.

### Manipulating `collections`

**Before continuing, read the [`collections` documentation](https://docs.python.org/3/library/collections.html) at least through the section on `namedtuple()`.**

##### Working with `collections.namedtuple`

In this section, we modify code that prints out a message about each of a bunch of animals.

Rewrite the following code to be more Pythonic by using `collections.namedtuple` to add readable attribute references. The attributes for these animals are `'name'`, `'species'`, `'color'`, and `'age'`.

In [59]:
Animal = collections.namedtuple('Animal', ['name', 'species', 'color', 'age'])

lassie = Animal('Lassie', 'dog', 'black', 12)
buddy = Animal('Buddy', 'pupper', 'red', 0.5)
astro = Animal('Astro', 'doggo', 'grey', 15)
mrpb = Animal('Mr. Peanutbutter', 'dog', 'golden', 35)
bojack = Animal('BoJack Horseman', 'horse', 'brown', 52)
pc = Animal('Princess Carolyn', 'cat', 'pink', 34)
tinkles = Animal('Mr. Tinkles', 'cat', 'white', 7)
pupper = Animal('Bella', 'pupper', 'brown', 0.5)
doggo = Animal('Max', 'doggo', 'brown', 5)
seuss = Animal('The Cat in the Hat', 'cat', 'stripey', 27)
pluto = Animal('Pluto (Disney)', 'dog', 'orange', 3)
plu2o = Animal('Pluto (space)', 'planet', 'brownish', 4500000000)
yertle = Animal('Yertle', 'turtle', 'green', 130)
horton = Animal('Horton', 'elephant', 'blue', 79)

animals = [lassie, buddy, astro, mrpb, bojack, pc, tinkles, pupper, doggo, seuss, pluto, plu2o, yertle, horton]

for animal in animals:
    name, species, color, age = animal[0], animal.species, animal[2], animal.age
    if species in ('dog', 'doggo', 'pupper'):
        if age > 5:
            age_descriptor = 'an old'
        else:
            age_descriptor = 'a young'
        print('{} is {} {} {} who is {} years old.'.format(name, age_descriptor, color, species, age))
    else:
        print('{} is a {}-year-old non-canine {} {}.'.format(name, age, color, species))

Lassie is an old black dog who is 12 years old.
Buddy is a young red pupper who is 0.5 years old.
Astro is an old grey doggo who is 15 years old.
Mr. Peanutbutter is an old golden dog who is 35 years old.
BoJack Horseman is a 52-year-old non-canine brown horse.
Princess Carolyn is a 34-year-old non-canine pink cat.
Mr. Tinkles is a 7-year-old non-canine white cat.
Bella is a young brown pupper who is 0.5 years old.
Max is a young brown doggo who is 5 years old.
The Cat in the Hat is a 27-year-old non-canine stripey cat.
Pluto (Disney) is a young orange dog who is 3 years old.
Pluto (space) is a 4500000000-year-old non-canine brownish planet.
Yertle is a 130-year-old non-canine green turtle.
Horton is a 79-year-old non-canine blue elephant.


In [None]:
# Rewrite me to be more Pythonic!
lassie = ('Lassie', 'dog', 'black', 12)
buddy = ('Buddy', 'pupper', 'red', 0.5)
astro = ('Astro', 'doggo', 'grey', 15)
mrpb = ('Mr. Peanutbutter', 'dog', 'golden', 35)
bojack = ('BoJack Horseman', 'horse', 'brown', 52)
pc = ('Princess Carolyn', 'cat', 'pink', 34)
tinkles = ('Mr. Tinkles', 'cat', 'white', 7)
pupper = ('Bella', 'pupper', 'brown', 0.5)
doggo = ('Max', 'doggo', 'brown', 5)
seuss = ('The Cat in the Hat', 'cat', 'stripey', 27)
pluto = ('Pluto (Disney)', 'dog', 'orange', 3)
plu2o = ('Pluto (space)', 'planet', 'brownish', 4500000000)
yertle = ('Yertle', 'turtle', 'green', 130)
horton = ('Horton', 'elephant', 'blue', 79)

for animal in [lassie, buddy, astro, mrpb, bojack, pc, tinkles, pupper, doggo, seuss, pluto, plu2o, yertle, horton]:
    if animal[1] == 'dog' or animal[1] == 'doggo' or animal[1] == 'pupper':
        if animal[3] > 5:
            print(animal[0] + ' is an old ' + animal[2] + ' ' + animal[1] + ' who is ' + str(animal[3]) + ' years old.')
        else:
            print(animal[0] + ' is a young ' + animal[2] + ' ' + animal[1] + ' who is ' + str(animal[3]) + ' years old.')
    else:
        print(animal[0] + ' is a ' + str(animal[3]) + '-year-old non-canine ' + animal[2] + ' ' + animal[1] + '.')
        
# Prints out:
# Lassie is an old black dog who is 12 years old.
# Buddy is a young red pupper who is 0.5 years old.
# Astro is an old grey doggo who is 15 years old.
# Mr. Peanutbutter is an old golden dog who is 35 years old.
# BoJack Horseman is a 52-year-old non-canine brown horse.
# Princess Carolyn is a 34-year-old non-canine pink cat.
# Mr. Tinkles is a 7-year-old non-canine white cat.
# Bella is a young brown pupper who is 0.5 years old.
# Max is a young brown doggo who is 5 years old.
# The Cat in the Hat is a 27-year-old non-canine stripey cat.
# Pluto (Disney) is a young orange dog who is 3 years old.
# Pluto (space) is a 4500000000-year-old non-canine brownish planet.
# Yertle is a 130-year-old non-canine green turtle.
# Horton is a 79-year-old non-canine blue elephant.

#### Using `collections.defaultdict` and `collections.Counter`

Using `/usr/share/dict/words` (alternatively, `https://stanfordpython.com/res/misc/words` if you are on Windows) as a data source, what are the three most common word lengths in the English language? Remember to strip off trailing whitespace.

**Bytes vs string in Python 3:**
- Byte objects are in machine readable form internally, Strings are only in human readable form. Since Byte objects are machine readable, they can be directly stored on the disk. Whereas, Strings need encoding before which they can be stored on disk.
https://stackoverflow.com/questions/6224052/what-is-the-difference-between-a-string-and-a-byte-string

In [42]:
# Change me to another file location if you've downloaded a copy of the word list.
# Recall that this file has one word per line.
# FILENAME = '/usr/share/dict/words'
URL = 'https://stanfordpython.com/res/misc/words'
write_file = 'stanford_dict.txt'
stanford_dict_response = requests.get(URL)
if stanford_dict_response.ok:
    raw_data = stanford_dict_response.content
    with open(write_file, 'wb') as f:
        f.write(raw_data)

# TODO(you): Print the three most common word lengths in the English language.

In [43]:
with open(write_file, 'r') as f:
    raw_text = f.read()
    word_len_counter = collections.Counter(list(map(lambda x: len(x), raw_text.split('\n'))))

MOST_COMMON_N = 3
word_len_counter.most_common(MOST_COMMON_N)

[(9, 32403), (10, 30878), (8, 29989)]

##### Evil Hangman Redux (optional)

Feel free to skip this section if you aren't familiar with Keith Schwarz's CS106B/L assignment: "Evil Hangman," in which a user plays the classic game of Hangman against a deceitful AI that will do everything that it can to win.

Suppose that you have a function `mask(word, letter)` which replaces each character in `word` with a dash if that character is different than `letter` - for example, `mask('banana', 'a')  # => '-a-a-a'`. We've provided a sample implementation below.

Your task is to write a function `largest_families(words, letter, num_families=3)` that returns the top `num_families` largest collections of words which share a mask, given a source collection of words and a chosen letter. In more detail, given a letter, a resulting family of words is one in which every word yields the same mask when masked using that letter. For example, suppose that `words = ['sees', 'says', 'sass']` and `letter = 's'`. Then there are two families of words: the mask `'s--s'` is a 2-word family containing `'sees'` and `'says'`, and the mask `'s-ss'` is a 1-word family containing just the word `'sass'`.

In [None]:
import collections

def mask(word, letter):
    return ''.join('-' if letter != ch else letter for ch in word)


def largest_families(words, letter, num_families=3):
    pass


# Quick test
words = ['sees', 'says', 'sass']
print(largest_families(words, 's', num_families=1)[0])  # => Should print ['sees', 'says']

#### Working Together

Use tools from the `collections` module to implement an `Employee` database, which maintains organizational relationships among employees. Suppose that your data is provided in a tab-separated file:

```
employee_name    employee_manager    salary    department    title
employee_name    employee_manager    salary    department    title
...
employee_name    employee_manager    salary    department    title
```

If you'd like sample data to work with, you can use the following
```
psarin    poohbear  0      CS   Instructor
poohbear  sahami    500    CS   Lecturer
tigger    poohbear  100    CS   Tiger
htiek     sahami    500    CS   Lecturer
sahami    mtl       5000   CS   Professor
guido     guido     50000  PSF  BDFL
```
Save the above text to a file, making sure that your text editor doesn't automatically replace all of tabs with spaces!

After writing code to load this information from a file, implement the following functions.

```Python
def directly_reports_to(employee, manager):
    """Return whether or not employee directly reports to manager"""
    pass

def indirectly_reports_to(employee, manager):
    """Return whether or not employee indirectly reports to manager"""
    pass
    
def in_department(dept):
    """Return a collection of all employees of a given department"""
    pass
    
def cost_of(dept):
    """Return the sum total of salaries for all employees of a given department""""
    pass
```

The primary portion of this section is parsing the file and storing the employees in a your choice of data structure keyed by some of the employees' information.

In [56]:
import collections

# Replace me with the name of a file containing employment data.
FILENAME = 'employee.txt'

# TODO(you): Read the data file and store the data in a data structure.
Employee = collections.namedtuple('Employee', ['employee_name', 'employee_manager', 'salary', 'department', 'title'])
employees = collections.defaultdict(Employee)
with open(FILENAME, 'r') as f:
    for row in f:
        employee_name, employee_manager, salary, department, title = row.split('\t')
        employees[employee_name] = Employee(employee_name, employee_manager, salary, department, title)

def directly_reports_to(employee, manager):
    """Return whether or not employee directly reports to manager"""
    if employee in employees:
        return employees[employee].employee_manager == manager
    print('Employee {} not found'.format(employee))
    return False


def indirectly_reports_to(employee, manager):
    """Return whether or not employee indirectly reports to manager"""
    top_level = False
    while not top_level:
        if employee not in employees:
            print('Employee {} not found'.format(employee))
            return False
        direct_manager = employees[employee].employee_manager
        if direct_manager == manager:
            return True
        elif employee == direct_manager: # only top level employee can have employee == direct_manager
            return False
        else:
            employee = direct_manager


def in_department(dept):
    """Return a collection of all employees of a given department"""
    return list(filter(lambda x: employees[x].department == dept, employees.keys()))


def cost_of(dept):
    """Return the sum total of salaries for all employees of a given department"""
    return sum(map(lambda x: int(employees[x].salary) if employees[x].department == dept else 0, employees.keys()))

In [57]:
# test functions
print(directly_reports_to('psarin', 'poohbear')) # True
print(directly_reports_to('poohbear', 'psarin')) # False
print(directly_reports_to('psarinssss', 'poohbear')) # False
print(directly_reports_to('psarin', 'sahami')) # False
print(indirectly_reports_to('psarin', 'sahami')) # True
print(indirectly_reports_to('psarin', 'htiek')) # False
print(in_department('CS')) # psarin, poohbear, tigger, htiek, sahami
print(in_department('PE')) # list()
print(cost_of('PSF')) # 50000
print(cost_of('PE')) # 0

True
False
Employee psarinssss not found
False
False
True
Employee mtl not found
False
['psarin', 'poohbear', 'tigger', 'htiek', 'sahami']
[]
50000
0


### Extracting data with `re`

If you're fairly new to regular expressions, we recommend you read through [the official Python HOWTO](https://docs.python.org/3/howto/regex.html) and walk through those examples instead of solving this portion of the lab.

Otherwise, **read through the official [`re` documentation](https://docs.python.org/3/library/re.html) through "Match Objects"** (although the next section provides some neat examples).

#### Wordplay

Using the list of words found at `/usr/share/dict/words` (or alternatively, `http://stanfordpython.com/res/misc/words`), determine all words that have all five vowels in order. That is, words that contain an `'a'`, `'e'`, `'i'`, `'o'`, and `'u'` in order, with any number (including 0) of non-vowel word characters before the 'a', between the vowels, and after the 'u'.

For example, your list should contain both `"abstemious"` and `"facetious"`. We found a total of 14 matches.

In [None]:
import re

# Change me to another file location if you've downloaded a copy of the word list.
# Recall that this file has one word per line.
WORD_FILE = '/usr/share/dict/words'
pattern = re.compile('your-regular-expression-here')

# TODO(you): Print out any words that have five vowels in order.

#### License Plates
I love crosswords. Seriously. I do the NYTimes Mini every morning, pretty much as soon as I wake up. I also love thinking about crosswords and playing fun crossword-related games. One game that's well-known among cruciverbalists (folks who make crosswords) is the **license plate game**.

Here's how you play the game: pick a license plate and ignore the numbers, filtering down to the alphabetic characters. Then, think of a word or phrase that has those characters, in that order. When you ignore numbers, my license plate is `"btp"`, so `"breastplate"` and `"subtype"` would be valid words.

Using the list of words found at `/usr/share/dict/words` (or alternatively, `http://stanfordpython.com/res/misc/words`), write a function, `license_plate_words(letters)` which returns a list of words that contain `letters` in the given order.

In [None]:
WORD_FILE = '/usr/share/dict/words'

def license_plate_words(letters):
    pass

print(license_plate_solver('btp')[:20])
print(license_plate_solver('aeiou')) # this should be the same as the previous problem!

#### Regex Crossword Checker

Take a moment to play one round of [Regex Crossword](https://regexcrossword.com/) (a highly entertaining site, if you've got hours to spare).

In the spirit of Regex Crossword, we will write a function that checks arbitrary regex crosswords. Your function should take in two lists, one representing horizontal clues and one representing vertical clues, as well as the potential solution to crossword in the form a list-of-lists in row-major order (i.e. the elements are lists representing rows of the crossword. You should return whether or not the potential solution is in fact valid.

```Python
def regex_crossword_check(horizontal_patterns, vertical_patterns, candidate):
    pass  # Your implementation here
```

For example, the call corresponding to the first "Beginner" puzzle (it's called "Beatles") would look like:

```Python
horiz = [r'HE|LL|O+', r'[PLEASE]+']
vert = [r'[^SPEAK]+', r'EP|IP|EF']
candidate = [
    ['H', 'E'],
    ['L', 'P']
]
regex_crossword_check(horiz, vert, candidate)  # => True
```

and the call corresponding to the second "Experiences" puzzle (it's called "Royal Dinner") would look like:

```Python
horiz = [r'(Y|F)(.)\2[DAF]\1', r'(U|O|I)*T[FRO]+', r'[KANE]*[GIN]*']
vert = [r'(FI|A)+', r'(YE|OT)K', r'(.)[IF]+', r'[NODE]+', r'(FY|F|RG)+']
candidate = [
    ['F', 'O', 'O', 'D', 'F'],
    ['I', 'T', 'F', 'O', 'R'],
    ['A', 'K', 'I', 'N', 'G']
]
regex_crossword_check(horiz, vert, candidate)  # => True
```

Some implementation notes:

* You may want to use `re.fullmatch` instead of `re.match` or `re.search`. The former matches a pattern string against an entire string, whereas the latter methods check to see if any prefix string or any substring, respectively, match the pattern.
* You can get the width and height of the crossword from the length of the vertical and horizontal clue lists, respectively.
* Remember your friend, `zip`!

In [None]:
import re
import string


def regex_crossword_check(horizontal_patterns, vertical_patterns, candidate):
    pass  # Your implementation 


# Quick tests.
horiz = [r'HE|LL|O+', r'[PLEASE]+']
vert = [r'[^SPEAK]+', r'EP|IP|EF']
candidate = [
    ['H', 'E'],
    ['L', 'P']
]
print(regex_crossword_check(horiz, vert, candidate))  # => True


horiz = [r'(Y|F)(.)\2[DAF]\1', r'(U|O|I)*T[FRO]+', r'[KANE]*[GIN]*']
vert = [r'(FI|A)+', r'(YE|OT)K', r'(.)[IF]+', r'[NODE]+', r'(FY|F|RG)+']
candidate = [
    ['F', 'O', 'O', 'D', 'F'],
    ['I', 'T', 'F', 'O', 'R'],
    ['A', 'K', 'I', 'N', 'G']
]
print(regex_crossword_check(horiz, vert, candidate))  # => True

#### Regex Crossword Solver (challenge)

This problem is hard - skip it unless you're feeling up for an algorithmic challenge.

Write a function to solve arbitrary regular expression crosswords.

Your function should take in two lists, one representing horizontal clues and one representing vertical clues, as well as a keyword argument representing the possible alphabet. Return (or lazily generate) a list of all answers consistent with the constraints, where an answer is formed by joining the characters in row-major order (consistent with their website).

```Python
import re
import string
def regex_crossword_solve(horizontal_patterns, vertical_patterns, alphabet=string.ascii_uppercase):
    pass
```

For example, the call corresponding to the first "Beginner" puzzle (it's called "Beatles") would look like:

```Python
horiz = [r'HE|LL|O+', r'[PLEASE]+']
vert = [r'[^SPEAK]+', r'EP|IP|EF']
regex_crossword_solve(horiz, vert)
```

and would return the final answer `['HELP']` derived from the (unique, in this case) solution `[['H', 'E'], ['L', 'P']]`. If there are multiple answers, return them all.

In [None]:
import re
import string


def regex_crossword_solve(horizontal_patterns, vertical_patterns, alphabet=string.ascii_uppercase):
    pass


# Quick test.
horiz = [r'HE|LL|O+', r'[PLEASE]+']
vert = [r'[^SPEAK]+', r'EP|IP|EF']
print(regex_crossword_solve(horiz, vert))

#### Multidirectional (super challenge)

If you look though the Regex Crossword site linked above, you'll see that some puzzles (starting from "Double Cross" onwards), support multiple directions. Update your function above to work first with bidirection clues (as in "Double Cross", "Cities", "Volapük", and "Hamlet"). If you finish that, see if you can solve the types of puzzles shown in "Hexagonal."

In [None]:
import re
import string


def regex_crossword_solve_multidimensional(horizontal_patterns_lr, vertical_patterns_tb, horizontal_patterns_rl, vertical_patterns_bt, alphabet=string.ascii_uppercase):
    pass

#### Minimal Regex (super challenge)

Given a finite set of positive samples and a finite set of negative examples, can we build a regular expression that matches the positives but rejects the negatives? Of course! We could just explicitly include the positives and explicitly reject the negatives. However, this approach leads to regexes that are quite long. For this part, write an algorithm that approximately generates the smallest regular expression that matches a list of positive samples and rejects a list of negative samples. Our metric for smallest will default to shortest, but feel free to come up with your own metric.

*Note: this problem is NP-hard, and is tied to some deep results in complexity theory. For more information, check out [this CSTheory.SE post](http://cstheory.stackexchange.com/questions/1854/is-finding-the-minimum-regular-expression-an-np-complete-problem)*

In [None]:
import re
import string

# This is a super challenging problem!
def minimal_regex(positives, negatives):
    pass

### Working with `itertools`

**Before continuing, make sure you read all of the [`itertools` documentation](https://docs.python.org/3/library/itertools.html).**

#### Tabulation

Write a `tabulate` function to generate a computation lookup table. `tabulate` should take in three arguments, a function, a start number (default 0), and a step size (default 1)

```Python
def tabulate(f, start=0, step=1):
    pass
```

This function can be used as follows:

```Python
sqgen = tabulate(lambda x: x ** 2)
next(sqgen)  # => 0 (which is equal to f(0))
next(sqgen)  # => 1 (which is equal to f(1))
next(sqgen)  # => 4 (which is equal to f(2))
next(sqgen)  # => 9 (which is equal to f(3))
```

For reference, our implmentation is one line and 43 characters.

Hint: take a look at the `itertools.count` function!

In [64]:
def tabulate(f, start=0, step=1):
    return map(f, itertools.count(start, step))


sqgen = tabulate(lambda x: x ** 2)
print(next(sqgen))  # => 0 (which is equal to f(0))
print(next(sqgen))  # => 1 (which is equal to f(1))
print(next(sqgen))  # => 4 (which is equal to f(2))
print(next(sqgen))  # => 9 (which is equal to f(3))

0
1
4
9


### `random`

**Before continuing, make sure you read the [`random` documentation](https://docs.python.org/3/library/random.html) through "Functions for Sequences."**

There's no code in this section - just read the documentation! It's rather short.

### Using `sys` for command-line tools.

#### Addition

Write a Python script `add.py` (a new file) that can be run on the command line with any number of additional arguments representing numbers that you want to add up. Your script should print the sum of numeric arguments. If there are arguments that can't be converted to floats, ignore them. You can use what we learned about exceptional control flow to determine if a number is convertible to a float. If there are no additional arguments to your script, you should print an error message and exit.

Recall you can use `sys.argv` to access the command-line arguments.

You should be able to invoke your script from the command line as follows:

```
(cs41-env)$ python add.py 4 1
5.0
(cs41-env)$ python add.py 17 38 "Hey wassup" "hello"
55.0
(cs41-env)$ python add.py 8 6 7 5 3 0 9
38.0
(cs41-env)$ python add.py
Usage: python add.py <nums>
    
    Add some numbers together
```

##### Argument Parsing with `argparse`

Python's [`argparse` module](https://docs.python.org/3/library/argparse.html) provides a nicer way to define scripts that accept commmand-line arguments. Read through the `argparse` documentation and then rewrite the above program using the tools provided by `argparse`.

#### `tree` (challenge)

Write a program that emulates the command-line utility `tree`, which pretty-prints the directory structure rooted by an argument name. If there is no argument, use the current working directory. For example,

```
$ python3 tree.py python-labs/
python-labs/
├── LICENSE
├── NOTES.md
├── README.md
├── markdown
│   ├── lab1-warmup.md
│   ├── lab2-datastructures.md
│   ├── lab3-functions.md
│   ├── lab4-fp.md
│   ├── lab5-oop.md
│   ├── lab6-standardlibrary.md
│   ├── lab7-thirdparty.md
│   └── lab8-pythonecosystem.md
└── notebooks
    ├── lab1-warmup-notebook.ipynb
    ├── lab2-datastructures-notebook.ipynb
    ├── lab3-functions-notebook.ipynb
    ├── lab4-fp-notebook.ipynb
    ├── lab5-oop-notebook.ipynb
    ├── lab6-standardibrary-notebook.ipynb
    ├── lab7-thirdparty-notebook.ipynb
    └── lab8-pythonecosystem-notebook.ipynb
```

The above is just an example - don't worry if your actual `python-labs/` directory doesn't look like this.

Use the [`pathlib` library](https://docs.python.org/3/library/pathlib.html) for filesystem navigation. For implementation details, check out `tree`'s [man page](http://linux.die.net/man/1/tree) or this [more helpful description](http://www.computerhope.com/unix/tree.htm). You don't need to implement any of the command-line flags for this part - just focus on navigating the file system.

#### Improving `tree` (super challenge)

Update your `tree` program to handle more advanced use cases, listed in the man page above. Can you handle symbolic links, maximum depth recursion, or pattern matching?

You can make this tool as powerful as you'd like.

### All Together Now

This final problem will incorporate all of the modules we've seen so far. We'll build a tool to determine the shortest airport journey between any two airports.

#### Airport Data
First, let's look at our data. OpenFlights publishes the following data files:

* [Airlines](https://raw.githubusercontent.com/jpatokal/openflights/master/data/airlines.dat)
* [Airports](https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat)
* [Routes](https://raw.githubusercontent.com/jpatokal/openflights/master/data/routes.dat)

For information about the data itself, [DataHub](https://datahub.io/dataset/open-flights) has a good writeup on the schema.

The information [by OpenFlights itself](https://openflights.org/data.html) is also quite good for getting an overview of the data.

You will write a script that, when given two airport codes (like SFO and JFK) and a maximum segment count, prints all possible ways to get from the source airport to the destination airport in at most that many segments:

```
$ python3 flights.py SFO JFK 2
SFO -> JFK
SFO -> LAX -> JFK
SFO -> ORD -> JFK
SFO -> DFW -> JFK
...
SFO -> PDX -> JFK
```

How powerful can you make this script? Consider adding extra features that utilize all of the standard library modules we've seen here.

In [68]:
# store data
flights_data_loc = {'airlines': 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airlines.dat', 
                    'airports': 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat', 
                    'routes': 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/routes.dat'}
file_ext = '.dat'

def store_data_from_url(url, filename): 
    response = requests.get(url)
    if response.ok:
        with open(filename, 'wb') as f:
            f.write(response.content)
            print('Successful wrote data from {} to {}'.format(url, filename))
    return

for k, v in flights_data_loc.items():
    store_data_from_url(v, k + file_ext)

Successful wrote data from https://raw.githubusercontent.com/jpatokal/openflights/master/data/airlines.dat to airlines.dat
Successful wrote data from https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat to airports.dat
Successful wrote data from https://raw.githubusercontent.com/jpatokal/openflights/master/data/routes.dat to routes.dat


In [89]:
start, end, max_flights = 'SFO', 'JFK', 2

Route = collections.namedtuple('Route', ['airline', 'airline_id', 'start_airport', 'end_airport', 'stops'])
valid_routes = []

with open('routes.dat', 'r') as f:
    for line in f:
        data = line.split(',')
#         if data[2] == 'SFO' and data[4] == 'JFK':
        valid_routes.append(Route(data[0], data[1], data[2], data[4], int(data[7])))

In [96]:
# there are 11 routes that are not direct flights. this is obviously too low. 
print(list(filter(lambda x: x.stops > 0, valid_routes)))

# remove these routes for now. it seems like the routes with stops = 1 have the individual stops = 0 flights anyways. we can get info on the transfer flights too using stops = 0 flights
print(list(filter(lambda x: x.airline == 'WN' and x.start_airport == 'MCO', valid_routes)))
print(list(filter(lambda x: x.airline == 'WN' and x.end_airport == 'CAK', valid_routes)))

[Route(airline='5T', airline_id='1623', start_airport='YRT', end_airport='YEK', stops=1), Route(airline='AC', airline_id='330', start_airport='ABJ', end_airport='BRU', stops=1), Route(airline='AC', airline_id='330', start_airport='YVR', end_airport='YBL', stops=1), Route(airline='CU', airline_id='1936', start_airport='FCO', end_airport='HAV', stops=1), Route(airline='FL', airline_id='1316', start_airport='HOU', end_airport='SAT', stops=1), Route(airline='FL', airline_id='1316', start_airport='MCO', end_airport='HOU', stops=1), Route(airline='FL', airline_id='1316', start_airport='MCO', end_airport='ORF', stops=1), Route(airline='SK', airline_id='4319', start_airport='ARN', end_airport='GEV', stops=1), Route(airline='WN', airline_id='4547', start_airport='BOS', end_airport='MCO', stops=1), Route(airline='WN', airline_id='4547', start_airport='MCO', end_airport='BOS', stops=1), Route(airline='WN', airline_id='4547', start_airport='MCO', end_airport='CAK', stops=1)]
[Route(airline='WN', a

In [97]:
routes = []
Route = collections.namedtuple('Route', ['airline', 'airline_id', 'start_airport', 'end_airport', 'stops'])

with open('routes.dat', 'r') as f:
    for line in f:
        data = line.split(',')
        routes.append(Route(data[0], data[1], data[2], data[4], int(data[7])))

In [121]:
random.seed(0)
random_routes = random.choices(routes, k = len(routes)//10)

In [116]:
def get_valid_routes(start_airport, end_airport, max_flights, route_map):
    
    valid_routes = []
    memoize_routes = {} # key = (start, end, max_flights), value = set(routes from start to end within max_flights Route(airline, airline_id, flight_cnt))
    
    def get_valid_routes_recur(start, end, max_flights, route_map, curr_flights):
        curr_start, curr_end, curr_flight_cnt = curr_flights[0].start_airport, curr_flights[-1].end_airport, len(curr_flights)
        if curr_flight_cnt > max_flights: 
            return
        if curr_start == start and curr_end == end:
            valid_routes.append(curr_flights)
            return
        for route in route_map:
            if route.start_airport == curr_end:
                get_valid_routes_recur(start, end, max_flights, route_map, curr_flights + [route])
        return
    
    for route in route_map:
        if route.start_airport == start_airport:
            get_valid_routes_recur(start_airport, end_airport, max_flights, route_map, [route])
    return valid_routes

In [122]:
t = get_valid_routes('SFO', 'JFK', 2, random_routes)
print(len(t))
t[0]

14


[Route(airline='VX', airline_id='5331', start_airport='SFO', end_airport='ORD', stops=0),
 Route(airline='DL', airline_id='2009', start_airport='ORD', end_airport='JFK', stops=0)]

In [129]:
setup_code = '''
import collections
import itertools
import random
import timeit

routes = []
Route = collections.namedtuple('Route', ['airline', 'airline_id', 'start_airport', 'end_airport', 'stops'])

with open('routes.dat', 'r') as f:
    for line in f:
        data = line.split(',')
        routes.append(Route(data[0], data[1], data[2], data[4], int(data[7])))

random.seed(0)
random_routes = random.choices(routes, k = len(routes)//10)

def get_valid_routes(start_airport, end_airport, max_flights, route_map):
    
    valid_routes = []
    memoize_routes = {} # key = (start, end, max_flights), value = set(routes from start to end within max_flights Route(airline, airline_id, flight_cnt))
    
    def get_valid_routes_recur(start, end, max_flights, route_map, curr_flights):
        curr_start, curr_end, curr_flight_cnt = curr_flights[0].start_airport, curr_flights[-1].end_airport, len(curr_flights)
        if curr_flight_cnt > max_flights: 
            return
        if curr_start == start and curr_end == end:
            valid_routes.append(curr_flights)
            return
        for route in route_map:
            if route.start_airport == curr_end:
                get_valid_routes_recur(start, end, max_flights, route_map, curr_flights + [route])
        return
    
    for route in route_map:
        if route.start_airport == start_airport:
            get_valid_routes_recur(start_airport, end_airport, max_flights, route_map, [route])
    return valid_routes
'''
test_code = '''get_valid_routes('SFO', 'JFK', 3, random_routes)'''
test_number = 10
test_time = timeit.timeit(stmt = test_code, setup = setup_code, number = test_number)
test_time / test_number

8.730555223100236

In [None]:
# with memoization
# memoization is hard to implement because 1) update rules can be very time-inefficient, 2) you don't want to recommend flights that go over max_flights, 3) you don't want to recommend itineraries that return a user to a previously visited airport (this one is a concern depending on implementation)

def get_valid_routes(start_airport, end_airport, max_flights, route_map):
    
    valid_routes = []
    memoize_routes = collections.defaultdict(list) # key = (start, end, max_flights), value = set(routes from start to end within max_flights Route(airline, airline_id, flight_cnt))
    
    def get_valid_routes_recur(start, end, max_flights, route_map, curr_flights):
        curr_start, curr_end, curr_flight_cnt = curr_flights[0].start_airport, curr_flights[-1].end_airport, len(curr_flights)
        if curr_flight_cnt > max_flights: 
            return
        if (curr_end, end) in memoize_routes:
            valid_routes.extend(map(lambda x: curr_flights + x, memoize_routes[(curr_end, end)]))
            for i, route in enumerate(route_map):
                memoize_routes[(route.start_airport, end)].append()
            return
        if curr_start == start and curr_end == end:
            valid_routes.append(curr_flights)
            for i, route in enumerate(route_map):
                memoize_routes[(route.start_airport, end)].append(route_map[i:])
            return
        for route in route_map:
            if route.start_airport == curr_end:
                get_valid_routes_recur(start, end, max_flights, route_map, curr_flights + [route])
        return
    
    for route in route_map:
        if route.start_airport == start_airport:
            get_valid_routes_recur(start_airport, end_airport, max_flights, route_map, [route])
    return valid_routes

In [133]:
# solution's version

import csv

Airport = collections.namedtuple('Airport', ['id', 'name', 'city', 'country', 'faa_iata', 'icao', 'lat', 'long', 'alt', 'utc_offset', 'dst', 'tz', 'type', 'source'])
Airline = collections.namedtuple('Airline', ['id', 'name', 'alias', 'iata', 'icao', 'callsign', 'country', 'active'])
Route = collections.namedtuple('Route', ['airline', 'airline_id', 'source_airport', 'source_airport_id', 'dest_airport', 'dest_airport_id', 'codeshare', 'stops', 'equipment'])

def load_data():
    with open('airports.dat') as f:
        airports = {}
        for line in csv.reader(f):
            airport = Airport._make(line)
            airports[airport.id] = airport

    with open('airlines.dat') as f:
        airlines = {}
        for line in csv.reader(f):
            airline = Airline._make(line)
            airlines[airline.id] = airline

    with open('routes.dat') as f:
        # top-level keyed by source airport ID, next level keyed by destination airport ID
        routes = collections.defaultdict(lambda: collections.defaultdict(list))
        for line in csv.reader(f):
            route = Route._make(line)
            routes[route.source_airport][route.dest_airport].append(route)

    return airports, airlines, routes

def find_flights(routes, source_airport, destination_airport, max_segments):
    # We implement a basic BFS algorithm for following the routes
    # Taken from http://eddmann.com/posts/depth-first-search-and-breadth-first-search-in-python/
    queue = [(source_airport, [source_airport])]
    while queue:
        airport, path = queue.pop(0)
        if len(path) > max_segments:
            return
        for next_airport in set(routes[airport].keys()) - set(path):
            if next_airport == destination_airport:
                yield path + [next_airport]
            else:
                queue.append((next_airport, path + [next_airport]))

solution_random_routes = collections.defaultdict(lambda: collections.defaultdict(list))
for r in random_routes:
    solution_random_routes[r.start_airport][r.end_airport].append(r)
print(list(find_flights(solution_random_routes, 'SFO', 'JFK', 1)))
print(list(find_flights(solution_random_routes, 'SFO', 'JFK', 2)))
print(list(find_flights(solution_random_routes, 'SFO', 'JFK', 3)))

[]
[['SFO', 'ATL', 'JFK'], ['SFO', 'ORD', 'JFK'], ['SFO', 'SAL', 'JFK'], ['SFO', 'LAS', 'JFK'], ['SFO', 'YYZ', 'JFK'], ['SFO', 'PEK', 'JFK'], ['SFO', 'FRA', 'JFK'], ['SFO', 'SAN', 'JFK']]
[['SFO', 'ATL', 'JFK'], ['SFO', 'ORD', 'JFK'], ['SFO', 'SAL', 'JFK'], ['SFO', 'LAS', 'JFK'], ['SFO', 'YYZ', 'JFK'], ['SFO', 'PEK', 'JFK'], ['SFO', 'FRA', 'JFK'], ['SFO', 'SAN', 'JFK'], ['SFO', 'CLT', 'ATL', 'JFK'], ['SFO', 'CLT', 'SJO', 'JFK'], ['SFO', 'CLT', 'SAV', 'JFK'], ['SFO', 'CLT', 'BNA', 'JFK'], ['SFO', 'CLT', 'MSY', 'JFK'], ['SFO', 'CLT', 'TPA', 'JFK'], ['SFO', 'ATL', 'MEX', 'JFK'], ['SFO', 'ATL', 'CHS', 'JFK'], ['SFO', 'ATL', 'FRA', 'JFK'], ['SFO', 'ATL', 'BNA', 'JFK'], ['SFO', 'ATL', 'SAN', 'JFK'], ['SFO', 'ATL', 'ORD', 'JFK'], ['SFO', 'ATL', 'IND', 'JFK'], ['SFO', 'ATL', 'SAL', 'JFK'], ['SFO', 'ATL', 'SAT', 'JFK'], ['SFO', 'ATL', 'CMH', 'JFK'], ['SFO', 'ORD', 'ATL', 'JFK'], ['SFO', 'ORD', 'MEX', 'JFK'], ['SFO', 'ORD', 'MSY', 'JFK'], ['SFO', 'ORD', 'SAT', 'JFK'], ['SFO', 'ORD', 'TPA', 'JFK'

### Cute Modules

#### `turtle` - Turtle graphics

Run the following code. A graphical window should appear that shows your new turtle friend! What other interesting shapes can you make?

In [None]:
import turtle

turtle.left(180)
turtle.forward(200)
turtle.left(180)

turtle.color('red', 'yellow')
turtle.begin_fill()

for _ in range(36):
    turtle.forward(400)
    turtle.left(170)
    if abs(turtle.pos()) < 1:
        break

turtle.end_fill()
turtle.done()

#### `unicodedata` - Unicode Database

Think about your favorite emoji. Can you guess its official name?

In [None]:
import unicodedata

print(unicodedata.lookup('SLICE OF PIZZA'))  # 🍕

print(unicodedata.name('🦄'))  # UNICORN FACE

#### `this` and `antigravity`

Just run the following lines of code.

In [None]:
import this

In [None]:
import antigravity

## Import Semantics

If you've made it through this far, congratulations! This was a long lab. If you're interested in the nitty-gritty details of Python's import mechanics, you can read through the [specification of the import system in the official language reference](https://docs.python.org/3/reference/import.html). It's a fairly long read but it can precisely answer any lingering questions you might have about exactly how Python imports modules and packages.

## Credit
Credit to Sam Redmond (@sredmond) who designed many of the Standard Library/Third-Party Library problems in this lab. Credit also goes to some video/person that Parth watched/talked to about crosswords which he can't remember any more... &#128542;

> With &#129412;s by @psarin and @coopermj