<span class='note'>*Make me look good.* Click on the cell below and press <kbd>Ctrl</kbd>-<kbd>Enter</kbd>.</span>

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open('css/custom.css', 'r').read()
    return HTML(styles)
css_styling()

<h5 class='prehead'>SA367 &middot; Mathematical Models for Decision Making &middot; Spring 2018 &middot; Uhan</h5>

<h5 class='lesson'>Lesson 5.</h5>

<h1 class='lesson_title'>The Mileage Running Problem</h1>

## The problem

Professor May B. Wright needs to fly from Baltimore (BWI) to Los Angeles (LAX) to attend a conference.
She thinks this would be the perfect opportunity to accumulate some frequent flyer miles on American Airlines (AA), where she already has Platinum status.

Looking into flights on AA, she sees that every itinerary from BWI to LAX costs roughly the same.
She has a full day to spare for travel, so she wants to know: which sequence of AA domestic flights starting at BWI and ending at LAX over the course of one day will allow her to accumulate the most miles?

* Yes, people actually do this. This is known as __mileage running__. 
    - Apparently, this has become harder to do in recent years.
    - [A recent article from the New York Times](https://www.nytimes.com/2014/09/14/upshot/the-fadeout-of-the-mileage-run.html).
    - [An older article from Wired](https://www.wired.com/2007/07/mileage-runner/).

## Modeling the problem

* Suppose we have a database of every AA domestic flight on a given day.

* In particular, for each flight, we have:
    - the flight number
    - the origin airport
    - the destination airport
    - the departure time at the origin airport
    - the arrival time at the destination airport
    - the distance traveled in miles

* How can we formulate Professor Wright's problem as a shortest path problem?

## pandas (the package, not the animals)

* In the same folder as this notebook, there is a file called `aa_domestic_flights.csv` with the database described above.

* `.csv` stands for **comma-separated values**.

* We can view `.csv` files in Excel - let's see what's in this file. _Cut to Excel..._

* How can we use this data in Python? With __pandas__.

* pandas is a Python package for data analysis. 
    - It's especially useful for cleaning and manipulating datasets.

* pandas does a lot of stuff &mdash; here are a few resources:
    - [Here is the official documentation for pandas](http://pandas.pydata.org/pandas-docs/stable/index.html).
    - [Chris Albon's notes](http://chrisalbon.com) are also a good resource on how to get things done with pandas (look in the _Data Wrangling_ section). 
    
* In this lesson, we'll use pandas in a very basic way to help us set up the shortest path problem we formulated above.

* To install pandas, open a <span class="rred">WinPython Command Prompt</span> and type

```
pip install pandas
```

* `pip` might tell you that pandas is already installed. If not, it should go ahead and install it for you.

* To use pandas, we first need to import it, like this:

In [None]:
import pandas as pd

* A `pandas` __DataFrame__ is just a two-dimensional table, with rows and columns.

* We can use the __`read_csv()`__ function in pandas to read `aa_domestic_flights.csv` into a DataFrame called `df`, like this:

In [None]:
# Read csv file into a DataFrame
# Designate departure and arrival time columns as dates
df = pd.read_csv('aa_domestic_flights.csv', parse_dates=['DEP_TIME', 'ARR_TIME'])

* By default, `read_csv()` assumes the first row of the csv file contains the names of each column.

* The `parse_dates` argument tells `read_csv()` which columns correspond to dates, so that we can perform date-specific calculations on these columns later.

* [Here is the official documentation for `read_csv()`.](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)

* It's a good idea to take a quick look at the DataFrame `read_csv()` creates, just in case something went wrong.

* To examine the first 5 rows of a DataFrame, we can use the `.head()` method:

In [None]:
# Print the first 5 rows of df
df.head()

* Another useful method is `.describe()`. 

* By default, `.describe()` only provides summary statistics for the columns with numeric data. 

* To get summary statistics for all the columns, include the argument `include="all"`, like this:

In [None]:
# Get summary statistics for all columns in df
df.describe(include="all")

* A column by itself is called a **Series**.

* You can select the Series `DEST` of the DataFrame `df` like this:

```
df["DEST"]
```

* So, to print the Series `DEST`, we could write:

In [None]:
# Print the DEST column
print(df["DEST"])

## Setting up the shortest path problem in networkx

* Now that we can access the flight database in Python, we can use its contents to setup the shortest path problem we formulated above.

* First, let's import `networkx` and `bellmanford` so we can use them:

In [None]:
import networkx as nx
import bellmanford as bf

### Creating a list of flights

* It will be useful to create a variable `flights` containing a list of all the flights.

* What part of the dataset contains this information?

_Write your notes here. Double-click to edit._

* From the `.describe()` output above, we see that the flights in `df["FLIGHT"]` are unique.

* We can convert the Series `df["FLIGHT"]` to a list with the function `list()`. 
    - Then we can use the list methods we learned about earlier, such as `.append()`, if necessary.

In [None]:
# Take the FLIGHT column from df, convert it to a list
flights = list(df["FLIGHT"])

* It's a good idea to make sure nothing funny happened &mdash; let's inspect the variable `flights` we just created:

In [None]:
# Print flights
print("Flights: {0}".format(flights))

* You might want to click on the left of the output above &mdash; this will collapse the output so it doesn't take over your browser window.

* Let's also make sure we have the right number of flights in the variable `flights`:

In [None]:
print("Number of flights: {0}".format(len(flights)))

### Creating a list of airports

* It will also be useful to have a variable `airports` containing a list of all the airports.

* What part of the dataset contains this information?

_Write your notes here. Double-click to edit._

* We can create the list `airports` like this:

In [None]:
# Convert the ORIGIN and DEST columns from df into sets, 
# take their union, convert to a list
airports = list(set(df["ORIGIN"]) | set(df["DEST"]))

* Um... what does this do??

* Let's try a smaller example and look at what's going on step-by-step.

* Pretend that `A` and `B` defined below are the `ORIG` and `DEST` columns from `df`

In [None]:
# Pretend that A and B are the ORIG and DEST columns from df
A = ['BWI', 'BWI', 'ORD', 'ORD']
B = ['LAX', 'ORD', 'SFO', 'LAX']

* In Python, a __set__ is an unordered collection of unique elements, just like the usual mathematical definition.

* `set(A)` takes all entries `A` and converts it into a set. This eliminates all duplicates within `A`.

* Same goes for `set(B)`.

In [None]:
# Convert A and B into sets, print them out
print(set(A))
print(set(B))

* The `|` operator takes the __union__ of the sets, like this:

In [None]:
# Take the union of set(A) and set(B), print it out
print(set(A) | set(B))

* This is almost what we want: we have a list of all the airports, but...

* Sets are similar to lists, but have their own methods. We can turn the set into a list with the function `list()`, like this:

In [None]:
# Print out the union of set(A) and set(B), converted to a list
print(list(set(A) | set(B)))

* See how that works? That's why `airports` defined above contains a list of all the airports in our dataset.

* Let's make sure everything looks OK with `airports`:

In [None]:
# Print list of airports
print('Airports: {0}'.format(airports))

# Print number of airports
print('Number of airports: {0}'.format(len(airports)))

### Adding nodes with attributes

* Now we're ready to build the shortest path graph. Let's start with an empty directed graph:

In [None]:
# Create empty NetworkX digraph
G = nx.DiGraph()

* Next, let's create a "start" and "end" node.

In [None]:
# Create start and end nodes
G.add_node("start")
G.add_node("end")

* Now, we need to add a node for each flight, or each row of our database.

* We can quickly iterate through the rows of a DataFrame using the `.itertuples()` method:

```python
for row in df.itertuples():
    # Put some code here
    # row.COLUMN_NAME = value of column COLUMN_NAME in the current row
```

* So we can add a node for each flight like this:

In [None]:
# Add a node for each flight
for row in df.itertuples():
    G.add_node(row.FLIGHT, origin=row.ORIGIN, dest=row.DEST, dep_time=row.DEP_TIME, arr_time=row.ARR_TIME, distance=row.DISTANCE)

* Wait &mdash;
```python
G.add_node(row.FLIGHT)
```
adds a node whose name is the value of `row.FLIGHT`. What is all the other stuff?

* Remember in the last lesson when we added the "length" attribute to each edge? Like this?

```python
G.add_edge(1, 2, length=9)
```
* We can add attributes to nodes as well. 

* The code above adds attributes called `origin`, `dest`, `dep_time`, `arr_time`, and `distance` to each node.
    - This will be handy later.

* To access a particular attribute of a node, we write something like this:

In [None]:
# print the departure time of flight "1-BOS-JFK"
print(G.node["1-BOS-JFK"]["dep_time"])

* The `.number_of_nodes()` method applied to a `networkx` graph &mdash; well, you can guess what it does. Or, you can just try it out:

In [None]:
# Print number of nodes in G
print(G.number_of_nodes())

### Adding edges

* Now we can check every pair of flight nodes, and check if we need to add an edge between them.
    - Remember the length of these edges is the <span class="rred">negative</span> of the distance of the first flight.

* To add or subtract times, we need to use `pd.to_timedelta()` &mdash; [here is the documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_timedelta.html).
    - For example, to subtract 30 minutes, we would write 
    ```python
    some_time_variable - pd.to_timedelta(30, unit="m")
    ```

* This might seem awkward, but if you think about it, working with dates and time _is_ awkward &mdash; you need to keep track of different (non-base-10) units.

In [None]:
# Iterate through every pair of flight nodes
for first in flights:
    for second in flights:
        
        # If the first flight arrives where the second flight departs...
        if (G.node[first]["dest"] == G.node[second]["origin"]):
            
            # And if the first flight arrives 45 minutes before the second flight leaves,
            # add an edge from the first flight to the second
            if (G.node[first]["arr_time"] + pd.to_timedelta(45, unit="m") < G.node[second]["dep_time"]):
                G.add_edge(first, second, length=-G.node[first]["distance"])

* Finally, we need to add edges:
    - from the start node to all flights departing from BWI, and
    - from all flights arriving at LAX to the end node.

In [None]:
# Iterate through all flights
for flight in flights:

    # If the flight departs from BWI, 
    # add an edge from start to this flight
    if G.node[flight]["origin"] == "BWI":
        G.add_edge("start", flight, length=0)
        
    # If the flight arrives at LAX, 
    # add an edge from this flight to end
    if G.node[flight]["dest"] == "LAX":
        G.add_edge(flight, "end", length=-G.node[flight]["distance"])

* Similar to `G.number_of_nodes()`, we can perform a sanity check with our work with `G.number_of_edges()`.

In [None]:
# Print the number of edges in G
print(G.number_of_edges())

## Solving the shortest path problem, interpreting the output

* Now that we have our directed graph set up, we can solve for the shortest path from the start node to the end node just like we did in the last lesson:

In [None]:
# Solve the shortest path problem using Bellman-Ford
length, nodes, negative_cycle = bf.bellman_ford(G, source="start", target="end", weight="length")

# Print output from Bellman-Ford
print("Negative cycle? {0}".format(negative_cycle))
print("Shortest path length: {0}".format(length))
print("Shortest path: {0}".format(nodes))

* What does the output tell us about how to solve Professor Wright's problem?

_Write your notes here. Double-click to edit._

## On your own...

Suppose Professor Wright wants to find the longest itinerary from IAD (Washington DC - Dulles) to SAN (San Diego) instead.

In the cell below, write the code that sets up and solves the shortest path formulation for her problem from start to finish.

In the cell after, describe in words what the output from the Bellman-Ford algorithm tells you about how to solve Professor Wright's problem.

In [None]:
# Import packages
import pandas as pd
import networkx as nx
import bellmanford as bf

# Read csv file into a DataFrame
# Designate departure and arrival time columns as dates
df = pd.read_csv('aa_domestic_flights.csv', parse_dates=['DEP_TIME', 'ARR_TIME'])

# Create empty NetworkX digraph
G = nx.DiGraph()

# Create start and end nodes
G.add_node("start")
G.add_node("end")

# Add a node for each flight
for row in df.itertuples():
    G.add_node(row.FLIGHT, origin=row.ORIGIN, dest=row.DEST, dep_time=row.DEP_TIME, arr_time=row.ARR_TIME, distance=row.DISTANCE)

# Iterate through every pair of flight nodes
for first in flights:
    for second in flights:
        
        # If the first flight arrives where the second flight departs...
        if (G.node[first]["dest"] == G.node[second]["origin"]):
            
            # And if the first flight arrives 45 minutes before the second flight leaves,
            # add an edge from the first flight to the second
            if (G.node[first]["arr_time"] + pd.to_timedelta(45, unit="m") < G.node[second]["dep_time"]):
                G.add_edge(first, second, length=-G.node[first]["distance"])

# Iterate through all flights
for flight in flights:

    # If the flight departs from IAD, 
    # add an edge from start to this flight
    if G.node[flight]["origin"] == "IAD":
        G.add_edge("start", flight, length=0)
        
    # If the flight arrives at SAN, 
    # add an edge from this flight to end
    if G.node[flight]["dest"] == "SAN":
        G.add_edge(flight, "end", length=0)
        
# Solve the shortest path problem using Bellman-Ford
length, nodes, negative_cycle = bf.bellman_ford(G, source="start", target="end", weight="length")

# Print output from Bellman-Ford
print("Negative cycle? {0}".format(negative_cycle))
print("Shortest path length: {0}".format(length))
print("Shortest path: {0}".format(nodes))

_Write your notes here. Double-click to edit._