Below code is used to align the entire routes table to the left

In [2]:
%%html
<style>
table {float:left}
</style>

## Introduction
In this tutorial, we will analyze flight route data using PageRank algorithm to rank the popularity of airports. The flight route data is the Flight Route Database dataset on Kaggle. This dataset contains 67663 flights among 3425 airports.

The dataset is included in the tutorial as the `routes.csv` file. Below are the columns of the dataset with brief descriptions.

| **routes** | | 
|:---|:---|
| airline | airline code |
| airline_id | airline id |
| source_airport | airport code of source airport |
| source_airport_id | airport id of source airport |
| destination_airport | airport code of destination airport |
| destination_airport_id | airport id of destination airport | 
| codeshare | is this flight codeshared with other airlines | 
| stops | number of stops made by the flight | 
| equipment | plane type information |

## Table of contents
- [Data pre-processing and cleaning](#Data-pre-processing-and-cleaning)
- [Generate auxiliary data for PageRank algorithm](#Generate-auxiliary-data-for-PageRank-algorithm)
- [Create adjacency matrix from source and destination airport codes](#Create-adjacency-matrix-from-source-and-destination-airport-codes)
- [Rank popularity of airports using PageRank algorithm](#Rank-popularity-of-airports-using-PageRank-algorithm)
- [Use IATACodes API to translate airport code to name](#Use-IATACodes-API-to-translate-airport-code-to-name)
- [References](#References)

## Data pre-processing and cleaning
The dataset `routes.csv` is first read into a pandas dataframe, then pre-processed and cleaned via the below steps

1. Replace column headers in the csv file with column headers defined in the code
2. Replace missing ids (denoted as `\N` in csv file) with 0
3. Encode codeshare column to bool values

Note that later only the airport codes will be used in the PageRank algorithm, hence replacing the missing airline id and airport id with 0 will not affect the result. <br/> It is meant to unify the data type of the columns to integer types.

In [3]:
import pandas as pd
import numpy as np

In [6]:
col_names = ['airline', 'airline_id', 'source_airport', 'source_airport_id', 'destination_airport', 
             'destination_airport_id', 'codeshare', 'stops', 'equipment']

routes = pd.read_csv('routes.csv', header=None, names=col_names, skiprows=1)
routes['airline_id'] = routes['airline_id'].apply(lambda x: x.replace('\\N', '0')).astype('int32')
routes['source_airport_id'] = routes['source_airport_id'].apply(lambda x: x.replace('\\N', '0')).astype('int32')
routes['destination_airport_id'] = routes['destination_airport_id'].apply(lambda x: x.replace('\\N', '0')).astype('int32')
routes['codeshare'] = routes['codeshare'].apply(lambda x: True if x == 'Y' else False)
routes['stops'] = routes['stops'].astype('int32')

print('DataFrame schema\n')
print(routes.dtypes)
print('\nFirst 5 rows of DataFrame\n')
print(routes.head())
print('\nTotal %d entries' % routes.shape[0])

DataFrame schema

airline                   object
airline_id                 int32
source_airport            object
source_airport_id          int32
destination_airport       object
destination_airport_id     int32
codeshare                   bool
stops                      int32
equipment                 object
dtype: object

First 5 rows of DataFrame

  airline  airline_id source_airport  source_airport_id destination_airport  \
0      2B         410            AER               2965                 KZN   
1      2B         410            ASF               2966                 KZN   
2      2B         410            ASF               2966                 MRV   
3      2B         410            CEK               2968                 KZN   
4      2B         410            CEK               2968                 OVB   

   destination_airport_id  codeshare  stops equipment  
0                    2990      False      0       CR2  
1                    2990      False      0       CR2  


## Generate auxiliary data for PageRank algorithm
Since we will use airport codes to generate the adjacency matrix, we need to find all the distinct airport codes and store them in a set.

In [7]:
sourceAirportCodes = list(routes['source_airport'])
destinationAirportCodes = list(routes['destination_airport'])
allAirportCodes = sourceAirportCodes + destinationAirportCodes

airportCodeSet = set()
for code in allAirportCodes:
    airportCodeSet.add(code)
print('Total number of distinct airport codes: %d' % (len(airportCodeSet)))

Total number of distinct airport codes: 3425


Based on the airport codes, we will create 2 lookup dictionaries that
* map airport code to matrix index
* map matrix index to airport code

The 1st dictionary is used to when creating the adjacency matrix

The 2nd dictionary is used to translate PageRank result (in matrix index) back to airport code 

In [8]:
airportCodeToMatrixIndexLookupDict = {}
matrixIndexToAirportCodeLookupDict = {}

for index, code in enumerate(airportCodeSet):
    airportCodeToMatrixIndexLookupDict[code] = index
    matrixIndexToAirportCodeLookupDict[index] = code

In addition, we need to aggregate the number of flights from source to destination airports. 
- The key is a tuple of source airport code and destination airport code, 
- The value is the number of flights for that airport combination, which is used as the weight in the adjancency matrix.

In [9]:
flightCounts = {}
for index, entry in routes.loc[:, ['source_airport', 'destination_airport']].iterrows():
    sourceAirport = entry[0]
    destinationAirport = entry[1]
    if (sourceAirport, destinationAirport) not in flightCounts:
        flightCounts[(sourceAirport, destinationAirport)] = 1
    else:
        flightCounts[(sourceAirport, destinationAirport)] = flightCounts[(sourceAirport, destinationAirport)] + 1
print('Total number of entries in flightCounts: %d' % (len(list(flightCounts.items()))))

Total number of entries in flightCounts: 37595


## Create adjacency matrix from source and destination airport codes

There are only 67663 entries to be stored in a 3425 * 3425 matrix, which means only about 0.6% of the entries have a meaningful value (1). The rest of the entries store value 0.

This mandates a sparse matrix data structure to be used, where only non-zero values are stored. This significantly saves time and space required to manipulate the matrix.

In the sparse matrix, the value in the i-th row and j-th column denotes the number of flights from the airport j to airport i. 

In [10]:
import scipy.sparse as sp

dimension = len(airportCodeSet)
data = []
row = []
col = []    

for index, entry in routes.loc[:, ['source_airport', 'destination_airport']].iterrows():
    sourceAirport = entry[0]
    destinationAirport = entry[1]
    data.append(flightCounts[(sourceAirport, destinationAirport)])
    row.append(airportCodeToMatrixIndexLookupDict[destinationAirport])
    col.append(airportCodeToMatrixIndexLookupDict[sourceAirport])

matrix = sp.coo_matrix((data, (row, col)), shape = (dimension, dimension))
print('Created adjacency matrix of dimension %d * %d' % (dimension, dimension))

Created adjacency matrix of dimension 3425 * 3425


## Rank popularity of airports using PageRank algorithm
We will not go into the details of realizing the PageRank algorithm which has been thoroughly covered in the lecture and homework.

The basic idea is that the more flights there is to an airport A, the higher the PageRank score of airport A will be.

Another point to take note is that not all the flights carry the same weight. Let's say John F Kennedy International (JFK) airport has a very high score (unsurprisingly), flights from JFK to an airport A carry higher weight, i.e. contribute more to increase in PageRank score of airport A, than flights from a lesser-known airport.

In [11]:
def PageRank(matrix):
    
    d = 0.85    # damping factor
    iters = 100 # number of iteratons

    x = np.ones(dimension)/dimension

    # Normalized adjacency matrix matrix and multiply by d
    matrix_col_sum = np.asarray(matrix.sum(0))[0]
    matrix.data = (matrix.data/matrix_col_sum[matrix.col])
    matrix = matrix * d

    # B represents all-zero columns changed to all-ones
    B = np.array([int(not bool(item)) for item in matrix_col_sum])*d/dimension

    for _ in range(iters):
        x_A = matrix @ x
        x_B = np.ones(dimension) * sum(B*x)
        x_C = (1 - d)/dimension * np.ones(dimension) * sum(x)
        x = x_A + x_B + x_C
    
    return x

result = PageRank(matrix)
ranksUnsorted = {}
for index, score in enumerate(result):
    code = matrixIndexToAirportCodeLookupDict[index]
    ranksUnsorted[code] = score

Use sorted function to sort all airports based on their PageRank score. <br/> Notice the trick in the lambda expression where the negative of the score is used as the sort key so that the airports are sorted in descending order based on their score.

In [14]:
ranksSorted = sorted(list(ranksUnsorted.items()), key=lambda x: -x[1])
print('Top 20 airports\n')
for index, entry in enumerate(ranksSorted[:20]):
    rank = index + 1
    print('No. %d %s: Score %f' % (rank, entry[0], entry[1]))

Top 20 airports

No. 1 ATL: Score 0.018729
No. 2 LAX: Score 0.009450
No. 3 LHR: Score 0.008350
No. 4 ORD: Score 0.008296
No. 5 SIN: Score 0.008077
No. 6 JFK: Score 0.007410
No. 7 CDG: Score 0.006026
No. 8 DFW: Score 0.006007
No. 9 BKK: Score 0.005549
No. 10 MIA: Score 0.005478
No. 11 SYD: Score 0.005343
No. 12 DEN: Score 0.005267
No. 13 PEK: Score 0.005248
No. 14 FRA: Score 0.005061
No. 15 ICN: Score 0.004869
No. 16 DME: Score 0.004656
No. 17 BCN: Score 0.004564
No. 18 HKG: Score 0.004469
No. 19 PVG: Score 0.004442
No. 20 BNE: Score 0.004406


## Use IATACodes API to translate airport code to name
We use IATACodes API look up airport names based on airport codes.

Below is the code skeleton that call IATACodes API to get all airport codes and names.

The api key is intentionally omitted to keep my own api key private. Follow the below instructions to set up api key before running the below code block.

1. Go to http://iatacodes.org, click on **FREE Access**, and register for access. The api key will be both displayed on the webpage as well as sent to you via email
2. Execute below command from the same directory where this jupyter notebook is located

```
echo YOUR_API_KEY > api_key.txt
```

**Points to take note**
* There is a cap of 250 requests/minute and 2500 requests/hour imposed by the API. Hence don't be too extravagant about rerunning the below code blocks.  
* Due to issues with IATACodes website's certificate, sending HTTPS request to their API using Python's `requests` packages fails with `CERTIFICATE_VERIFY_FAILED` error. <br/> Therefore I have disabled SSL certificate verification by setting verify parameter to False, and disabled unverified HTTPS request warnings.

In [15]:
import requests
import urllib3

# disable unverified HTTPS request warnings
urllib3.disable_warnings()

url = 'https://iatacodes.org/api/v6/airports'

with open('api_key.txt', 'r') as f:
    api_key = f.read().replace('\n','')

params = {
    'api_key': api_key
}

response = requests.get(url, params=params, verify=False).json()

Generate airport name lookup dictionary

In [16]:
airportNameLookupDict = {}
for airport in response['response']:
    code = airport['code']
    name = airport['name']
    airportNameLookupDict[code] = name

print('Total number of airports retrieved from IATACodes: %d' % (len(airportNameLookupDict)))

Total number of airports retrieved from IATACodes: 10051


Use lookup dictionary to translate top airport codes to airport names and print them out.

In [17]:
topAirportList = {}
for index, entry in enumerate(ranksSorted[:20]):
    rank = index + 1
    name = airportNameLookupDict[entry[0]]
    topAirportList[rank] = name

print('Top 20 airport names\n')
for rank, name in topAirportList.items():
    print('No. %d: %s' % (rank, name))

Top 20 airport names

No. 1: Hartsfield-jackson Atlanta International
No. 2: Los Angeles International
No. 3: Heathrow
No. 4: Chicago O'hare International
No. 5: Singapore Changi
No. 6: John F Kennedy International
No. 7: Charles De Gaulle
No. 8: Dallas/Fort Worth International
No. 9: Suvarnabhumi International
No. 10: Miami International Airport
No. 11: Kingsford Smith
No. 12: Denver International
No. 13: Beijing Capital International
No. 14: Frankfurt International Airport
No. 15: Seoul (Incheon)
No. 16: Domodedovo
No. 17: El Prat De Llobregat
No. 18: Hong Kong International
No. 19: Shanghai Pudong International
No. 20: Brisbane International


Below are the top 20 most popular airports generated by our program. You should get the same result by running the above code block.

    No. 1: Hartsfield-jackson Atlanta International
    No. 2: Los Angeles International
    No. 3: Heathrow
    No. 4: Chicago O'hare International
    No. 5: Singapore Changi
    No. 6: John F Kennedy International
    No. 7: Charles De Gaulle
    No. 8: Dallas/Fort Worth International
    No. 9: Suvarnabhumi International
    No. 10: Miami International Airport
    No. 11: Kingsford Smith
    No. 12: Denver International
    No. 13: Beijing Capital International
    No. 14: Frankfurt International Airport
    No. 15: Seoul (Incheon)
    No. 16: Domodedovo
    No. 17: El Prat De Llobregat
    No. 18: Hong Kong International
    No. 19: Shanghai Pudong International
    No. 20: Brisbane International

## References

1. Flight Route Database: https://www.kaggle.com/open-flights/flight-route-database
2. IATACodes API: http://iatacodes.org
3. PageRank Algorithm: https://en.wikipedia.org/wiki/PageRank