## PAGERANK 

As we have seen there is a clear positive skew in the business of airports across the country. We have seen that **ORD** or **ATL** are two of the busiest airports, while there are hundreds of other airports that operate on a much smaller scale.

##### Goal:
Use the NetworkX Graph library to analyze the underlying flow of traffic at all US airports via Pagerank.

##### Hypothesis:
By calculating the **Pagerank**(popularity) of an airport, we are able to capture the network flow that exists amongst airports in the US. This score would reflect the random chance of traveller ending up at a given airport. We hope to see a meaningful correlation between Pagerank and `DEP_DEL15` whether that is positive or negative. As we get to modelling stages, we will experiment with `ORIGIN_PAGERANK` and `DEST_PAGERANK` both as features.

In [0]:
import networkx as nx
import numpy as np
import pandas as pd
import plotly.express as px
import collections
import math


import pyspark.sql.functions as f
from pyspark.sql.types import *
from pyspark.sql.functions import isnan, when, count, col

### Pull in cleaned flights data

In [0]:
airlines = spark.read.option("header", "true").parquet(f"dbfs:/mnt/mids-w261/team20SSDK/cleaned_data/airlines/airlines_latest_utc/part-00*.parquet")

In [0]:
#Because between query may have edge case, I'm just being explicit here
air_w_year = airlines.select("ORIGIN", "ORIGIN_UTC", "DEST", "DEST_UTC",  
                              f.year("ORIGIN_UTC").alias('utc_org_year'), 
                              f.year("DEST_UTC").alias('utc_dest_year'),                              
                             )
print(air_w_year.count())
display(air_w_year)

ORIGIN,ORIGIN_UTC,DEST,DEST_UTC,utc_org_year,utc_dest_year
BNA,2019-07-31T15:45:00.000+0000,MSP,2019-07-31T17:55:00.000+0000,2019,2019
BNA,2019-07-31T23:15:00.000+0000,MSP,2019-08-01T01:20:00.000+0000,2019,2019
BNA,2019-08-01T00:45:00.000+0000,MSY,2019-08-01T02:05:00.000+0000,2019,2019
BNA,2019-07-31T18:30:00.000+0000,MSY,2019-07-31T20:05:00.000+0000,2019,2019
BNA,2019-07-31T12:45:00.000+0000,MSY,2019-07-31T14:15:00.000+0000,2019,2019
BNA,2019-07-31T21:45:00.000+0000,MSY,2019-07-31T23:15:00.000+0000,2019,2019
BNA,2019-07-31T12:50:00.000+0000,OAK,2019-07-31T17:25:00.000+0000,2019,2019
BNA,2019-08-01T00:10:00.000+0000,PHL,2019-08-01T02:00:00.000+0000,2019,2019
BNA,2019-07-31T15:25:00.000+0000,PHL,2019-07-31T17:30:00.000+0000,2019,2019
BNA,2019-08-01T03:05:00.000+0000,PHL,2019-08-01T05:00:00.000+0000,2019,2019


### Build this metric only based off training data

In [0]:
airlines = air_w_year.where(f.col("utc_org_year") < 2019)
airlines.select("utc_org_year").distinct().collect()

### Convert the sample to Pandas
* To use NetworkX
* Ease of visual building

In [0]:

airlines_pd = airlines.toPandas()
airlines_pd.head()

Unnamed: 0,ORIGIN,ORIGIN_UTC,DEST,DEST_UTC,utc_org_year,utc_dest_year
0,PHL,2017-08-24 11:05:00,MCO,2017-08-24 13:34:00,2017,2017
1,PHL,2017-08-25 11:05:00,MCO,2017-08-25 13:34:00,2017,2017
2,PHL,2017-08-26 11:05:00,MCO,2017-08-26 13:34:00,2017,2017
3,PHL,2017-08-27 11:05:00,MCO,2017-08-27 13:34:00,2017,2017
4,PHL,2017-08-28 11:05:00,MCO,2017-08-28 13:34:00,2017,2017


In [0]:
airlines_pd.shape

### Get all unique airports in out flights data

In [0]:
total_unique_airports = list(set(airlines_pd["ORIGIN"]).union(set(airlines_pd["DEST"])))
len(total_unique_airports)

### Initialize NetworkX Directed Graph, and add all the airports as nodes

In [0]:
#Init A Directed Graph
G = nx.DiGraph()

#Add all airports as nodes
G.add_nodes_from(total_unique_airports)

### Group all `ORIGIN -> DEST` flight paths and get the counts

We will use these counts as the weights on this directed graph. Where the weight of an edge from node A to Node B is the number of times a flight has made this trip in the time range 2015-2018.

In [0]:
grouped_by_trip = airlines_pd.groupby(['ORIGIN','DEST']).size().reset_index()
grouped_by_trip = grouped_by_trip.rename(columns={0: "count"})
grouped_by_trip.head()

Unnamed: 0,ORIGIN,DEST,count
0,ABE,ATL,3790
1,ABE,CLT,911
2,ABE,DTW,3201
3,ABE,EWR,1
4,ABE,FLL,72


#### Add all the edges of graph with weight as the total number of times a given trip has occurred.

In [0]:
#For every origin and destination add edge with weight which represents total number of trips across all time 2015-2019
for row in grouped_by_trip.iterrows():
  row_obj = row[1]
  G.add_edge(row_obj["ORIGIN"], row_obj["DEST"], weight = row_obj["count"])

#### Compute the Page Rank of the airports

In [0]:
%%time

airport_pr = nx.pagerank(G, alpha=0.85, personalization=None, max_iter=1000, tol=1e-06, nstart=None, weight='weight', dangling=None)
airport_pr_sorted = {k: v for k, v in sorted(airport_pr.items(), key=lambda item: -item[1])}

In [0]:
airport_pr_sorted

#### Some graph metrics

In [0]:
#2015-2018
print("Number of total edges: ", G.number_of_edges())

print("Number of total trips : ", G.size(weight='weight'))

print("Sanity check with grouped trips", sum(grouped_by_trip["count"]))

### Aiport Edge Degree

We wanted to gain an understanding of the behavior as per the **In-Degree** and **Out-Degree** for all airports we are analyzing

#### In-Degree Analysis

In [0]:
in_degree_sequence = sorted([d for n, d in G.in_degree], reverse=True)  # degree sequence
in_degree_count = collections.Counter(in_degree_sequence)
in_deg, in_cnt = zip(*in_degree_count.items())

in_degree_count = pd.DataFrame({"in_degree": in_deg, "count":in_cnt})

in_degree_count.describe()

Unnamed: 0,in_degree,count
count,79.0,79.0
mean,52.860759,4.620253
std,43.089571,9.641396
min,0.0,1.0
25%,19.5,1.0
50%,44.0,1.0
75%,75.5,3.0
max,185.0,60.0


In [0]:
fig = px.bar(in_degree_count, x='in_degree', y='count',
             hover_data=['in_degree', 'count'], color='count', height=400,
             title="Airport In-Degree Count")

fig.show()

#### Out-Degree Analysis

In [0]:
out_degree_sequence = sorted([d for n, d in G.out_degree], reverse=True)  # degree sequence
out_degree_count = collections.Counter(out_degree_sequence)
out_degree, out_cnt = zip(*out_degree_count.items())
out_degree_count = pd.DataFrame({"out_degree": out_degree, "count":out_cnt})

out_degree_count.describe()

Unnamed: 0,out_degree,count
count,74.0,74.0
mean,53.22973,4.932432
std,43.777315,9.829131
min,0.0,1.0
25%,18.25,1.0
50%,45.0,2.0
75%,76.25,4.0
max,184.0,59.0


In [0]:
fig = px.bar(out_degree_count, x='out_degree', y='count',
             hover_data=['out_degree', 'count'], color='count', height=400,
             title="Airport Out-Degree Count")

fig.show()

### We have computed a Page rank for every airport

Pull in `airport_meta` table and add this as a new data point for every airport that we have seen in our training data.

In [0]:
airport_meta = spark.read.option("header", "true").parquet(f"dbfs:/mnt/mids-w261/team20SSDK/cleaned_data/station/airport_meta/part-00000*.parquet")
print(airport_meta.count())
display(airport_meta)

ICAO,IATA,usaf,wban,name,country,state,lat,lon,elev,begin,end,STATION,station_tz,pagerank
KBGM,BGM,725150,4725,GREATER BINGHAMTON/E A LINK F,US,NY,42.207,-75.98,486.2,19730101,20190305,72515004725,America/New_York,0.000513908535703126
PADL,DLG,703210,25513,DILLINGHAM AIRPORT,US,AK,59.05,-158.517,26.2,20060101,20190304,70321025513,America/Anchorage,0.0004404265791019735
KINL,INL,727470,14918,FALLS INTERNATIONAL AIRPORT,US,MN,48.561,-93.398,360.6,19730101,20190304,72747014918,America/Chicago,0.0006057517487302925
TJPS,PSE,785203,398,MERCEDITA AIRPORT,RQ,,18.0,-66.55,9.1,19960101,20180831,78520300398,America/Puerto_Rico,0.0005011645446733933
KMSY,MSY,722310,12916,LOUIS ARMSTRONG NEW ORLEANS I,US,LA,29.997,-90.278,1.2,19451001,20190304,72231012916,America/Chicago,0.0061264407747525
NSTU,PPG,917650,61705,TAFUNA/PAGO PAGO INTERNATIONA,AQ,AS,-14.331,-170.714,3.7,19450801,20190303,91765061705,Pacific/Pago_Pago,0.0004284305019084392
KGEG,GEG,727850,24157,SPOKANE INTERNATIONAL AIRPORT,US,WA,47.622,-117.528,717.2,19410811,20190304,72785024157,America/Los_Angeles,0.0018568771858964
KDRT,DRT,722610,22010,DEL RIO INTERNATIONAL AIRPORT,US,TX,29.378,-100.927,304.5,19510501,20190304,72261022010,America/Chicago,0.0004159633739412966
KSNA,SNA,722977,93184,J. WAYNE APT-ORANGE CO APT,US,CA,33.68,-117.866,16.5,19400617,20190304,72297793184,America/Los_Angeles,0.0055382005981556
KBUR,BUR,722880,23152,BURBANK-GLENDALE-PASA ARPT,US,CA,34.201,-118.358,236.2,19430601,20190304,72288023152,America/Los_Angeles,0.0031572126819339


#### Broadcast the PageRank table, and use a UDF to add this feature to all airports.

In [0]:
iata_to_pr = sc.broadcast(airport_pr)

#### Note, there can be airports that we have note seen in training that show up in validation or test.

For these airports we asign it an even probability, multiplied by a teleportation factor. By doing though the sum of scores will be greater than **1.0** we are still able to represent new airports that must have been built recently or started flying minimally given they did not show up in any of the training data.

$$ pageRankNewAirport = \frac{1}{n} * telFactor$$

In [0]:
def map_iata_to_pr(iata):
  if iata in iata_to_pr.value:
    return iata_to_pr.value[iata]
  #In case the station doesn't exist we use (1/N)
  return (1/365)*.15

#convert to a UDF Function by passing in the function and return type of function
udf_map_to_pr = f.udf(map_iata_to_pr, DoubleType())
airport_meta = airport_meta.withColumn("pagerank", udf_map_to_pr("IATA"))
airport_meta.show()

In [0]:
airport_meta.write.mode('overwrite').parquet('dbfs:/mnt/mids-w261/team20SSDK/cleaned_data/station/airport_meta')

### Pagerank Spread

In [0]:
airport_meta_pd = spark.read.option("header", "true").parquet(f"dbfs:/mnt/mids-w261/team20SSDK/cleaned_data/station/airport_meta/part-00*.parquet").toPandas()
airport_meta_pd.head()

Unnamed: 0,ICAO,IATA,usaf,wban,name,country,state,lat,lon,elev,begin,end,STATION,station_tz,pagerank
0,KBGM,BGM,725150,4725,GREATER BINGHAMTON/E A LINK F,US,NY,42.207,-75.98,486.2,19730101,20190305,72515004725,America/New_York,0.000514
1,PADL,DLG,703210,25513,DILLINGHAM AIRPORT,US,AK,59.05,-158.517,26.2,20060101,20190304,70321025513,America/Anchorage,0.00044
2,KINL,INL,727470,14918,FALLS INTERNATIONAL AIRPORT,US,MN,48.561,-93.398,360.6,19730101,20190304,72747014918,America/Chicago,0.000606
3,TJPS,PSE,785203,398,MERCEDITA AIRPORT,RQ,,18.0,-66.55,9.1,19960101,20180831,78520300398,America/Puerto_Rico,0.000501
4,KMSY,MSY,722310,12916,LOUIS ARMSTRONG NEW ORLEANS I,US,LA,29.997,-90.278,1.2,19451001,20190304,72231012916,America/Chicago,0.006126


In [0]:
fig = px.bar(airport_meta_pd.sort_values(by=["pagerank"], ascending=False), x='IATA', y='pagerank',
             hover_data=['IATA', 'pagerank', 'state','station_tz'], color='pagerank',
             title="Station Pagerank Scores", height=400)
fig.show()