# Content Draft

* https://www.kaggle.com/divyanshrai/airport-route-analysis
* https://www.kaggle.com/divyanshrai/graphing-airport-route-analysis
* https://towardsdatascience.com/catching-that-flight-visualizing-social-network-with-networkx-and-basemap-ce4a0d2eaea6
* https://www.kaggle.com/lserafin/simple-exploration-notebook-flight-routes
* https://openflights.org/data.html


(Part I - Intro)
Our Topic
Why we choose this topic + objective
Data source: Openflights 
Technology/library used


(Part II - EDA)
- Describe datasets used (what are the fields / no. of entries / etc.)
- Perform EDA with plots

*Top 10 Countries/Cities with most airports

*Top 10 Airports with most (in/out) air routes

*Top 10 Airlines with most flights number

*Top 10 Plane models used


(Part III - Network Visualization)
Plot flight network with NetworkX graph (no map, only network structure)
- Plot all airports in the world
- Plot airports in one chosen continent & share insights

Plot flight network on map
- show different map region / graph style, share insights

(Part IV - Network Analysis I: Test 6 degree theory)

- Describe how to address "local airport not conncected" issue
Possible Approach:
1. we manually create a link between any 2 airports in same city
2. change the unit from airpoty to city
My suggested approach: don't deal with this
(Give reasons: this is just a study for us to apply network knowledge; we want to use airport/flight as units so the initial setup is good enough)

- Explain what is strongly / weakly connected graph
(For this part's analysis, we need to use the largest strongly connected component in flight network, as we need to filter out airports that are not connected to the global flight network)
Describe how many airport(nodes) do we ignore (no. of nodes not in largest strongly connected)

- Test 6 degree theory
(Answer to this is the graph's diameter)
https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.distance_measures.diameter.html#networkx.algorithms.distance_measures.diameter

- Find the airport with longest route (use NetworkX Periphery function)
Periphery: nodes with eccentricity = diameter
https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.distance_measures.periphery.html?highlight=periphery#networkx.algorithms.distance_measures.periphery


(Part V - Network Analysis II: Measuring Node Importance with Centrality Measures + PageRank)

- Plot airports with top 10 different centrality/Pagerank, try give insights
 
https://networkx.org/documentation/stable/reference/algorithms/bipartite.html#module-networkx.algorithms.bipartite.centrality

I. Degree centrality
II. Closeness Centrality
Which airports will allow you to reach all other airports with the lowest average number of airports in between? 
III. Betweenness Centrality
Which airports often act as bridges between other pairs of airports? 
IV. PageRank

(How to evaluate airport importance? My rough idea: compare above results with airport passenger/flight numbers. The measure correlated the most maybe considered as best tmeasuring metric) 
https://en.wikipedia.org/wiki/List_of_busiest_airports_by_passenger_traffic#2017_statistics

(Part VI - Conclusion)
- Significance of our project
- What we learned, etc

In [None]:
import numpy as np 
import pandas as pd
import collections
import networkx as nx

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap as Basemap

In [None]:
# Read route data
route_cols = ['Airline', 'Airline ID', 'Source', 'Source Airport ID',
              'Dest', 'Dest Airport ID', 'Codeshare', 'Stops', 'equipment']
routes_df = pd.read_csv("../input/flight-route-database/routes.csv", skiprows=1, names = route_cols)
routes_df['Source Airport ID'] = pd.to_numeric(routes_df['Source Airport ID'].astype(str), 'coerce')
routes_df['Dest Airport ID'] = pd.to_numeric(routes_df['Dest Airport ID'].astype(str), 'coerce')
    
print(routes_df.shape)
routes_df.head()

In [None]:
# Read airport data
airport_df = pd.read_csv("../input/openflights-airports-database-2017/airports.csv")
print(airport_df.shape)
airport_df.tail()

In [None]:
# Drop airport that don't have IATA data
airport_df = airport_df[airport_df.IATA != '\\N']
print(airport_df.shape)
airport_df.tail()

In [None]:
# make new route df with route count info
routes_all = pd.DataFrame(routes_df.groupby(['Source', 'Dest']).size().reset_index(name='counts'))

airport_all = airport_df[['Name','City','Country','Latitude', 'Longitude', 'IATA']]
IATA_array = airport_all["IATA"].tolist()

# extract japan airport info
airport_jp = airport_df[(airport_df.Country == "Japan")][['Airport ID','Name','City','Latitude','Longitude','IATA']]
#jp_airport_ix = airport_jp.index.values
routes_jp = routes_df[(routes_df['Source Airport ID'].isin(airport_jp['Airport ID'])) &
                      (routes_df['Dest Airport ID'].isin(airport_jp['Airport ID']))] 

In [None]:
routes_all.head()

In [None]:
# only keep route with airport have IATA code
routes_all = routes_all[routes_all['Source'].isin(IATA_array)]
routes_all = routes_all[routes_all['Dest'].isin(IATA_array)]

In [None]:
# add route for all 2 airports in same city

# make 2 temp df

local_source_ap = airport_all[['City','Country','IATA']].copy()
local_source_ap.rename({'IATA': 'Source'}, axis=1, inplace=True)
local_source_ap.dropna(inplace=True)

local_dest_ap = airport_all[['City','Country','IATA']].copy()
local_dest_ap.rename({'IATA': 'Dest'}, axis=1, inplace=True)
local_dest_ap.dropna(inplace=True)

In [None]:
print(local_source_ap.shape)

In [None]:
# only consider airpot that already have routes

# make set of all airport with route
ap_set1 = set(routes_all["Source"].tolist())
ap_set2 = set(routes_all["Dest"].tolist())
print(len(ap_set1))
print(len(ap_set2))
ap_set1.update(ap_set2)
print(len(ap_set1))

In [None]:
local_source_ap2 = local_source_ap[(local_source_ap['Source'].isin(ap_set1))]
local_dest_ap2 = local_dest_ap[(local_dest_ap['Dest'].isin(ap_set1))]

print(local_source_ap2.shape)
print(local_dest_ap2.shape)

In [None]:
s1 = set(local_source_ap2['Source'].tolist())
s2 = set(local_dest_ap2['Dest'].tolist())
print(s1.difference(s2))

In [None]:
local_route = pd.merge(local_source_ap2, local_dest_ap2, how='inner', on=['City', 'Country'])
local_route = local_route.query("Source != Dest")

print(local_route.shape)
local_route

In [None]:
interset = pd.merge(local_route, routes_all, how='inner', on=['Source', 'Dest'])
interset

In [None]:
print(routes_all.shape)
routes_all.head()

In [None]:
routes_all_n_local = routes_all.append(local_route)
print(routes_all_n_local.shape)

In [None]:
routes_all_n_local.drop(['City', 'Country'], axis=1, inplace=True)
routes_all_n_local['counts'] = routes_all_n_local['counts'].fillna(1)
routes_all_n_local.head()

In [None]:
# to find number of flights in and out of an airport
# it is similar to find number of rows in which each airport occur in either one of the 2 columns
counts = routes_all['Source'].append(routes_all.loc[routes_all['Source'] != routes_all['Dest'], 'Dest']).value_counts()

# create a data frame of position based on names in count
counts = pd.DataFrame({'IATA': counts.index, 'total_flight': counts})
pos_data = counts.merge(airport_all, on = 'IATA')

In [None]:
counts.head()

In [None]:
pos_data.head()

In [None]:
routes_100 = routes_all.nlargest(100, 'counts')
routes_100.head()

# EDA

In [None]:
# Plot Top 10 Countries with Most Aiports 
cnt_srs = airport_df['Country'].value_counts().nlargest(10)
plt.figure(figsize=(12,6))
sns.barplot(cnt_srs.index, cnt_srs.values)
plt.xticks(rotation='vertical')
plt.xlabel('Country', fontsize=12)
plt.ylabel('Number of Airports', fontsize=12)
plt.show()

In [None]:
cnt_srs

In [None]:
# Plot Top 10 Cities with Most Aiports 
cnt_srs = airport_df['City'].value_counts().nlargest(10)
plt.figure(figsize=(12,6))
sns.barplot(cnt_srs.index, cnt_srs.values)
plt.xticks(rotation='vertical')
plt.xlabel('City', fontsize=12)
plt.ylabel('Number of Airports', fontsize=12)
plt.show()

In [None]:
cnt_srs

In [None]:
# Plot Top 10 Airlines based on number of flights
cnt_srs = routes_df['Airline'].value_counts().nlargest(10)
plt.figure(figsize=(12,6))
sns.barplot(cnt_srs.index, cnt_srs.values)
plt.xticks(rotation='vertical')
plt.xlabel('Airlines', fontsize=12)
plt.ylabel('Number of Flights', fontsize=12)
plt.show()

In [None]:
cnt_srs

In [None]:
# Plot Top 10 Aircraft Types
cnt_srs = routes_df['equipment'].value_counts().nlargest(10)
plt.figure(figsize=(12,6))
sns.barplot(cnt_srs.index, cnt_srs.values)
plt.xticks(rotation='vertical')
plt.xlabel('Plane Model', fontsize=12)
plt.ylabel('Number used in different routes', fontsize=12)
plt.show()

In [None]:
cnt_srs

In [None]:
# Plot Top 10 Destination Aiports 
cnt_srs = routes_df['Dest Airport'].value_counts().nlargest(10)
plt.figure(figsize=(12,6))
sns.barplot(cnt_srs.index, cnt_srs.values)
plt.xticks(rotation='vertical')
plt.xlabel('Airport', fontsize=12)
plt.ylabel('Number of Flights', fontsize=12)
plt.show()

In [None]:
cnt_srs

In [None]:
# Plot Top 10 Destination Aiports 
cnt_srs = routes_df['Source Airport'].value_counts().nlargest(10)
plt.figure(figsize=(12,6))
sns.barplot(cnt_srs.index, cnt_srs.values)
plt.xticks(rotation='vertical')
plt.xlabel('Airport', fontsize=12)
plt.ylabel('Number of Flights', fontsize=12)
plt.show()

In [None]:
cnt_srs

# Network Visualization

In [None]:
# Create networkX graph
graph0 = nx.from_pandas_edgelist(routes_all, source = 'Source', target = 'Dest', edge_attr = 'counts',create_using = nx.DiGraph())
print(nx.info(graph))

In [None]:
# Create networkX graph
graph = nx.from_pandas_edgelist(routes_all_n_local, source = 'Source', target = 'Dest', edge_attr = 'counts',create_using = nx.DiGraph())
print(nx.info(graph))

In [None]:
# default graph using Networkx built-in graphing function
plt.figure(figsize = (10,9))
nx.draw_networkx(graph)
plt.savefig("1.png", format = "png", dpi = 300)
plt.show()

In [None]:
# default graph using Networkx built-in graphing function
plt.figure(figsize = (14,14))

options = {
    "node_color": "black",
    "node_size": 1,
    "edge_color": "gray",
    "linewidths": 0,
    "width": 0.3,
    "alpha": 0.3
}
nx.draw_kamada_kawai(graph,with_labels=False,alpha=0.15, edge_color="grey",node_color="black",node_size=0.5, arrows=False)
plt.savefig("g_all_3.png", format = "png", dpi = 100)
plt.show()

In [None]:
# Above graph too diffuclt to visulize. 
# Try only explore Japan airport network
graph_jp = nx.from_pandas_edgelist(routes_jp, source = 'Source Airport', target = 'Dest Airport',create_using = nx.DiGraph())
print(nx.info(graph_jp))

In [None]:
plt.figure(figsize = (10,9))
options = {
}
nx.draw(graph_jp, **options)
#nx.draw_networkx(graph_jp)

plt.show()

In [None]:
plt.figure(figsize = (10,10))
nx.draw_networkx(graph_jp)
plt.show()

In [None]:
plt.figure(figsize = (10,10))
nx.draw_circular(graph_jp,with_labels=True,alpha=1, edge_color="grey",node_color="white",edgecolors="blue",node_size=550)
plt.show()

In [None]:
plt.figure(figsize = (18,10))
nx.draw_kamada_kawai(graph_jp,with_labels=True,alpha=1, edge_color="grey",node_color="white",edgecolors="blue",node_size=700)
plt.show()

In [None]:
# Set up base map
m = Basemap(projection='merc', llcrnrlon=123, llcrnrlat=22, urcrnrlon=148, urcrnrlat=48, lat_ts=0, resolution='l',)

# import long lat as m attribute
mx, my = m(pos_data['Longitude'].values, pos_data['Latitude'].values)
pos = {}
for count, elem in enumerate (pos_data['IATA']):
    pos[elem] = (mx[count], my[count])

In [None]:
plt.figure(figsize = (15,15))
nx.draw_networkx_nodes(G = graph_jp, pos = pos, node_list = graph_jp.nodes(), node_color = 'r', alpha = 0.8,
                       node_size = 0.1)
nx.draw_networkx_edges(G = graph_jp, pos = pos, edge_color='g', width = routes_all['counts'], 
                       alpha=0.2, arrows = False)
#plt.savefig("map_1.png", format = "png", dpi = 300)
plt.show()

In [None]:
plt.figure(figsize = (18,18))
m.shadedrelief()
nx.draw_networkx_nodes(G = graph_jp, pos = pos, node_list = graph.nodes(), node_color = 'r', alpha = 1.0, node_size = 5)
nx.draw_networkx_edges(G = graph_jp, pos = pos, edge_color='green',alpha=0.2, arrows = False)
plt.show()

In [None]:
# Set up base map
plt.figure(figsize = (60,30))
m = Basemap(projection='merc', resolution='l', suppress_ticks=True)

# import long lat as m attribute
mx, my = m(pos_data['Longitude'].values, pos_data['Latitude'].values)
pos = {}
for count, elem in enumerate (pos_data['IATA']):
    pos[elem] = (mx[count], my[count])
    
m.drawcountries(linewidth = 0.1)
m.drawstates(linewidth = 0.05)
m.drawcoastlines(linewidth=0.1)

nx.draw_networkx_nodes(G = graph, pos = pos, node_list = graph.nodes(), node_color = 'r', alpha = 0.8,
                       node_size = [counts['total_flight'][s]*0.01 for s in graph.nodes()])
nx.draw_networkx_edges(G = graph, pos = pos, edge_color='g', width = routes_all['counts']*0.25, 
                       alpha=0.2, arrows = False)
plt.savefig("map_5.png", format = "png", dpi = 300)
plt.show()

In [None]:
# Above graph too diffuclt to visulize. 
# Try only explore top 100 routes
graph_100 = nx.from_pandas_edgelist(routes_100, source = 'Source Airport', target = 'Dest Airport',create_using = nx.DiGraph())
print(nx.info(graph_100))

In [None]:
# Set up base map
plt.figure(figsize = (60,30))
m = Basemap(projection='merc', resolution='l', suppress_ticks=True)

# import long lat as m attribute
mx, my = m(pos_data['Longitude'].values, pos_data['Latitude'].values)
pos = {}
for count, elem in enumerate (pos_data['IATA']):
    pos[elem] = (mx[count], my[count])
    
m.drawcountries(linewidth = 0.1)
m.drawstates(linewidth = 0.05)
m.drawcoastlines(linewidth=0.1)
#m.shadedrelief()

nx.draw_networkx_nodes(G = graph_100, pos = pos, node_list = graph_100.nodes(), node_color = 'r', alpha = 1,
                       node_size = 0.5)
nx.draw_networkx_edges(G = graph_100, pos = pos, edge_color='g', width = routes_100['counts']*0.15, 
                       alpha=1, arrows = False)
nx.draw_networkx_labels(G = graph_100, pos = pos, font_size=10, font_color='white', font_family='sans-serif', 
                        font_weight='normal', alpha=1.0)

plt.savefig("map_100.png", format = "png", dpi = 300)
plt.show()

# Network Analysis

6 degree theory

In [None]:
# Find number of strongly connected components in flight network
print(nx.number_strongly_connected_components(graph))

largest_scc_nodes = max(nx.strongly_connected_components(graph), key=len)
largest_scc = graph.subgraph(largest_scc_nodes)

# Find number of airport in the largest strongly connected component
print(len(largest_scc.nodes()))

# Find the ratio of this compontnet's airport in the flight network
print(len(graph.nodes()))
print(len(largest_scc.nodes)/len(graph.nodes()))

In [None]:
# Find number of strongly connected components in flight network
print(nx.number_strongly_connected_components(graph))

In [None]:
# Get the largest strongly connected component in flight network
largest_scc_nodes = max(nx.strongly_connected_components(graph), key=len)
largest_scc = graph.subgraph(largest_scc_nodes)

# Find number of airport in the largest strongly connected component
print(len(largest_scc.nodes()))

# Find the ratio of this compontnet's airport in the entire flight network
print(len(graph.nodes()))
print(len(largest_scc.nodes)/len(graph.nodes()))

In [None]:
print(nx.average_shortest_path_length(largest_scc))

In [None]:
all_len_dict = dict(nx.shortest_path_length(largest_scc))
all_len_dict_value_list = list(all_len_dict.values())

flatten_len_list = []

for i in all_len_dict_value_list: 
    flatten_len_list.extend(i.values())

In [None]:
flatten_len_list = [x for x in flatten_len_list if x != 0]
len(flatten_len_list)

In [None]:
plt.figure(figsize = (10,10))
y = np.array(flatten_len_list)
plt.hist(y, bins=np.arange(1, 14, 1));
plt.xticks(np.arange(1, 14, 1))
plt.ylabel('Occurrences')
plt.xlabel('Shortest Path Length');

In [None]:
ctr = collections.Counter(flatten_len_list)
print("Frequency of the elements in the list : ", sorted(ctr.items()))

In [None]:
sum(flatten_len_list) / len(flatten_len_list) 

In [None]:
diameter=nx.diameter(largest_scc)
print(diameter)

periphery=nx.periphery(largest_scc)
print(periphery)

In [None]:
radius=nx.radius(largest_scc)
radius

In [None]:
# find the longest shortest path btw 2 airports in 
# the largest strongly connected component

maxLen = 0
node = ''

for i in list(largest_scc.nodes):       
    len = nx.shortest_path_length(largest_scc,source='YPO',target=i)
    if (maxLen < len):
        maxLen = len
        node = i

print(maxLen)
print(node)

In [None]:
print(nx.shortest_path_length(graph,source='YPO',target='IRP'))
print(nx.shortest_path(graph,source='YPO',target='IRP'))

In [None]:
print(nx.shortest_path_length(graph,source='IRP',target='YPO'))
print(nx.shortest_path(graph,source='IRP',target='YPO'))

Plot diameter on map 

In [None]:
# Set up base map
plt.figure(figsize = (60,30))
m = Basemap(projection='merc', resolution='l', suppress_ticks=True)

In [None]:
pos_data_dia = pos_data.loc[pos_data['IATA'].isin(['YPO', 'YAT', 'ZKE', 'YFA', 'YMO', 'YTS', 'YYZ', 'ADD', 'FIH', 'FKI', 'GOM', 'BNC', 'BUX', 'IRP'])]

In [None]:
g_longest_path = nx.Graph()
g_longest_path.add_edges_from([('YPO', 'YAT'), ('YAT', 'ZKE'),('ZKE', 'YFA'), ('YFA', 'YMO'),
                             ('YMO', 'YTS'), ('YTS', 'YYZ'),('YYZ', 'ADD'), ('ADD', 'FIH'),
                             ('FIH', 'FKI'), ('FKI', 'GOM'),('GOM', 'BNC'), ('BNC', 'BUX'),
                             ('BUX', 'IRP')])

In [None]:
plt.figure(figsize = (30,30))
m = Basemap(projection='cea',llcrnrlat=-90,urcrnrlat=90,\
            llcrnrlon=-180,urcrnrlon=180,resolution='l')

# import long lat as m attribute
mx, my = m(pos_data_dia['Longitude'].values, pos_data_dia['Latitude'].values)
pos = {}
for count, elem in enumerate (pos_data_dia['IATA']):
    pos[elem] = (mx[count], my[count])
    
m.drawcountries(linewidth = 0.1)
m.drawstates(linewidth = 0.05)
m.drawcoastlines(linewidth=0.1)

nx.draw_networkx_nodes(G = g_longest_path, pos = pos, node_list=g_longest_path.nodes(), node_color = 'r', alpha =1,
                      node_size=0.1)
nx.draw_networkx_edges(G = g_longest_path, pos = pos, edge_color='g', width = 0.5, alpha=1, arrows = False)
plt.savefig("map_6.png", format = "png", dpi = 500)
plt.show()

Centrality

In [None]:
in_deg = nx.in_degree_centrality(graph)
sort = sorted(in_deg.items(), key=lambda x: -x[1])
print(sort[:10])
print(sort[-10:])

In [None]:
out_deg = nx.out_degree_centrality(graph)
sort = sorted(out_deg.items(), key=lambda x: -x[1])
print(sort[:10])
print(sort[-10:])

In [None]:
deg = nx.degree_centrality(graph)
sort = sorted(deg.items(), key=lambda x: -x[1])
print(sort[:10])
print(sort[-10:])

In [None]:
eig_cen = nx.eigenvector_centrality(graph)
sort = sorted(eig_cen.items(), key=lambda x: -x[1])
print(sort[:10])
print(sort[-10:])

In [None]:
clo_cen = nx.closeness_centrality(graph)
sort = sorted(clo_cen.items(), key=lambda x: -x[1])
print(sort[:10])
print(sort[-10:])

In [None]:
btw_cen = nx.betweenness_centrality(graph)
sort = sorted(btw_cen.items(), key=lambda x: -x[1])
print(sort[:10])
print(sort[-10:])

In [None]:
pagerank = nx.pagerank(graph)
sort = sorted(pagerank.items(), key=lambda x: -x[1])
print(sort[:10])
print(sort[-10:])