# Airline Analysis

## TODO:
- get the distane of the edge based on longitude and latitude ✅
- add all the information to the graph so we have an easy time calculating the measures. ✅
- get list of graph theoretical measures to apply to airline networks:
    - number and strength of hubs 
    - network robustness measures
    - number of/ precense if paths
    - diameter of the graph
    - centrality measures
    - spectrum of graph
    - measure of correlation between country label
    - how much in country/out of country ✅
    - number of triangles in graph
    - set weights if we have multiple fights to same place by same airline ✅
    - get overlaping nodes functions. similarity of graph
- get planes associated with each flight in route, so we can get number of passagers. May do later only for the biggest airlines, at end of analysis ❓

Resources:

Economics:
http://www.oecd.org/daf/competition/airlinecompetition.htm

Graph theoretical:
https://beta.vu.nl/nl/Images/werkstuk-meer_tcm235-280356.pdf

aircraft traffic data by main airport:
https://datamarket.com/data/set/196g/aircraft-traffic-data-by-main-airport#!ds=196g!nto=6:ntp=b:ntq=3:ntr=1.1g.1u.7.z.a.j.v.1b.t.d.s.1n.12.p.8.b.y.e.19.17.1v.9.i.11.1f.1s.1a.1w.x.14.1l.1p.4.k.1r.g.1x.1c.f.15.q.1j.1t.l.1k.1h:nts=nf.rb&display=line





### Loading the dataset:

In [None]:
from  geopy.distance import distance #calculates distance based on coordinates

import operator
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
!bash download_data.sh

In [None]:
airports = pd.read_csv('airports.dat', header=None, names=
                      ["AirportID","Name", "City", "Country", "IATA", "ICAO",
                       "Latitude", "Longitude", "Altitude", "Timezone", "DST", "TzDatabaseTimeZone",
                       "Type", "Source"],
                      na_values='\\N')
airlines = pd.read_csv('airlines.dat', header=None, names=
                       ["AirlineID", "Name", "Alias", "IATA", "ICAO", "Callsign", "Country", "Active"]
                       ,na_values='\\N')
routes = pd.read_csv('routes.dat', header=None, names=
                     ['Airline', 'AirlineID', 'SourceAirport', 'SourceAirportID', 'DestinationAirport',
                      'DestinationAirportID', 'Codeshare', 'Stops', 'Equipment'],
                    na_values='\\N')
planes = pd.read_csv('planes.dat', header=None, names=['Name', 'IATA code', 'ICAO code'])

In [None]:
routes.head()

In [None]:
airlines.head()

Only keep airports in that are both in routes and airports dataframes:

In [None]:
valid_airports = set(routes.SourceAirport).union(set(routes.DestinationAirport)) 
#set(airports.IATA).intersection(set(routes.SourceAirport).union(set(routes.DestinationAirport)))

In [None]:
set(airports.IATA) - valid_airports

Airports to fill in information for :

In [None]:
len(valid_airports - set(airports.IATA))

In [None]:
routes = routes[routes.SourceAirport.isin(valid_airports) &  routes.DestinationAirport.isin(valid_airports)]
airports = airports[airports.IATA.isin(valid_airports)]

Only keep airlines in intersection of that are both in the airline and in the routes dataframe :

In [None]:
valid_airlines = set(routes.AirlineID)#set(airlines.AirlineID).intersection(set(routes.AirlineID))

In [None]:
#set(airlines.AirlineID) - valid_airlines

In [None]:
airlines = airlines[airlines.AirlineID.isin(valid_airlines)]
routes = routes[routes.AirlineID.isin(valid_airlines)]

We check that for each airline we have exactly one edge between a given source and destination none. This means that our graph will be unweighted. 

In [None]:
routes_by_airline = routes[['SourceAirport', 'DestinationAirport', 'Airline']]
routes_by_airline.drop_duplicates().shape == routes_by_airline.shape

In [None]:
newMatrix = [['Altay Air Base', 'AAT', 47.7498856, 88.0858078, 'Altay', 'China', 'ZWAT'],
             ['Baise Youjiang Airport ', 'AEB', 23.7206001, 106.9599991, 'Baise', 'China', 'ZGBS'],
             ['Tasiilaq', 'AGM', 65.6122961, -37.6183355, 'Tasiilaq', 'Greenland', 'BGAM'],
             ['Atmautluak Airport', 'ATT', 60.8666992, -162.272995, 'Atmautluak', 'United States', ''],
             ['Branson Airport', 'BKG', 36.532082, -93.200544, 'Branson', 'United States', 'KBBG'],
             ['Baoshan Yunduan Airport', 'BSD', 25.0533009, 99.1682968, 'Baoshan', 'United States', 'ZPBS'],
             ['Laguindingan Airport', 'CGY', 8.612203, 124.456496, 'Cagayan de Oro City', 'Philippines', ''],
             ['Chuathbaluk Airport', 'CHU', 61.579102, -159.216003, 'Chuathbaluk', 'United States', 'PACH'],
             ['Crooked Creek Airport', 'CKD', 61.8679008, -158.1349945, 'Crooked Creek', 'United States', 'CJX'],
             ['Desierto De Atacama Airport', 'CPO', -27.2612, -70.7791977, 'Copiapo', 'Chile', 'SCAT'],
             ['Dandong Airport', 'DDG', 40.0247002, 124.2860031, 'Dandong', 'China', 'ZYDD'],
             ['Hamad International Airport', 'DOH', 25.2620449, 51.6130829, 'Doha', 'Qatar', 'OTHH'],
             ['Dongying Shengli Airport', 'DOY', 37.5085983, 118.788002, 'Dongying', 'China', 'ZSDY'],
             ['Saertu Airport', 'DQA', 46.7463889, 125.1405556, 'Daqing Shi', 'China', 'ZYDQ'],
             ['Førde Airport', 'FDE', 61.3911018, 5.7569399, 'Førde', 'Norway', 'ENBL'],
             ['FMt. Fuji Shizuoka Airport', 'FSZ', 34.7960435, 138.1877518, 'Makinohara', 'Japan', 'RJNS'],
             ['Foshan Shadi Airport', 'FUO', 23.0832996, 113.0699997, 'Foshan', 'China', 'ZGFS'],
             ['Goulimime Airport', 'GLN', 29.0266991, -10.0502996, 'Goulimime', 'Morocco', 'GMAG'],
             ['Gheshm Airport', 'GSM', 26.9487, 56.2687988, 'Gheshm', 'Iran', 'OIKQ']]
             
             
             
append_df = pd.DataFrame(newMatrix, columns=['Name', 'IATA', 'Latitude', 'Longitude', 'City', 'Country', 'ICAO']) 
airports.append(append_df, sort=False)


In [None]:
airports[airports.Country == 'Japan']


### Merging Routes with Airlines:

We are only interessted in currently active airlines:

In [None]:
merged_routes = pd.merge(airlines[airlines.Active == 'Y'], routes, on='AirlineID')

In [None]:
merged_routes.head(1)

Getting whether the flight is international or not:

In [None]:
Airport_to_country = airports.set_index('IATA').Country.to_dict()

In [None]:
def get_international(x):
    try:
        if Airport_to_country[x.SourceAirport] == Airport_to_country[x.DestinationAirport]:
            return 1
        else:
            return 0
    except:
        return None

In [None]:
merged_routes['International'] = merged_routes.apply(
    lambda x: get_international(x), axis=1)

reset frames to create mappings:

In [None]:
#only keep values we are interested in
airports_filtered = airports[['Name', 'Country', 'Longitude', 'Latitude', 'Timezone', 'IATA', 'City']].copy()

In [None]:
#IATA airport id -> longitude latitude
#airports_filtered.dropna(inplace=True)
airports_filtered.set_index('IATA', inplace=True)

airports_filtered.Longitude.dropna().shape == airports_filtered.Longitude.shape

In [None]:
print(airports.Longitude.shape)
print(airports_filtered.Longitude.shape)

In [None]:
location_mapping = airports_filtered.apply(lambda x: [x.Longitude, x.Latitude], axis=1).to_dict()

In [None]:
#Airline name -> airlineID
airline_name_to_number = merged_routes.Name.drop_duplicates().reset_index(drop=True).to_dict()
airline_name_to_number = {v: k for k, v in airline_name_to_number.items()}

In [None]:
merged_routes['AirlineNbr'] = merged_routes.Name.map(airline_name_to_number)

Fill in Nan values:

In [None]:
merged_routes['Codeshare'] = merged_routes.Codeshare.fillna('N')

### Getting the distance between two airports:

Example of functionality:

In [None]:
element = airports_filtered.apply(lambda x: (x.Latitude, x.Longitude), axis=1)[0]
element2 = airports_filtered.apply(lambda x: (x.Latitude, x.Longitude), axis=1)[1]

In [None]:
distance(element, element2).km

In [None]:
distance_mapping = airports_filtered.apply(lambda x: (x.Latitude, x.Longitude), axis=1).to_dict()

Additing it to merged_routes:

In [None]:
def get_distance(source, dest):
    try:
        dist = distance(distance_mapping[source], distance_mapping[dest]).km
        return dist
    except:
        return None

In [None]:
merged_routes['Distance'] = merged_routes.apply(lambda x: get_distance(x.SourceAirport, x.DestinationAirport), axis=1)

In [None]:
relevant_columns = ['Name', 'ICAO', 'Country', 'SourceAirport', 'DestinationAirport', 'Codeshare',
                    'Stops', 'Equipment', 'AirlineNbr', 'International', 'Distance']

In [None]:
merged_routes.head()

In [None]:
merged_routes[relevant_columns].head()

In [None]:
merged_routes = merged_routes[relevant_columns]

## Preliminary analysis of the biggest airlines:

In [None]:
merged_routes.Name.value_counts().head(10)

In [None]:
merged_routes.Name.value_counts().head(120).plot(kind='bar', color='b')
_ = plt.xticks([])

In [None]:
merged_routes.Name.value_counts().plot(kind='hist', log=True, bins=20)

In [None]:
merged_routes.Name.value_counts().describe()

In [None]:
reasonably_big_airlines = merged_routes.Name.value_counts()[merged_routes.Name.value_counts() > 100].index

In [None]:
merged_routes = merged_routes[merged_routes.Name.isin(reasonably_big_airlines)]

We look at a total of 138 airlines:

In [None]:
merged_routes.Name.unique().shape

### Meta-Data analysis

Propotion of international to national flights

In [None]:
merged_routes.groupby('Name').International.mean().plot(kind='hist', bins=40)

mean distance of flights:

In [None]:
merged_routes.groupby('Name').Distance.mean().plot(kind='hist', bins=30)

Max distance of flights:

In [None]:
merged_routes.groupby('Name').Distance.max().plot(kind='hist', bins=30)

In [None]:
plt.scatter(merged_routes.groupby('Name').Distance.max(),  merged_routes.groupby('Name').Distance.min())

In [None]:
plt.scatter(merged_routes.groupby('Name').Distance.median(),  merged_routes.groupby('Name').International.mean())

shortest distance of flight:

In [None]:
merged_routes.groupby('Name').Distance.min().plot(kind='hist', bins=30)

Proportion of codeshare flights:

In [None]:
merged_routes['Codeshare'] = merged_routes.Codeshare.map(lambda x: 1 if x == 'Y' else 0)

In [None]:
merged_routes.groupby('Name').Codeshare.mean().plot(kind='hist', bins=30)

Missing Airports:

In [None]:
len(set(merged_routes.SourceAirport).union(set(merged_routes.DestinationAirport)) - set(airports.IATA))

In [None]:
set(merged_routes.SourceAirport).union(set(merged_routes.DestinationAirport)) - set(airports.IATA)

## Create graph of all airlines:

Create graph with edge having airline associated to

In [None]:
biggest = merged_routes.AirlineNbr.value_counts().head(10).index

In [None]:
edge_attributes = ['Country', 'Name', 'AirlineNbr', 'Distance', 'International']

In [None]:
Airline_Graph = nx.from_pandas_edgelist(merged_routes, 
                                        source='SourceAirport', 
                                        target='DestinationAirport', 
                                        edge_attr=['Country', 'Name', 'AirlineNbr', 'Distance', 'International'])

In [None]:
color_edges = list(nx.get_edge_attributes(Airline_Graph, 'AirlineNbr').values())

In [None]:
nx.set_node_attributes(Airline_Graph, location_mapping, 'Location')

In [None]:
#draw_airline_network(Airline_Graph, 'All airlines')

In [None]:
Airport_to_city = airports.set_index('IATA').City.to_dict()

In [None]:
Airport_to_name = airports.set_index('IATA').Name.to_dict()

In [None]:
e_centrality = nx.eigenvector_centrality(Airline_Graph)

In [None]:
centrality = np.array(list(e_centrality.values()))

Major Airports:

In [None]:
major_airports = {Airport_to_name[k] for  k, v in e_centrality.items() if v > np.quantile(centrality, 0.99) and k in Airport_to_city.keys()}

In [None]:
major_airports

plt.figure(figsize=(20, 10))
nx.draw_networkx(Airline_Graph, 
                 pos=nx.get_node_attributes(Airline_Graph, 'Location'), 
                 edge_color=color_edges, edge_cmap=plt.cm.Set2, node_size=0, labels=dict(), alpha=0.4)

## Looking at individual networks:

Example analysis of one graph:

In [None]:
def create_airline_network(airline):
    df = merged_routes[merged_routes['Name'] == airline]
    Airline_Graph = nx.from_pandas_edgelist(df, 
                                      source='SourceAirport', target='DestinationAirport', edge_attr=['Country'])
    nx.set_node_attributes(Airline_Graph, location_mapping, 'Location')
    return Airline_Graph

In [None]:
Ryanair = create_airline_network('Ryanair')

In [None]:
e, U = np.linalg.eigh(nx.normalized_laplacian_matrix(Ryanair).todense())

In [None]:
plt.plot(e)

In [None]:
Lufthansa = create_airline_network('Lufthansa')

In [None]:
e, U = np.linalg.eigh(nx.normalized_laplacian_matrix(Lufthansa).todense())
plt.plot(e)

In [None]:
merged_routes.Name.value_counts().describe()

In [None]:
merged_routes.Country.value_counts().head(20)

In [None]:
merged_routes.Codeshare.value_counts()

In [None]:
Low_cost = ['Southwest Airlines', 'AirAsia', 'Ryanair','easyJet', 'WestJet']

In [None]:
merged_routes[merged_routes.Name.isin(Low_cost)].Name.value_counts()

In [None]:
merged_routes.head()

In [None]:
def draw_airline_network(Airline_Graph, airline):
    plt.figure(figsize=(10, 10))
    centrality = nx.betweenness_centrality(Airline_Graph)
    size = np.array(list(centrality.values()))*1000
    nx.draw_spring(Airline_Graph, node_size=size, width=0.1)
    plt.title(airline)
    plt.show()
    
def get_spectrum_figures(Airline_Graph):
    e, U = np.linalg.eigh(nx.normalized_laplacian_matrix(Airline_Graph).todense())
    plt.plot(e)
    plt.show()
    plt.plot(nx.laplacian_spectrum(Airline_Graph))
    plt.show()
    plt.boxplot(nx.degree_centrality(Airline_Graph).values())
    plt.show()

In [None]:
for cheap in Low_cost:
    Cheap = create_airline_network(cheap)
    draw_airline_network(Cheap, cheap)

## Adding in external information: for relected airlines - case study

In [None]:
Delays_data = pd.read_csv('delays_Data.csv')

In [None]:
Delays_data

In [None]:
def convert_mixed_fractions(x):
    if '%' in x:
        return float(x[:-1])/100
    else:
        return float(x)

In [None]:
Delays_data['On-time (A14)'] = Delays_data['On-time (A14)'].map(convert_mixed_fractions)

In [None]:
Airline_list = pd.read_csv('airlines_Data.csv')

In [None]:
Airline_list.columns

In [None]:
Delays_data.columns

In [None]:
set(Airline_list.name.map(lambda x: x.lower())).intersection(set(merged_routes.Name.map(lambda x: x.lower())))

In [None]:
set(Delays_data['On-time']).intersection(airlines)

In [None]:
set(Airline_list.name).intersection(airlines)

# Networks Analysis

Clustering the networks based on stats:

In [None]:
from networkx.algorithms.approximation.clique import large_clique_size

In [None]:
from networkx.algorithms.community import greedy_modularity_communities

In [None]:
def get_network_stats(airline):
    Airline_Graph = create_airline_network(airline)
    
    
    components = nx.number_connected_components(Airline_Graph)
    component_ratio = 0
    bridges = len(list(nx.bridges(Airline_Graph)))#/ Airline_Graph.number_of_edges()
    tree = nx.is_forest(Airline_Graph)
    max_clique = large_clique_size(Airline_Graph)
    bipartite = nx.is_bipartite(Airline_Graph)
    density = nx.density(Airline_Graph)
    #try with this 
    international_ratio = merged_routes.groupby('Name').get_group(airline).International.sum() / (Airline_Graph.size()*2)
    
    
    if components > 1 :
        print(airline,'  ' ,components)
        
        c = sorted(nx.connected_components(Airline_Graph), key = len, reverse=True)
        component_ratio = len([i for t in c[1:] for i in t])/ len(c[0])
        print(component_ratio)
        Airline_Graph = Airline_Graph.subgraph(c[0])

    diameter = nx.diameter(Airline_Graph)
    node_connectivity = nx.node_connectivity(Airline_Graph)
    algebraic_connectivity = nx.algebraic_connectivity(Airline_Graph)
    clustering = nx.average_clustering(Airline_Graph)
    
    nb = nx.betweenness_centrality(Airline_Graph, normalized=True)
    betweenness = np.array(list(nb.values()))
    max_betweenness = np.max(betweenness)
    upper_betweenness = np.quantile(betweenness, 0.75)
    median_betweenness = np.quantile(betweenness, 0.5)
    lower_betweenness = np.quantile(betweenness, 0.27)
    
    node_edge_ratio =  Airline_Graph.number_of_nodes() / Airline_Graph.number_of_edges()
    degree_assortativity = nx.degree_assortativity_coefficient(Airline_Graph)
    shortest_path_length = nx.average_shortest_path_length(Airline_Graph)
    
    c = list(greedy_modularity_communities(Airline_Graph))
    nbr_communities = len(c)
    
    
    hist = nx.degree_histogram(Airline_Graph)
    hist_len = len(hist)
    total = sum(hist)
    per_large_degree = sum(hist[5:])/total
    deadend = hist[1]/total
    path = hist[2]/total
    tri = sum(hist[3:])/total

    return np.array([per_large_degree,
                     hist_len,
                     deadend,
                     path,
                     tri,
                     max_clique, #0
                     tree,       #1
                     bipartite, #2
                     bridges, #3
                     diameter, #4
                     components, #5
                     density,
                     component_ratio, 
                     node_connectivity, 
                     clustering, 
                     algebraic_connectivity, 
                     max_betweenness,
                     upper_betweenness,
                     median_betweenness,
                     lower_betweenness,
                     degree_assortativity, 
                     shortest_path_length, 
                     international_ratio,
                     ])

In [None]:
get_network_stats('Lufthansa')

Collecting the data:

In [None]:
Stats = []
airlines = merged_routes.Name.unique()
for name in airlines:
    stat = get_network_stats(name)
    if type(stat) != None:
        Stats.append(stat)

Put everything into a matrix, do PCA for dimensionality reduction & cluster:

In [None]:
network_stats = np.array(Stats)

In [None]:
from sklearn.preprocessing import StandardScaler
x = StandardScaler().fit_transform(network_stats)

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=5)
principalComponents = pca.fit_transform(x)

In [None]:
from mpl_toolkits.mplot3d import Axes3D

In [None]:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')

x = principalComponents[:, 0]
y = principalComponents[:, 1]
z = principalComponents[:, 2]



ax.scatter(x, y, z, c='r', marker='o')

In [None]:
plt.scatter(principalComponents[:, 0], principalComponents[:, 1])

In [None]:
print(pca.explained_variance_ratio_) 

In [None]:
from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=10, affinity='euclidean', linkage='ward')  
cluster.fit_predict(principalComponents)  

In [None]:
plt.scatter(principalComponents[:,0], principalComponents[:,1], c=cluster.labels_, cmap=plt.cm.tab20)  

In [None]:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')

x = principalComponents[:, 0]
y = principalComponents[:, 1]
z = principalComponents[:, 2]



ax.scatter(x, y, z, c=cluster.labels_, marker='o', cmap=plt.cm.tab20)

In [None]:
from sklearn.mixture import GaussianMixture as GMM
gmm = GMM(n_components=5).fit(principalComponents)
labels = gmm.predict(principalComponents)

In [None]:
plt.scatter(principalComponents[:,0], principalComponents[:,1], c=labels, cmap=plt.cm.tab20)  

In [None]:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')

x = principalComponents[:, 0]
y = principalComponents[:, 2]
z = principalComponents[:, 1]



ax.scatter(x, y, z, c=labels, cmap=plt.cm.tab20)

In [None]:
plt.figure(figsize=(30, 30))
plt.scatter(principalComponents[:,0], principalComponents[:,1], c= cluster.labels_, cmap=plt.cm.tab20)  
for i, name in enumerate(airlines):
    plt.annotate(name, xy = (principalComponents[i, 0], principalComponents[i, 1]), 
             xytext = (0, 0), textcoords = 'offset points')

In [None]:
most_common = merged_routes[~merged_routes.Name.duplicated()].Country.value_counts().head().index
most_common

In [None]:
countries = list(most_common) + list(merged_routes.Country.unique())
airline_to_countryid = merged_routes.set_index('Name').Country.to_dict()
country_coloring = [countries.index(airline_to_countryid[i]) for i in merged_routes.Name.unique()]

In [None]:
to_keep = [i for i, name in enumerate(airlines) if airline_to_countryid[name] in most_common]

In [None]:
plt.figure(figsize=(20, 10))
plt.scatter(principalComponents[to_keep,0], 
            principalComponents[to_keep,1], 
            c= np.array(country_coloring)[to_keep], 
            cmap=plt.cm.Set1)  
for i in to_keep:
    plt.annotate(airlines[i], xy = (principalComponents[i, 0], principalComponents[i, 1]), 
             xytext = (0, 0), textcoords = 'offset points')

One large cluster of american airways:

but we have some dispersion, and then a cheap airline cluster

In [None]:
American_Airlines = ['Us Airways', 'Delta Airlines', 'American Airlines', 'United Airlines']

Total of 4 clusters for chinese airlines

In [None]:
Chinese_Airlines = ['Air China', 'China Eastern Airlines', 'China Southern Airlines', 'Hainan Airlines']

1 clear cluster for United Arab Emirates:

In [None]:
Emirate_Airlines = ['Fly Dubai', 'Emirates', 'Etihad Airways']

German airlines are distributed all over, there is no clear cluster!

In [None]:
Extreme_networks = ['Era Alaska', 'TUIfly', 'Air Arabia']

In [None]:
for name in Extreme_networks:
    Airline_Graph = create_airline_network(name)
    draw_airline_network(Airline_Graph, name)

In [None]:
scatter_data = pd.DataFrame(principalComponents)

scatter_data['name'] = airlines
scatter_data['labels'] = cluster.labels_
scatter_data['country'] = scatter_data.name.map(airline_to_countryid)

In [None]:
scatter_data['ontime'] = scatter_data.name.map(Delays_data.set_index('On-time')['On-time (A14)'].to_dict())
scatter_data['delay'] = scatter_data.name.map(Delays_data.set_index('On-time')['Avg. Delay'].to_dict())
scatter_data['safety'] = scatter_data.name.map(Airline_list.set_index('name').safety_score.to_dict())

In [None]:
scatter = scatter_data[scatter_data.delay.notna()]

In [None]:
scatter['safety'] = scatter.safety.astype(float)

In [None]:
scatter[1][6]

Only result I can see so far:
there is a delay line that goes from bottom left to top right:

In [None]:
plt.figure(figsize=(20, 10))
plt.scatter(scatter[0], 
            scatter[1], 
            c= scatter['delay'], 
            cmap=plt.cm.Spectral)  

for i in scatter.index:
    plt.annotate(scatter['name'][i], xy = (scatter[0][i], scatter[1][i]), 
             xytext = (0, 0), textcoords = 'offset points')
plt.colorbar()

In [None]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
scatter[-1] = 1

In [None]:
scatter

In [None]:
X = np.array(Stats)[list(scatter.index)]

In [None]:
regr = linear_model.LinearRegression()

# Train the model using the training sets
#regr.fit(scatter[[0, 1, 2, 3]], scatter['delay'])
regr.fit(X, scatter['delay'])

In [None]:
#regr.score(scatter[[0, 1,2, 3]], scatter['delay'])
regr.score(X, scatter['delay'])

In [None]:
scatter['pred'] = regr.predict(X)#regr.predict(scatter[[0, 1, 2, 3]])

In [None]:
plt.figure(figsize=(20, 10))
plt.scatter(scatter[0], 
            scatter[1], 
            c= scatter['pred'], 
            cmap=plt.cm.Spectral)  

for i in scatter.index:
    plt.annotate(scatter['name'][i], xy = (scatter[0][i], scatter[1][i]), 
             xytext = (0, 0), textcoords = 'offset points')
plt.colorbar()

In [None]:
plt.bar(range(len(regr.coef_)), np.log(abs(regr.coef_)),)
[regr.coef_]

In [None]:
#predict the safety score using linear regression
scatter_safety=scatter[pd.notnull(scatter['safety'])]
X_safety = np.array(Stats)[list(scatter_safety.index)]

In [None]:
regr_safety=linear_model.LinearRegression()
regr_safety.fit(X_safety,scatter_safety['safety'])

In [None]:
regr_safety.score(X_safety,scatter_safety['safety'])

In [None]:
scatter['pred_safety']=regr_safety.predict(X)

In [None]:
plt.bar(range(len(regr_safety.coef_)), np.log(abs(regr_safety.coef_)),)
[regr_safety.coef_]

In [None]:
#predict the on-time score using linear regression
ontime_rig=linear_model.LinearRegression()
ontime_rig.fit(X,scatter['ontime'])

In [None]:
ontime_rig.score(X,scatter['ontime'])

In [None]:
scatter['pred_ontime']=ontime_rig.predict(X)

In [None]:
plt.bar(range(len(ontime_rig.coef_)), np.log(abs(ontime_rig.coef_)),)
[ontime_rig.coef_]

The determining factor in prediction is: 
                     upper_betweenness,
                     median_betweenness,
                     lower_betweenness,

In [None]:
"""per_large_degree,
                     hist_len,
                     deadend,
                     path,
                     tri,
                     max_clique, 
                     tree,       
                     bipartite, 
                     bridges, 
                     diameter,
                     components,
                     density,
                     component_ratio, 
                     node_connectivity, 
                     clustering, 
                     algebraic_connectivity, 
                     max_betweenness,
                     upper_betweenness,
                     median_betweenness,
                     lower_betweenness,
                     degree_assortativity, 
                     shortest_path_length, 
                     """

Getting information summarized for individual networks:

In [None]:
def airlines_network_analysis(airline):
    
    Airline_Graph = create_airline_network(airline)
    
    #Highlights hubs   
    print("10% biggest airports of ", airline)
    print()
    deg = np.array(list(Airline_Graph.degree))
    deg_value = deg[:,1]
    deg_value = deg_value.astype(np.float)
    perc = np.percentile(deg_value, q=90)
    biggest_hubs = np.array(np.where(deg_value > perc))

    for i in np.nditer(biggest_hubs):
        print(airports[airports.IATA == deg[i,0]].Name.to_string(index=False), "has degree : ", deg[i,1])
    
    
    #Diameter,robustness
    print("Anlysis")
    print("Number of edges : ", Airline_Graph.number_of_edges())
    print("Number of nodes", Airline_Graph.number_of_nodes(), "nodes")
    print("Diameter : ", nx.diameter(Airline_Graph))
    print("Average distance:", merged_routes.groupby('Name').get_group(airline).Distance.mean())
    print("International Ratio: ",merged_routes.groupby('Name').get_group(airline).International.sum() / (Airline_Graph.size()*2))
    print("Node connectivity", nx.node_connectivity(Airline_Graph))
    
    eb = nx.edge_betweenness_centrality(Airline_Graph)
    key, value = max(eb.items(), key = lambda p: p[1])
    print("Max edge betwenness: ",value , "from ", Airport_to_city.get(key[0]), "to", Airport_to_city.get(key[1]))
    key, value = min(eb.items(), key = lambda p: p[1])
    print("Min edge betwenness: ",value, "from", Airport_to_city.get(key[0]), "to", Airport_to_city.get(key[1]))
                                                            
    nb = nx.betweenness_centrality(Airline_Graph)
    key, value = max(nb.items(), key= lambda p:p[1])
    print("Max node betwenness: ", value, "airport", Airport_to_city.get(key))
    key, value = min(nb.items(), key= lambda p:p[1])
    print("Min node betwenness: ", value, "airport", Airport_to_city.get(key))
    print("Algbraic connectivity: ", nx.algebraic_connectivity(Airline_Graph))
    
    #Plot network
    plt.figure(figsize=[7,9])
    plt.subplot(211)
    plt.title('Degree Distribution')
    plt.hist(deg_value, bins=50)
    
    plt.subplot(212)
    plt.title('Distances distribution')
    merged_routes.groupby('Name').get_group(airline).Distance.hist(bins=30)
    
    
    
   
    draw_airline_network(Airline_Graph, airline)
    

In [None]:
airlines_network_analysis('Ryanair')

In [None]:
airlines_network_analysis('American Airlines')

In [None]:
#High cooperation probability based on common bottlenecks 
def find_helper(airline):
    Airline_Graph = create_airline_network(airline)
    #bottlenecks
    eb = nx.edge_betweenness_centrality(Airline_Graph)
    #Decreasing sorting 
    eb_sorted = sorted(eb.items(), key = lambda p: 1-p[1])
    #find helper for the bottleneck
    for i in range(5):
        print('Betweenness value ' , eb_sorted[i][1])
        print('Bottleneck from ', Airport_to_city.get(eb_sorted[i][0][0]), 'to', Airport_to_city.get(eb_sorted[i][0][1]))
        
        helper_routes = merged_routes[(merged_routes.SourceAirport == eb_sorted[i][0][0]) & (merged_routes.DestinationAirport == eb_sorted[i][0][1])]
        if (helper_routes.shape[0] > 1):
            print('Best helpers : ')
            print((helper_routes[helper_routes.Name != airline].Name.to_string(index=False)))
            print()
        else:
            print(airline, ' is the unique airline \n\n')
            

find_helper('American Airlines')


## Competition Analysis

In [None]:
merged_routes['SourceCountry'] = merged_routes.apply(lambda x: 
                Airport_to_country[x.SourceAirport], axis=1)
merged_routes['DestinationCountry'] = merged_routes.apply(lambda x: 
                Airport_to_country[x.DestinationAirport], axis=1)
merged_routes.head(5)

In [None]:
def overlaps(route_a, route_b):
    source_dist = distance(distance_mapping[route_a['SourceAirport']],
                           distance_mapping[route_b['SourceAirport']]).km
    dest_dist = distance(distance_mapping[route_a['DestinationAirport']],
                         distance_mapping[route_b['DestinationAirport']]).km
    return source_dist <= 100 and dest_dist <= 100

"""
Computed the overlap (competition) score between two airlines.
It takes not only nodes into consideration but also the edges.
So, if Airline A goes from Bucharest to Zurich and Airline B
from Geneva to Zurich, based on this fact only, they are not
competitors.
"""
def compute_overlap_score(airline_x, airline_y):
    routes_x = merged_routes[merged_routes.Name == airline_x]
    routes_y = merged_routes[merged_routes.Name == airline_y]
    
    score = 0
    for i, row_x in routes_x.iterrows():
        does_overlap = False
        joined_routes = routes_y[(routes_y.SourceCountry == row_x['SourceCountry']) &
                                 (routes_y.DestinationCountry == row_x['DestinationCountry'])]
        for j, row_y in joined_routes.iterrows():
            if overlaps(row_x, row_y):
                does_overlap = True
        if does_overlap:
            score += 1
    return score / len(routes_x)

compute_overlap_score("Ryanair", "Wizz Air")

In [None]:
"""
Computed the overlap (competition) score between given airlines.
It takes not only nodes into consideration but also the edges.
So, if Airline A goes from Bucharest to Zurich and Airline B
from Geneva to Zurich, based on this fact only, they are not
competitors.
"""
def compute_overlap_scores(airline_list):
    scores = []
    for airline_a in airline_list:
        for airline_b in airline_list:
            scores.append([airline_a, airline_b, compute_overlap_score(airline_a, airline_b)])
    return pd.DataFrame(scores, columns = ['AirlineA', 'AirlineB', 'Score'])

In [None]:
overlap_scores = compute_overlap_scores(["Ryanair", "Wizz Air"])
sns.heatmap(overlap_scores.pivot("AirlineA", "AirlineB", "Score"), cmap="YlGnBu")

In [None]:
overlap_scores = compute_overlap_scores(Low_cost)
sns.heatmap(overlap_scores.pivot("AirlineA", "AirlineB", "Score"), cmap="YlGnBu")

In [None]:
overlap_scores = compute_overlap_scores(Best_Airlines)
sns.heatmap(overlap_scores.pivot("AirlineA", "AirlineB", "Score"), cmap="YlGnBu")

In [None]:
overlap_scores = compute_overlap_scores(Large_Airlines)
sns.heatmap(overlap_scores.pivot("AirlineA", "AirlineB", "Score"), cmap="YlGnBu")

In [None]:
overlap_scores = compute_overlap_scores(Chinese)
sns.heatmap(overlap_scores.pivot("AirlineA", "AirlineB", "Score"), cmap="YlGnBu")

In [None]:
overlap_scores = compute_overlap_scores(airlines.Name)
sns.heatmap(overlap_scores.pivot("AirlineA", "AirlineB", "Score"), cmap="YlGnBu")