# Flight Delays and Cancellations

In this notebook we are going to do EDA for [2015 Flight Delays and Cancellations](https://www.kaggle.com/usdot/flight-delays) dataset from [Kaggle](https://www.kaggle.com/).


In [108]:
import numpy as np
import csv
import operator
import pandas as pd
import matplotlib.pyplot as plt
import pylab
%matplotlib inline
pylab.rcParams['figure.figsize'] = (16,  10)

## Load dataset:

The dataset contains 3 CSV files:

* airlines.csv: contains information about airlines.
    * **IATA_CODE**: unique identifier.
    * **AIRLINE**.

* airports.csv: contains information about airports.
    * **IATA_CODE**: unique identifier.
    * **AIRPORT**.
    * **CITY**.
    * **LATITUDE**.
    * **LONGITUDE**.

* flights.csv: contains 33 columns related to flight information.
               

In [109]:
airlines_data = pd.read_csv('../input/airlines.csv')
airports_data = pd.read_csv('../input/airports.csv')
flights_data = pd.read_csv('../input/flights.csv')
#Not able to load all the data because of the limited resources. The final result will not be the
#same for the whole dataset
flights_data=flights_data[1:90000]

In [110]:
airlines_data.head()

In [111]:
airports_data.head()

In [112]:
flights_data.head()

Keep 25 columns out of 33. In general, we are going to use these columns to study flights using their delation,cancellation,cancellation reasons,time,speed,and changing direction(divert).

Using the previous properties, we are going to help the passengers to choose the best airline to travel. 

In [113]:
flights_data.columns.values
flights_data=flights_data[['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER',
       'TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT',
       'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY',
       'SCHEDULED_TIME','DISTANCE', 'SCHEDULED_ARRIVAL',
       'ARRIVAL_TIME', 'ARRIVAL_DELAY', 'DIVERTED', 'CANCELLED',
       'CANCELLATION_REASON', 'AIR_SYSTEM_DELAY', 'SECURITY_DELAY',
       'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY']]

flights_data does not contain the namd of airline (just contains the unique identifier) so we are going to add new column **AIRLINE_NAME** to flight_data from airlines_data.

In [114]:
flights_data["AIRLINE_NAME"]=flights_data.apply(lambda x: airlines_data.loc[airlines_data['IATA_CODE'] == x["AIRLINE"],"AIRLINE"].values[0],axis=1)

In [115]:
flights_data[["AIRLINE_NAME","AIRLINE","ORIGIN_AIRPORT"]].head()

Add the origin airports names to flight_data from airports_data.

In [116]:
flights_data["ORIGIN_AIRPORT_NAME"]=flights_data.apply(lambda x: airports_data.loc[airports_data['IATA_CODE'] == x["ORIGIN_AIRPORT"],"AIRPORT"].values[0],axis=1)

In [117]:
flights_data[["AIRLINE_NAME","ORIGIN_AIRPORT","ORIGIN_AIRPORT_NAME"]].head()

General Information about the flights data:

In [118]:
#General Info
number_of_delayed = flights_data["DEPARTURE_DELAY"].apply(lambda s: 1 if s!=0 else 0);
print("Total number of flights: "+str(len(flights_data)))
print("Number of cancelled flights: "+str(sum(flights_data["CANCELLED"])))
print("Number of delayed flights: "+str(sum(number_of_delayed)))
print("Number of diverted flights: "+str(sum(flights_data["DIVERTED"])))


print("Number of not cancelled flights: "+str(len(flights_data)-sum(flights_data["CANCELLED"])))
print("Number of not delayed flights: "+str(len(flights_data)-sum(number_of_delayed)))
# print("The number of missing data: "+str(flights_data['DEPARTURE_TIME'].isnull().sum()));
print("Percentage of cancelled flights: "+str((sum(flights_data["CANCELLED"])*1.0/len(flights_data))*100)+"%")
print("Percentage of delayed flights: "+str((sum(number_of_delayed)*1.0/len(flights_data))*100)+"%")

Check the number of missing (empty) values for each column.

In [119]:
flights_data["ON_TIME"]=flights_data["ARRIVAL_DELAY"].apply(lambda row: 1 if row==0 else 0)
print(len(flights_data["AIRLINE_DELAY"]))
print("ON_TIME: "+str(flights_data["ON_TIME"].sum()))
missing_data_info={};
for column in flights_data.columns:
    missing_data_info[column]=flights_data[column].isnull().sum()
missing_data_info_sorted = sorted(missing_data_info.items(), key=operator.itemgetter(1))
missing_data_info_sorted

We noticed that the number of missing data in CANCELLATION_REASON is large and that is because when the flight was not cancelled (it was delayed or on time), no value was added.

In [120]:
flights_data[["DEPARTURE_DELAY","ARRIVAL_DELAY"]].plot.box()

We notice from the previous plot that there are some negative values and that means there are some flights took off before few minutes before the exact time. We are going to call that flights ahead_flights and the other one delayed_flights

**get_airline_information** this function will help to filter and add new column to airlines_data.

In [121]:
def get_airline_information(column_name,airline_dataframe,flight_dataframe):
    return airline_dataframe.apply(lambda x: flight_dataframe.loc[x["IATA_CODE"]==flight_dataframe["AIRLINE"],column_name].values[0] if len(flight_dataframe.loc[x["IATA_CODE"]==flight_dataframe["AIRLINE"],column_name])>0 else 0,axis=1)

## Calculate the percentage of cancelled and delayed flights for each Airline:

In [122]:
cancelled_flights = flights_data
grouped_cancelled_flights=cancelled_flights[["AIRLINE","AIRLINE_NAME","CANCELLED","ON_TIME"]].groupby(['AIRLINE','AIRLINE_NAME']).sum().reset_index()
grouped_cancelled_flights["FLIGHTS_COUNT"]=cancelled_flights[["AIRLINE","AIRLINE_NAME","ON_TIME"]].groupby(['AIRLINE','AIRLINE_NAME']).count().reset_index()["ON_TIME"]
grouped_cancelled_flights["CANCELLED_PERCENTAGE"]=grouped_cancelled_flights["CANCELLED"]*1.0/grouped_cancelled_flights["FLIGHTS_COUNT"]*100
grouped_cancelled_flights["ON_TIME_PERCENTAGE"]=grouped_cancelled_flights["ON_TIME"]*1.0/grouped_cancelled_flights["FLIGHTS_COUNT"]*100
grouped_cancelled_flights[["AIRLINE","AIRLINE_NAME","FLIGHTS_COUNT","CANCELLED","ON_TIME","CANCELLED_PERCENTAGE","ON_TIME_PERCENTAGE"]].sort_values(by=['CANCELLED_PERCENTAGE'],ascending=[False])

* Add a new column **FLIGHTS_COUNT**      to airlines_data which represents the total count of flights.
* Add a new column **ON_TIME**            to airlines_data which represents the count of flights that were exactly on time.
* Add a new column **ON_TIME_PERCENTAGE** to airlines_data which we can use it to sort the data and decide which airline is better.  


In [123]:
airlines_data["FLIGHTS_COUNT"]=get_airline_information("FLIGHTS_COUNT",airlines_data,grouped_cancelled_flights)
airlines_data["ON_TIME"]=get_airline_information("ON_TIME",airlines_data,grouped_cancelled_flights)
airlines_data["ON_TIME_PERCENTAGE"]=get_airline_information("ON_TIME_PERCENTAGE",airlines_data,grouped_cancelled_flights)
airlines_data.sort_values(by="ON_TIME_PERCENTAGE",ascending=False)

In [124]:
airlines_data["ON_TIME"].plot.pie(labels=airlines_data["AIRLINE"],autopct='%.2f', fontsize=20, figsize=(10, 10),colors=['r','g','b','w','y'])

In [125]:
airlines_data.sort_values(by=["ON_TIME_PERCENTAGE"],ascending=False).plot(x="AIRLINE",y='ON_TIME_PERCENTAGE',kind='bar', figsize=(10, 10),colors=['r','g','b','w','y'])

As we said **DEPARTURE_DELAY** column there are some negative values and that means that the flight took off before the time. We are going to calculate the mean for delayed flights and mean for ahead flights for each of the airlines. 

In [126]:
#Delay by Airlines
positive_delayed_flight=flights_data
positive_delayed_flight=positive_delayed_flight[positive_delayed_flight['DEPARTURE_DELAY']>=0]
positive_delayed_flight_grouped=positive_delayed_flight[["AIRLINE","AIRLINE_NAME","DEPARTURE_DELAY"]].groupby(["AIRLINE",'AIRLINE_NAME']).mean().reset_index()

In [127]:
airlines_data["MEAN_DEPARTURE_DELAY"]=get_airline_information("DEPARTURE_DELAY",airlines_data,positive_delayed_flight_grouped)
airlines_data[["AIRLINE","ON_TIME_PERCENTAGE","MEAN_DEPARTURE_DELAY"]].sort_values(by="MEAN_DEPARTURE_DELAY",ascending=True).head()

In [128]:
#Mean delay for each airlines
airlines_data.sort_values(by=["MEAN_DEPARTURE_DELAY"],ascending=False).plot(x="AIRLINE",y="MEAN_DEPARTURE_DELAY",kind='bar')

In [129]:
#Ahead flights by Airlines
ahead_flight=flights_data
ahead_flight=ahead_flight[ahead_flight['DEPARTURE_DELAY']<=0]
ahead_flight['DEPARTURE_DELAY']=ahead_flight['DEPARTURE_DELAY'].abs()
ahead_flight_grouped=ahead_flight[["AIRLINE","AIRLINE_NAME","DEPARTURE_DELAY"]].groupby(['AIRLINE','AIRLINE_NAME']).mean().reset_index()
ahead_flight_grouped.sort_values(by=["DEPARTURE_DELAY"],ascending=False)

In [130]:
airlines_data["MEAN_DEPARTURE_AHEAD"]=get_airline_information("DEPARTURE_DELAY",airlines_data,ahead_flight_grouped)
airlines_data[["AIRLINE","ON_TIME_PERCENTAGE","MEAN_DEPARTURE_DELAY"]].sort_values(by="MEAN_DEPARTURE_DELAY",ascending=True).head()

In [131]:
airlines_data.sort_values(by=["MEAN_DEPARTURE_AHEAD"],ascending=False).plot(x="AIRLINE",y="MEAN_DEPARTURE_AHEAD",kind='bar')

In [132]:
airlines_data[["AIRLINE","ON_TIME_PERCENTAGE","MEAN_DEPARTURE_DELAY","MEAN_DEPARTURE_AHEAD"]].sort_values(by=["MEAN_DEPARTURE_AHEAD"],ascending=False)

In [133]:
airlines_data["CANCELLED_PERCENTAGE"]=get_airline_information("CANCELLED_PERCENTAGE",airlines_data,grouped_cancelled_flights)
airlines_data.sort_values(by=["CANCELLED_PERCENTAGE"],ascending=False).plot(x="AIRLINE",y="CANCELLED_PERCENTAGE",kind='bar')

In [134]:
airlines_data[["AIRLINE","CANCELLED_PERCENTAGE"]].sort_values(by=["CANCELLED_PERCENTAGE"],ascending=True)

In [135]:
#Percentage by AIRLINES for diverted flights
diverted_flights = flights_data#.drop(flights_data[flights_data["CANCELLED"] != 1].index)
diverted_flights=diverted_flights[["AIRLINE","AIRLINE_NAME","DIVERTED"]].groupby(['AIRLINE','AIRLINE_NAME']).sum().reset_index()
diverted_flights.sort_values(by=["DIVERTED"],ascending=True).head(3)

In [136]:
airlines_data["DIVERTED_FLIGHTS"]=get_airline_information("DIVERTED",airlines_data,diverted_flights)
airlines_data.sort_values(by=["DIVERTED_FLIGHTS"],ascending=False).plot(x="AIRLINE",y="DIVERTED_FLIGHTS",kind='bar')

### Cancelled Flights

In [137]:
#CANCELLATION_REASON PERCENTAGE
cancellation_reasons_flights = flights_data
cancellation_reasons_flights=cancellation_reasons_flights[["CANCELLATION_REASON","CANCELLED"]].groupby(['CANCELLATION_REASON']).sum().reset_index()
cancellation_reasons_flights["CANCELLATION_REASON_PERCENTAGE"]=cancellation_reasons_flights["CANCELLED"]/sum(flights_data["CANCELLED"])
print("A - Carrier; B - Weather; C - National Air System; D - Security")
cancellation_reasons_flights

In [138]:
#CANCELLATION_REASON FOR AIRLINES
cancellation_reasons_flights = flights_data
cancellation_reasons_flights=cancellation_reasons_flights[["CANCELLED","AIRLINE","AIRLINE_NAME","CANCELLATION_REASON"]].groupby(['AIRLINE','AIRLINE_NAME','CANCELLATION_REASON']).sum().reset_index()
print("A - Carrier; B - Weather; C - National Air System; D - Security")
cancellation_reasons_flights.sort_values(by=['CANCELLED'],ascending=[False])

In [139]:
def create_airlines_cancellation_table(reason_code,airlines_dataframe,cancellation_reasons_dataframe):
    tmp_cancellation_reasons=cancellation_reasons_dataframe[cancellation_reasons_dataframe["CANCELLATION_REASON"]==reason_code]
    return airlines_dataframe.apply(lambda x: tmp_cancellation_reasons.loc[x["IATA_CODE"]==tmp_cancellation_reasons["AIRLINE"],"CANCELLED"].values[0] if len(tmp_cancellation_reasons.loc[x["IATA_CODE"]==tmp_cancellation_reasons["AIRLINE"],"CANCELLED"])>0 else 0,axis=1)

    

In [140]:
airlines_cancellation_reasons=airlines_data;
airlines_cancellation_reasons["CARRIER"]=create_airlines_cancellation_table("A",airlines_cancellation_reasons,cancellation_reasons_flights)
airlines_cancellation_reasons["WEATHER"]=create_airlines_cancellation_table("B",airlines_cancellation_reasons,cancellation_reasons_flights)
airlines_cancellation_reasons["AIR_SYS"]=create_airlines_cancellation_table("C",airlines_cancellation_reasons,cancellation_reasons_flights)
airlines_cancellation_reasons["SECURITY"]=create_airlines_cancellation_table("D",airlines_cancellation_reasons,cancellation_reasons_flights)
airlines_cancellation_reasons

In [141]:
# Setting the positions and width for the bars
pos = list(range(len(airlines_cancellation_reasons['AIRLINE']))) 
width = 0.25 
# Plotting the bars
fig, ax = plt.subplots(figsize=(35,10))

plt.bar(pos, 
        airlines_cancellation_reasons['CARRIER'], 
        # of width
        width, 
        # with alpha 0.5
        alpha=0.5, 
        # with color
        color='#EE3224', 
        # with label the first value in first_name
        label=airlines_cancellation_reasons['CARRIER'][0])

plt.bar([p + width for p in pos], 
        airlines_cancellation_reasons['WEATHER'], 
        # of width
        width, 
        # with alpha 0.5
        alpha=0.5, 
        # with color
        color='#4286f4', 
        # with label the first value in first_name
        label=airlines_cancellation_reasons['WEATHER'][0])


plt.bar([p + width*2 for p in pos], 
        airlines_cancellation_reasons['AIR_SYS'], 
        # of width
        width, 
        # with alpha 0.5
        alpha=0.5, 
        # with color
        color='#FFC222', 
        # with label the first value in first_name
        label=airlines_cancellation_reasons['AIR_SYS'][0])

plt.bar([p + width*3 for p in pos], 
        airlines_cancellation_reasons['SECURITY'], 
        # of width
        width, 
        # with alpha 0.5
        alpha=0.5, 
        # with color
        color='#80f441', 
        # with label the first value in first_name
        label=airlines_cancellation_reasons['SECURITY'][0])

# Set the y axis label
ax.set_ylabel('Count')

# Set the chart's title
ax.set_title('Airlines cancelled reasons counts')
ax.set_xticks([p + 1.5 * width for p in pos])

# Set the labels for the x ticks
ax.set_xticklabels(airlines_cancellation_reasons['AIRLINE'])

plt.xlim(min(pos)-width, max(pos)+width*5)
plt.ylim([0, max(airlines_cancellation_reasons['SECURITY'] + airlines_cancellation_reasons['AIR_SYS'] + airlines_cancellation_reasons['WEATHER']+airlines_cancellation_reasons["CARRIER"])] )

# Adding the legend and showing the plot
plt.legend(['CARRIER',"WEATHER", 'AIR_SYS', 'SECURITY'], loc='upper left')
plt.grid()
plt.show()

## Speed

We have **DEPARTURE_TIME**,**ARRIVAL_TIME**,**YEAR**,**MONTH** and **DAY** in flights_data.
* We notice that **DEPARTURE_TIME** and **ARRIVAL_TIME** are represtened by 2,3 or 4-digits. These digits represent the time for the flight hhmm.
 When there are just 2-digits like 33 that means the time is 00:33.
 When there are just 3-digits like 133 that means the time is 01:33.
 We want to make all values consist of 4-digits by adding zeros to left.
 
* After converting all **DEPARTURE_TIME** and **ARRIVAL_TIME** values to 4-digits, we will use **YEAR**,**MONTH** and **DAY** to create new columns **DEPARTURE_DATE** and **ARRIVAL_DATE**. Each of these columns is from datetime type so that we can calculate the duration between DEPARTURE_TIME and ARRIVAL_TIME easily and use it to calculate the mean speed for the flight.

In [142]:
#make the time 4-digits to all departure records
flights_data['DEPARTURE_TIME']=flights_data['DEPARTURE_TIME'].fillna(0)
flights_data['DEPARTURE_TIME']=flights_data['DEPARTURE_TIME'].astype(int)
flights_data['SCHEDULED_DEPARTURE']=flights_data['SCHEDULED_DEPARTURE'].apply(lambda x: "0"+str(x) if (int(x)<999 and int(x)>99) else "00"+str(x) if int(x)<100 else int(x))
flights_data['DEPARTURE_TIME']=flights_data['DEPARTURE_TIME'].apply(lambda x: "0"+str(x) if (int(x)<999 and int(x)>99) else "00"+str(x) if int(x)<100 else int(x))

# #combine time with data and formate it
flights_data['SCHEDULED_DEPARTURE_DATE']=flights_data[['SCHEDULED_DEPARTURE','YEAR','MONTH','DAY']].apply(lambda x: str(x['YEAR'])+"-"+str(x['MONTH'])+"-"+str(x['DAY'])+"-"+str(x['SCHEDULED_DEPARTURE']),axis=1)
flights_data['SCHEDULED_DEPARTURE_DATE']=pd.to_datetime(flights_data['SCHEDULED_DEPARTURE_DATE'], format='%Y-%m-%d-%H%M', errors='coerce')

flights_data['DEPARTURE_DATE']=flights_data[['DEPARTURE_TIME','YEAR','MONTH','DAY']].apply(lambda x: str(x['YEAR'])+"-"+str(x['MONTH'])+"-"+str(x['DAY'])+"-"+str(x['DEPARTURE_TIME']),axis=1)
flights_data['DEPARTURE_DATE']=pd.to_datetime(flights_data['DEPARTURE_DATE'], format='%Y-%m-%d-%H%M', errors='coerce')

In [143]:
flights_data['DEPARTURE_DATE'].head()

In [144]:
flights_data.head()

In [145]:
#make the time 4-digits to all arrival records
flights_data['ARRIVAL_TIME']=flights_data['ARRIVAL_TIME'].fillna(0)
flights_data['ARRIVAL_TIME']=flights_data['ARRIVAL_TIME'].astype(int)
flights_data['SCHEDULED_ARRIVAL']=flights_data['SCHEDULED_ARRIVAL'].apply(lambda x: "0"+str(x) if (int(x)<999 and int(x)>99) else "00"+str(x) if int(x)<100 else x)
flights_data['ARRIVAL_TIME']=flights_data['ARRIVAL_TIME'].apply(lambda x: "0"+str(x) if (int(x)<999 and int(x)>99) else "00"+str(x) if int(x)<100 else x)

#combine time with data and formate it
flights_data['SCHEDULED_ARRIVAL_DATE']=flights_data[['SCHEDULED_ARRIVAL','YEAR','MONTH','DAY']].apply(lambda x: str(x['YEAR'])+"-"+str(x['MONTH'])+"-"+str(x['DAY'])+"-"+str(x['SCHEDULED_ARRIVAL']),axis=1)
flights_data['SCHEDULED_ARRIVAL_DATE']=pd.to_datetime(flights_data['SCHEDULED_ARRIVAL_DATE'], format='%Y-%m-%d-%H%M', errors='coerce')

flights_data['ARRIVAL_DATE']=flights_data[['ARRIVAL_TIME','YEAR','MONTH','DAY']].apply(lambda x: str(x['YEAR'])+"-"+str(x['MONTH'])+"-"+str(x['DAY'])+"-"+str(x['ARRIVAL_TIME']),axis=1)
flights_data['ARRIVAL_DATE']=pd.to_datetime(flights_data['ARRIVAL_DATE'], format='%Y-%m-%d-%H%M', errors='coerce')


In [146]:
flights_data["ARRIVAL_TIME"].head()

Now we are going to calculate the difference between **ARRIVAL_DATE** and **DEPARTURE_DATE** which gives the duration for the flights.

In [147]:
flights_data["FLIGHT_TIME"]=flights_data['ARRIVAL_DATE']-flights_data['DEPARTURE_DATE']

In [148]:
flights_data[["ARRIVAL_DATE","DEPARTURE_DATE",'FLIGHT_TIME']].head()

Add new Column **FLIGHT_TIME_IN_MINUTES** which represents the flight duration (**FLIGHT_TIME**) in minutes.

In [149]:
flights_data["FLIGHT_TIME_IN_MINUTES"]=flights_data['FLIGHT_TIME'].apply(lambda x: int(x.seconds/60) if x.seconds>0 else 0)

Speed = Distance/time

In [150]:
flights_data['SPEED']=flights_data.apply(lambda x: x["DISTANCE"]/x['FLIGHT_TIME_IN_MINUTES'] if x['FLIGHT_TIME_IN_MINUTES']>0 else 0,axis=1)

In [157]:
flights_data[['SPEED','DISTANCE','FLIGHT_TIME_IN_MINUTES','ARRIVAL_DATE','DEPARTURE_DATE']].sort_values(by=["SPEED"],ascending=False).head()

Calculate the *mean speed* for each airline. This speed is apporximated (does not mean the speed of the airline in the sky) because the time for takeoff and land is contained in flight duration.

In [152]:
#Speed by AIRLINES
flights=flights_data[["AIRLINE","SPEED"]].groupby(['AIRLINE']).mean().reset_index()

In [153]:
airlines_data["MEAN_SPEED"]=get_airline_information("SPEED",airlines_data,flights)
airlines_data[["AIRLINE","MEAN_SPEED"]].sort_values(by=["MEAN_SPEED"],ascending=False).head(3)

In [154]:
#Airlines by Speed
# plot = flights.sort_values(by=["SPEED"],ascending=False).plot(x="AIRLINE_NAME",y="SPEED",kind='bar')
airlines_data.sort_values(by=["MEAN_SPEED"],ascending=False).plot(x="AIRLINE",y="MEAN_SPEED",kind='bar')


Using all the previous metric, we will calculate ranking for airlines in simple way by giving points for each airline depending on each metric.

In [155]:
airlines_data["RANKING"]=0
tmp=airlines_data.sort_values(by=["ON_TIME_PERCENTAGE"],ascending=True).reset_index(drop=True)
tmp["RANKING"]=tmp.apply(lambda x: (x["RANKING"]+x.name),axis=1)

tmp=tmp.sort_values(by=["MEAN_SPEED"],ascending=True).reset_index(drop=True)
tmp["RANKING"]=tmp.apply(lambda x: (x["RANKING"]+x.name),axis=1)

tmp=tmp.sort_values(by=["MEAN_DEPARTURE_DELAY"],ascending=False).reset_index(drop=True)
tmp["RANKING"]=tmp.apply(lambda x: (x["RANKING"]+x.name),axis=1)

tmp=tmp.sort_values(by=["MEAN_DEPARTURE_AHEAD"],ascending=False).reset_index(drop=True)
tmp["RANKING"]=tmp.apply(lambda x: (x["RANKING"]+x.name),axis=1)

tmp=tmp.sort_values(by=["CANCELLED_PERCENTAGE"],ascending=False).reset_index(drop=True)
tmp["RANKING"]=tmp.apply(lambda x: (x["RANKING"]+x.name),axis=1)

tmp=tmp.sort_values(by=["DIVERTED_FLIGHTS"],ascending=False).reset_index(drop=True)
tmp["RANKING"]=tmp.apply(lambda x: (x["RANKING"]+x.name),axis=1)

(tmp.sort_values(by=["RANKING"],ascending=False))[["AIRLINE","RANKING"]]

## ORIGIN_AIRPORT

In [156]:
#Percentage by ORIGIN_AIRPORT
cancelled_flights_by_origin_airpot = flights_data
grouped_cancelled_flights_by_origin_airpot=cancelled_flights_by_origin_airpot[["ORIGIN_AIRPORT","CANCELLED","ON_TIME"]].groupby(['ORIGIN_AIRPORT']).sum().reset_index()
grouped_cancelled_flights_by_origin_airpot["FLIGHTS_COUNT"]=cancelled_flights_by_origin_airpot[["ORIGIN_AIRPORT","ON_TIME"]].groupby(['ORIGIN_AIRPORT']).count().reset_index()["ON_TIME"]
grouped_cancelled_flights_by_origin_airpot["CANCELLED_PERCENTAGE"]=grouped_cancelled_flights_by_origin_airpot["CANCELLED"]*1.0/grouped_cancelled_flights_by_origin_airpot["FLIGHTS_COUNT"]*100
grouped_cancelled_flights_by_origin_airpot["ON_TIME_PERCENTAGE"]=grouped_cancelled_flights_by_origin_airpot["ON_TIME"]*1.0/grouped_cancelled_flights_by_origin_airpot["FLIGHTS_COUNT"]*100
grouped_cancelled_flights_by_origin_airpot[["ORIGIN_AIRPORT","FLIGHTS_COUNT","CANCELLED","ON_TIME","CANCELLED_PERCENTAGE","ON_TIME_PERCENTAGE"]].sort_values(by=['ON_TIME_PERCENTAGE'],ascending=[False])
plt.figure();
# print(len(grouped_cancelled_flights_by_origin_airpot["ORIGIN_AIRPORT"]))
plot = grouped_cancelled_flights_by_origin_airpot.sort_values(by=["ON_TIME_PERCENTAGE"],ascending=False).plot(x="ORIGIN_AIRPORT",y="ON_TIME_PERCENTAGE",kind='bar',figsize=(100,30))