<img src="../images/airplane-symbol.jpg" style="float: left; margin: 20px;" width="50" height="50"> 
#  Predicting Flight Delays (<i>a Proof-of-Concept</i>)

Author: Solomon Heng

---

In [1]:
import pandas as pd
import numpy as np

---
### (1) Importing overall flights data

---

In [2]:
df_flights = pd.read_csv('../datasets/flights.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
pd.set_option('display.max_columns', 40)
df_flights.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,5,2354.0,-11.0,21.0,15.0,205.0,194.0,169.0,1448,404.0,4.0,430,408.0,-22.0,0,0,,,,,,
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,10,2.0,-8.0,12.0,14.0,280.0,279.0,263.0,2330,737.0,4.0,750,741.0,-9.0,0,0,,,,,,
2,2015,1,1,4,US,840,N171US,SFO,CLT,20,18.0,-2.0,16.0,34.0,286.0,293.0,266.0,2296,800.0,11.0,806,811.0,5.0,0,0,,,,,,
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,20,15.0,-5.0,15.0,30.0,285.0,281.0,258.0,2342,748.0,8.0,805,756.0,-9.0,0,0,,,,,,
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,25,24.0,-1.0,11.0,35.0,235.0,215.0,199.0,1448,254.0,5.0,320,259.0,-21.0,0,0,,,,,,


---
### (2) Extracting only KATL flights

---

In [4]:
atl_df = df_flights[df_flights['DESTINATION_AIRPORT'] == 'ATL']
atl_df.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
9,2015,1,1,4,DL,1173,N826DN,LAS,ATL,30,33.0,3.0,12.0,45.0,221.0,203.0,186.0,1747,651.0,5.0,711,656.0,-15.0,0,0,,,,,,
10,2015,1,1,4,DL,2336,N958DN,DEN,ATL,30,24.0,-6.0,12.0,36.0,173.0,149.0,133.0,1199,449.0,4.0,523,453.0,-30.0,0,0,,,,,,
13,2015,1,1,4,DL,2324,N3751B,SLC,ATL,40,34.0,-6.0,18.0,52.0,215.0,199.0,176.0,1590,548.0,5.0,615,553.0,-22.0,0,0,,,,,,
33,2015,1,1,4,DL,95,N320US,SLC,ATL,140,134.0,-6.0,43.0,217.0,215.0,231.0,182.0,1590,719.0,6.0,715,725.0,10.0,0,0,,,,,,
77,2015,1,1,4,EV,5583,N882AS,VPS,ATL,520,514.0,-6.0,9.0,523.0,66.0,57.0,42.0,250,705.0,6.0,726,711.0,-15.0,0,0,,,,,,


In [6]:
atl_df.shape

(346904, 31)

---
### (3) Exploring data extracted

Exploring the extracted KATL data to see if we have sufficient data for analysis and modeling

---

**The Federal Aviation Administration (FAA) considers a flight to be delayed when it is 15 minutes later than its scheduled time**

**Eventually we will be classifying the delays into 3 groups:**
1. <15 minutes _(group 0)_
2. 15 minutes to 1 hour _(group 1)_
3. 1 hour to 3 hours _(group 2)_
4. above 3 hours _(group 3)_

**Rationale:**
1. _(<15mins) belong to the no delay category where things are normal_ 
2. _(15mins to 1hr) will be the category in which the airport or airline will perhaps decide if a reshuffling of ground resource deployment is needed_
3. _(1hr to 3hr) will be the category in which the airline or airport will perhaps decide on the necessary actions to take to mitigate the impact of the delays (e.g. rescheduling transit passengers to another flight to prevent delaying the departure of the connecting flight, etc)_
4. _(>3hrs)_ will be the category in which compensation is technically already due (for EU) and airlines or airports will perhaps decide on how to do "damage control"

_Note: Customers would have to be compensated after delays exceed 3 hours (for EU). Since we are looking at US and they DO NOT have any form of obligated compensation for flight delays, we shall use EU as a benchmark for the last class (> 3 hours). Also, if delays are long, airlines would likely reschedule the departure to a later timing._

As such we will be exploring the 3 categories of data to see if we have sufficient data

In [25]:
# No delay flights

print (atl_df[atl_df['ARRIVAL_DELAY'] < 15].shape)
atl_df[atl_df['ARRIVAL_DELAY'] < 15].head()

(290998, 31)


Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
9,2015,1,1,4,DL,1173,N826DN,LAS,ATL,30,33.0,3.0,12.0,45.0,221.0,203.0,186.0,1747,651.0,5.0,711,656.0,-15.0,0,0,,,,,,
10,2015,1,1,4,DL,2336,N958DN,DEN,ATL,30,24.0,-6.0,12.0,36.0,173.0,149.0,133.0,1199,449.0,4.0,523,453.0,-30.0,0,0,,,,,,
13,2015,1,1,4,DL,2324,N3751B,SLC,ATL,40,34.0,-6.0,18.0,52.0,215.0,199.0,176.0,1590,548.0,5.0,615,553.0,-22.0,0,0,,,,,,
33,2015,1,1,4,DL,95,N320US,SLC,ATL,140,134.0,-6.0,43.0,217.0,215.0,231.0,182.0,1590,719.0,6.0,715,725.0,10.0,0,0,,,,,,
77,2015,1,1,4,EV,5583,N882AS,VPS,ATL,520,514.0,-6.0,9.0,523.0,66.0,57.0,42.0,250,705.0,6.0,726,711.0,-15.0,0,0,,,,,,


In [26]:
# Delayed arrivals >15mins & <1hour

print (atl_df[(atl_df['ARRIVAL_DELAY'] > 15) & (atl_df['ARRIVAL_DELAY'] < 60)].shape)
atl_df[(atl_df['ARRIVAL_DELAY'] > 15) & (atl_df['ARRIVAL_DELAY'] < 60)].head()

(32933, 31)


Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
400,2015,1,1,4,DL,1456,N3759,RDU,ATL,605,701.0,56.0,11.0,712.0,90.0,73.0,57.0,356,809.0,5.0,735,814.0,39.0,0,0,,0.0,0.0,39.0,0.0,0.0
1109,2015,1,1,4,DL,1446,N3767,PHX,ATL,715,809.0,54.0,17.0,826.0,206.0,193.0,171.0,1587,1317.0,5.0,1241,1322.0,41.0,0,0,,0.0,0.0,0.0,0.0,41.0
1366,2015,1,1,4,WN,742,N245WN,TPA,ATL,740,739.0,-1.0,9.0,748.0,90.0,108.0,69.0,406,857.0,30.0,910,927.0,17.0,0,0,,17.0,0.0,0.0,0.0,0.0
1442,2015,1,1,4,DL,1460,N901DA,MIA,ATL,745,825.0,40.0,19.0,844.0,120.0,107.0,84.0,594,1008.0,4.0,945,1012.0,27.0,0,0,,0.0,0.0,27.0,0.0,0.0
1643,2015,1,1,4,DL,2030,N332NW,MDW,ATL,800,838.0,38.0,12.0,850.0,117.0,98.0,80.0,591,1110.0,6.0,1057,1116.0,19.0,0,0,,0.0,0.0,19.0,0.0,0.0


In [33]:
# Delayed arrivals >1hour & <3hour

print (atl_df[(atl_df['ARRIVAL_DELAY'] > 60) & (atl_df['ARRIVAL_DELAY'] < 180)].shape)
atl_df[(atl_df['ARRIVAL_DELAY'] > 60) & (atl_df['ARRIVAL_DELAY'] < 180)].head()

(13460, 31)


Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
262,2015,1,1,4,DL,1205,N612DL,PHX,ATL,600,804.0,124.0,16.0,820.0,207.0,190.0,169.0,1587,1309.0,5.0,1127,1314.0,107.0,0,0,,0.0,0.0,0.0,0.0,107.0
465,2015,1,1,4,WN,1966,N466WN,CMH,ATL,615,831.0,136.0,13.0,844.0,110.0,90.0,65.0,447,949.0,12.0,805,1001.0,116.0,0,0,,0.0,0.0,116.0,0.0,0.0
1284,2015,1,1,4,F9,1070,N227FR,ORD,ATL,730,835.0,65.0,15.0,850.0,110.0,106.0,82.0,606,1112.0,9.0,1020,1121.0,61.0,0,0,,0.0,0.0,61.0,0.0,0.0
1379,2015,1,1,4,DL,240,N803DN,SFO,ATL,740,1031.0,171.0,16.0,1047.0,271.0,270.0,248.0,2139,1755.0,6.0,1511,1801.0,170.0,0,0,,0.0,0.0,170.0,0.0,0.0
1894,2015,1,1,4,OO,6378,N109SY,ORD,ATL,815,955.0,100.0,16.0,1011.0,121.0,109.0,85.0,606,1236.0,8.0,1116,1244.0,88.0,0,0,,0.0,0.0,63.0,25.0,0.0


In [34]:
# Delayed arrivals >3hour

print (atl_df[(atl_df['ARRIVAL_DELAY'] > 180)].shape)
atl_df[(atl_df['ARRIVAL_DELAY'] > 180)].head()

(3353, 31)


Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
5787,2015,1,1,4,EV,5163,N709EV,TUL,ATL,1239,1852.0,373.0,34.0,1926.0,106.0,117.0,79.0,674,2145.0,4.0,1525,2149.0,384.0,0,0,,11.0,0.0,373.0,0.0,0.0
14868,2015,1,2,5,DL,1768,N918DE,CAE,ATL,615,1627.0,612.0,9.0,1636.0,68.0,67.0,45.0,192,1721.0,13.0,723,1734.0,611.0,0,0,,0.0,0.0,611.0,0.0,0.0
22654,2015,1,2,5,DL,1743,N994DL,ECP,ATL,1349,454.0,905.0,22.0,516.0,71.0,68.0,37.0,240,653.0,9.0,1600,702.0,902.0,0,0,,0.0,0.0,902.0,0.0,0.0
25471,2015,1,2,5,EV,5604,N754EV,MOB,ATL,1635,1959.0,204.0,19.0,2018.0,75.0,84.0,48.0,302,2206.0,17.0,1850,2223.0,213.0,0,0,,9.0,0.0,0.0,204.0,0.0
30800,2015,1,3,6,EV,5210,N927EV,BTR,ATL,510,1332.0,502.0,8.0,1340.0,84.0,81.0,61.0,448,1541.0,12.0,734,1553.0,499.0,0,0,,0.0,0.0,480.0,19.0,0.0


---
**We seem to have sufficient data for analysis for KATL airport**

---
**Points of consideration:**

1. We shall set our objectives to classify arrival delay which have <u><b>began their departure</b></u>.
    > For this, we will take Departure Delay _(the difference between actual time in which the aircraft started pushing back and the scheduled time of pushback)_ into consideration when doing predictions. Which effectively means the model is only usable after an aircraft is confirmed for departure and has started commencing departure pushback.
2. Do note that there are information in the dataset which we usually do not have the luxury of having till the actual event actualizes. As such we need to consider if we want to include them into the model later on _(might not be realistic for real life usage)_.
3. Our initial intention was to include enroute weather data into the model. However the data is very difficult to get our hands on. As such we will make the assumption that scheduled flight duration is done properly by the airlines with enroute weather taken into consideration.
    > The scheduled time will already consider enroute weather conditions _(e.g. strong enroute tail wind -> shorter flight duration)_ and as such we do not need to include it in the data. However, TAF or terminal METAR will still matter simply cause a very poor condition will cause diversions or holding delays due to the aircraft's inability to land

---
### (4) Exporting data extracted

---

In [5]:
atl_df.to_csv('../datasets/2015KATLflights.csv', index=False)