# Flight Delays: Web App

In this final notebook, we want to prepare our data for use in the web app.

### Business Goals

Storage is cheaper than processing power. That's why in this notebook we will prep as much data as possible for our app server. That way, we will minimize costly operations from having to be done live on the server and instead our app will only have to look up certain values.

Here we want to achieve the following:

1. Prepare data for vizualizations by airport, holiday, time of day/week and date.
2. Save the data out as CSV files

In [2]:
import pandas as pd
import glob
import os
import requests
import json
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import sqlite3 as db
import datetime
from pytz import timezone
import pytz

from datetime import datetime, timedelta

pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning)

import seaborn as sns

import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning)

## Importing data

First we import the data that was prepared in our first notebook.

In [3]:
# Importing data
carrier_data = pd.read_csv('data/prepared/data_for_graphing.csv')

In [4]:
# Filter for airports relevant for our business case
relevant_airports = ['ATL', 'DFW', 'DEN', 'ORD', 'LAX', 'CLT', 'LAS', 'PHX', 
                     'MCO', 'SEA', 'MIA', 'IAH', 'JFK', 'FLL', 'EWR', 'SFO', 'MSP', 'DTW',
                     'BOS', 'SLC', 'PHL', 'BWI', 'TPA', 'SAN', 'MDW', 'LGA', 'BNA', 'IAD',
                     'DAL', 'DCA', 'PDX', 'AUS', 'HOU', 'HNL', 'STL', 'RSW', 'SMF', 'MSY',
                     'SJU', 'RDU', 'OAK', 'MCI', 'CLE', 'IND', 'SAT', 'SNA', 'PIT', 'CVG',
                     'CMH', 'PBI', 'JAX', 'MKE', 'ONT', 'ANC', 'BDL', 'OGG', 'OMA', 'MEM',
                     'BOI', 'RNO', 'CHS', 'OKC']

airport_filter = '|'.join(relevant_airports)

carrier_data = carrier_data[carrier_data['ORIGIN'].str.contains(airport_filter)]

In [5]:
# Create a key for looking up an airport connecting
# Each flight has a destination and an origin so we want to create pairs of those for looking up values
carrier_data['airport-lookup-key'] = carrier_data['ORIGIN'] + '-' + carrier_data['DEST']
# To cut down on storage, we drop duplicate entries
airport_lookup = carrier_data.drop_duplicates(subset=['airport-lookup-key'])

In [6]:
# Our app needs to look up only certain features of airports, like their elevation, Distance between them, etc.
# We filter for those columns here
airport_lookup = airport_lookup[['ORIGIN', 'DEST','airport-lookup-key', 
                                 'origin-elevation','dest-elevation', 'DISTANCE','dest-lat-long',
                                 'origin-lat-long','origin-tz','dest-tz']]

In [7]:
# Save out the data to a csv file
airport_lookup.to_csv('data/prepared/airport_lookup.csv', index=False)

## Grouping Data

We have more data here than a user needs. We'll group it to calculate the percentage of delayed flights (out of total flights) that fit a certain grouping.

We want the user to be able to look up percentage of delayed flights:
* by airport
* by holida
* by time of day/time of the week
* and finally by day

In [11]:
# First we run datetime on the FL_DATE field to cut out time. That gets too granular
carrier_data['FL_DATE'] = pd.to_datetime(carrier_data['FL_DATE']).dt.date

# Next we group our data by date, origin, holiday, day of the week and time of day
# we also group it by whether the flight had a severe delay or not
grouped_data = carrier_data.groupby(['FL_DATE', 'ORIGIN', 'holiday', 'DAY_OF_WEEK',
                                     'takeoff-time-of-day', 'target'], as_index=False).size()

Let's preview our data

In [10]:
grouped_data.head()

Unnamed: 0,FL_DATE,ORIGIN,holiday,DAY_OF_WEEK,takeoff-time-of-day,target,size
0,2021-06-01,ANC,Not a Holiday,Tuesday,Early Afternoon,No,8
1,2021-06-01,ANC,Not a Holiday,Tuesday,Early Afternoon,Yes,1
2,2021-06-01,ANC,Not a Holiday,Tuesday,Early Evening,No,4
3,2021-06-01,ANC,Not a Holiday,Tuesday,Early Morning,No,11
4,2021-06-01,ANC,Not a Holiday,Tuesday,Late Afternoon,No,4


Next we need to create a pivot of that data so we can calculate the percetage of delayed flights for each grouping.

In [12]:
# Delays by airport
df_by_airport = pd.pivot_table(grouped_data, values='size', index=['FL_DATE', 'ORIGIN'],
                    columns=['target'], aggfunc=np.sum, fill_value=0)
df_by_airport = df_by_airport.reset_index()

# Divide the number of flights with severe delays by the total number of flights
df_by_airport['percent-delayed'] = df_by_airport['Yes'] / (df_by_airport['No'] + df_by_airport['Yes'])

# Delays by holiday
df_by_holiday = pd.pivot_table(grouped_data, values='size', index=['holiday', 'ORIGIN'],
                    columns=['target'], aggfunc=np.sum, fill_value=0)
df_by_holiday = df_by_holiday.reset_index()
df_by_holiday['percent-delayed'] = df_by_holiday['Yes'] / (df_by_holiday['No'] + df_by_holiday['Yes'])

# Delays by holiday
df_by_timeofday_weekday = pd.pivot_table(grouped_data, values='size', index=['DAY_OF_WEEK', 'takeoff-time-of-day','ORIGIN'],
                    columns=['target'], aggfunc=np.sum, fill_value=0)
df_by_timeofday_weekday = df_by_timeofday_weekday.reset_index()
df_by_timeofday_weekday['percent-delayed'] =df_by_timeofday_weekday['Yes'] / (df_by_timeofday_weekday['No'] + df_by_timeofday_weekday['Yes'])

In [14]:
# Preview grouping by airport
df_by_airport.head()

target,FL_DATE,ORIGIN,No,Yes,percent-delayed
0,2021-06-01,ANC,57,1,0.017241
1,2021-06-01,ATL,779,7,0.008906
2,2021-06-01,AUS,140,7,0.047619
3,2021-06-01,BDL,51,0,0.0
4,2021-06-01,BNA,201,1,0.00495


In [15]:
# Preview grouping by holiday
df_by_holiday.head()

target,holiday,ORIGIN,No,Yes,percent-delayed
0,Christmas,ANC,41,0,0.0
1,Christmas,ATL,569,20,0.033956
2,Christmas,AUS,201,10,0.047393
3,Christmas,BDL,44,7,0.137255
4,Christmas,BNA,179,4,0.021858


In [16]:
# Preview grouping by time of day/week
df_by_timeofday_weekday.head()

target,DAY_OF_WEEK,takeoff-time-of-day,ORIGIN,No,Yes,percent-delayed
0,Friday,Early Afternoon,ANC,507,20,0.037951
1,Friday,Early Afternoon,ATL,8191,511,0.058722
2,Friday,Early Afternoon,AUS,2328,185,0.073617
3,Friday,Early Afternoon,BDL,531,75,0.123762
4,Friday,Early Afternoon,BNA,1803,172,0.087089


Now let's save these out to files.

In [17]:
df_by_airport.to_csv('data/prepared/delays-by-airport.csv', index=False)
df_by_holiday.to_csv('data/prepared/delays-by-holiday.csv', index=False)
df_by_timeofday_weekday.to_csv('data/prepared/df_by_timeofday_weekday.csv', index=False)

## Previewing the types of graphs our user can make
Here we will show an example of the types of graphs a user can make by just filtering or looking up values from the groupings we created above.

### Atlanta Airport Delays throughout the week and day

In [22]:
# Sorting x and y axis orders
order_days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
order_times = ['Night', 'Late Evening', 'Early Evening', 'Late Afternoon', 'Early Afternoon', 'Late Morning',
                   'Early Morning']


px.density_heatmap(df_by_timeofday_weekday.loc[df_by_timeofday_weekday['ORIGIN'] == 'ATL'],
                             x="DAY_OF_WEEK", y="takeoff-time-of-day", z='percent-delayed', histfunc="avg",
                             color_continuous_scale='OrRd',
                             category_orders={"DAY_OF_WEEK": order_days, "takeoff-time-of-day": order_times},
                             title='Percent of Flights with Severe Delays Throughout the Week - Atlanta')

### Worst Holiday Delays at New York's JFK Airport

In [24]:
px.bar(df_by_holiday,
       x=df_by_holiday.loc[df_by_holiday['ORIGIN'] == 'JFK']['holiday'],
       y=df_by_holiday.loc[df_by_holiday['ORIGIN'] == 'JFK']['percent-delayed'],
       color=df_by_holiday.loc[df_by_holiday['ORIGIN'] == 'JFK']['percent-delayed'],
       color_continuous_scale='OrRd',
        labels={"x": "Holiday",
                "y": "Severe Delays"},
        title="Percent of Flights Severely Delayed by Holiday at JFK Airport")

### Trend of delays over the last year at Los Angles International Airport (LAX)

In [26]:
px.line(df_by_airport,
        x=df_by_airport.loc[df_by_airport['ORIGIN'] == 'LAX']['FL_DATE'],
        y=df_by_airport.loc[df_by_airport['ORIGIN'] == 'LAX']['percent-delayed'],
        labels={"x": "Date",
                "y": "Percentage of Severe Delays"},
        title="Daily Severe Airport Delays at LAX Airport")