# Predicting Major Flight Delays

Our client, FlightChicken, would like a model that predicts whether a flight will experience a major delay. Delays can cause a major disruption to travel plans, especially if they cause a person to miss their connecting flight. FlightChicken would like to give their users a heads up about potential travel disruptions like this.

This is a major undertaking as there are hundreds of airlines and thousands of airports in the United States alone. That's why FlightChicken would like to launch with just an MVP to prove our their concept. This MVP should support major US airports and 8 of the most popular airlines.

In [1]:
import pandas as pd
import glob
import os
import requests
import json
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
import time

from datetime import datetime, timedelta
from pandas import Timestamp

pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning)

import airportsdata
from pytz import timezone
import pytz

## Business Understanding

MVP should support:

* [Top 8 US Airlines](https://www.statista.com/statistics/250577/domestic-market-share-of-leading-us-airlines/)
 * American Airlines
 * Delta Air Lines
 * United Airlines
 * Southwest Airlines
 * Alaska Airlines
 * JetBlue Airways
 * Spirit
 * SkyWest
* [Large and medium airport hubs](https://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/media/cy20-commercial-service-enplanements.pdf)
 * "The term hub is used by the FAA to identify very busy commercial service airports. Large hubs are the airports that each account for at least one percent of total U.S. passenger enplanements."
 * In 2020 these accounted for 84% of all enplanements


## Data

To complete this project, we will be using data from several sources.

1. **Bureau of Transportation Statistics: Carrier On-Time Performence Database.** This database contains scheduled and actual departure and arrival times reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).
2. **National Oceanic and Atmospheric Administration (NOAA): Daily Weather Summaries:** Data on select weather conditions at airprots collected by weather stations.
3. **Timezone for Each Airport by StackOverflow user hroptatyr:** This data will allow us to convert our timedata to UTC and make it easier to work with. [Link](https://raw.githubusercontent.com/hroptatyr/dateutils/tzmaps/iata.tzmap)

Additionally, to link data from NOAA to each airport, I manually looked up the weather station for all airports relevant in this project. This data was can be found in FILE PATH. To reproduce, go to [Climate Data Online Search](https://www.ncei.noaa.gov/cdo-web/search) and make the following selections:

1. Select Weather Observation Type/Dataset: Daily Summaries
2. Select Date Range: 2018-01-01 to 2021-12-31
3. Search For: Stations
4. Enter a Search Term: enter the city and state of the airport plus the term 'airport'. e.g. Atlanta, GA airport
5. Hit 'Search'
6. On the results page, find the closest/most relevant weather station. In the example of "Atlanta, GA airport" you would select 'Atlanta Hartsfield Jackson International Airport". Hit 'Add to Cart'. **On the results page, make note of the Station ID. This is what will serve as the key for linking BTS data with weather data.**
7. Repeat for every airport.


### Data Preperation & Cleaning
First, we need to load all our data into pandas so that we can work with it.

#### Carrier On-Time Performence Database
This data can only be downloaded by month, which means it is split among many files.

In [2]:
flight_data = glob.glob(os.path.join('data/downloaded/carrier-on-time-performence', "*.csv"))
carrier_data = pd.concat((pd.read_csv(f) for f in flight_data), ignore_index=True)

In [3]:
# We're trying to predict only flight delays not cancellations so we'll remove those from this list
carrier_data = carrier_data.loc[carrier_data['CANCELLED'] == 0]
carrier_data.drop(['CANCELLED'], axis = 1, inplace=True)

In [4]:
# Dropping Columns with a lot of missing info
carrier_data.isna().sum()

YEAR                         0
MONTH                        0
DAY_OF_MONTH                 0
DAY_OF_WEEK                  0
FL_DATE                      0
MKT_CARRIER                  0
MKT_CARRIER_FL_NUM           0
OP_CARRIER                   0
TAIL_NUM                     0
OP_CARRIER_FL_NUM            0
ORIGIN                       0
DEST                         0
CRS_DEP_TIME                 0
DEP_DELAY                    0
DEP_DELAY_NEW                0
CRS_ARR_TIME                 0
ARR_DELAY_NEW            19033
CRS_ELAPSED_TIME             0
DISTANCE                     0
CARRIER_DELAY          5864061
WEATHER_DELAY          5864061
NAS_DELAY              5864061
SECURITY_DELAY         5864061
LATE_AIRCRAFT_DELAY    5864061
dtype: int64

In [5]:
cols_to_drop = ['CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']
carrier_data.drop(cols_to_drop, axis = 1, inplace=True)

In [6]:
carrier_data = carrier_data[np.isfinite(carrier_data['ARR_DELAY_NEW'])]

## Engineering Target

In [3]:
# Create a function to engineer this feature
def serious_delay(value):
    """
    Funtion takes in int or float and returns category of delay
    """
    if value <= 60:
        return "No"
    else:
        return "Yes"
    
carrier_data['target'] = carrier_data['ARR_DELAY_NEW'].apply(serious_delay)

#### Bringing In Additional Data on Our Airports
Next, we'll use the airportsdata package to load in some additional details for each airport (like its location, timezone, and elevation).

In [7]:
airports = airportsdata.load('IATA')
airports_list = []
lat_list = []
lon_list = []
elevation_list = []
tz_list = []

for key in airports.keys():
    airport = airports[key]['iata']
    lat = airports[key]['lat']
    lon = airports[key]['lon']
    elevation = airports[key]['elevation']
    tz = airports[key]['tz']
    airports_list.append(airport)
    lat_list.append(lat)
    lon_list.append(lon)
    elevation_list.append(elevation)
    tz_list.append(tz)

airport_locations = pd.DataFrame(
    {'Airport': airports_list,
     'Latitude': lat_list,
     'Longitude': lon_list,
     'Timezone': tz_list,
     'Elevation': elevation_list,
    })

airport_locations['lat-long'] = airport_locations['Latitude'].astype(str) + ',' + airport_locations['Longitude'].astype(str)

In [8]:
# Now we bring that information to our main dataframe
carrier_data['origin-lat-long'] = carrier_data['ORIGIN'].map(airport_locations.set_index('Airport')['lat-long'])
carrier_data['origin-tz'] = carrier_data['ORIGIN'].map(airport_locations.set_index('Airport')['Timezone'])
carrier_data['origin-elevation'] = carrier_data['ORIGIN'].map(airport_locations.set_index('Airport')['Elevation'])

carrier_data['dest-lat-long'] = carrier_data['DEST'].map(airport_locations.set_index('Airport')['lat-long'])
carrier_data['dest-tz'] = carrier_data['DEST'].map(airport_locations.set_index('Airport')['Timezone'])
carrier_data['dest-elevation'] = carrier_data['DEST'].map(airport_locations.set_index('Airport')['Elevation'])

In [9]:
# Check whether we were able to match all records
print('Missing origin timezones: {}'.format(carrier_data['origin-tz'].isna().sum()))
print('Missing origin elevations: {}'.format(carrier_data['origin-elevation'].isna().sum()))
print('Missing origin locations: {}'.format(carrier_data['origin-lat-long'].isna().sum()))
print('------------------')
print('Missing destination timezones: {}'.format(carrier_data['dest-tz'].isna().sum()))
print('Missing destination  elevations: {}'.format(carrier_data['dest-elevation'].isna().sum()))
print('Missing destination  locations: {}'.format(carrier_data['dest-lat-long'].isna().sum()))

Missing origin timezones: 13825
Missing origin elevations: 13825
Missing origin locations: 13825
------------------
Missing destination timezones: 13810
Missing destination  elevations: 13810
Missing destination  locations: 13810


Looks like about 13,000 records couldn't be matched. That's not bad considering we have over 7,000,000 record total. We'll simply drop the Null values.

In [10]:
carrier_data = carrier_data.dropna(subset=['origin-tz', 'origin-elevation', 'origin-lat-long',
                                           'dest-tz', 'dest-elevation', 'dest-lat-long'])

The time feature we are concerned with is takeoff and landing time.  Our goal is to convert it to a universal UTC time. To do this, we first need to transform it a bit so that it's workable.

All times are expressed as an integer in military time. For example, 940 is 9:40am and 1500 is 3:00pm. We would like to first convert it to a string that can be read as 24H time, then combined with the FL_DATE field so that we can have an exact take-off date and time.

In [11]:
# First we create a helper function to carry out the transformation
def float_to_time(time):
    '''
    Function takes in an integer representation of time (24-hour format)
    and returns a string in proper datetime formatting. Example: 1545 (int) becomes 15:45 (string)
    '''
    time_str = str(time)
    digits = len(time_str)
    if digits < 2:
        return '00:0' + str(time)
    if digits == 2:
        return '00:' + str(time)
    if digits == 3:
        return '0' + time_str[:1] + ':' + time_str[1:]
    if digits == 4:
        return time_str[:2] + ':' + time_str[2:]

In [12]:
# First, we apply the function above to transform the CRS_DEP_TIME field
carrier_data['CRS_DEP_TIME'] = carrier_data['CRS_DEP_TIME'].apply(float_to_time)
carrier_data['CRS_ARR_TIME'] = carrier_data['CRS_ARR_TIME'].apply(float_to_time)
# Next, we update the FL_DATE field so that it now contains the proper date AND time of takeoff
carrier_data['FL_DATE'] =  pd.to_datetime(carrier_data['FL_DATE'].astype(str) + ' ' + carrier_data['CRS_DEP_TIME'])

Now that we have the above expressed in LOCAL time, we will use the timezone data to create an additional element in UTC time.

In [13]:
# First we need to make our data, which is timezone naive, to timezone aware
carrier_data['FL_DATE'] = carrier_data['FL_DATE'].astype('datetime64[ns]')
carrier_data['FL_DATE'] = carrier_data.apply(lambda x: x['FL_DATE'].replace(tzinfo=timezone(x['origin-tz'])), axis=1)

Next, we want to know when the flight will be landing. We have the field CRS_ELAPSED_TIME to see how many minutes are supposed to take between departure and arrival. This is expressed in minutes. We can add that to the UTC FL_DATE and then convert it to a local time.

In [14]:
carrier_data['flight_duration'] = pd.to_timedelta(carrier_data['CRS_ELAPSED_TIME'],'m')
carrier_data['FL_ARR_DATE_REL_ORIGIN'] = carrier_data['FL_DATE'] + carrier_data['flight_duration']
# And now we convert arrival time and date to a local time
carrier_data['FL_ARR_DATE'] = carrier_data.apply(lambda x: x['FL_ARR_DATE_REL_ORIGIN'].tz_convert(x['dest-tz']), axis=1)



### Airport Congestion
One hypothesis is that flight delays can be tied to airport "traffic" (or congestion). It stands to reason that if an airline has 100 flights scheduled to take off at 11am, there is a higher chance of delays than if there are only 10 flights scheduled to take off.

Moreover, because traffic and delays in the morning can propegate throughout the day congestion before our flight takes off can also play a role.

We'll create a set of features that put a number on this congestion.

First, we will round takeoff times to the nearest hour. This will make calculations easier. Then, we will createa  dataframe that holds information on airport congestion at every airport, at every hour of the day throughout the week

In [14]:
# def takeoff_hour_rounder(time):
#     '''
#     Function takes in a time and returns time rounded to the 
#     nearest hour by adding a timedelta hour if minute >= 30
#     '''
#     return (time.replace(second=0, microsecond=0, minute=0, hour=time.hour)
#                +timedelta(hours=time.minute//30))

In [15]:
# carrier_data['FL_DATE_ROUNDED'] = carrier_data['FL_DATE'].apply(takeoff_hour_rounder)
# carrier_data['FL_ARR_DATE_ROUNDED'] = carrier_data['FL_ARR_DATE'].apply(takeoff_hour_rounder)

KeyboardInterrupt: 

In [15]:
# We won't be needitn timezone info anymore, so let's remove it
def remove_timezone(dt):
    # HERE `dt` is a python datetime
    # object that used .replace() method
    
    return dt.replace(tzinfo=None)

In [16]:
carrier_data['FL_DATE'] = carrier_data['FL_DATE'].apply(remove_timezone)
#carrier_data['FL_DATE_ROUNDED'] = carrier_data['FL_DATE_ROUNDED'].apply(remove_timezone)
carrier_data['FL_ARR_DATE'] = carrier_data['FL_ARR_DATE'].apply(remove_timezone)
#carrier_data['FL_ARR_DATE_ROUNDED'] = carrier_data['FL_ARR_DATE_ROUNDED'].apply(remove_timezone)

In [17]:
carrier_data['takeoff-hour'] = carrier_data['FL_DATE'].dt.hour.astype(int)
carrier_data['arriving-hour'] = carrier_data['FL_ARR_DATE'].dt.hour.astype(int)

In [18]:
conditions = [
    (carrier_data['takeoff-hour'] >= 5) & (carrier_data['takeoff-hour'] <= 8),
    (carrier_data['takeoff-hour'] >= 8) & (carrier_data['takeoff-hour'] <= 12),
    (carrier_data['takeoff-hour'] > 12) & (carrier_data['takeoff-hour'] <= 15),
    (carrier_data['takeoff-hour'] > 15) & (carrier_data['takeoff-hour'] <= 17),
    (carrier_data['takeoff-hour'] > 17) & (carrier_data['takeoff-hour'] <= 19),
    (carrier_data['takeoff-hour'] > 19) & (carrier_data['takeoff-hour'] <= 21),
    (carrier_data['takeoff-hour'] >= 0) & (carrier_data['takeoff-hour'] < 5),
    (carrier_data['takeoff-hour'] > 21)
]
values = ['Early Morning', 'Late Morning', 'Early Afternoon', 'Late Afternoon', 'Early Evening',  'Late Evening', 'Night', 'Night']

carrier_data['takeoff-time-of-day'] = np.select(conditions, values)

In [19]:
arr_conditions = [
    (carrier_data['arriving-hour'] >= 5) & (carrier_data['arriving-hour'] <= 8),
    (carrier_data['arriving-hour'] >= 8) & (carrier_data['arriving-hour'] <= 12),
    (carrier_data['arriving-hour'] > 12) & (carrier_data['arriving-hour'] <= 15),
    (carrier_data['arriving-hour'] > 15) & (carrier_data['arriving-hour'] <= 17),
    (carrier_data['arriving-hour'] > 17) & (carrier_data['arriving-hour'] <= 19),
    (carrier_data['arriving-hour'] > 19) & (carrier_data['arriving-hour'] <= 21),
    (carrier_data['arriving-hour'] >= 0) & (carrier_data['arriving-hour'] < 5),
    (carrier_data['arriving-hour'] > 21)
]
arr_values = ['Early Morning', 'Late Morning', 'Early Afternoon', 'Late Afternoon', 'Early Evening',  'Late Evening', 'Night', 'Night']

carrier_data['arrival-time-of-day'] = np.select(arr_conditions, arr_values)

In [20]:
carrier_data['ARR_DAY_OF_WEEK'] = carrier_data['FL_ARR_DATE'].dt.dayofweek

In [21]:
carrier_data[['ORIGIN', 'DEST', 'DAY_OF_WEEK', 'FL_DATE', 'FL_ARR_DATE', 'ARR_DAY_OF_WEEK']]

Unnamed: 0,ORIGIN,DEST,DAY_OF_WEEK,FL_DATE,FL_ARR_DATE,ARR_DAY_OF_WEEK
0,SAN,DEN,2,2021-07-13 18:25:00,2021-07-13 21:40:00,1
2,SAN,DEN,2,2021-07-13 12:10:00,2021-07-13 15:30:00,1
3,SAN,DEN,2,2021-07-13 20:50:00,2021-07-14 00:05:00,2
4,SAN,HNL,2,2021-07-13 18:40:00,2021-07-13 21:40:00,1
5,SAN,HNL,2,2021-07-13 08:40:00,2021-07-13 11:45:00,1
...,...,...,...,...,...,...
7580719,CMH,DCA,7,2022-01-09 12:24:00,2022-01-09 13:52:00,6
7580720,CMH,DCA,1,2022-01-10 12:24:00,2022-01-10 13:52:00,0
7580721,CMH,DCA,2,2022-01-11 12:24:00,2022-01-11 13:52:00,1
7580722,CMH,DCA,3,2022-01-12 12:24:00,2022-01-12 13:52:00,2


In [22]:
# Adjusting so day_of_week index match
carrier_data['ARR_DAY_OF_WEEK'] = carrier_data['ARR_DAY_OF_WEEK'] + 1

In [23]:
# Next we make the day_of_week columns more reader-friendly
day_of_week_translation = {1: 'Monday',
                          2: 'Tuesday',
                          3: 'Wednesday',
                          4: 'Thursday',
                          5: 'Friday',
                          6: 'Saturday',
                          7: 'Sunday'}

carrier_data['DAY_OF_WEEK'].replace(day_of_week_translation, inplace=True)
carrier_data['ARR_DAY_OF_WEEK'].replace(day_of_week_translation, inplace=True)

In [24]:
# Takeoff Congestion Key
carrier_data['takeoff-congestion-key'] = carrier_data['ORIGIN'] \
                        + carrier_data['DAY_OF_WEEK'].astype(str) \
                        + carrier_data['takeoff-time-of-day']

# Arrival Congestion Key
carrier_data['arrival-congestion-key'] = carrier_data['DEST'] \
                        + carrier_data['ARR_DAY_OF_WEEK'].astype(str) \
                        + carrier_data['arrival-time-of-day']

In [25]:
records = carrier_data.groupby('takeoff-congestion-key')['FL_DATE'].nunique().tolist()

In [26]:
# Now we create a new dataframe that holds data on congestion
airport_congestion_by_hour = carrier_data.groupby('takeoff-congestion-key')['TAIL_NUM'].count()
airport_congestion_by_hour = airport_congestion_by_hour.to_frame()
airport_congestion_by_hour.reset_index(inplace=True)
airport_congestion_by_hour.rename(columns={'TAIL_NUM': 'count_of_flights'}, inplace=True)

In [27]:
airport_congestion_by_hour['num_records'] = records
airport_congestion_by_hour['avg-takeoff-congestion'] = airport_congestion_by_hour['count_of_flights'] / airport_congestion_by_hour['num_records']

In [28]:
# Now we calculate landing congestion
airport_arrival_congestion_by_hour = carrier_data.groupby('arrival-congestion-key')['TAIL_NUM'].count()
airport_arrival_congestion_by_hour = airport_arrival_congestion_by_hour.to_frame()
airport_arrival_congestion_by_hour.reset_index(inplace=True)
airport_arrival_congestion_by_hour.rename(columns={'TAIL_NUM': 'count_of_flights_arriving'}, inplace=True)

arr_records = carrier_data.groupby('arrival-congestion-key')['FL_ARR_DATE'].nunique().tolist()

airport_arrival_congestion_by_hour['num_arr_records'] = arr_records
airport_arrival_congestion_by_hour['avg-arrival-congestion'] = airport_arrival_congestion_by_hour['count_of_flights_arriving'] / airport_arrival_congestion_by_hour['num_arr_records']

airport_congestion_by_hour.drop(columns=['count_of_flights', 'num_records'], inplace=True)
airport_arrival_congestion_by_hour.drop(columns=['count_of_flights_arriving', 'num_arr_records'], inplace=True)

In [29]:
# And finally we can combine the two into a congestion dataframe
airport_congestion_by_hour = pd.merge(airport_congestion_by_hour, airport_arrival_congestion_by_hour,
                                      left_on='takeoff-congestion-key', right_on='arrival-congestion-key')
airport_congestion_by_hour.drop(columns=['arrival-congestion-key'], inplace=True)

airport_congestion_by_hour.rename(columns={"takeoff-congestion-key": "congestion-key"}, inplace=True)
airport_congestion_by_hour.to_csv('data/prepared/airport_congestion.csv', index=False)

### Merging in Congestion Data

In [30]:
# Now we add congestion data to our main dataframe
carrier_data = pd.merge(carrier_data, airport_congestion_by_hour, left_on='takeoff-congestion-key', right_on='congestion-key')
# updating key
airport_congestion_by_hour = airport_congestion_by_hour.add_prefix('dest-')
# Now data on the congestion conditions of the airport where the flight is arriving
carrier_data = pd.merge(carrier_data, airport_congestion_by_hour, left_on='arrival-congestion-key', right_on='dest-congestion-key')

### Proximity to Holidays
Anyone who has ever traveled by plane knows that delays seem to be most prevelant around the holidays. That's why one last feature we want to engineer is some sort of proximity to holidays.

In [31]:
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar

date_range = pd.date_range(start='2021-01-01', end='2025-12-31')
cal = calendar()
holidays = cal.holidays(start=date_range.min(), end=date_range.max(), return_name=True)
holidays.reset_index(name='holiday').rename(columns={'index':'date'})
holidays = holidays.to_frame()
holidays.reset_index(inplace=True)
holidays.columns = ['holiday_date', 'holiday_name']

# holidays.to_csv('data/prepared/holidays.csv', index=False)

carrier_data.sort_values('FL_DATE', inplace=True)
carrier_data = pd.merge_asof(carrier_data, holidays, left_on='FL_DATE', right_on='holiday_date',
                       direction='nearest', tolerance=pd.Timedelta(days=7))

carrier_data['days-from-holiday'] = (carrier_data['FL_DATE'] - carrier_data['holiday_date']).dt.days
carrier_data['holiday'] = carrier_data.loc[carrier_data['days-from-holiday'] == 0, 'holiday_name']
carrier_data['holiday'].fillna('Not a Holiday', inplace=True)
carrier_data['days-from-holiday'] = carrier_data['days-from-holiday'].astype(str)
carrier_data['days-from-specific-holiday'] = carrier_data['holiday_name'] + '_' + carrier_data['days-from-holiday'].astype(str)

# Cleaning up the results a bit
carrier_data['days-from-specific-holiday'].fillna('no-close-holiday', inplace=True)

In [32]:
# Once again, we clean up columns we don't need anymore
carrier_data.drop(columns=['holiday_date', 'holiday_name'], inplace=True)

### Time of day takoff & landing

Lastly, we will create a continuous variable that quantifies when throughout the day that flight takes off. This is to account that it's possible for delays to be more prevalent at certain points in the day. To do this, we'll create a variable that measures a flight's distance from midnight in minutes.

In [33]:
carrier_data['takeoff-mins-from-midnight'] = ((pd.to_datetime(carrier_data['CRS_DEP_TIME'])
                                               - pd.to_datetime(carrier_data['CRS_DEP_TIME']).dt.normalize()) \
                                              / pd.Timedelta('1 minute')).astype(int)

carrier_data['CRS_ARR_TIME'] = carrier_data['CRS_ARR_TIME'].replace({'24:00':'00:00'})

carrier_data['landing-mins-from-midnight'] = ((pd.to_datetime(carrier_data['CRS_ARR_TIME'])
                                               - pd.to_datetime(carrier_data['CRS_ARR_TIME']).dt.normalize()) \
                                              / pd.Timedelta('1 minute')).astype(int)

#### Filtering for Airports & Airlines Relevant to Business Case
The MVP calls for us to support flights originating from major US airports and 8 major airlines. So we filter down our data for that.

In [35]:
# Create list of relevant aiports based on business case
relevant_airports = ['ATL', 'DFW', 'DEN', 'ORD', 'LAX', 'CLT', 'LAS', 'PHX', 
                     'MCO', 'SEA', 'MIA', 'IAH', 'JFK', 'FLL', 'EWR', 'SFO', 'MSP', 'DTW',
                     'BOS', 'SLC', 'PHL', 'BWI', 'TPA', 'SAN', 'MDW', 'LGA', 'BNA', 'IAD',
                     'DAL', 'DCA', 'PDX', 'AUS', 'HOU', 'HNL', 'STL', 'RSW', 'SMF', 'MSY',
                     'SJU', 'RDU', 'OAK', 'MCI', 'CLE', 'IND', 'SAT', 'SNA', 'PIT', 'CVG',
                     'CMH', 'PBI', 'JAX', 'MKE', 'ONT', 'ANC', 'BDL', 'OGG', 'OMA', 'MEM',
                     'BOI', 'RNO', 'CHS', 'OKC']

# Filter Dataframe to include only relevant airlines & airports
airport_filter = '|'.join(relevant_airports)

carrier_data = carrier_data[carrier_data['ORIGIN'].str.contains(airport_filter)]

Before we further filter down the data to our relevant case, we'll save it out for graphing/vizualization purposes.

In [36]:
carrier_data.to_csv('data/prepared/data_for_graphing.csv', index=False)

In [4]:
# Create list of relevant IATA airline designators based on business case
relevant_airlines = ['WN', # Southwest
                     'DL', # Delta
                     'OO', # SkyWest
                     'AA', # American Airlines
                     'UA', # United Airlines
                     'B6', # JetBlue
                     'AS', # Alaska Airlines
                     'NK', # Spirit Airlines
                    ]

airline_filter = '|'.join(relevant_airlines)
carrier_data = carrier_data[carrier_data['OP_CARRIER'].str.contains(airline_filter)]

#### Filtering for more frequently vistied locations
For the sake of efficiency, we can further filter down our data to include only destinations that are at least somewhat frequently traveled. If, out of 7 million flights, a destination was visited fewer than 1000 times we'll cut that flight.

In [5]:
carrier_data = carrier_data.groupby('DEST').filter(lambda x: len(x) > 1000)

### Merging in Weather Data

In [33]:
# start = datetime.strptime('2021-06-01', '%Y-%m-%d').date()
# end = datetime.strptime('2022-07-31', '%Y-%m-%d').date()

# current_date = start
# date_list = []

# while current_date <= end:
#     date_list.append(current_date)
#     current_date_plus_30 = current_date + timedelta(days=30)
#     date_list.append(current_date_plus_30)
#     current_date = current_date_plus_30 + timedelta(days=1)

# for i in range(0,len(date_list)):
#     date_list[i] = date_list[i].strftime('%Y-%m-%d')
    
# start_dates = []
# end_dates = []
# for i in range(0,len(date_list)):
#     if i%2 == 0:
#         start_dates.append(date_list[i])
#     else:
#         end_dates.append(date_list[i])
        
# origins = list(carrier_data['origin-lat-long'].unique())
# destinations = list(carrier_data['dest-lat-long'].unique())
# lat_long_list = list(set(origins+destinations))

In [34]:
# def get_keys(path):
#     with open(path) as f:
#         return json.load(f)
    
# keys = get_keys("C:/Users/Robert/.secret/weather_api.json")
# history_data_url = 'http://api.weatherapi.com/v1/history.json'
# api_key = keys['api_key']

# r = requests.get(history_data_url + '?key=' + api_key + '&q=' + location + '&dt=' + start_dates[i] + '&end_dt=' + end_dates[i])
# d = json.loads(r.text)

# weather_latlong = []
# weather_dates = []
# weather_maxtemp_c = []
# weather_mintemp_c = []
# weather_avgtemp_c = []
# weather_totalprecip_mm = []
# weather_avgvis_km = []
# weather_maxwind_kph = []
# weather_avghumidity = []

# for location in lat_long_list:
#     print('Working on {}'.format(location))
#     for i in range(0, len(start_dates)):
#         r = requests.get(history_data_url + '?key=' + api_key + '&q=' + location + '&dt=' + start_dates[i] + '&end_dt=' + end_dates[i])
#         d = json.loads(r.text)
#         for j in range(0,31):
#             weather_latlong.append(location)
#             weather_dates.append(d['forecast']['forecastday'][j]['date'])
#             weather_maxtemp_c.append(d['forecast']['forecastday'][j]['day']['maxtemp_c'])
#             weather_mintemp_c.append(d['forecast']['forecastday'][j]['day']['mintemp_c'])
#             weather_avgtemp_c.append(d['forecast']['forecastday'][j]['day']['avgtemp_c'])
#             weather_totalprecip_mm.append(d['forecast']['forecastday'][j]['day']['totalprecip_mm'])
#             weather_avgvis_km.append(d['forecast']['forecastday'][j]['day']['avgvis_km'])
#             weather_maxwind_kph.append(d['forecast']['forecastday'][j]['day']['maxwind_kph'])
#             weather_avghumidity.append(d['forecast']['forecastday'][j]['day']['avghumidity'])

In [35]:
# weather_df = pd.DataFrame(
#     {'lat-long': weather_latlong,
#      'date': weather_dates,
#      'maxtemp': weather_maxtemp_c,
#      'mintemp': weather_mintemp_c,
#      'avgtemp': weather_avgtemp_c,
#      'totalprecip': weather_totalprecip_mm,
#      'avgvis': weather_avgvis_km,
#      'maxwind': weather_maxwind_kph,
#      'avghumidity': weather_avghumidity
#     })

In [36]:
# weather_df.to_csv('data/downloaded/weather-data.csv', index=False)

In [2]:
weather_df = pd.read_csv('data/downloaded/weather-data.csv')

In [6]:
# First, we need to create keys for matching the weather data to locations and dates
carrier_data['FL_DATE'] = pd.to_datetime(carrier_data['FL_DATE'])
carrier_data['FL_ARR_DATE'] = pd.to_datetime(carrier_data['FL_ARR_DATE'])
carrier_data['weather-key'] = carrier_data['origin-lat-long'] + carrier_data['FL_DATE'].dt.date.astype(str)
carrier_data['dest-weather-key'] = carrier_data['dest-lat-long'] + carrier_data['FL_ARR_DATE'].dt.date.astype(str)
weather_df['weather-key'] = weather_df['lat-long'] + weather_df['date']

In [7]:
# Create a key to prepare merging in daily weather data for each airport
carrier_data = carrier_data.merge(weather_df, on='weather-key')
weather_df = weather_df.add_prefix('dest-')
carrier_data = carrier_data.merge(weather_df, on='dest-weather-key')

In [8]:
irrelevant_columns = ['YEAR', 'FL_DATE', 'MKT_CARRIER_FL_NUM', 'OP_CARRIER', 'TAIL_NUM',
                      'OP_CARRIER_FL_NUM', 'CRS_DEP_TIME', 'DEP_DELAY', 'DEP_DELAY_NEW', 'CRS_ARR_TIME',
                      'ARR_DELAY_NEW', 'origin-lat-long', 'origin-tz', 'dest-lat-long_x', 'dest-tz',
                      'flight_duration', 'FL_ARR_DATE_REL_ORIGIN', 'FL_ARR_DATE', 'takeoff-hour',
                      'arriving-hour', 'takeoff-congestion-key', 'arrival-congestion-key', 'congestion-key',
                      'dest-congestion-key', 'days-from-holiday', 'weather-key', 'dest-weather-key',
                      'lat-long', 'date', 'dest-lat-long_y', 'dest-date', 'dest-date', 'dest-lat-long_y',
                      'date', 'lat-long','dest-weather-key', 'weather-key']

carrier_data.drop(columns=irrelevant_columns, inplace=True)

## Saving out the cleaned data

In [9]:
carrier_data.to_csv('data/prepared/data_for_modeling.csv', index=False)