# Predicting Flight Delay

Problem Set-up:
We define a delayed flight to be one that is delayed by >= 15 minutes. 
The prediction problem is to train a model that can classify flights, to predict if they will or will not be delayed.

Use case:
- The idea is that this model would be useful to choosing airlines, flightpaths, airports, at the time of booking, relatively in advance of the scheduled departure (days, weeks, months ahead of time). Therefore, the prediction problem will focus on features that can be known in advance, rather than predicting using day-off features like weather and previous flights from that day. 

Notes:
- We restrict the analysis to relatively large airport, those with more than 20 (domestic) flights a day

In [1]:
# Imports
from sklearn.linear_model import LogisticRegression

import numpy as np
import pandas as pd

In [2]:
# Import custom code
import fld.io as flio

In [3]:
# Set data path
dat_path = "/Users/thomasdonoghue/Documents/UCSD/1-Classes/2016-2017/" \
           "2-Winter/CSE255_WebMining/Assignments/Assgn-2/Data/"

In [4]:
# Load all data
airlines_df, airports_df, flights_df = flio.load_data(dat_path, N_flights=10000)

In [5]:
# Drop cancelled flights
flights_df = flights_df[flights_df['CANCELLED'] != 1]
#flights_df = flights_df[np.isfinite(flights_df['DEPARTURE_DELAY'])]
#flights_df.dropna('DEPARTURE_DELAY')

In [6]:
# Check available features
flights_df.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK',
       'AIRLINE', 'FLIGHT_NUMBER', 'TAIL_NUMBER', 'ORIGIN_AIRPORT',
       'DESTINATION_AIRPORT', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME',
       'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'SCHEDULED_TIME',
       'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE', 'WHEELS_ON', 'TAXI_IN',
       'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME', 'ARRIVAL_DELAY', 'DIVERTED',
       'CANCELLED', 'CANCELLATION_REASON', 'AIR_SYSTEM_DELAY',
       'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY',
       'WEATHER_DELAY'],
      dtype='object')

In [8]:
# Set labels
def create_label(row):
    if row['DEPARTURE_DELAY'] > 15: return 1
    else: return 0

def features(df):
    return np.array(flights_df[['DISTANCE','DAY_OF_WEEK']].values)
    #return np.array(df['DISTANCE']).reshape(-1, 1)
    #return np.hstack([np.array(df['AIR_TIME']).reshape(-1, 1), 
    #                 np.array(df['DISTANCE']).reshape(-1, 1)])

def labels(df):
    return np.array(df['LABEL'])#.reshape(-1, 1)

In [9]:
# Create labels
flights_df['LABEL'] = flights_df.apply(lambda row: create_label(row), axis=1)

In [10]:
# Get features and labels
X = features(flights_df)
y = labels(flights_df)

In [11]:
# Set up model, and train it
model = LogisticRegression()
model.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [12]:
# Check training score
model.score(X, y)

0.83586594504579514

In [13]:
# Check if model is predicting any delayed flights (0 is no)
sum(model.predict(X))

0

In [14]:
# Check what percent of flights in training are delayed
sum(y) / len(y)

0.16413405495420483