In [2]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy
import scipy.stats as stats
import math

# Plot settings
plt.rcParams['figure.figsize'] = [24, 8]

# Load the dataset
df = pd.read_csv('../dataset/full_data_flightdelay.csv', dtype='unicode')

<figure>
  <IMG SRC="https://www.colorado.edu/cs/profiles/express/themes/cuspirit/logo.png" WIDTH=50 ALIGN="right">
</figure>

# Semester Project - Part 1
*CSPB 3022 Data Science Algorithms - Spring 2022*

*Author: Thomas Cochran*

## Project Topic

**Project goal:**

The goal of this project is to create a binary classifier that predicts whether a domestic flight will be delayed or not given a set of departure conditions. 

**Type of problem:**

This is a binary classification problem with target classes: `delayed` and `not delayed`. The target class `delayed` is defined as any flight whose departure time exceeds 15 minutes.

**Project motivation:**

Many of us have found ourselves sitting on a boarded plane, probably in a coach middle seat, wondering: why hasn't this plane taken off yet? It would be nice to mentally prepare ourselves for this situation, especially if the person in front of us has already put their seat in full recline mode. This is one motivation for working on this project. An additional motivation is the ability to plan our travel a little more effeciently by predicting future delays and then planning around them.


## Dataset

**Source:**

The dataset used in this project is from kaggle ([source](https://www.kaggle.com/datasets/threnjen/2019-airline-delays-and-cancellations)). It consists of a single file located in `dataset/full_data_flightdelay.csv` which contains 1.27 GB of data in the form of 6,489,062 rows of domestic airline departures and associated weather for the year of 2019. This dataset has been merged from the following primary sources:

1. Bureau of Transportation statistics: [Link](https://www.transtats.bts.gov/databases.asp?Z1qr_VQ=E&Z1qr_Qr5p=N8vn6v10&f7owrp6_VQF=D)
2. National Centers for Environmental Information (NOAA): [Link](https://www.ncdc.noaa.gov/cdo-web/datasets)

**Description:**

Departure data from the Bureau of Transportation statistics consists primarily of monthly performance reports that contain a plethora of features for domestic departure flights, including:

* Day of week
* Delayed or not
* Airline carrier name
* Age of the departing aircraft

The NOAA data is merged with the above departure data which adds some potentially interesting features that may contribute to delays, such as:

* Departure snowfall and percipitation in inches
* Departure temperature
* Departure wind speed

In the dataset, each row corresponds to a domestic departure flight. There are 26 features per row, of which there are 8 categorical and 18 numerical features. A description of each feature can be found in the file `dataset/documentation.md`.

In [21]:
# A brief look at a random sample of departure flights and some of their features
df[[ 'MONTH', 'DEP_DEL15', 'CARRIER_NAME', 'NUMBER_OF_SEATS', 
     'DEPARTING_AIRPORT', 'PLANE_AGE', 'PRCP', 'SNOW']].sample(n=8, replace=False)

Unnamed: 0,MONTH,DEP_DEL15,CARRIER_NAME,NUMBER_OF_SEATS,DEPARTING_AIRPORT,PLANE_AGE,PRCP,SNOW
2014130,4,0,SkyWest Airlines Inc.,50,Northwest Arkansas Regional,17,0.0,0.0
1574409,4,1,American Airlines Inc.,150,Douglas Municipal,19,0.0,0.0
4538813,9,0,American Airlines Inc.,160,Seattle International,19,0.0,0.0
1722524,4,0,"Midwest Airline, Inc.",69,Ronald Reagan Washington National,15,0.0,0.0
592810,2,0,American Eagle Airlines Inc.,44,Cleveland-Hopkins International,18,0.0,0.0
2601383,6,1,American Airlines Inc.,172,Salt Lake City International,9,0.01,0.0
5561489,11,0,Delta Air Lines Inc.,191,Atlanta Municipal,1,0.24,0.0
1110764,3,0,Southwest Airlines Co.,175,Pittsburgh International,5,0.0,0.0


## Data cleaning and Exploratory Data Analysis (EDA):

TODO