<h1 align="center" style="margin:0;padding:0;">Flight Delays</h1>
<p align="center" style="margin-top:0;padding-top:0;font-style:italic;">By <a href="https://github.com/sudislife/">Sudaksh Mishra</a></p>

Job searching can sometimes be exhausting, it's important to show recruiters that I can code. So, I came across this competition on kaggle and I'm pretty sure everyone is using the CatBoost model because it was a part of a course. I thought why not use this to flex my ML skills. One of my favourite professors said to me during my Masters degree,

> "Sure, you can kill a fly with a tank, but do you really need to use a tank to do that? Similarly you can use a neural network on everything, but should you? It's just overkill."
>
> -- By [Dr. Reda Bouadjenek](https://www.linkedin.com/in/rbouadjenek/)

So today, I will be making a Artificial Neural Network (ANN) with maybe a Recurrent Neural Network (RNN) as I see the data contains some dates knowing full well that CatBoost works with almost ~73.5% score. I really wish I can match or get a better score than 70%.

In [2]:
import pandas as pd
import plotly.express as px
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

In [3]:
# Data Source: Yury Kashnitsky. (2018). mlcourse.ai: Flight delays . Kaggle. https://kaggle.com/competitions/flight-delays-fall-2018
train = pd.read_csv('Data/flight_delays_train.csv')
test  = pd.read_csv('Data/flight_delays_test.csv')

In [None]:
train.describe()

Unnamed: 0,DepTime,Distance
count,100000.0,100000.0
mean,1341.52388,729.39716
std,476.378445,574.61686
min,1.0,30.0
25%,931.0,317.0
50%,1330.0,575.0
75%,1733.0,957.0
max,2534.0,4962.0


In [None]:
train.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y


**Just by looking at the 2 cells above**

I will need to do at least 3 things before I do anything with this data.

1. Remove the c- from columns Month, DayofMonth and DayOfWeek columns. (Is that inconsistent camelcase you see? Yes, it is, I'm going to leave it there to irritate you)
2. Figure out how I'm going to one-hot encode this entire UniqueCarrier, Origin, and Dest
3. Convert N to 0 and Y to 1 

## Data Exploration

Let me try to check the data which the ANN will have to understand. As a scholar in ML, I've been taught to dislike

1. Imbalanced Data
2. Outliers

We like to look for

1. 🌠Correlations in data🌠
2. 

In [27]:
px.histogram(train, 
             x     = 'dep_delayed_15min', 
             color = 'dep_delayed_15min', 
             title = 'Target distribution')

Of course, imbalanced data is always fun to see. As you know we will need to balance the data to remove all biases. There are a few ways:

1. Oversampling
2. Undersampling

And I learned a third way recently

3. [BinaryFocalCrossentropy](https://www.tensorflow.org/api_docs/python/tf/keras/losses/BinaryFocalCrossentropy)

In [16]:
fig = px.histogram(train, 
                   x       = 'UniqueCarrier', 
                   color   = 'dep_delayed_15min', 
                   title   = 'UniqueCarrier vs Number of Delays',
                   barmode = 'group',
                   labels  = {'dep_delayed_15min': 'Delayed >15min'}, 
                  )

fig.update_layout(yaxis_title = 'Number of Delays', 
                  xaxis_title = 'Carrier Code')

fig.show()

In [25]:
fig = px.histogram(train,
                   x             = 'Distance',
                   color         = 'dep_delayed_15min',
                   title         = 'Distance vs Number of Delays',
                   labels        = {'dep_delayed_15min': 'Delayed >15min'},
                   pattern_shape = 'dep_delayed_15min',
                   marginal      = 'box',
                   nbins         = 100,
                   width         = 1200,
                   height        = 600
                  )

fig.update_layout(yaxis_title = 'Number of Delays')
fig.show()

In [9]:
odNumDelay = train.groupby(['Origin', 'Dest'])['dep_delayed_15min'].value_counts()

# Values of Y - N
delayCounts = odNumDelay.unstack().fillna(0)
delayCounts['num_dep_delayed'] = delayCounts['Y'] - delayCounts['N']

# Make type int
delayCounts['num_dep_delayed'] = delayCounts['num_dep_delayed'].astype(int)
delayCounts['Y'] = delayCounts['Y'].astype(int)
delayCounts['N'] = delayCounts['N'].astype(int)
delayCounts.reset_index(inplace=True)
delayCounts

dep_delayed_15min,Origin,Dest,N,Y,num_dep_delayed
0,ABE,ATL,10,2,-8
1,ABE,CLE,18,0,-18
2,ABE,CLT,2,0,-2
3,ABE,CVG,13,4,-9
4,ABE,DTW,4,0,-4
...,...,...,...,...,...
4424,YAK,CDV,3,1,-2
4425,YAK,JNU,4,1,-3
4426,YUM,IPL,3,0,-3
4427,YUM,LAX,13,4,-9


In [15]:
import plotly.graph_objects as go
fig = go.Figure(data=go.Heatmap(
    x          = delayCounts['Origin'], 
    y          = delayCounts['Dest'], 
    z          = delayCounts['num_dep_delayed'],
    colorscale = 'Blues',)
)

fig.update_layout(width  = 850, 
                  height = 800,
                  title  = 'Number of delayed flights between Origin and Destination',
                  xaxis_title = 'Origin',
                  yaxis_title = 'Destination')

fig.show()

## Preprocessing