<h2>Imports</h2>
<ul>
    <li>Numpy</li>
    <li>Pandas</li>
    <li>Matplotlib</li>
    <li>Seaborn</li>
    <li>Scikit-Learn</li>
</ul>

In [24]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import copy

# Scikit-Learn imports
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler

<h2>Data loading & Preprocessing</h2>
<h4>Data loading</h4>
<ul>
    <li>Load data</li>
    <li>split train and test data</li>
</ul>
<h4>Preprocessing</h4>
<ul>
    <li>Categorical features</li>
    <li>Missing values</li>
    <li>Normalizing</li>
    <li>Eliminate correlated features</li>
</ul>
<p><b>NOTE:</b> Load data from 2019 & 2020 and concatnate them with each other</p>

In [13]:
# Load data 
data_2019 = pd.read_csv('.//Jan_2019_ontime.csv')
data_2020 = pd.read_csv('.//Jan_2020_ontime.csv')
data = pd.merge(data_2019, data_2020, how='outer')

# Drop unwanted column that created middle of merging procedure
data = data.drop(columns=['Unnamed: 21'])

<h4>Handle Categorical features</h4>
<p>1. Handle DES_TIME_BLK by spliting it into two seperate columns</p>

In [14]:
DEP_TIME_BLK_column = data['DEP_TIME_BLK']
DEP_no1 = []
DEP_no2 = []
for val in DEP_TIME_BLK_column:
    index = val.find('-')
    val1 = val[:index]
    val2 = val[index+1:]
    DEP_no1.append(copy.deepcopy(val1))
    DEP_no2.append(copy.deepcopy(val2))
data = data.drop(columns=['DEP_TIME_BLK'])
pd.concat([data, pd.DataFrame(DEP_no1, columns=['DES_TIME_BLK_no1']), pd.DataFrame(DEP_no2, columns=['DES_TIME_BLK_no2'])], axis=1)

Unnamed: 0,DAY_OF_MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN,...,DEST,DEP_TIME,DEP_DEL15,ARR_TIME,ARR_DEL15,CANCELLED,DIVERTED,DISTANCE,DES_TIME_BLK_no1,DES_TIME_BLK_no2
0,1,2,9E,20363,9E,N8688C,3280,11953,1195302,GNV,...,ATL,601.0,0.0,722.0,0.0,0.0,0.0,300.0,0600,0659
1,1,2,9E,20363,9E,N348PQ,3281,13487,1348702,MSP,...,CVG,1359.0,0.0,1633.0,0.0,0.0,0.0,596.0,1400,1459
2,1,2,9E,20363,9E,N8896A,3282,11433,1143302,DTW,...,CVG,1215.0,0.0,1329.0,0.0,0.0,0.0,229.0,1200,1259
3,1,2,9E,20363,9E,N8886A,3283,15249,1524906,TLH,...,ATL,1521.0,0.0,1625.0,0.0,0.0,0.0,223.0,1500,1559
4,1,2,9E,20363,9E,N8974C,3284,10397,1039707,ATL,...,FSM,1847.0,0.0,1940.0,0.0,0.0,0.0,579.0,1900,1959
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1191326,31,5,9E,20363,9E,N331CA,4812,15412,1541205,TYS,...,DTW,1002.0,1.0,1128.0,1.0,0.0,0.0,443.0,0700,0759
1191327,31,5,9E,20363,9E,N295PQ,4813,11433,1143302,DTW,...,JFK,1747.0,0.0,1933.0,0.0,0.0,0.0,509.0,1700,1759
1191328,31,5,9E,20363,9E,N294PQ,4814,11996,1199603,GSP,...,LGA,554.0,0.0,752.0,0.0,0.0,0.0,610.0,0600,0659
1191329,31,5,9E,20363,9E,N228PQ,4815,10397,1039707,ATL,...,XNA,1714.0,0.0,1811.0,0.0,0.0,0.0,589.0,1700,1759


<h4>Handle Categorical features</h4>
<p>2. Handle other categorical features by unique label coding strategy</p>

In [17]:
# Determine all the columns names that are categorical
categorical_columns = ['OP_UNIQUE_CARRIER', 'OP_CARRIER', 'TAIL_NUM', 'ORIGIN', 'DEST']

# Handle categorical features using label encoder
encoder = LabelEncoder()
for col in categorical_columns:
    data[col] = encoder.fit_transform(data[col])

<h4>Missing values</h4>
<p><b>Strategy:</b> our strategy is to use the <b>mean</b> of the whole column and relace it with missing values</p>
<p><b>NOTE:</b> At first I wanted to apply <b>KNN</b> for handling missing values but for a data frame at sime about 1M x 22 it took so much time, but you can see the code below</p>

In [23]:
# Select your desired imputer
imputer = SimpleImputer(strategy='mean')
# imputer = KNNImputer(n_neighbors=5)

# Handle missing values
imputed_data = imputer.fit_transform(data)
data_without_missing_values = pd.DataFrame(imputed_data, columns=data.columns)

<h4>Normalizing</h4>
<p><b>NOTE:</b> Method is to use standard normalization, that means to transform data to new values with mean of zero and standard deviation of 1</p>

In [None]:
scalar = StandardScaler()
normalized_data = 