# Prediction in Minutes
In this file we will look at a few regressors from Sklearn and how they help us predict the delay of the flight in minutes. 

#### Objectives
1. Reading the file.
2. Only looking at those data points where the flight was late.
3. Splitting the dataset to train and test data.
3. Fit a few models and look at a few metrics for our predicted values.

The different Regressors we use in this file are
1. Linear Regressor [LOOKUP](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
2. Extra trees Regressor [LOOKUP](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html)
3. XGboost Regressor [LOOKUP](https://xgboost.readthedocs.io/en/latest/python/python_api.html)
4. Random Forest Regressor [LOOKUP](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

#### Scope
With knowing the flight delay in minutes we can try to move the schedule of the other flights and hence make sure subsequent flights are not delayed.

#### Reading the necessary Libraries

In [1]:
import pandas as pd
import numpy as np  

In [2]:
# Looking at the dataframe once
df = pd.read_csv("Data/Merged_Flight_Weather.csv", index_col = 0)
df.head()

  mask |= (ar1 == a)


Unnamed: 0,ArrDel15,ArrDelayMinutes,ArrTime,CRSArrTime,CRSDepTime,DayofMonth,DepDel15,DepDelayMinutes,DepTime,DestAirportID,...,pressure,cloudcover,DewPointF,WindGustKmph,tempF,WindChillF,humidity,time,date,airport
0,0.0,8.0,1610.0,1602,745,1,0.0,0.0,741.0,12478,...,1030,0,23,8,34,29,65,800,2016-01-01,SEA
1,1.0,33.0,1237.0,1204,820,1,0.0,1.0,821.0,14107,...,1030,0,23,8,34,29,65,800,2016-01-01,SEA
2,0.0,0.0,1138.0,1143,759,1,0.0,0.0,754.0,11292,...,1030,0,23,8,34,29,65,800,2016-01-01,SEA
3,0.0,10.0,1401.0,1351,738,1,0.0,2.0,740.0,12266,...,1030,0,23,8,34,29,65,800,2016-01-01,SEA
4,0.0,0.0,1524.0,1527,745,1,0.0,0.0,742.0,10397,...,1030,0,23,8,34,29,65,800,2016-01-01,SEA


In [3]:
# Checking if there are any missing values
df.isnull().sum()

ArrDel15           0
ArrDelayMinutes    0
ArrTime            0
CRSArrTime         0
CRSDepTime         0
DayofMonth         0
DepDel15           0
DepDelayMinutes    0
DepTime            0
DestAirportID      0
FlightDate         0
Month              0
Origin             0
OriginAirportID    0
Quarter            0
Year               0
DeptTime           0
Unnamed: 0.1       0
windspeedKmph      0
winddirDegree      0
weatherCode        0
precipMM           0
visibility         0
pressure           0
cloudcover         0
DewPointF          0
WindGustKmph       0
tempF              0
WindChillF         0
humidity           0
time               0
date               0
airport            0
dtype: int64

In [4]:
# Now we are label encoding the 
df["Origin"] = df["Origin"].astype('category')
df["airport"] = df["airport"].astype('category')
df["Origin"] = df["Origin"].cat.codes
df["airport"] = df["airport"].cat.codes

#### Filtering the data
We will only be looking at those flights which arrived late because it will be redundant to predict the delay in muntes for flights that have not arrived late.

In [5]:
# Looking at only those flights that arrived late
X = df[df.ArrDel15 == 1.0]
y = X['ArrDelayMinutes']
X = X.drop(['ArrDelayMinutes','FlightDate','ArrTime','date','airport','time'], axis = 1)

In [6]:
# Looking at the features
X.columns

Index(['ArrDel15', 'CRSArrTime', 'CRSDepTime', 'DayofMonth', 'DepDel15',
       'DepDelayMinutes', 'DepTime', 'DestAirportID', 'Month', 'Origin',
       'OriginAirportID', 'Quarter', 'Year', 'DeptTime', 'Unnamed: 0.1',
       'windspeedKmph', 'winddirDegree', 'weatherCode', 'precipMM',
       'visibility', 'pressure', 'cloudcover', 'DewPointF', 'WindGustKmph',
       'tempF', 'WindChillF', 'humidity'],
      dtype='object')

In [7]:
# Splitting the data into train and test data
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.80)

In [8]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((308000, 27), (77000, 27), (308000,), (77000,))

### Metrics used
To evaluvate our model we will be looking at a few metrics, namely
1. Mean Absolute Error  (MAE): Mean absolute error is a measure of errors between paired observations expressing the same phenomenon.

2. Mean Squared Error(MSE): The mean squared error or mean squared deviation of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss.

3. Root Mean Squared Error (RMSE) : Essentially the root of MSE

4. Coefficient of Determination ($ R^2 $): The coefficient of determination, denoted R² or r² and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable.
    

${MAE} = \frac{1}{n}\sum \limits _{i=1} ^{n} |y_{i} - \hat{y}| $

${MSE} = \frac{1}{N}\sum \limits _{i=1} ^{n} (y_{i} - \hat{y})^2 $

${RMSE} = \sqrt{\frac{1}{N}\sum \limits _{i=1} ^{n} (y_{i} - \hat{y})^2} $

${R^2} = \frac{VAR(X, Y)_(fitted line)}{VAR(X, Y)mean}$

# Prediction

## Linear Regression

In [9]:
from sklearn.linear_model import LinearRegression
linear_regressor = LinearRegression()
linear_regressor.fit(X_train,y_train)
y_pred_linear_regression = linear_regressor.predict(X_test)

In [10]:
import sklearn.metrics as metrics
mean_absolute_error=metrics.mean_absolute_error(y_test, y_pred_linear_regression) 
MSE=metrics.mean_squared_error(y_test, y_pred_linear_regression) 
RMSE= np.sqrt(MSE)
R2=metrics.r2_score(y_test, y_pred_linear_regression)
print("MAE: {}".format(mean_absolute_error))
print("MSE: {}".format(MSE))
print("RMSE: {}".format(RMSE))
print("R-Squared: {}".format(R2))

MAE: 12.021289036296457
MSE: 300.15399861065396
RMSE: 17.324953062292952
R-Squared: 0.9412113307019422


## Extra trees Regressor

In [11]:
from sklearn.ensemble import ExtraTreesRegressor
extra_trees_regressor = ExtraTreesRegressor()
extra_trees_regressor.fit(X_train,y_train)
y_pred_extra_trees = extra_trees_regressor.predict(X_test)

In [12]:
mean_absolute_error=metrics.mean_absolute_error(y_test, y_pred_extra_trees) 
MSE=metrics.mean_squared_error(y_test, y_pred_extra_trees) 
RMSE= np.sqrt(MSE)
R2=metrics.r2_score(y_test, y_pred_extra_trees)
print("MAE: {}".format(mean_absolute_error))
print("MSE: {}".format(MSE))
print("RMSE: {}".format(RMSE))
print("R-Squared: {}".format(R2))

MAE: 2.3501311688311683
MSE: 54.93284868311689
RMSE: 7.411669763495732
R-Squared: 0.9892407594442176


## XGboost Regressor

In [13]:
from xgboost import XGBRegressor
xgboost_regressor = XGBRegressor()
xgboost_regressor.fit(X_train,y_train)
y_pred_xgboost = xgboost_regressor.predict(X_test)

In [14]:
mean_absolute_error=metrics.mean_absolute_error(y_test, y_pred_xgboost) 
MSE=metrics.mean_squared_error(y_test, y_pred_xgboost) 
RMSE= np.sqrt(MSE)
R2=metrics.r2_score(y_test, y_pred_xgboost)
print("MAE: {}".format(mean_absolute_error))
print("MSE: {}".format(MSE))
print("RMSE: {}".format(RMSE))
print("R-Squared: {}".format(R2))

MAE: 10.635369193300024
MSE: 234.87361550124666
RMSE: 15.325586954542612
R-Squared: 0.9539972568333066


## Random forest Regressor

In [15]:
from sklearn.ensemble import RandomForestRegressor
random_forest_regressor = RandomForestRegressor()
random_forest_regressor.fit(X_train,y_train)
y_pred_random_forest = random_forest_regressor.predict(X_test)

In [16]:
mean_absolute_error=metrics.mean_absolute_error(y_test, y_pred_random_forest) 
MSE=metrics.mean_squared_error(y_test, y_pred_random_forest) 
RMSE= np.sqrt(MSE)
R2=metrics.r2_score(y_test, y_pred_random_forest)
print("MAE: {}".format(mean_absolute_error))
print("MSE: {}".format(MSE))
print("RMSE: {}".format(RMSE))
print("R-Squared: {}".format(R2))

MAE: 5.785841038961038
MSE: 85.27390864415585
RMSE: 9.234387291215148
R-Squared: 0.9832981081770066
