<h1>Machine Learning - Predicting Values for The Fatalities Column</h1>

<h2>Preliminary Steps</h2>

Let's begin with importing the necessary libraries:

In [103]:
import math
import pandas as pd
import sklearn
import numpy as np
from sklearn import linear_model, metrics, preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.metrics import make_scorer
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

import warnings
warnings.filterwarnings("ignore")

For this step, we need to load the dataframe we cleaned in the Data Handling step:

In [104]:
df = pd.read_csv("df_cleaned.csv")
df.head()

Unnamed: 0,weekday,day,month,year,time,aircraft_type,num_of_engines,engine_type,engine_model,years_active,airframe_hrs,operator,occupants,accident_loc,above_ocean,flight_phase,damage,accident_latitude,accident_longtitude,fatalities
0,7,2.0,8,1919,13.566667,98,3.0,1,51,0.0,0.0,623,14.0,89,0,5.0,4.0,45.396389,10.888056,14.0
1,2,11.0,8,1919,13.566667,159,5.0,1,95,0.75,890.25,1722,7.0,187,0,4.0,3.0,51.94137,1.306789,1.0
2,4,18.0,8,1926,14.5,51,4.0,1,88,0.0,47.0,314,15.0,187,0,5.0,3.0,-51.174,0.868,4.0
3,2,22.0,8,1927,8.25,166,2.0,1,9,1.0,1187.0,1202,11.0,187,0,5.0,3.0,51.25,0.216,1.0
4,3,19.0,3,1929,13.566667,169,3.0,1,78,0.0,0.0,936,4.0,188,0,6.0,3.0,42.3,-83.21666,4.0


<h2>The Process</h2>

We will follow these steps for the machine learning step:

<h3>1) Seperating our target column from the data frame</h3>

Our target column will be <b>'fatalities'</b>, we will seperate it from the dataframe:

In [105]:
df_x = df[df.columns[df.columns != 'fatalities']]
ser_y = df['fatalities']

In [106]:
df_x.head()

Unnamed: 0,weekday,day,month,year,time,aircraft_type,num_of_engines,engine_type,engine_model,years_active,airframe_hrs,operator,occupants,accident_loc,above_ocean,flight_phase,damage,accident_latitude,accident_longtitude
0,7,2.0,8,1919,13.566667,98,3.0,1,51,0.0,0.0,623,14.0,89,0,5.0,4.0,45.396389,10.888056
1,2,11.0,8,1919,13.566667,159,5.0,1,95,0.75,890.25,1722,7.0,187,0,4.0,3.0,51.94137,1.306789
2,4,18.0,8,1926,14.5,51,4.0,1,88,0.0,47.0,314,15.0,187,0,5.0,3.0,-51.174,0.868
3,2,22.0,8,1927,8.25,166,2.0,1,9,1.0,1187.0,1202,11.0,187,0,5.0,3.0,51.25,0.216
4,3,19.0,3,1929,13.566667,169,3.0,1,78,0.0,0.0,936,4.0,188,0,6.0,3.0,42.3,-83.21666


In [107]:
ser_y.head()

0    14.0
1     1.0
2     4.0
3     1.0
4     4.0
Name: fatalities, dtype: float64

<h3>2) Splitting our data frame to train and test sets</h3>

We will set the size of our train set to 80% of the dataframe, and the test set to 20% of the dataframe:

In [108]:
X_train, X_test, y_train, y_test = train_test_split(df_x, ser_y, test_size=0.2, random_state=42) 

<h3>3) Training models on our train set</h3>

We will send the train set of our data frame and target column to the different model function to train the model on it:

In [109]:
linreg = LinearRegression().fit(X_train, y_train)
lasso = Lasso().fit(X_train, y_train)
elasticnet = ElasticNet().fit(X_train, y_train)
ridge = Ridge().fit(X_train,y_train)

<h3>4) Initial Evaluation</h3>

We will first understand how well the model predicts for the raw the data frame:

In [110]:
print("Linear Regression:\n")
y_pred1 = linreg.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred1))
print("MSE Score: ", mean_squared_error(y_test, y_pred1))
print("\nLasso:\n")
y_pred2 = lasso.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred2))
print("MSE Score: ", mean_squared_error(y_test, y_pred2))
print("\nElastic Net:\n")
y_pred3 = elasticnet.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred3))
print("MSE Score: ", mean_squared_error(y_test, y_pred3))
print("\nRidge:\n")
y_pred4 = ridge.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred4))
print("MSE Score: ", mean_squared_error(y_test, y_pred4))

Linear Regression:

R2 Score:  0.3874127929526764
MSE Score:  698.7654011872179

Lasso:

R2 Score:  0.3796791673461588
MSE Score:  707.5869859957177

Elastic Net:

R2 Score:  0.35258005492252786
MSE Score:  738.4983761564546

Ridge:

R2 Score:  0.3873796607237897
MSE Score:  698.8031944270111


<h3>5) Improving Attempts</h3>

<b>First attempt:</b><br>
We will try to improve our model using methods of <b>feature engineering</b>.<br>
Let's create a new column that will include variation of values from other columns:

<ul>
    <li>For airborne flight phases (numbered 4-6), set the value of the engine type multiplied by 3 and add the damage</li>
    <li>For flights in transitional flight pahses (numbered 3 and 6), set the value of the engine type multiplied by 2  and add the damage</li>
    <li>For the other values, that indicate the aircraft is on the ground, set the value of the engine type  and add the damage</li>
</ul><br>
In all cases, if the damage is 5 (Indicates Missing aircraft), add 4 instead (Aircraft destroyed).

In [111]:
fatality_rate = []
for i in range(df.shape[0]): #For every row
    if (df.iloc[i,-5] < 3) | (df.iloc[i,-5] > 7):
        if(df.iloc[i,-4] == 5):
            fatality_rate.append(1*df.iloc[i,7]+4)
        else:
            fatality_rate.append(1*df.iloc[i,7]+df.iloc[i,-4]) 
    elif (df.iloc[i,-5] == 3) | (df.iloc[i,-5] == 6):
        if(df.iloc[i,-4] == 5):
            fatality_rate.append(2*df.iloc[i,7]+4)
        else:
            fatality_rate.append(2*df.iloc[i,7]+df.iloc[i,-4]) 
    else:
        if(df.iloc[i,-4] == 5):
            fatality_rate.append(3*df.iloc[i,7]+4)
        else:
            fatality_rate.append(3*df.iloc[i,7]+df.iloc[i,-4]) 
        
df['fatality_rate'] = fatality_rate

In [112]:
#Re-organizing the dataframe with the new column, so fatalities will be the last column
df.head()

Unnamed: 0,weekday,day,month,year,time,aircraft_type,num_of_engines,engine_type,engine_model,years_active,...,operator,occupants,accident_loc,above_ocean,flight_phase,damage,accident_latitude,accident_longtitude,fatalities,fatality_rate
0,7,2.0,8,1919,13.566667,98,3.0,1,51,0.0,...,623,14.0,89,0,5.0,4.0,45.396389,10.888056,14.0,7.0
1,2,11.0,8,1919,13.566667,159,5.0,1,95,0.75,...,1722,7.0,187,0,4.0,3.0,51.94137,1.306789,1.0,6.0
2,4,18.0,8,1926,14.5,51,4.0,1,88,0.0,...,314,15.0,187,0,5.0,3.0,-51.174,0.868,4.0,6.0
3,2,22.0,8,1927,8.25,166,2.0,1,9,1.0,...,1202,11.0,187,0,5.0,3.0,51.25,0.216,1.0,6.0
4,3,19.0,3,1929,13.566667,169,3.0,1,78,0.0,...,936,4.0,188,0,6.0,3.0,42.3,-83.21666,4.0,5.0


We will repeat the first steps again, and evaluate the model now:

In [113]:
df_x = df[df.columns[df.columns != 'fatalities']]
ser_y = df['fatalities']

X_train, X_test, y_train, y_test = train_test_split(df_x, ser_y, test_size=0.2, random_state=42)

linreg = LinearRegression().fit(X_train, y_train)
lasso = Lasso().fit(X_train, y_train)
elasticnet = ElasticNet().fit(X_train, y_train)
ridge = Ridge().fit(X_train,y_train)

print("Linear Regression:\n")
y_pred1 = linreg.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred1))
print("MSE Score: ", mean_squared_error(y_test, y_pred1))
print("\nLasso:\n")
y_pred2 = lasso.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred2))
print("MSE Score: ", mean_squared_error(y_test, y_pred2))
print("\nElastic Net:\n")
y_pred3 = elasticnet.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred3))
print("MSE Score: ", mean_squared_error(y_test, y_pred3))
print("\nRidge:\n")
y_pred4 = ridge.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred4))
print("MSE Score: ", mean_squared_error(y_test, y_pred4))

Linear Regression:

R2 Score:  0.4349526022852378
MSE Score:  644.5377360344506

Lasso:

R2 Score:  0.42398590497670574
MSE Score:  657.047218041807

Elastic Net:

R2 Score:  0.4078503609386941
MSE Score:  675.4526952920431

Ridge:

R2 Score:  0.43492092942626803
MSE Score:  644.5738646015349


Great! There was a slight imporvement in both scores!

<b>Second attempt:</b><br>
We will remove one categorical column with many possible values, that has less correlations with othe columns.

In [114]:
df.drop(['operator'], axis=1, inplace=True)

We will repeat the first steps again, and evaluate the model now:

In [115]:
df_x = df[df.columns[df.columns != 'fatalities']]
ser_y = df['fatalities']

X_train, X_test, y_train, y_test = train_test_split(df_x, ser_y, test_size=0.2, random_state=42)

linreg = LinearRegression().fit(X_train, y_train)
lasso = Lasso().fit(X_train, y_train)
elasticnet = ElasticNet().fit(X_train, y_train)
ridge = Ridge().fit(X_train,y_train)

print("Linear Regression:\n")
y_pred1 = linreg.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred1))
print("MSE Score: ", mean_squared_error(y_test, y_pred1))
print("\nLasso:\n")
y_pred2 = lasso.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred2))
print("MSE Score: ", mean_squared_error(y_test, y_pred2))
print("\nElastic Net:\n")
y_pred3 = elasticnet.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred3))
print("MSE Score: ", mean_squared_error(y_test, y_pred3))
print("\nRidge:\n")
y_pred4 = ridge.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred4))
print("MSE Score: ", mean_squared_error(y_test, y_pred4))

Linear Regression:

R2 Score:  0.4350066293049677
MSE Score:  644.476108544222

Lasso:

R2 Score:  0.4240301802512717
MSE Score:  656.9967141631122

Elastic Net:

R2 Score:  0.4079015265467766
MSE Score:  675.3943317541706

Ridge:

R2 Score:  0.4349745475350797
MSE Score:  644.5127035474292


Almost no change...

<b>Third attempt:</b><br>
We will change the train-test ratio of our data frame

We will repeat the first steps again, and evaluate the model now:

In [116]:
df_x = df[df.columns[df.columns != 'fatalities']]
ser_y = df['fatalities']

X_train, X_test, y_train, y_test = train_test_split(df_x, ser_y, test_size=0.3, random_state=42)

linreg = LinearRegression().fit(X_train, y_train)
lasso = Lasso().fit(X_train, y_train)
elasticnet = ElasticNet().fit(X_train, y_train)
ridge = Ridge().fit(X_train,y_train)

print("Linear Regression:\n")
y_pred1 = linreg.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred1))
print("MSE Score: ", mean_squared_error(y_test, y_pred1))
print("\nLasso:\n")
y_pred2 = lasso.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred2))
print("MSE Score: ", mean_squared_error(y_test, y_pred2))
print("\nElastic Net:\n")
y_pred3 = elasticnet.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred3))
print("MSE Score: ", mean_squared_error(y_test, y_pred3))
print("\nRidge:\n")
y_pred4 = ridge.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred4))
print("MSE Score: ", mean_squared_error(y_test, y_pred4))

Linear Regression:

R2 Score:  0.4237828305949942
MSE Score:  626.0011810602329

Lasso:

R2 Score:  0.4144528668618098
MSE Score:  636.1372350106097

Elastic Net:

R2 Score:  0.3981431556299886
MSE Score:  653.8560726919201

Ridge:

R2 Score:  0.42373177419314567
MSE Score:  626.0566486331461


Back to the start line...

<b>Fourth attempt:</b><br>
We will normalize our data to see if it has any effect on our data:

In [117]:
scaler = MinMaxScaler()
d = preprocessing.normalize(df)
scaled_df = pd.DataFrame(d, columns=df.columns)
scaled_df.head()

Unnamed: 0,weekday,day,month,year,time,aircraft_type,num_of_engines,engine_type,engine_model,years_active,airframe_hrs,occupants,accident_loc,above_ocean,flight_phase,damage,accident_latitude,accident_longtitude,fatalities,fatality_rate
0,0.003636,0.001039,0.004156,0.996878,0.007048,0.050909,0.001558,0.000519,0.026493,0.0,0.0,0.007273,0.046234,0.0,0.002597,0.002078,0.023582,0.005656,0.007273,0.003636
1,0.000938,0.005158,0.003751,0.899879,0.006362,0.07456,0.002345,0.000469,0.044548,0.000352,0.417466,0.003283,0.08769,0.0,0.001876,0.001407,0.024357,0.000613,0.000469,0.002814
2,0.002063,0.009282,0.004125,0.993185,0.007477,0.026299,0.002063,0.000516,0.045379,0.0,0.024237,0.007735,0.096431,0.0,0.002578,0.001547,-0.026389,0.000448,0.002063,0.003094
3,0.000878,0.009658,0.003512,0.845994,0.003622,0.072878,0.000878,0.000439,0.003951,0.000439,0.521118,0.004829,0.082097,0.0,0.002195,0.001317,0.0225,9.5e-05,0.000439,0.002634
4,0.001539,0.009746,0.001539,0.989502,0.006959,0.08669,0.001539,0.000513,0.040011,0.0,0.0,0.002052,0.096437,0.0,0.003078,0.001539,0.021698,-0.042687,0.002052,0.002565


We will repeat the first steps again, and evaluate the model now:

In [118]:
df_x = df[scaled_df.columns[scaled_df.columns != 'fatalities']]
ser_y = df['fatalities']

X_train, X_test, y_train, y_test = train_test_split(df_x, ser_y, test_size=0.2, random_state=42)

linreg = LinearRegression().fit(X_train, y_train)
lasso = Lasso().fit(X_train, y_train)
elasticnet = ElasticNet().fit(X_train, y_train)
ridge = Ridge().fit(X_train,y_train)

print("Linear Regression:\n")
y_pred1 = linreg.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred1))
print("MSE Score: ", mean_squared_error(y_test, y_pred1))
print("\nLasso:\n")
y_pred2 = lasso.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred2))
print("MSE Score: ", mean_squared_error(y_test, y_pred2))
print("\nElastic Net:\n")
y_pred3 = elasticnet.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred3))
print("MSE Score: ", mean_squared_error(y_test, y_pred3))
print("\nRidge:\n")
y_pred4 = ridge.predict(X_test)
print("R2 Score: ", r2_score(y_test, y_pred4))
print("MSE Score: ", mean_squared_error(y_test, y_pred4))

Linear Regression:

R2 Score:  0.4350066293049677
MSE Score:  644.476108544222

Lasso:

R2 Score:  0.4240301802512717
MSE Score:  656.9967141631122

Elastic Net:

R2 Score:  0.4079015265467766
MSE Score:  675.3943317541706

Ridge:

R2 Score:  0.4349745475350797
MSE Score:  644.5127035474292


Didn't have any effect...

<h3>6) Determining the best result</h3>

After several attempt to improve our model, we found out that the best rsult is given at the second attempt<br>
by the <b>'Linear Regression'</b> model, with:<br>
<ul>
    <li>R2 Score of <b>0.4350066293049677</b></li>
    <li>MSE Score of <b>644.476108544222</b></li>
</ul?