## Model Training and Hyperparameter Optimization

### Group: Dominance
* Tito Osadebey<br>
  [Email](https://www.osadebe.tito@gmail.com) | [GitHub](https://github.com/titoausten)
* Hammed Arogundade<br>
  [Email](https://www.arogundadehammed09@gmail.com) | [GitHub](https://github.com/ahmeedaro)
* Waqar Ahmed<br>
  [Email](https://www.waqarahmed695@gmail.com) | [GitHub](https://github.com/waqarahmed6095)

<hr>

### Import required libraries 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Read Dataset

In [2]:
data = pd.read_csv("insurance.csv")
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


### Label encode categorical features

In [12]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()

#Categorical Features: sex, smoker and region
data['sex'] = le.fit_transform(data['sex'])
data['smoker'] = le.fit_transform(data['smoker'])

data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,southwest,16884.924
1,18,1,33.77,1,0,southeast,1725.5523
2,28,1,33.0,3,0,southeast,4449.462
3,33,1,22.705,0,0,northwest,21984.47061
4,32,1,28.88,0,0,northwest,3866.8552


In [15]:
data = pd.get_dummies(data=data, columns=['region'])
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,charges,region_northeast,region_northwest,region_southeast,region_southwest
0,19,0,27.9,0,1,16884.924,0,0,0,1
1,18,1,33.77,1,0,1725.5523,0,0,1,0
2,28,1,33.0,3,0,4449.462,0,0,1,0
3,33,1,22.705,0,0,21984.47061,0,1,0,0
4,32,1,28.88,0,0,3866.8552,0,1,0,0


<hr>

### Features and labels

In [16]:
X = data.drop('charges', axis=1)
y = data.loc[:, 'charges']

<hr>

### Splitting the data into training and test datasets.

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_test = train_test_split(X, y, test_size=0.3)

<hr>

### Feature Scaling

In [18]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_Xtrain = scaler.fit_transform(X_train)
scaled_Xval = scaler.transform(X_val)

<hr>

### Model Selection
it is a regression problem, hence regression algorithms are utilized

In [19]:
#Linear Regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

#Decision Tree Regressor
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor()

#XGB Regressor
from xgboost import XGBRegressor
xgbr = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.7, learning_rate=0.1)

#Random Forest Regressor, Gradient Boosting Regressor, Extra Trees Regressor and Bagging Regressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor, BaggingRegressor
forest = RandomForestRegressor()
boost = GradientBoostingRegressor()
extratree = ExtraTreesRegressor()
bagging = BaggingRegressor()

### Choosing the Best Performing Model using Cross validation

In [20]:
lr.fit(scaled_Xtrain, y_train)
dt.fit(scaled_Xtrain, y_train)
xgbr.fit(scaled_Xtrain, y_train)
forest.fit(scaled_Xtrain, y_train)
boost.fit(scaled_Xtrain, y_train)
extratree.fit(scaled_Xtrain, y_train)
bagging.fit(scaled_Xtrain, y_train)

BaggingRegressor()

In [21]:
from sklearn.model_selection import cross_val_score, RepeatedKFold

models_scores = ['Linear', 'Decision Tree', 'XGB', 'Random Forest', 'Gradient Boosting', 'Extra Trees', 'Bagging']
models = [lr , dt, xgbr, forest, boost, extratree, bagging]
index = 0
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

for model in models_scores:
    score = cross_val_score(models[index], scaled_Xtrain, y_train,
                                           scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
    score = np.absolute(score)
    print(f"{model} Regressor Model Mean MAE: {score.mean():.3f}.")
    index += 1

Linear Regressor Model Mean MAE: 4248.836.
Decision Tree Regressor Model Mean MAE: 3099.740.
XGB Regressor Model Mean MAE: 2884.457.
Random Forest Regressor Model Mean MAE: 2620.470.
Gradient Boosting Regressor Model Mean MAE: 2540.690.
Extra Trees Regressor Model Mean MAE: 2709.893.
Bagging Regressor Model Mean MAE: 2760.077.


#### Gradient Boosting Regressor is the Best performing model having the least Mean MAE

<hr>

### Confirming Gradient Boosting Regressor is the Best performing Model

In [22]:
lrpred = lr.predict(scaled_Xval)
dtpred = dt.predict(scaled_Xval)
xgbpred = xgbr.predict(scaled_Xval)
forestpred = forest.predict(scaled_Xval)
boostpred = boost.predict(scaled_Xval)
extrapred = extratree.predict(scaled_Xval)
baggingpred = bagging.predict(scaled_Xval)

In [23]:
from sklearn.metrics import mean_squared_error, r2_score
n = 0
predictions = [lrpred , dtpred, xgbpred, forestpred, boostpred, extrapred, baggingpred]

for model in models_scores:
    rmse = np.sqrt(mean_squared_error(y_test, predictions[n]))
    r2score = models[n].score(scaled_Xval, y_test)
    print(f"\n{model}:\nrmse = {rmse}\nr2_score = {r2score}")
    
    n += 1


Linear:
rmse = 6013.135885057847
r2_score = 0.7628479052963053

Decision Tree:
rmse = 6010.508790659173
r2_score = 0.7630550800067813

XGB:
rmse = 5253.330590579023
r2_score = 0.8189934077746457

Random Forest:
rmse = 4785.89730452618
r2_score = 0.8497717253969328

Gradient Boosting:
rmse = 4558.953671037306
r2_score = 0.8636813469016373

Extra Trees:
rmse = 4884.0568883012165
r2_score = 0.843546113352146

Bagging:
rmse = 5085.615923441999
r2_score = 0.8303663350463912


##### Gradient Boosting Regressor is still Best performing Model confirming with lowest rmse value and highest r2 score

<hr>

### Hyper-parameter Optimization
using Randomized Search

In [24]:
from sklearn.model_selection import RandomizedSearchCV

parameters = {'learning_rate':[0.05,0.25,0.5,1],
              'subsample': [0.9,0.5,0.2,0.1],
              'n_estimators':[100,500,1000,1500],
              'max_depth':[4,6,8,10]}

random_model = RandomizedSearchCV(estimator=boost, param_distributions= parameters,
                                scoring='neg_mean_absolute_error', n_iter=100, n_jobs=-1, cv=cv)

In [25]:
model = random_model.fit(scaled_Xtrain, y_train)

In [34]:
print(f"Best Hyperparameters: {model.best_params_}")
best_est = model.best_estimator_
print(f"Best score: {np.absolute(model.best_score_)}")

Best Hyperparameters: {'subsample': 0.9, 'n_estimators': 100, 'max_depth': 4, 'learning_rate': 0.05}
Best score: 2527.8579626646856


In [35]:
boosted = best_est
boosted.fit(scaled_Xtrain, y_train)

search_pred = boosted.predict(scaled_Xval)

<hr>

### Model Evaluation

In [37]:
from sklearn.metrics import mean_squared_error, r2_score

rmse = np.sqrt(mean_squared_error(y_test, search_pred))
r2score = boosted.score(scaled_Xval, y_test)
score = np.absolute(np.mean(cross_val_score(boosted, scaled_Xtrain, y_train,
                                           scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)))
                  
print(f"After optimization:\nrmse = {rmse}\nr2_score = {r2score}\nscore = {score:.3f}.")

After optimization:
rmse = 4551.801085191208
r2_score = 0.8641087546234093
score = 2505.567.


#### Comparing the values before and after Hyper-parameter optimization:
* Mean MAE decreased from 2540.690 to 2505.567
* r2 score increased from by 86.37% to 86.41%
* rmse decreased from 4558.95 to 4551.80.