# Wikipedia Notable Life Expectancies
# [Notebook  13: Models](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_models_thanak_2022_10_14.ipynb)
### Context

The
### Objective

The
### Data Dictionary
- Feature: Description

### Importing Libraries

In [None]:
# To structure code automatically
%load_ext nb_black

# To import/export sqlite databases
# import sqlite3 as sql

# To save/open python objects in pickle file
import pickle

# To help with reading, cleaning, and manipulating data
import pandas as pd
import numpy as np

# To help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To be used for missing value imputation
from sklearn.impute import SimpleImputer

# To help with model building
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

from sklearn.ensemble import (
    AdaBoostRegressor,
    GradientBoostingRegressor,
    RandomForestRegressor,
    BaggingRegressor,
)
from xgboost import XGBRegressor

# To randomly split data, for cross validation, and to check model performance
from sklearn import metrics
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    r2_score,
    mean_absolute_percentage_error,
)

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# To be used for hyperparameter tuning searches
from scipy.stats import loguniform
from scipy.stats import uniform
from scipy.stats import expon

# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To define the maximum number of rows to be displayed in a dataframe
pd.set_option("display.max_rows", 211)

# To set some dataframe visualization attributes
pd.set_option("max_colwidth", 150)

# To supress scientific notations for a dataframe
# pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
# import warnings

# warnings.filterwarnings("ignore")

# To set some plot visualization attributes
sns.set_theme()
sns.set(font_scale=1.4)
sns.set_palette(
    (
        "midnightblue",
        "goldenrod",
        "maroon",
        "darkolivegreen",
        "cadetblue",
        "tab:purple",
        "yellowgreen",
    )
)
# plt.rc("font", size=12)
# plt.rc("axes", titlesize=15)
# plt.rc("axes", labelsize=14)
# plt.rc("xtick", labelsize=13)
# plt.rc("ytick", labelsize=13)
# plt.rc("legend", fontsize=13)
# plt.rc("legend", fontsize=14)
# plt.rc("figure", titlesize=16)

# To play auditory cue when cell has executed, has warning, or has error and set chime theme
import chime

chime.theme("zelda")

## Data Overview

### [Reading](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_train_preproc.csv), Sampling, and Checking Data Shape

In [None]:
# Reading the dataset
data = pd.read_csv("wp_life_expect_train_preproc.csv")

# Making a working copy
df = data.copy()

# Checking the shape
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

# Checking first 2 rows of the data
df.head(2)

In [None]:
# Checking last 2 rows of the data
df.tail(2)

In [None]:
# Checking a sample of the data
df.sample(5)

### Checking Data Types and Null Values

In [None]:
# Checking data types and null values
df.info()

#### Observations:
- With our dataset loaded, we are ready for modeling.
- We have three variables that need typcasting from object to category, then one hot encoding just prior to modeling.

#### Typecasting `region`, `prior_region`, and `known_for` as Categorical

In [None]:
# Typcasting prior_region and region as categorical
df[["prior_region", "region", "known_for"]] = df[
    ["prior_region", "region", "known_for"]
].astype("category")

# Re-check info
df.info()

## Data Preparation for Modeling
In contrast to building the [linear regression model](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_olsmodel_thanak_2022_10_9.ipynb), we will be tuning these models.  So, we will split the train set into train and validation sets and utilize the `test` set only to check out-of-sample performance of the champion model.  We will load and treat the test set at that point.

### Defining Independent and Dependent Variables for Train and Validation Sets

In [None]:
# Creating list of predictor columns
predictor_cols = [
    "num_references",
    "years",
    "region",
    "prior_region",
    "known_for",
]

# Defining target column
target = "age"

# Defining independent and dependent variables
X = df[predictor_cols]
y = df[target]

# One hot encoding of categorical predictors and typecasting all predictors as float
X = pd.get_dummies(X, drop_first=True).astype("float64")

# Splitting into 70:30 train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

# Checking shape of train and validation sets
print(
    f"There are {X_train.shape[0]} rows and {X_train.shape[1]} columns in the train set.\n"
)
print(
    f"There are {X_val.shape[0]} rows and {X_val.shape[1]} columns in the validation set.\n"
)

# Checking a sample
X_train.sample()

## Model Building
#### Model Evaluation Criterion
The predictions made by the regressors will have the following performance metrics:
- RMSE
- MAE
- R$^2$
- Ajusted R$^2$
- MAPE

#### Which Metric to Optimize?
- For hyperparameter tuning, we will optimize R$^2$, which is the proportion of variation in the target that is explained by the predictors.  

- To select the champion model, will compare Adjusted R$^2$.  It is the metric that represents the amount of variation in the target that is explained by the predictors, with a penalty for more predictors.  The number of included predictors may vary between algorithms, especially as we are building including examples of decion tree regressors.  R$^2$ will improve with the addition of predictors, even if they contribute very little to the model, whereas, the penalty in Adjusted R$^2$ offsets such an increase.

#### Functions for Checking and Tuning Model Performance

In [None]:
# Function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# Function to compute MAPE
def mape_score(targets, predictions):
    return np.mean(np.abs((targets - predictions) / targets)) * 100


# Function to compute and display different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute and return a dataframe of different metrics to check
    regression model performance
    
    model: regressor
    predictors: independent variables
    target: dependent variable
    """
    # Predictions
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # To compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # To compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # To compute RMSE
    mae = mean_absolute_error(target, pred)  # To compute MAE
    mape = mape_score(target, pred)  # To compute MAPE

    # Creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

#### Defining Scorer for Cross-validation and Hyperparameter Tuning

In [None]:
# Type of scoring used to compare parameter combinations--maximizing Adj R-squared
scorer = "r2"

### Building the Models

In [None]:
%%time

# Creating list to store the models
models = []

# Appending models to the list
models.append(('Dtree', DecisionTreeRegressor(random_state=42)))

models.append(('Random Forest', RandomForestRegressor(random_state=42)))

models.append(('Bagging Dtree', BaggingRegressor(random_state=42)))

models.append(('GBM', GradientBoostingRegressor(random_state=42)))

models.append(('AdaBoost Dtree', AdaBoostRegressor(random_state=42)))

models.append(('XGB_gbtree', XGBRegressor(random_state=42)))

models.append(('XGB_gblinear', XGBRegressor(random_state=42, booster='gblinear')))

# Create empty list to store all model's names and CV scores
names = []
results = []

# Loop through all models to get the mean cross validated score
print("\n" "Cross-Validation:" "\n")

for name, model in models:
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=5
    )
    results.append(cv_result)
    names.append(name)
    print(f"{name}: {cv_result.mean()}")
    
print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = r2_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))

In [None]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(20, 7))

fig.suptitle("Algorithm Comparison for Cross-validation R-squared Score")
ax = fig.add_subplot(111)

plt.boxplot(results)
ax.set_xticklabels(names)
plt.xticks(rotation=30)

plt.show()

#### Observations:
- We have negative R$^2$ values for four of the models.  This means they are performing worse than a model that merely equates the predicted values to the constant mean value of the target.
- The remaining three models, *GBM*, *XGB_gbtree*, and *XGB_gblinear* are giving generalized performances on train and validation sets, with similar, albeit very low, R$^2$ scores as [*olsmodel3*](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_olsmodel_thanak_2022_10_9.ipynb) (0.087).  Before hyperparameter tuning, *GBM* is outperforming the other models, including *olsmodel3*, with both train and validation R$^2$ scores of ~0.10.
- We will perform hyperparameter tuning on the top 3 models.  Purely as an exercise we will also keep *Random Forest* in the mix.

#### Collecting Models with Best Performance

In [None]:
# List of top models so far
top_models = [models[1]] + [models[3]] + models[-2:]

#### Creating Dataframes to Compare Training and Validation Performance of Best Models

In [None]:
# Creating empty dictionary to hold the models
models_to_tune = {}

# For loop to add models to dictionary
for model in top_models:
    key = model[0]
    value = model[1]
    models_to_tune[key] = value

# Initializing dataframes to compare performance of all models
models_train_comp_df = pd.DataFrame()
models_val_comp_df = pd.DataFrame()

# For loop to add performance results of each top model
for name, model in models_to_tune.items():
    models_train_comp_df[name] = model_performance_regression(model, X_train, y_train).T
    models_val_comp_df[name] = model_performance_regression(model, X_val, y_val).T

#### Comparing Top Models Before Hyperparameter Tuning

In [None]:
# Comparing train performance
print(f"Training Performance:")
models_train_comp_df

In [None]:
# Comparing validation performance
print(f"Validation Performance:")
models_val_comp_df

#### Observations:
- Here, we compare the performance on the whole train set to the validation set.
- Only *GBM* and *XGB_gblinear* are giving generalized performances on the two sets.
- These two are performing on par or slightly better than [*olsmodel3*](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_olsmodel_thanak_2022_10_9.ipynb), our linear regression model, for all metrics.
- We will see if hyperparameter tuning improves their performance, again keeping *Rand Forest* and *XGB_gbtree* in the mix for demonstration and comparison.

## Hyperparameter Tuning

### *Random Forest Tuned*

In [None]:
# Confirming the model
models_to_tune["Random Forest"]

In [None]:
%%time

# Defining model
Model = RandomForestRegressor(random_state=42)

# Parameter grid to pass in RandomizedSearchCV
param_grid = { 
    "n_estimators": np.arange(100, 500), 
    "min_samples_leaf": [None] + np.arange(1, 10).tolist(),
    "max_features": ['sqrt'], 
    "max_samples": uniform(loc=0.3, scale=0.5),
    'criterion': ['squared_error'],
    "max_depth": [None]
}

# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model,
    param_distributions=param_grid,
    n_iter=10,
    n_jobs=-1,
    scoring=scorer,
    cv=5,
    random_state=42,
)

# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)

print(
    "Best parameters are {} with CV score={}:".format(
        randomized_cv.best_params_, randomized_cv.best_score_
    )
)

# Chime notification when cell successfully executes
chime.success()

In [None]:
# Building model with best parameters
Random_Forest_tuned = RandomForestRegressor(
    criterion="squared_error",
    max_depth=None,
    max_features="sqrt",
    max_samples=0.3909124836035503,
    min_samples_leaf=4,
    n_estimators=260,
)

# Fit the model on training data
Random_Forest_tuned.fit(X_train, y_train)

In [None]:
# Calculating different metrics
Random_Forest_tuned_train_perf = model_performance_regression(
    Random_Forest_tuned, X_train, y_train
)
print("Training performance:\n", Random_Forest_tuned_train_perf)
Random_Forest_tuned_val_perf = model_performance_regression(
    Random_Forest_tuned, X_val, y_val
)
print("\nValidation performance:\n", Random_Forest_tuned_val_perf)

# Adding model to model comparison dataframes
models_train_comp_df["Random Forest Tuned"] = Random_Forest_tuned_train_perf.T
models_val_comp_df["Random Forest Tuned"] = Random_Forest_tuned_val_perf.T

#### Observations:
- Hyperparameter tuning improved performance for *Random Forest*.
- The algorithm is still overfitting the train set, compared to the validation set.
- Note that we had a 10% fit fail during cross-validation ("UserWarning: One or more of the test scores are non-finite..") indicating cross-validation had some folds for which hyperparameter combinations led to Nan values.  We are going to allow it here, and go with the results of the successful iterations.  *Random Forest* is not a likely candidate for the champion model.

### *GBM Tuned*

In [None]:
# Confirming the model
models_to_tune["GBM"]

In [None]:
%%time

# Defining model
Model = GradientBoostingRegressor(random_state=42)

# Parameter grid to pass in RandomizedSearchCV
param_grid = {
    "n_estimators": np.arange(100, 500),
    "learning_rate": loguniform(0.001, 1),
    "subsample": uniform(loc=0.3, scale=0.5),
    "max_features": uniform(loc=0.3, scale=0.5),
}

# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model,
    param_distributions=param_grid,
    n_iter=100,
    n_jobs=-1,
    scoring=scorer,
    cv=5,
    random_state=42,
)

# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)

print(
    "Best parameters are {} with CV score={}:".format(
        randomized_cv.best_params_, randomized_cv.best_score_
    )
)

# Chime notification when cell successfully executes
chime.success()

In [None]:
# Building model with best parameters
GBM_tuned = GradientBoostingRegressor(
    random_state=42,
    learning_rate=0.08171272700715591,
    max_features=0.6630456668613307,
    n_estimators=368,
    subsample=0.7847684335570795,
)

# Fit the model on training data
GBM_tuned.fit(X_train, y_train)

In [None]:
# Calculating different metrics
GBM_tuned_train_perf = model_performance_regression(GBM_tuned, X_train, y_train)
print("Training performance:\n", GBM_tuned_train_perf)
GBM_tuned_val_perf = model_performance_regression(GBM_tuned, X_val, y_val)
print("\nValidation performance:\n", GBM_tuned_val_perf)

# Adding model to model comparison dataframes
models_train_comp_df["GBM Tuned"] = GBM_tuned_train_perf.T
models_val_comp_df["GBM Tuned"] = GBM_tuned_val_perf.T

#### Observations:
- The performance for *GBM* is improved with hyperparameter tuning.  
- There is a slight increase in overfitting, but the validation metrics are better.

### *XGB_gbtree Tuned*

In [None]:
# Confirming the model
models_to_tune["XGB_gbtree"]

In [None]:
%%time

# Defining model
Model = XGBRegressor(random_state=42, booster='gbtree')

# Parameter grid to pass in RandomizedSearchCV
param_grid={
    'n_estimators': np.arange(100, 500),
    "learning_rate": uniform(0.1, 0.3), # aka eta
    'gamma': expon(), # aka min_split_loss
    'subsample': uniform(loc=0.6, scale=0.2), # proportion of train set to randomly sample prior to growing trees
    'max_depth': np.arange(3, 8).tolist(),
    'colsample_bytree': uniform(loc=0.3, scale=0.5)
}

# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model,
    param_distributions=param_grid,
    n_iter=100,
    n_jobs=-1,
    scoring=scorer,
    cv=5,
    random_state=42,
)

# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)

print(
    "Best parameters are {} with CV score={}:".format(
        randomized_cv.best_params_, randomized_cv.best_score_
    )
)

# Chime notification when cell successfully executes
chime.success()

In [None]:
# Building model with best parameters
XGB_gbtree_tuned = XGBRegressor(
    booster="gbtree",
    random_state=42,
    colsample_bytree=0.42649508399462055,
    gamma=1.188792356281234,
    learning_rate=0.12263036412693079,
    max_depth=3,
    n_estimators=404,
    subsample=0.7391497377969234,
)

# Fit the model on training data
XGB_gbtree_tuned.fit(X_train, y_train)

In [None]:
# Calculating different metrics
XGB_gbtree_tuned_train_perf = model_performance_regression(
    XGB_gbtree_tuned, X_train, y_train
)
print("Training performance:\n", XGB_gbtree_tuned_train_perf)
XGB_gbtree_tuned_val_perf = model_performance_regression(XGB_gbtree_tuned, X_val, y_val)
print("\nValidation performance:\n", XGB_gbtree_tuned_val_perf)

# Adding model to model comparison dataframes
models_train_comp_df["XGB_gbtree Tuned"] = XGB_gbtree_tuned_train_perf.T
models_val_comp_df["XGB_gbtree Tuned"] = XGB_gbtree_tuned_val_perf.T

#### Observations:
- The performance for *XGB_gbtree* is improved with hyperparameter tuning.  
- There is a slight increase in overfitting, but the validation metrics are better.

### *XGB_gblinear Tuned*

In [None]:
# Confirming the model
models_to_tune["XGB_gblinear"]

In [None]:
%%time

# Defining model
Model = XGBRegressor(random_state=42, booster='gblinear')

# Parameter grid to pass in RandomizedSearchCV
param_grid={
    'n_estimators': np.arange(100, 500),
    'reg_lambda': loguniform(.0001, 1)
}

# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model,
    param_distributions=param_grid,
    n_iter=100,
    n_jobs=-1,
    scoring=scorer,
    cv=5,
    random_state=42,
)

# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)

print(
    "Best parameters are {} with CV score={}:".format(
        randomized_cv.best_params_, randomized_cv.best_score_
    )
)

# Chime notification when cell successfully executes
chime.success()

In [None]:
# Building model with best parameters
XGB_gblinear_tuned = XGBRegressor(
    booster="gblinear",
    random_state=42,
    n_estimators=439,
    reg_lambda=0.0009206654892274761,
)

# Fit the model on training data
XGB_gblinear_tuned.fit(X_train, y_train)

In [None]:
# Calculating different metrics
XGB_gblinear_tuned_train_perf = model_performance_regression(
    XGB_gblinear_tuned, X_train, y_train
)
print("Training performance:\n", XGB_gblinear_tuned_train_perf)
XGB_gblinear_tuned_val_perf = model_performance_regression(
    XGB_gblinear_tuned, X_val, y_val
)
print("\nValidation performance:\n", XGB_gblinear_tuned_val_perf)

# Adding model to model comparison dataframes
models_train_comp_df["XGB_gblinear Tuned"] = XGB_gblinear_tuned_train_perf.T
models_val_comp_df["XGB_gblinear Tuned"] = XGB_gblinear_tuned_val_perf.T

## Model Performance Comparison

### Performance of Various Models Tuned and Untuned

In [None]:
# Displaying train performance of all models
print("Train Performance Comparison:")
models_train_comp_df.sort_index(axis=1)

In [None]:
# Displaying validation performance of all models
print("Validation Performance Comparison:")
models_val_comp_df.sort_index(axis=1)

#### Observations:
- *GBM Tuned* has the highest R$^2$ (0.109) on the validation set, followed by *XGB_gbtree Tuned*, then *GBM*.
- As we did not include the Decision Tree here, we can ignore Adjusted R$^2$, and just compare R$^2$.
- Of these three models with R$^2$ scores over 10, there is some variation in overfitting.

#### Comparison of Percentage of Overfitting for R$^2$

In [None]:
# Subtracting the ratio of validation R-square/train R-square from 1
overfit_perc = (
    1
    - (
        models_val_comp_df.loc["R-squared", :]
        / models_train_comp_df.loc["R-squared", :]
    )
) * 100

print(f"Percentage of R-square overfitting:")
overfit_perc.sort_values()

#### Observations:
- *XGB_gblinear* and *XGB_gblinear Tuned* both performed better on the validation set, than the training set, which is interesting.
- Of the top 3 models for R$^2$, *GBM* generalized considerably better than *GBM Tuned* and *XGB_gtree Tuned*.  
- That said, *GBM Tuned* has the highest R$^2$ score on the validation set.
- Next we will try another modeling iteration, replacing the `known_for` feature with the original `known for` category columns.  For linear regression, we had to drop categorical columns to eliminate multicollinearity, so entries with multiple `known for` categories were grouped, into `two` and `three_to_five` classes.  We retained that approach for the above modeling iteration, but for this iteration we will allow entries to have their original multiple categories.

## 2nd Modeling Iteration with Original `known for` Category Columns

### Defining Independent and Dependent Variables for Train and Validation Sets

In [None]:
# Creating list of predictor columns
predictor_cols = [
    "num_references",
    "years",
    "region",
    "prior_region",
    'sciences', 
    'social',
    'spiritual',
    'academia_humanities',
    'business_farming',
    'arts',
    'sports',
    'law_enf_military_operator',
    'politics_govt_law',
    'crime'
]

# Defining target column
target = "age"

# Defining independent and dependent variables
X = df[predictor_cols]
y = df[target]

# One hot encoding of categorical predictors and typecasting all predictors as float
X = pd.get_dummies(X, drop_first=True).astype("float64")

# Splitting into 70:30 train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

# Checking shape of train and validation sets
print(
    f"There are {X_train.shape[0]} rows and {X_train.shape[1]} columns in the train set.\n"
)
print(
    f"There are {X_val.shape[0]} rows and {X_val.shape[1]} columns in the validation set.\n"
)

# Checking a sample
X_train.sample()

In [None]:
# Type of scoring used to compare parameter combinations--maximizing Adj R-squared
scorer = "r2"

### Building the Models

In [None]:
%%time

# Creating list to store the models
models = []

# Appending models to the list
models.append(('Dtree2', DecisionTreeRegressor(random_state=42)))

models.append(('Random Forest2', RandomForestRegressor(random_state=42)))

models.append(('Bagging Dtree2', BaggingRegressor(random_state=42)))

models.append(('GBM2', GradientBoostingRegressor(random_state=42)))

models.append(('AdaBoost Dtree2', AdaBoostRegressor(random_state=42)))

models.append(('XGB_gbtree2', XGBRegressor(random_state=42)))

models.append(('XGB_gblinear2', XGBRegressor(random_state=42, booster='gblinear')))

# Create empty list to store all model's names and CV scores
names = []
results = []

# Loop through all models to get the mean cross validated score
print("\n" "Cross-Validation:" "\n")

for name, model in models:
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=5
    )
    results.append(cv_result)
    names.append(name)
    print(f"{name}: {cv_result.mean()}")
    
print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = r2_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))

In [None]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(20, 7))

fig.suptitle("Algorithm Comparison for Cross-validation R-squared Score")
ax = fig.add_subplot(111)

plt.boxplot(results)
ax.set_xticklabels(names)
plt.xticks(rotation=30)

plt.show()

#### Observations:
- We have negative R$^2$ values for four of the models.  This means they are performing worse than a model that merely equates the predicted values to the constant mean value of the target.
- The remaining three models, *GBM*, *XGB_gbtree*, and *XGB_gblinear* are giving generalized performances on train and validation sets, with similar, albeit very low, R$^2$ scores as [*olsmodel3*](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_olsmodel_thanak_2022_10_9.ipynb) (0.087).  Before hyperparameter tuning, *GBM* is outperforming the other models, including *olsmodel3*, with both train and validation R$^2$ scores of ~0.10.
- We will perform hyperparameter tuning on the top 3 models.  Purely as an exercise we will also keep *Random Forest* in the mix.

#### Collecting Models with Best Performance

In [None]:
# List of top models so far
top_models = [models[1]] + [models[3]] + models[-2:]

#### Adding Models Training and Validation Performance Comparison Dataframes

In [None]:
# Creating empty dictionary to hold the models
models_to_tune = {}

# For loop to add models to dictionary
for model in top_models:
    key = model[0]
    value = model[1]
    models_to_tune[key] = value

# For loop to add performance results of each top model
for name, model in models_to_tune.items():
    models_train_comp_df[name] = model_performance_regression(model, X_train, y_train).T
    models_val_comp_df[name] = model_performance_regression(model, X_val, y_val).T

#### Comparing Top Models Before Hyperparameter Tuning

In [None]:
# Comparing train performance
print(f"Training Performance:")
models_train_comp_df[models_to_tune]

In [None]:
# Comparing validation performance
print(f"Validation Performance:")
models_val_comp_df[models_to_tune]

#### Observations:
- Here, we compare the performance on the whole train set to the validation set.
- Only *GBM* and *XGB_gblinear* are giving generalized performances on the two sets.
- These two are performing on par or slightly better than [*olsmodel3*](https://github.com/teresahanak/wikipedia-life-expectancy/blob/main/wp_life_expect_olsmodel_thanak_2022_10_9.ipynb), our linear regression model, for all metrics.
- We will see if hyperparameter tuning improves their performance, again keeping *Rand Forest* and *XGB_gbtree* in the mix for demonstration and comparison.

## Hyperparameter Tuning

### *Random Forest2 Tuned*

In [None]:
# Confirming the model
models_to_tune["Random Forest2"]

In [None]:
%%time

# Defining model
Model = RandomForestRegressor(random_state=42)

# Parameter grid to pass in RandomizedSearchCV
param_grid = { 
    "n_estimators": np.arange(100, 500), 
    "min_samples_leaf": [None] + np.arange(1, 10).tolist(),
    "max_features": ['sqrt'], 
    "max_samples": uniform(loc=0.3, scale=0.5),
    'criterion': ['squared_error'],
    "max_depth": [None]
}

# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model,
    param_distributions=param_grid,
    n_iter=10,
    n_jobs=-1,
    scoring=scorer,
    cv=5,
    random_state=42,
)

# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)

print(
    "Best parameters are {} with CV score={}:".format(
        randomized_cv.best_params_, randomized_cv.best_score_
    )
)

# Chime notification when cell successfully executes
chime.success()

In [None]:
# Building model with best parameters
Random_Forest2_tuned = RandomForestRegressor(
    criterion="squared_error",
    max_depth=None,
    max_features="sqrt",
    max_samples=0.3909124836035503,
    min_samples_leaf=4,
    n_estimators=260,
)

# Fit the model on training data
Random_Forest2_tuned.fit(X_train, y_train)

In [None]:
# Calculating different metrics
Random_Forest2_tuned_train_perf = model_performance_regression(
    Random_Forest2_tuned, X_train, y_train
)
print("Training performance:\n", Random_Forest2_tuned_train_perf)
Random_Forest2_tuned_val_perf = model_performance_regression(
    Random_Forest2_tuned, X_val, y_val
)
print("\nValidation performance:\n", Random_Forest2_tuned_val_perf)

# Adding model to model comparison dataframes
models_train_comp_df["Random Forest2 Tuned"] = Random_Forest2_tuned_train_perf.T
models_val_comp_df["Random Forest2 Tuned"] = Random_Forest2_tuned_val_perf.T

#### Observations:
- Hyperparameter tuning improved performance for *Random Forest*.
- The algorithm is still overfitting the train set, compared to the validation set.
- Note that we had a 10% fit fail during cross-validation ("UserWarning: One or more of the test scores are non-finite..") indicating cross-validation had some folds for which hyperparameter combinations led to Nan values.  We are going to allow it here, and go with the results of the successful iterations.  *Random Forest* is not a likely candidate for the champion model.

### *GBM2 Tuned*

In [None]:
# Confirming the model
models_to_tune["GBM2"]

In [None]:
%%time

# Defining model
Model = GradientBoostingRegressor(random_state=42)

# Parameter grid to pass in RandomizedSearchCV
param_grid = {
    "n_estimators": np.arange(100, 500),
    "learning_rate": loguniform(0.001, 1),
    "subsample": uniform(loc=0.3, scale=0.5),
    "max_features": uniform(loc=0.3, scale=0.5),
}

# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model,
    param_distributions=param_grid,
    n_iter=100,
    n_jobs=-1,
    scoring=scorer,
    cv=5,
    random_state=42,
)

# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)

print(
    "Best parameters are {} with CV score={}:".format(
        randomized_cv.best_params_, randomized_cv.best_score_
    )
)

# Chime notification when cell successfully executes
chime.success()

In [None]:
# Building model with best parameters
GBM2_tuned = GradientBoostingRegressor(
    random_state=42,
    learning_rate=0.08171272700715591,
    max_features=0.6630456668613307,
    n_estimators=368,
    subsample=0.7847684335570795,
)

# Fit the model on training data
GBM2_tuned.fit(X_train, y_train)

In [None]:
# Calculating different metrics
GBM2_tuned_train_perf = model_performance_regression(GBM2_tuned, X_train, y_train)
print("Training performance:\n", GBM2_tuned_train_perf)
GBM2_tuned_val_perf = model_performance_regression(GBM2_tuned, X_val, y_val)
print("\nValidation performance:\n", GBM2_tuned_val_perf)

# Adding model to model comparison dataframes
models_train_comp_df["GBM2 Tuned"] = GBM2_tuned_train_perf.T
models_val_comp_df["GBM2 Tuned"] = GBM2_tuned_val_perf.T

#### Observations:
- The performance for *GBM2* is improved with hyperparameter tuning.  
- There is a slight increase in overfitting, but the validation metrics are better.

### *XGB_gbtree2 Tuned*

In [None]:
# Confirming the model
models_to_tune["XGB_gbtree2"]

In [None]:
%%time

# Defining model
Model = XGBRegressor(random_state=42, booster='gbtree')

# Parameter grid to pass in RandomizedSearchCV
param_grid={
    'n_estimators': np.arange(100, 500),
    "learning_rate": uniform(0.1, 0.3), # aka eta
    'gamma': expon(), # aka min_split_loss
    'subsample': uniform(loc=0.6, scale=0.2), # proportion of train set to randomly sample prior to growing trees
    'max_depth': np.arange(3, 8).tolist(),
    'colsample_bytree': uniform(loc=0.3, scale=0.5)
}

# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model,
    param_distributions=param_grid,
    n_iter=100,
    n_jobs=-1,
    scoring=scorer,
    cv=5,
    random_state=42,
)

# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)

print(
    "Best parameters are {} with CV score={}:".format(
        randomized_cv.best_params_, randomized_cv.best_score_
    )
)

# Chime notification when cell successfully executes
chime.success()

In [None]:
# Building model with best parameters
XGB_gbtree2_tuned = XGBRegressor(
    booster="gbtree",
    random_state=42,
    colsample_bytree=0.42649508399462055,
    gamma=1.188792356281234,
    learning_rate=0.12263036412693079,
    max_depth=3,
    n_estimators=404,
    subsample=0.7391497377969234,
)

# Fit the model on training data
XGB_gbtree2_tuned.fit(X_train, y_train)

In [None]:
# Calculating different metrics
XGB_gbtree2_tuned_train_perf = model_performance_regression(
    XGB_gbtree2_tuned, X_train, y_train
)
print("Training performance:\n", XGB_gbtree2_tuned_train_perf)
XGB_gbtree2_tuned_val_perf = model_performance_regression(XGB_gbtree2_tuned, X_val, y_val)
print("\nValidation performance:\n", XGB_gbtree2_tuned_val_perf)

# Adding model to model comparison dataframes
models_train_comp_df["XGB_gbtree2 Tuned"] = XGB_gbtree2_tuned_train_perf.T
models_val_comp_df["XGB_gbtree2 Tuned"] = XGB_gbtree2_tuned_val_perf.T

#### Observations:
- The performance for *XGB_gbtree2* is improved with hyperparameter tuning.  
- There is a slight increase in overfitting, but the validation metrics are better.

### *XGB_gblinear2 Tuned*

In [None]:
# Confirming the model
models_to_tune["XGB_gblinear2"]

In [None]:
%%time

# Defining model
Model = XGBRegressor(random_state=42, booster='gblinear')

# Parameter grid to pass in RandomizedSearchCV
param_grid={
    'n_estimators': np.arange(100, 500),
    'reg_lambda': loguniform(.0001, 1)
}

# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model,
    param_distributions=param_grid,
    n_iter=100,
    n_jobs=-1,
    scoring=scorer,
    cv=5,
    random_state=42,
)

# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)

print(
    "Best parameters are {} with CV score={}:".format(
        randomized_cv.best_params_, randomized_cv.best_score_
    )
)

# Chime notification when cell successfully executes
chime.success()

In [None]:
# Building model with best parameters
XGB_gblinear2_tuned = XGBRegressor(
    booster="gblinear",
    random_state=42,
    n_estimators=439,
    reg_lambda=0.0009206654892274761,
)

# Fit the model on training data
XGB_gblinear2_tuned.fit(X_train, y_train)

In [None]:
# Calculating different metrics
XGB_gblinear2_tuned_train_perf = model_performance_regression(
    XGB_gblinear2_tuned, X_train, y_train
)
print("Training performance:\n", XGB_gblinear2_tuned_train_perf)
XGB_gblinear2_tuned_val_perf = model_performance_regression(
    XGB_gblinear2_tuned, X_val, y_val
)
print("\nValidation performance:\n", XGB_gblinear2_tuned_val_perf)

# Adding model to model comparison dataframes
models_train_comp_df["XGB_gblinear2 Tuned"] = XGB_gblinear2_tuned_train_perf.T
models_val_comp_df["XGB_gblinear2 Tuned"] = XGB_gblinear2_tuned_val_perf.T

#### Observations:


## Model Performance Comparison

### Performance of Various Models Tuned and Untuned

In [None]:
# Displaying train performance of all models
print("Train Performance Comparison:")
models_train_comp_df.sort_index(axis=1)

In [None]:
# Displaying validation performance of all models
print("Validation Performance Comparison:")
models_val_comp_df.sort_index(axis=1)

#### Observations:
- *GBM Tuned* has the highest R$^2$ (0.109) on the validation set, followed by *XGB_gbtree Tuned*, then *GBM*.
- As we did not include the Decision Tree here, we can ignore Adjusted R$^2$, and just compare R$^2$.
- Of these three models with R$^2$ scores over 10, there is some variation in overfitting.

#### Comparison of Percentage of Overfitting for R$^2$

In [None]:
# Subtracting the ratio of validation R-square/train R-square from 1
overfit_perc = (
    1
    - (
        models_val_comp_df.loc["R-squared", :]
        / models_train_comp_df.loc["R-squared", :]
    )
) * 100

print(f"Percentage of R-square overfitting:")
overfit_perc.sort_values()

#### Observations:
- *XGB_gblinear* and *XGB_gblinear Tuned* both performed better on the validation set, than the training set, which is interesting.
- Of the top 3 models for R$^2$, *GBM* generalized considerably better than *GBM Tuned* and *XGB_gtree Tuned*.  
- That said, *GBM Tuned* has the highest R$^2$ score on the validation set.
- Next we will try another modeling iteration, replacing the `known_for` feature with the original `known for` category columns.  For linear regression, we had to drop categorical columns to eliminate multicollinearity, so entries with multiple `known for` categories were grouped, into `two` and `three_to_five` classes.  We retained that approach for the above modeling iteration, but for this iteration we will allow entries to have their original multiple categories.

### *GBM Tuned* Performance on Test Set

In [None]:
# Checking performance of champion model on test set
GBM_tuned_test_perf = model_performance_regression(GBM_tuned, X_test, y_test)
print("Test performance:\n", GBM_tuned_test_perf)

# Creating test and train performance df
champion_df = pd.DataFrame()
champion_df["GBM Tuned Train"] = GBM_tuned_train_perf.T
champion_df["GBM Tuned Test"] = GBM_tuned_test_perf.T
champion_df["Overfit Percentage"] = (
    1 - (champion_df["GBM Tuned Test"] / champion_df["GBM Tuned Train"])
) * 100
champion_df.drop("Adj. R-squared", inplace=True)

In [None]:
# Performance on train and test sets
print(
    f'Average overfit of the 4 metrics is {np.round(champion_df["Overfit Percentage"].sum()/4, 2)}%.'
)
champion_df

#### Observations:
- *GBM Tuned*'s performance is holding up on the unseen test data.
- We have a model that explains 10.7% of the variation in life span of notable Wikipedia individuals, who meet inclusion criteria.
- The model predicts life expectancy within average errors of 11.5 years and 18.8%.
- Let us check the most important predictive features of the model.

### Feature Importance of *GBM Tuned*

In [None]:
# Plotting feature importances of final model
feature_names = X_train.columns
importances = GBM_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

#### Observations:
- Before deciding on a champion model, we will try another very similar approach.
- Instead of using the extracted feature `known_for`, that grouped entries with multiple `known for` categories, we will let the original features stand.  This approach would not have worked for the basic linear regression model, because we had to drop columns to avoid multicollinearity

In [None]:
print("Complete")

# Chime notification when cell executes
chime.success()

# [Proceed to Data Cleaning Part ]()