<a href="https://colab.research.google.com/github/sheldonkemper/portfolio/blob/main/Activity_6_2_7_Advanced_Modelling_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<br>

**First things first** - please go to 'File' and select 'Save a copy in Drive' so that you have your own version of this activity set up and ready to use.
Remember to update the portfolio index link to your own work once completed!


# Activity: Exploring decision trees

# Objective
The objective of this activity is to combined all the concepts explore during this week, i.e. various tree-based models, pre- and post pruning, and how to interpret the models using SHAP values.

# Instructions

## 1. Data exploration
- 1.1 Load the dataset and conduct basic explorations such as viewing the first few rows, describing the dataset to understand its structure, features, and target variable.
- 1.2 Visualise the distribution of the target variable to check for imbalance.
- 1.3 Correlation analysis to visualise relationships between the target variable and features using Seaborn or Matplotlib.

## 2. Transformations
- 2.1 Encode categorical variables using techniques like one-hot encoding or label encoding.
- 2.2 Normalise or standardise numerical features if required. (Hint: tree-models don't need it)

## 3. Compare basic models
- 3.1 Compare basic decision tree models using pre-pruning (early stopping) vs post-pruning (CCP)
- 3.2 Train basic tree models including decision tree, a bagging model (e.g., random forest), and a boosting model (AdaBoost, gradient boosting, XGBoost).
- 3.3 For each model, train on the training set and check for overfitting using metrics such as accuracy, precision, recall, F1 score, and ROC-AUC.
- 3.4 Compare the performance of these models and summarise the findings.

## 4. Hyperparameter tuning: Pre-pruning and post-pruning
- 4.1 Implement hyperparameter tuning for the decision tree model. Explore parameters such as `max_depth`, `min_samples_split`, and `max_features`, and regularisation paramteres such as `gamma`, `learning_rate`.
- 4.2 Compare the performance of the tuned/pruned decision tree model against its baseline version.

## 5. Interpretation using SHAP values
- 5.1 Choose one of the models for interpretation (preferably a complex model like random forest or XGBoost).
- 5.2 Visualise the SHAP values and interpret the results to understand the impact of different features on the model's predictions.

#### Submission guidelines
- Ensure your notebook is well-commented to explain your code and thought process.
- Include visualisations to support your explorations and findings.
- Summarise your insights and conclusions at the end of the Notebook.

This activity is designed to provide a hands-on experience with decision trees and their ensemble counterparts, focusing on the entire machine learning workflow from data preprocessing to model interpretation. It is designed to allow flexibility to the user to implement what they deem appropriate, hence the results might vary from user to user, but are based on previous demonstration videos and data sets to those can be used as a benchmark.


# Data set
Use a classification dataset such as the UCI Machine Learning Repository's Bank Marketing data set. You can find more details about it on
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing. This data set has been previously explored in the SHAP demo.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
import xgboost as xgb
import seaborn as sns


bank = pd.read_csv("https://raw.githubusercontent.com/fourthrevlxd/cam_dsb/main/C2_W6_Datasets/bank-additional-full-processed.csv")
np.random.seed(seed)


X = bank.drop('y', axis=1).copy()
y = bank['y'].copy()


## 1. Data exploration
- 1.1 Load the data set and conduct basic explorations such as viewing the first few rows, describing the data set to understand its structure, features, and target variable.
- 1.2 Visualise the distribution of the target variable to check for imbalance.
- 1.3 Correlation analysis to visualise relationships between the target variable and features using Seaborn or Matplotlib.


## 2. Transformations
- 2.1 Encode categorical variables using techniques like one-hot encoding or label encoding.
- 2.2 Normalise or standardise numerical features if required. (Hint: tree-models don't need it)

In [None]:
std = StandardScaler()
ohe = OneHotEncoder()
lbe = LabelEncoder()

## 3. Compare basic models
- 3.1 Train basic models including logistic regression, decision tree, a bagging model (e.g., random forest), and a boosting model (e.g., XGBoost).
- 3.2 For each model, train on the training set and check for overfitting using metrics such as accuracy, precision, recall, F1 score, and ROC-AUC.
- 3.3 Compare the performance of these models and summarise the findings.

In [None]:
lr = LogisticRegression(random_state=seed)
dt = DecisionTreeClassifier(random_state=seed)
rf = RandomForestClassifier(random_state=seed)
ab = AdaBoostClassifier(random_state=seed)
gb = GradientBoostingClassifier(random_state=seed)
xgb = xgb.XGBClassifier(random_state=seed)

## 4. Hyperparameter tuning: Pre-pruning and post-pruning
- 4.1 Implement hyperparameter tuning for the decision tree model. Explore parameters such as `max_depth`, `min_samples_split`, and `max_features`, and regularisation paramteres such as `gamma`, `learning_rate`.
- 4.2 Compare the performance of the tuned/pruned decision tree model against its baseline version.


In [None]:
models_param_grids = {
    'Decision Tree': {
        'model': ,
        'param_grid': {

        }
    },
    'Random Forest': {
        'model': ,
        'param_grid': {

        }
    },
    'Gradient Boosting': {
        'model': ,
        'param_grid': {

        }
    },
    'XGBoost': {
        'model': ,
        'param_grid': {

        }
    }
}

grid = GridSearchCV()

## 5. Interpretation using SHAP values
- 5.1 Choose one of the models for interpretation (preferably a complex model like random forest or XGBoost).
- 5.2 Visualise the SHAP values and interpret the results to understand the impact of different features on the model's predictions.


In [None]:
import shap
shap.initjs()

# SHAP values
shap_ex = shap.TreeExplainer("model")  # replace with your trained model'
vals = shap_ex("X_test") # replace with your test data

In [None]:
# Waterfall plot
shap.plots.waterfall()

In [None]:
# Force plot
shap.plots.force()

In [None]:
# SHAP scatter
shap.plots.scatter()

In [None]:
# Beeswarm plot
shap.plots.beeswarm()

In [None]:
# Violin plot
shap.plots.violin()
