TODO:
- Overview/Content of Notebook
- How to use a notebook

## ML Overview

## Data Preprocessing
- Missing Data 
	- Just mention this, don't go into detail. Very dataset specific, lots of good resources available.
- Categorial Features
- Label Encoding
- Standardisation & Normalisation

In [3]:
import pandas as pd
import numpy as np
np.random.seed(42)

### Missing Data
- Important to check if there is missing data and handle this. Potential to break models (silently and non-silently).
- Handling is dataset specific
- Add resources links for this 

In [None]:
from ml_nb_code import get_nan_example
df = get_nan_example()
df

In [None]:
# Drop rows with missing data
df = df.dropna()
df.loc[df.isna().any(axis=1)]

In [None]:
df

### Standardisation & Normalisation

Normalisation is important bla

In [None]:
from ml_nb_code import feature_scaling
feature_scaling()

Z-standardistaion: $\frac{X - \mu} {\sigma}$

This makes the data have zero mean and unit variance

In [None]:
from ml_nb_code import feature_scaling_example
feature_scaling_example()

Can either use sklearn or do it manually

In [None]:
# Manually
from ml_nb_code import get_fs_data
df = get_fs_data()

df["x1_norm"] = (df["x1"] - df["x1"].mean()) / df["x1"].std()
df["x2_norm"] = (df["x2"] - df["x2"].mean()) / df["x2"].std()

print(f"Mean: {df['x1_norm'].mean()}, Std: {df['x1_norm'].std()}")
print(f"Mean: {df['x2_norm'].mean()}, Std: {df['x2_norm'].std()}")

In [None]:
# Using sklearn
from ml_nb_code import get_fs_data
from sklearn.preprocessing import StandardScaler
df = get_fs_data()

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the data (i.e. calculate the mean and standard deviation)
scaler.fit(df[["x1", "x2"]])
# Transform the data
df[["x1_norm", "x2_norm"]] = scaler.transform(df[["x1", "x2"]])

# Can combine the fit and transform steps
# df[["x1_norm", "x2_norm"]] = scaler.fit_transform(df[["x1", "x2"]])

print(f"Mean: {df['x1_norm'].mean()}, Std: {df['x1_norm'].std()}")
print(f"Mean: {df['x2_norm'].mean()}, Std: {df['x2_norm'].std()}")

### Categorial Features/Inputs

TODO: Change both categorial and label encoding to use the dataset used for the decision tree concepts.


In [None]:
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'Type': ['Resid', 'Comm', 'Indus', 'Resid', 'Indus', 'Comm']})
df

In [None]:
# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
# Fit and transform the data
encoded_data = encoder.fit_transform(df)

# Create a new dataframe with the encoded data
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Type']), dtype=int)
# Concatenate the original and encoded dataframes
result_df = pd.concat([df, encoded_df], axis=1)
result_df

## Label encoding
Similarly to model inputs, most model also require the target variable to be numerical. This is generally done using label encoding.

In [None]:
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(data=np.arange(15).reshape(5, 3), columns=["Feature1", "Feature2", "Feature2"])
df["Target"] = ["Safe", "Unsafe", "Safe", "Safe", "Unsafe"]
df

In [None]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()
# Fit and transform the data
df["Target_encoded"] = label_encoder.fit_transform(df["Target"])
df

In [None]:
# Inverse transform (do we need to show this here? or move to separate coding notebook)
df["Target_2"] = label_encoder.inverse_transform(df["Target_encoded"])
df

## Model Fitting
- Decision Tree
    - Interactive example
    - Sklearn example (Hands on)
    - Visualisation   
- Overfitting/Underfitting

### Decision Tree Concepts

In [None]:
### TODO: Check that this is working correctly?
# From David Dempsey's notebook
from ml_nb_code import decision_tree
decision_tree()
# TASK 1
# move the top slider to divide the dataset, trying both features
# try to separate the safe and unsafe bridges as much as possible
# when you are satisfied with the split of data, check the box to lock the root node

# TASK 2
# repeat the exercise for the lefthand and righthand sliders below
# further separate and subdivide the data, trying to distinguish the two binary classes
# can you construct a decision tree that classifies the two bridge types based on their features?

# Consider the original dataframe given in the cells above. Which part is the feature matrix X, and
# which is the label vector y?
# What are the parameters of this model? What are the hyperparameters?

# TASK 3
# Suppose you are given a new bridge: load_capacity of 45, steel, and 10 years old. What would your model predict?

### Descision Tree (Hands on)
- Hands on data pre-processing of the heart disease dataset (todo: add details)
- Train a decision tree classifier

In [None]:
### Import relevant libraries and load the data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler
from ml_nb_code import get_heart_df

heart_df = get_heart_df(features=["thalach", "oldpeak", "thal"])
heart_df

# from ml_nb_code import load_iris_df
# iris_df, iris_feature_names = load_iris_df()
# iris_df

In [None]:
## Hands-On - Prepare the data


### Solution -- Hidden

# Normalise the features
numerical_features = ["thalach", "oldpeak"]
std_scaler = StandardScaler()
heart_df[numerical_features] = std_scaler.fit_transform(heart_df[numerical_features])

# Encode the categorical features
categorical_features = ["thal"]
encoder = OneHotEncoder(sparse_output=False)
encoded_data = encoder.fit_transform(heart_df[categorical_features])
heart_df[encoder.get_feature_names_out()] = encoded_data

# Encode the labels
label_encoder = LabelEncoder()
heart_df["target_encoded"] = label_encoder.fit_transform(heart_df["target"])

features = numerical_features + list(encoder.get_feature_names_out())

In [None]:
heart_df

In [None]:
# Fit a decision tree classifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt


# Create a Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Train the model on the training data
clf.fit(heart_df[features], heart_df["target_encoded"])

# Get model predictions
y_pred = clf.predict(heart_df[features])

# Calculate accuracy
accuracy = accuracy_score(heart_df["target_encoded"], y_pred)
print(f"Accuracy: {accuracy:.2f}")

In [None]:
# Visualize the decision tree
plt.figure(figsize=(10, 6))
plot_tree(clf, filled=True, impurity=False, feature_names=features, class_names=label_encoder.inverse_transform(clf.classes_))
plt.show()

Note: Left corresponds to True, Right corresponds to False

In [None]:
# Plot decision boundaries
from ml_nb_code import plot_decision_boundary_heart
plot_decision_boundary_heart(heart_df, clf, features)

## Overfitting and Underfitting

In [None]:
from ml_nb_code import linear_regression_fitting_example
linear_regression_fitting_example()

## Model Evaluation & Hyperparameters
- Measure the models performance on unseen data (i.e. not used during training)
- Commonly done by splitting avilable (labelled) data into a training and validation/testing set
- Commonly 80/20 split

#### Train/Validation Split
- Training set: used to train the model
- Validation set: used to evaluate the model

TODO:
- Add interactive example for both train/val split and cross validation to show effect of train/val proportion and number of folds


In [4]:
### Example
### TODO:
# - Find better dataset than iris, does not make the point particular well...
#   - Needs to show overfitting of default decision tree
# - Make interactive, show effect of validation data size
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from ml_nb_code import get_prepped_heart_df

heart_df, feature_keys = get_prepped_heart_df()

# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(heart_df[feature_keys], heart_df["target_encoded"], test_size=0.2, random_state=42)
train_df = pd.concat([X_train, y_train], axis=1)
val_df = pd.concat([X_val, y_val], axis=1)

# Create a Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Train the model on the training data
clf.fit(train_df[feature_keys], train_df["target_encoded"])

# Get model predictions
train_y_pred = clf.predict(train_df[feature_keys])
val_y_pred = clf.predict(val_df[feature_keys])

# Calculate accuracy
train_accuracy = accuracy_score(train_df["target_encoded"], train_y_pred)
val_accuracy = accuracy_score(val_df["target_encoded"], val_y_pred)
print(f"Training Accuracy: {train_accuracy:.2f}")
print(f"Validation Accuracy: {val_accuracy:.2f}")

Training Accuracy: 0.97
Validation Accuracy: 0.70


In [None]:
heart_df

### Cross Validation
- Allows for more reliable model evaluation
- Gives indication on uncertainty in training process
- Todo: **Add schematic of how this works**

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from ml_nb_code import get_prepped_heart_df
import numpy as np

np.random.seed(5)

heart_df, feature_keys = get_prepped_heart_df()

# Create a Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Use cross-validation to evaluate the model
cv_scores = cross_val_score(clf, heart_df[feature_keys], heart_df["target_encoded"], cv=5)

# Print the cross-validation scores
print(f"Cross-validation scores: {', '.join([f'{cur_score:.3f}' for cur_score in cv_scores])}")
print(f"Mean cross-validation score: {cv_scores.mean():.2f}")
print(f"Standard deviation of cross-validation scores: {cv_scores.std():.2f}")

Cross-validation scores: 0.667, 0.700, 0.712, 0.593, 0.610
Mean cross-validation score: 0.66
Standard deviation of cross-validation scores: 0.05


### Hyperparameters
- Parameters that are not learned during training
- Often highly relevant for overfitting/underfitting
    - E.g. Depth of Tree in Decision Tree

In [1]:
### TODO: 
### - And decision boundary visualisation
### - Details on what those parameters are
### - Update this to use cross validation instead of train/test spli
from ml_nb_code import hyperparam_tuning_example
hyperparam_tuning_example()

HBox(children=(IntSlider(value=10, description='Max Depth:', max=10, min=1, style=SliderStyle(description_widt…

Output()

In [5]:
heart_df

Unnamed: 0,thalach,oldpeak,thal,target,thal_Fixed_defect,thal_Normal,thal_Reversable_defect,target_encoded
0,0.017494,1.068965,Fixed_defect,Presence,1.0,0.0,0.0,1
1,-1.816334,0.381773,Normal,No Precense,0.0,1.0,0.0,0
2,-0.899420,1.326662,Reversable_defect,No Precense,0.0,0.0,1.0,0
3,1.633010,2.099753,Normal,Presence,0.0,1.0,0.0,1
4,0.978071,0.295874,Normal,Presence,0.0,1.0,0.0,1
...,...,...,...,...,...,...,...,...
297,-1.161395,-0.734914,Reversable_defect,No Precense,0.0,0.0,1.0,0
298,-0.768432,0.124076,Reversable_defect,No Precense,0.0,0.0,1.0,0
299,-0.375469,2.013854,Reversable_defect,No Precense,0.0,0.0,1.0,0
300,-1.510696,0.124076,Reversable_defect,No Precense,0.0,0.0,1.0,0


## Hyperparameter Tuning
- What is it

TODO:
- Add some visualisation for it
- Example
- Use all features?
- Visualisation of results

In [12]:
### Grid Search Example
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from ml_nb_code import get_prepped_heart_df

heart_df, feature_keys = get_prepped_heart_df()

# Create a Random Forest classifier
rf = RandomForestClassifier(random_state=42)

# Define the parameter grid for the grid search
param_grid = {
    'n_estimators': [50, 100, 200],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4, 10]
}

# Perform the grid search using cross-validation
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=5, scoring='accuracy')
grid_search.fit(heart_df[feature_keys], heart_df["target_encoded"])

# Print the best parameters and the best score
print(f"Best parameters: {grid_search.best_params_}")

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best parameters: {'min_samples_leaf': 10, 'min_samples_split': 2, 'n_estimators': 200}


In [26]:
# Create a dataframe with the results
results = []
for cur_param in param_grid.keys():
    results.append(pd.DataFrame(grid_search.cv_results_)[f"param_{cur_param}"])

results_df = pd.concat(results, axis=1)
results_df["mean_test_score"] = grid_search.cv_results_["mean_test_score"]

In [28]:
results_df.sort_values("mean_test_score", ascending=False)

Unnamed: 0,param_n_estimators,param_min_samples_split,param_min_samples_leaf,mean_test_score
35,200,10,10,0.760791
32,200,5,10,0.760791
29,200,2,10,0.760791
31,100,5,10,0.757401
28,100,2,10,0.757401
34,100,10,10,0.757401
25,100,10,4,0.757288
33,50,10,10,0.754068
30,50,5,10,0.754068
27,50,2,10,0.754068


## Full Hands-On Example
- Add details here for a full hands on example
TODO: 
- Determine which dataset to use for this