# Model Building and Evaluation

Adapted from Wafiq Syed 2020 [How to use Scikit-Learn Datasets for Machine Learning](https://towardsdatascience.com/how-to-use-scikit-learn-datasets-for-machine-learning-d6493b38eca3) and Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

## Overview

This tutorial covers core concepts in model building and evaluation, focusing on classification and clustering approaches using the Breast Cancer Wisconsin dataset. We'll explore various algorithms and metrics for evaluating their performance.

## Learning Objectives

- Learn to build and evaluate classification models using metrics like accuracy, precision, recall, and F1-score
- Understand how to interpret confusion matrices and ROC curves
- Apply clustering model evaluation using both external and internal validation methods
- Apply practical model evaluation techniques using scikit-learn's metrics module
- Gain hands-on experience with real biomedical data analysis

### Tasks to complete

- Build classification models
- Evaluate model performance using various metrics
- Create and analyze clustering models
- Generate performance visualizations
- Compare different evaluation approaches

## Prerequisites

- A working Python environment and familiarity with Python
- Basic understanding of machine learning concepts
- Familiarity with numpy and sklearn libraries
- Knowledge of basic statistical concepts


## Get Started

- Please select kernel "conda_tensorflow2_p310" from SageMaker notebook instance.


### Import libraries


In [None]:
# Import the warnings module to handle and control warning messages during code execution.
import warnings

# Import the pyplot module from matplotlib for creating plots and visualizations.
import matplotlib.pyplot as plt

# Import the numpy library for numerical computations, especially for handling arrays and matrices.
import numpy as np

# Import the pandas library for data manipulation and analysis, particularly using DataFrames.
import pandas as pd

# Import the metrics module from sklearn (scikit-learn) for evaluating model performance. # TODO this should probably be removed - as it's imported again later
import sklearn.metrics  # TODO this should probably be removed - as it's imported again later

# Import the interp function from numpy for interpolation, likely used in ROC curve calculations.
from numpy import interp

# Import dendrogram, fcluster, and linkage functions from scipy.cluster.hierarchy for hierarchical clustering and dendrogram plotting.
from scipy.cluster.hierarchy import dendrogram, fcluster, linkage

# Import datasets, linear_model, and metrics modules from sklearn for various machine learning tasks like loading datasets, linear models, and evaluation metrics.
from sklearn import datasets, linear_model, metrics

# Import the clone function from sklearn.base for creating copies of estimators.
from sklearn.base import clone

# Import the KMeans class from sklearn.cluster for K-Means clustering algorithm.
from sklearn.cluster import KMeans

# Import the load_breast_cancer function from sklearn.datasets to load the breast cancer dataset.
from sklearn.datasets import load_breast_cancer

# Import the PCA class from sklearn.decomposition for Principal Component Analysis (dimensionality reduction).
from sklearn.decomposition import PCA

# Import auc and roc_curve functions from sklearn.metrics for calculating Area Under the ROC Curve and plotting ROC curves.
from sklearn.metrics import auc, roc_curve

# Import the train_test_split function from sklearn.model_selection to split datasets into training and testing sets.
from sklearn.model_selection import train_test_split

# Import the KNeighborsClassifier class from sklearn.neighbors for K-Nearest Neighbors classification algorithm.
from sklearn.neighbors import KNeighborsClassifier

# Import LabelEncoder and label_binarize from sklearn.preprocessing for encoding categorical labels and binarizing labels in a one-vs-all fashion.
from sklearn.preprocessing import LabelEncoder, label_binarize

# Sets the backend of matplotlib to the 'inline' backend: With this backend,
# the output of plotting commands is displayed inline within frontends like
# the Jupyter notebook, directly below the code cell that produced it.
# The resulting plots will then also be stored in the notebook document.
%matplotlib inline

## Classification Example

In this example, we’ll be working with the “Breast Cancer Wisconsin” dataset. We will import the data and understand how to read it. We will also build a simple ML model that is able to classify cancer scans either as malignant or benign.


### Import “Breast Cancer Wisconsin” dataaset

The dataset can be found in _sklearn.datasets_. Each dataset has a corresponding function used to load the dataset. These functions follow the same format: "load_DATASET()", where DATASET refers to the name of the dataset.


In [None]:
# Loads the breast cancer dataset from scikit-learn's datasets module.
bc = datasets.load_breast_cancer()

These load functions (such as _load_breast_cancer()_) don't return data in the tabular format, they return a **Bunch** object, a Scikit-Learn's fancy name for a Dictionary.

Let's looking into its keys.


In [None]:
# Prints the keys of the 'bc' dictionary.
print(bc.keys())

We can get the following keys:

- **data** is all the feature data (the attributes of the scan that help us identify if the tumor is malignant or benign, such as radius, area, etc.) in a NumPy array
- **target** is the target data (the variable you want to predict, in this case whether the tumor is malignant or benign) in a NumPy array,
- **feature_names** are the names of the feature variables, in other words names of the columns in data
- **target_names** is the name(s) of the target variable(s), in other words name(s) of the target column(s)
- **DESCR**, short for DESCRIPTION, is a description of the dataset
- **filename** is the path to the actual file of the data in CSV format.
- **data_module** is the name of the data module from where the data is being loaded.

It’s important to note that all of Scikit-Learn datasets are divided into data and target. data represents the features, which are the variables that help the model learn how to predict. target includes the actual labels. In our case, the target data is one column classifies the tumor as either 0 indicating malignant or 1 for benign.


Let's take a look the description of the dataset


In [None]:
# Prints the description of the dataset object 'bc' to the console.
print(bc.DESCR)

### Working with the Dataset

We can use _pandas_ to explore the dataset.


In [None]:
# Read the DataFrame using pandas, initializing it with breast cancer feature data and column names from bc.feature_names
df = pd.DataFrame(bc.data, columns=bc.feature_names)

# Add a new column named "target" to the DataFrame
df["target"] = bc.target

# Display the first five rows of the DataFrame to show the initial data structure
df.head()

To see the value of this dataset, run


In [None]:
# Displays a concise summary of the DataFrame 'df', including column names, data types, non-null values, and memory usage.
df.info()

There are a few things to observe:

- There aren’t any missing values, all the columns have 569 values. This saves us time from having to account for missing values.
- All the data types are numerical. This is important because Scikit-Learn models do not accept categorical variables. In the real world, when we get categorical variables, we transform them into numerical variables. Scikit-Learn’s datasets are free of categorical variables.

Hence, Scikit-Learn takes care of the data cleansing work. Their datasets are extremely valuable. You will benefit from learning ML by using them.


### Let's do some AI

Let’s build a model that classifies cancer tumors as malignant (spreading) or benign (non-spreading). This will show you how to use the data for your own models. We’ll build a simple K-Nearest Neighbors model.


First, let’s split the dataset into two, one for training the model — giving it data to learn from, and the second for testing the model — seeing how well the model performs on data (scans) it hasn’t seen before.


In [None]:
# Store the feature data from the 'bc' dataset into the variable 'X'.
X = bc.data

# Store the target data (labels) from the 'bc' dataset into the variable 'y'.
y = bc.target

# Import the 'train_test_split' function from Scikit-Learn's model_selection module.
from sklearn.model_selection import train_test_split

# Split the feature data (X) and target data (y) into training and testing sets using the train_test_split function.
# X_train: Feature data for the training set.
# X_test: Feature data for the testing set.
# y_train: Target data (labels) for the training set.
# y_test: Target data (labels) for the testing set.
X_train, X_test, y_train, y_test = train_test_split(X, y)

This gives us two datasets —one for training and one for testing. Let’s get onto training the model.


In [None]:
# Classifier implementing the k-nearest neighbors vote. Initializes a KNeighborsClassifier object with 6 neighbors.
logreg = KNeighborsClassifier(n_neighbors=6)

# Fit the k-nearest neighbors classifier from the training dataset. Trains the KNeighborsClassifier model using the training features (X_train) and training labels (y_train).
logreg.fit(X_train, y_train)

# Return the mean accuracy on the given test data and labels. Calculates and prints the accuracy score of the trained KNeighborsClassifier model on the test features (X_test) and test labels (y_test).
print("Model accurcy: ", logreg.score(X_test, y_test))

The n_neighbors parameter in KNN determines the number of neighboring data points considered when making a classification decision. Choosing this value involves a trade-off:

- **Smaller values**: Result in more intricate decision boundaries that closely fit the training data but may lead to overfitting and increased sensitivity to noise.
- **Larger values**: Produce smoother decision boundaries that enhance generalization but risk overlooking important patterns in the data.

Selecting an appropriate n_neighbors value requires balancing these factors to capture local patterns without excessive noise influence. This choice should be guided by the dataset's characteristics (e.g., larger datasets or datasets with more noise often benefit from larger values of `n_neighbors`) and your domain knowledge. Additionally, cross-validation is commonly used to test different values and identify the optimal setting for a given dataset.

For the Breast Cancer Wisconsin dataset, a value of 6 was chosen based on prior experimentation showing good performance with this dataset, the relatively small size of the dataset, and the desire to balance capturing local patterns while still avoiding noise.


## Clustering Example

In this example, we will learn how we can fit a clustering model on “Breast Cancer Wisconsin” dataset. We will use a labeled dataset to help us see the results of the clustering model and compare it with actual labels. A point to remember here is that, usually labeled data is not available in the real world,
which is why we choose to go for unsupervised methods like clustering. We will try to cover two different
algorithms, one each from partitioning based clustering and hierarchical clustering.


In [None]:
# Load Wisconsin Breast Cancer Dataset

# Import the load_breast_cancer function from the sklearn.datasets module to load the dataset.
from sklearn.datasets import load_breast_cancer

# load data
# Load the Wisconsin Breast Cancer dataset into the variable 'bc'.
bc = load_breast_cancer()

# Store the feature data
# Assign the feature data from the loaded dataset 'bc' to the variable 'X'.
X = bc.data

# store the target data
# Assign the target data (labels) from the loaded dataset 'bc' to the variable 'y'.
y = bc.target
# Print the shape of the feature data 'X' and the names of the features from 'bc.feature_names'.
print(X.shape, bc.feature_names)

It is evident that we have a total of 569 observations and 30 attributes or features for each observation.


### Partition based Clustering Example


We will choose the simplest yet most popular partition based clustering model for our example, which
is **K-means** algorithm. This algorithm is a centroid based clustering algorithm, which starts with some
assumption about the total clusters in the data and with random centers assigned to each of the clusters.
It then reassigns each data point to the center closest to it, using Euclidean distance as the distance metric.
After each reassignment, it recalculates the center of that cluster. The whole process is repeated iteratively
and stopped when reassignment of data points doesn’t change the cluster centers. Variants include
algorithms like **K-medoids**.


In [None]:
# Import KMeans class from scikit-learn library for K-means clustering algorithm.
from sklearn.cluster import KMeans

# Initialize KMeans clustering object, specifying the number of clusters and random state.
km = KMeans(n_clusters=2, random_state=2)
# Fit the K-means model to the data X, which performs the clustering.
km.fit(X)

# Get the cluster labels assigned to each data point by the fitted K-means model.
labels = km.labels_
# Get the coordinates of the cluster centers calculated by K-means.
centers = km.cluster_centers_

# Print the cluster labels for the first 10 data points to see cluster assignments.
print(labels[:10])

# Print the cluster centers, representing the mean feature vector for each cluster.
# These centers are numerical values in the feature space (30 dimensions in this case) and indicate the central point of each cluster.
print(centers)

In [None]:
# we will leverage PCA to reduce the input dimensions (30) to two principal components
# and visualize the clusters on top of the same.

# Instantiate PCA object from scikit-learn, specifying that we want to reduce the data to 2 principal components.
pca = PCA(n_components=2)
# Fit PCA to the data X and then transform X to its first 2 principal components, storing the result in bc_pca.
bc_pca = pca.fit_transform(X)

In [None]:
# Visualize the clusters on the reduced 2D feature space for the actual labels as well as the clustered output labels.
# Create a figure and a set of subplots (1 row, 2 columns) for visualization, setting the figure size to 8x4 inches.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
# Set the suptitle for the entire figure to "Visualizing breast cancer clusters".
fig.suptitle("Visualizing breast cancer clusters")
# Adjust the spacing between subplots and the top of the figure to prevent overlap with the suptitle.
fig.subplots_adjust(top=0.85, wspace=0.5)
# Set the title for the first subplot (ax1) to "Actual Labels", representing the ground truth labels.
ax1.set_title("Actual Labels")
# Set the title for the second subplot (ax2) to "Clustered Labels", representing the labels from the clustering algorithm.
ax2.set_title("Clustered Labels")

# Iterate through each data point in the dataset using its index 'i'.
for i in range(len(y)):
    # Check if the actual label 'y[i]' for the i-th data point is 0.
    if y[i] == 0:
        # If actual label is 0, scatter plot the i-th data point on the first subplot (ax1)
        # using the first two PCA components (bc_pca[i, 0], bc_pca[i, 1]), color it green ('g'), and use a dot marker ('.').
        c1 = ax1.scatter(bc_pca[i, 0], bc_pca[i, 1], c="g", marker=".")
    # Check if the actual label 'y[i]' for the i-th data point is 1.
    if y[i] == 1:
        # If actual label is 1, scatter plot the i-th data point on the first subplot (ax1)
        # using the first two PCA components, color it red ('r'), and use a dot marker ('.').
        c2 = ax1.scatter(bc_pca[i, 0], bc_pca[i, 1], c="r", marker=".")

    # Check if the clustered label 'labels[i]' for the i-th data point is 0.
    if labels[i] == 0:
        # If clustered label is 0, scatter plot the i-th data point on the second subplot (ax2)
        # using the first two PCA components, color it green ('g'), and use a dot marker ('.').
        c3 = ax2.scatter(bc_pca[i, 0], bc_pca[i, 1], c="g", marker=".")
    # Check if the clustered label 'labels[i]' for the i-th data point is 1.
    if labels[i] == 1:
        # If clustered label is 1, scatter plot the i-th data point on the second subplot (ax2)
        # using the first two PCA components, color it red ('r'), and use a dot marker ('.').
        c4 = ax2.scatter(bc_pca[i, 0], bc_pca[i, 1], c="r", marker=".")

# Create a legend 'l1' for the first subplot (ax1) using the scatter plot handles 'c1' and 'c2', and label the classes as "0" and "1".
l1 = ax1.legend([c1, c2], ["0", "1"])
# Create a legend 'l2' for the second subplot (ax2) using the scatter plot handles 'c3' and 'c4', and label the clusters as "0" and "1".
l2 = ax2.legend([c3, c4], ["0", "1"])

We can clearly see that the clustering has worked quite well and it shows distinct
separation between clusters with labels 0 and 1 and is quite similar to the actual labels. However we do
have some overlap where we have mislabeled some instances.

Remember in an actual real-world scenario, you will not have the actual labels to compare with and the
main idea is to find structures or patterns in your data in the form of these clusters.
Hence even when dealing with labeled data and running clustering do not
compare clustered label values with actual labels and try to measure accuracy.

Another very important
point to remember is that cluster label values have no significance. The labels 0 and 1 are just values to
distinguish cluster data points from each other.

Also another important note
is that if we had asked for more than two clusters, the algorithm would have readily supplied more clusters
but it would have been hard to interpret those and many of them would not make sense. Hence, one of
the caveats of using the K-means algorithm is to use it in the case where we have some idea about the total
number of clusters that may exist in the data.


### Hierarchical Clustering Example

We can use the same data to perform a hierarchical clustering and see if the results change much as
compared to K-means clustering and the actual labels.

agglomerative clustering is hierarchical clustering using a
bottom up approach i.e. each observation starts in its own cluster and clusters are successively merged
together. The merging criteria can be used from a candidate set of linkages; the selection of linkage governs
the merge strategy. Some examples of linkage criteria are Ward, Complete linkage, Average linkage and so
on


In [None]:
# compute the linkage matrix using Ward’s minimum variance criterion.
# Perform hierarchical/agglomerative clustering.
# Returns hierarchical clustering encoded as a linkage matrix.
# Apply hierarchical clustering to the data 'X' using Ward's minimum variance method.
Z = linkage(X, "ward")
# Print the resulting linkage matrix 'Z' to the console.
print(Z)

In [None]:
# Use a dendrogram to visualize the hierarchical clustering distance-based merges.
# Create a new figure for the dendrogram plot with a specified size (width=8 inches, height=3 inches).
plt.figure(figsize=(8, 3))
# Set the title of the dendrogram plot.
plt.title("Hierarchical Clustering Dendrogram")
# Set the label for the x-axis, representing the data points.
plt.xlabel("Data point")
# Set the label for the y-axis, representing the distance between clusters.
plt.ylabel("Distance")

# Plot the hierarchical clustering as a dendrogram using the linkage matrix 'Z'.
dendrogram(Z)

# Add a horizontal line to the dendrogram to help in visually determining a cutoff for cluster formation.
# y: position of the horizontal line on the y-axis (distance), set to 10000 in data coordinates.
# c: color of the line, set to 'k' for black.
# ls: line style, set to '--' for dashed line.
# lw: line weight, set to 0.5 for a thin line.
plt.axhline(y=10000, c="k", ls="--", lw=0.5)
# Display the dendrogram plot.
plt.show()

In [None]:
# Get the cluster labels. This line is a general comment indicating the purpose of the following code block, which is to obtain cluster labels from hierarchical clustering.
max_dist = 10000  # Define a threshold distance 'max_dist' and set it to 10000. This value will be used as the maximum distance to form clusters.

# Form flat clusters from the hierarchical clustering defined by
# the given linkage matrix. This multi-line comment explains that the following line uses the 'fcluster' function to generate flat clusters from a pre-computed hierarchical clustering.
#   - Z: This parameter is expected to be the linkage matrix, which is the output of hierarchical clustering algorithms like 'linkage' in scipy. It encodes the hierarchical relationships between data points.
#   - max_dist: This parameter specifies the threshold distance for forming clusters. Clusters will be formed such that the maximum cophenetic distance between any two original observations in the same cluster is no more than 'max_dist'.
#   - criterion="distance": This parameter defines the criterion to use for forming flat clusters. Here, "distance" means that clusters are formed based on the 'max_dist' distance threshold.
# The resulting flat cluster labels are assigned to the variable 'hc_labels'. Each data point will be assigned a cluster label based on the distance criterion.
hc_labels = fcluster(
    Z, max_dist, criterion="distance"
)  # Use the fcluster function to derive flat clusters from the hierarchical clustering.

In [None]:
# This code block aims to visualize and compare cluster outputs based on PCA-reduced dimensions against the original label distribution of a breast cancer dataset.
# Create a figure and a set of subplots (1 row, 2 columns) for visualization, setting the figure size to 8x4 inches.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
# Set the main title for the entire figure, which encompasses both subplots.
fig.suptitle("Visualizing breast cancer clusters")
# Adjust subplot parameters to provide more space at the top for the suptitle and adjust horizontal spacing between subplots.
fig.subplots_adjust(top=0.85, wspace=0.5)
# Set the title for the first subplot (ax1) to "Actual Labels", representing the ground truth.
ax1.set_title("Actual Labels")
# Set the title for the second subplot (ax2) to "Hierarchical Clustered Labels", showing the clusters found by hierarchical clustering.
ax2.set_title("Hierarchical Clustered Labels")

# Iterate through each data point in the dataset using an index 'i' ranging from 0 to the length of 'y' (actual labels).
for i in range(len(y)):
    # Check if the actual label 'y[i]' for the i-th data point is 0.
    if y[i] == 0:
        # If the label is 0, create a scatter plot point on the first subplot (ax1) using the PCA-reduced data 'bc_pca[i, 0]' (first principal component) and 'bc_pca[i, 1]' (second principal component).
        # Set the color of the point to green ('g') and the marker style to a dot ('.'). Assign the scatter plot object to 'c1'.
        c1 = ax1.scatter(bc_pca[i, 0], bc_pca[i, 1], c="g", marker=".")
    # Check if the actual label 'y[i]' for the i-th data point is 1.
    if y[i] == 1:
        # If the label is 1, create a scatter plot point on the first subplot (ax1) using the PCA-reduced data 'bc_pca[i, 0]' and 'bc_pca[i, 1]'.
        # Set the color of the point to red ('r') and the marker style to a dot ('.'). Assign the scatter plot object to 'c2'.
        c2 = ax1.scatter(bc_pca[i, 0], bc_pca[i, 1], c="r", marker=".")

    # Check if the hierarchical clustering label 'hc_labels[i]' for the i-th data point is 1.
    if hc_labels[i] == 1:
        # If the cluster label is 1, create a scatter plot point on the second subplot (ax2) using the PCA-reduced data 'bc_pca[i, 0]' and 'bc_pca[i, 1]'.
        # Set the color of the point to green ('g') and the marker style to a dot ('.'). Assign the scatter plot object to 'c3'.
        c3 = ax2.scatter(bc_pca[i, 0], bc_pca[i, 1], c="g", marker=".")
    # Check if the hierarchical clustering label 'hc_labels[i]' for the i-th data point is 2.
    if hc_labels[i] == 2:
        # If the cluster label is 2, create a scatter plot point on the second subplot (ax2) using the PCA-reduced data 'bc_pca[i, 0]' and 'bc_pca[i, 1]'.
        # Set the color of the point to red ('r') and the marker style to a dot ('.'). Assign the scatter plot object to 'c4'.
        c4 = ax2.scatter(bc_pca[i, 0], bc_pca[i, 1], c="r", marker=".")

# Create a legend for the first subplot (ax1) using the scatter plot objects 'c1' and 'c2', labeling them as "0" and "1" respectively, and assign the legend object to 'l1'.
l1 = ax1.legend([c1, c2], ["0", "1"])
# Create a legend for the second subplot (ax2) using the scatter plot objects 'c3' and 'c4', labeling them as "1" and "2" respectively, and assign the legend object to 'l2'.
l2 = ax2.legend([c3, c4], ["1", "2"])

We definitely see two distinct clusters but there is more overlap as compared to the K-means method
between the two clusters and we have more mislabeled instances. However, do take a note of the label
numbers; here we have 1 and 2 as the label values. This is just to reinforce the fact that the label values are
just to distinguish the clusters and don’t mean anything. The advantage of this method is that you do not
need to input the number of clusters beforehand and the model tries to find it from the underlying data.


## Classification Model Evaluation Metrics


In [None]:
# Let's first prepare train and test datasets to build our classification models.
# Split the feature matrix X and target vector y into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(
    # X is the feature matrix containing the independent variables.
    X,
    # y is the target vector containing the dependent variable (labels).
    y,
    # test_size=0.3 specifies that 30% of the data will be used for testing, and 70% for training.
    test_size=0.3,
    # random_state=42 ensures that the data split is reproducible. Using the same random_state will result in the same split each time the code is run.
    random_state=42,
)
# Print the shapes of the training and testing feature matrices to verify the split.
print(X_train.shape, X_test.shape)

In [None]:
# Import the warnings module to handle warning messages.
import warnings

# Filter all warnings to be ignored to suppress them from output.
warnings.filterwarnings("ignore")

# Initialize a Logistic Regression model from scikit-learn's linear_model module.
logistic = linear_model.LogisticRegression()
# Train the Logistic Regression model using the training data (X_train features and y_train labels).
logistic.fit(X_train, y_train)

### Confusion Matrix


In [None]:
# Define a function to display a formatted confusion matrix.
def display_confusion_matrix(true_labels, predicted_labels, classes=[1, 0]):
    # Determine the total number of classes from the input 'classes' list.
    total_classes = len(classes)
    # Define levels and codes for creating a MultiIndex for pandas DataFrame columns and rows.
    level_labels = [total_classes * [0], list(range(total_classes))]
    # Calculate the confusion matrix using scikit-learn's metrics.confusion_matrix function.
    cm = metrics.confusion_matrix(
        # Pass the true labels to the confusion_matrix function.
        y_true=true_labels,
        # Pass the predicted labels to the confusion_matrix function.
        y_pred=predicted_labels,
        # Specify the classes to be considered in the confusion matrix.
        labels=classes,
    )
    # Create a pandas DataFrame to present the confusion matrix in a structured format.
    cm_frame = pd.DataFrame(
        # Use the calculated confusion matrix 'cm' as the data for the DataFrame.
        data=cm,
        # Define the columns of the DataFrame using pandas MultiIndex for hierarchical column labels ('Predicted:' and class names).
        columns=pd.MultiIndex(levels=[["Predicted:"], classes], codes=level_labels),
        # Define the index (rows) of the DataFrame using pandas MultiIndex for hierarchical row labels ('Actual:' and class names).
        index=pd.MultiIndex(levels=[["Actual:"], classes], codes=level_labels),
    )
    # Print the formatted confusion matrix DataFrame.
    print(cm_frame)


# Use the trained logistic regression model to predict class labels for the test features (X_test).
y_pred = logistic.predict(X_test)
# Display the confusion matrix to evaluate the model's performance.
# Pass the true labels (y_test), predicted labels (y_pred), and class labels [0, 1] to the function.
display_confusion_matrix(true_labels=y_test, predicted_labels=y_pred, classes=[0, 1])

We can see that out of
60 observations with label 0 (malignant), our model has correctly predicted 59 observations. Similarly out of
111 observations with label 1 (benign), our model has correctly predicted 107 observations


### True Positive, False Positive, True Negative and False Negative


In [None]:
# Set positive class label
positive_class = 1

# True Positive (TP): This is the count of the total number of instances from the
# positive class where the true class label was equal to the predicted class label.
TP = 107

# False Positive (FP): This is the count of the total number of instances from the
# negative class where our model misclassified them by predicting them as positive.
FP = 4

# True Negative (FN): This is the count of the total number of instances from the
# negative class where the true class label was equal to the predicted class label.
TN = 59

# False Negative (FN): This is the count of the total number of instances from the
# positive class where our model misclassified them by predicting them as negative.
FN = 1

### Accuracy

This is one of the most popular measures of classifier performance. It is defined as the overall
accuracy or proportion of correct predictions of the model. The formula for computing accuracy from the
confusion matrix is:

$Accurcy=\frac{TP+TN}{TP+FP+TN+FN}$


In [None]:
# Calculate the framework accuracy using scikit-learn's accuracy_score function and round to 5 decimal places.
fw_acc = round(metrics.accuracy_score(y_true=y_test, y_pred=y_pred), 5)
# Manually compute accuracy using the confusion matrix components (TP, TN, FP, FN) and round to 5 decimal places.
mc_acc = round((TP + TN) / (TP + TN + FP + FN), 5)

# Print the framework-calculated accuracy.
print("Framework Accuracy:", fw_acc)
# Print the manually computed accuracy.
print("Manually Computed Accuracy:", mc_acc)

### Precision

Precision, also known as positive predictive value, is another metric that can be derived from
the confusion matrix. It is defined as the number of predictions made that are actually correct or relevant out
of all the predictions based on the positive class. The formula for precision is as follows:

$Precision=\frac{TP}{TP+FP}$

A model with high precision will identify a higher fraction of positive class as compared to a model
with a lower precision. Precision becomes important in cases where we are more concerned about finding
the maximum number of positive class even if the total accuracy reduces.


In [None]:
# Calculates precision using scikit-learn's precision_score function and rounds it to 5 decimal places.
fw_prec = round(metrics.precision_score(y_true=y_test, y_pred=y_pred), 5)
# Manually computes precision using the formula: True Positives (TP) / (True Positives (TP) + False Positives (FP)), and rounds it to 5 decimal places.
mc_prec = round((TP) / (TP + FP), 5)

# Prints the precision calculated using scikit-learn's function.
print("Framework Precision:", fw_prec)
# Prints the precision calculated manually.
print("Manually Computed Precision:", mc_prec)

### Recall

Recall, also known as sensitivity, is a measure of a model to identify the percentage of relevant
data points. It is defined as the number of instances of the positive class that were correctly predicted. This is
also known as hit rate, coverage, or sensitivity. The formula for recall is:

$Recall=\frac{TP}{TP+FN}$

Recall becomes an important measure of classifier performance in scenarios where we want to catch
the most number of instances of a particular class even when it increases our false positives. For example,
consider the case of bank fraud, a model with high recall will give us higher number of potential fraud cases.
But it will also help us raise alarm for most of the suspicious cases.


In [None]:
# Calculate the recall score using scikit-learn's metrics.recall_score function, rounding to 5 decimal places.
fw_rec = round(metrics.recall_score(y_true=y_test, y_pred=y_pred), 5)
# Manually compute recall using the formula: True Positives (TP) / (True Positives (TP) + False Negatives (FN)), rounding to 5 decimal places.
mc_rec = round((TP) / (TP + FN), 5)

# Print the recall score calculated using scikit-learn's framework.
print("Framework Recall:", fw_rec)
# Print the recall score manually computed.
print("Manually Computed Recall:", mc_rec)

### F1-Score

There are some cases in which we want a balanced optimization of both precision and recall.
F1 score is a metric that is the harmonic mean of precision and recall and helps us optimize a classifier for
balanced precision and recall performance.
The formula for the F1 score is:

$F1 Score = \frac{2 x Precision x Recall}{Precision + Recall}$


In [None]:
# Calculates the F1-score using scikit-learn's metrics.f1_score function, rounding to 5 decimal places.
fw_f1 = round(metrics.f1_score(y_true=y_test, y_pred=y_pred), 5)
# Manually computes the F1-score using the formula: 2 * (precision * recall) / (precision + recall), rounding to 5 decimal places.
mc_f1 = round((2 * mc_prec * mc_rec) / (mc_prec + mc_rec), 5)

# Prints the F1-score calculated using scikit-learn's framework.
print("Framework F1-Score:", fw_f1)
# Prints the F1-score computed manually.
print("Manually Computed F1-Score:", mc_f1)

### Receiver Operating Characteristic (ROC) Curve

The ROC curve can be created by plotting the fraction of true positives versus the fraction of false
positives, i.e. it is a plot of True Positive Rate (TPR) versus the False Positive Rate (FPR). It is applicable
mostly for scoring classifiers. Scoring classifiers are the type of classifiers which will return a probability
value or score for each class label, from which a class label can be deduced (based on maximum probability
value).

This curve can be plotted using the true positive rate (TPR) and the false positive rate (FPR) of a
classifier. TPR is known as sensitivity or recall, which is the total number of correct positive results, predicted
among all the positive samples the dataset. FPR is known as false alarms or (1 - specificity), determining the
total number of incorrect positive predictions among all negative samples in the dataset.


In [None]:
def get_class_labels(clf, label_encoder=None, class_names=None):
    """
    Retrieves class labels from a classifier, label encoder, or directly from
    provided class names.

    Args:
        clf: A trained classifier object.
        label_encoder: Optional. A fitted LabelEncoder object.
        class_names: Optional. A list of class name strings.

    Returns:
        list: A list of class labels.

    Raises:
        ValueError: If class labels cannot be determined from any of the inputs.
    """
    # Check if the classifier object has class labels defined
    if hasattr(clf, "classes_"):
        # If yes, get class labels from the classifier
        class_labels = clf.classes_
    # Else if a label encoder is provided
    elif label_encoder:
        # If yes, get class labels from the label encoder
        class_labels = label_encoder.classes_
    # Else if class names are directly provided
    elif class_names:
        # If yes, use the provided class names
        class_labels = class_names
    # Else if no class labels can be derived
    else:
        # Raise a ValueError indicating inability to determine prediction classes
        raise ValueError(
            "Unable to derive prediction classes, please specify class_names!"
        )
    return class_labels


def get_prediction_scores(clf, features):
    """
    Gets prediction scores (probabilities or decision function values) from a
    classifier.

    Args:
        clf: A trained classifier object.
        features: Feature matrix to get predictions for.

    Returns:
        An array of prediction scores or probabilities.

    Raises:
        AttributeError: If the classifier doesn't have predict_proba or decision_function methods.
    """
    # Check if the classifier has a predict_proba method (for probability estimates)
    if hasattr(clf, "predict_proba"):
        # Get probability predictions from the classifier
        return clf.predict_proba(features)
    # Else if the classifier has a decision_function method (for decision values)
    elif hasattr(clf, "decision_function"):
        # Get decision function values from the classifier
        return clf.decision_function(features)
    # Else if the classifier has neither predict_proba nor decision_function
    else:
        # Raise an AttributeError indicating the estimator lacks probability or confidence scoring
        raise AttributeError(
            "Estimator doesn't have a probability or confidence scoring system!"
        )


def calculate_binary_roc_data(y_test, y_score):
    """
    Calculates ROC curve data for binary classification.

    Args:
        y_test: True binary labels.
        y_score: Target scores (probabilities or decision function outputs).

    Returns:
        tuple: A tuple containing (fpr, tpr, roc_auc) where:
            - fpr: array of false positive rates
            - tpr: array of true positive rates
            - roc_auc: area under the ROC curve
    """
    # Compute False Positive Rate, True Positive Rate, and thresholds for ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_score)
    # Compute Area Under the ROC Curve (ROC AUC)
    roc_auc = auc(fpr, tpr)

    return fpr, tpr, roc_auc


def plot_binary_roc_curve(fpr, tpr, roc_auc):
    """
    Plots the ROC curve for binary classification.

    Args:
        fpr: Array of false positive rates.
        tpr: Array of true positive rates.
        roc_auc: Area under the ROC curve.
    """
    # Create a new figure for plotting ROC curves
    plt.figure(figsize=(6, 4))

    # Plot the ROC curve for binary classification
    plt.plot(
        fpr,
        tpr,
        label="ROC curve (area = {0:0.2f})".format(roc_auc),
        linewidth=2.5,
    )


def calculate_multiclass_roc_data(y_test, y_score, n_classes):
    """
    Calculates ROC curve data for multi-class classification.

    Args:
        y_test: Binarized true labels.
        y_score: Target scores (probabilities or decision function outputs).
        n_classes: Number of classes.

    Returns:
        tuple: A tuple containing (fpr, tpr, roc_auc) where:
            - fpr: dictionary of false positive rates for each class, plus 'micro' and 'macro' averages
            - tpr: dictionary of true positive rates for each class, plus 'micro' and 'macro' averages
            - roc_auc: dictionary of ROC AUC values for each class, plus 'micro' and 'macro' averages
    """

    # Initialize dictionaries to store false positive rates, true positive
    # rates, and ROC AUC values for each class.
    fpr = dict()
    tpr = dict()
    roc_auc = dict()

    # Iterate through each class to compute ROC curve and AUC
    for i in range(n_classes):
        # Compute FPR, TPR, and thresholds for each class's ROC curve
        fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
        # Compute ROC AUC for each class
        roc_auc[i] = auc(fpr[i], tpr[i])

    # ## Compute micro-average ROC curve and ROC area
    # Compute micro-average ROC curve (considering all classes together)
    fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
    # Compute ROC AUC for micro-average ROC curve
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

    # ## Compute macro-average ROC curve and ROC area
    # Initialize an array to store all false positive rates for macro-average calculation
    all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
    # Initialize an array to store interpolated true positive rates
    mean_tpr = np.zeros_like(all_fpr)
    # Interpolate ROC curves at each point for macro-average
    for i in range(n_classes):
        mean_tpr += interp(all_fpr, fpr[i], tpr[i])
    # Average true positive rates to get macro-average TPR
    mean_tpr /= n_classes
    # Assign macro-average FPR and TPR to dictionaries
    fpr["macro"] = all_fpr
    tpr["macro"] = mean_tpr
    # Compute ROC AUC for macro-average ROC curve
    roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

    return fpr, tpr, roc_auc


def plot_multiclass_roc_curves(fpr, tpr, roc_auc, class_labels):
    """
    Plots ROC curves for multi-class classification, including micro and macro averages.

    Args:
        fpr: Dictionary of false positive rates for each class, plus 'micro' and 'macro' averages.
        tpr: Dictionary of true positive rates for each class, plus 'micro' and 'macro' averages.
        roc_auc: Dictionary of ROC AUC values for each class, plus 'micro' and 'macro' averages.
        class_labels: List of class labels corresponding to the classes.
    """
    # Create a new figure for plotting ROC curves
    plt.figure(figsize=(6, 4))
    # Plot micro-average ROC curve
    plt.plot(
        fpr["micro"],
        tpr["micro"],
        label="micro-average ROC curve (area = {0:0.2f})".format(roc_auc["micro"]),
        linewidth=3,
    )

    # Plot macro-average ROC curve
    plt.plot(
        fpr["macro"],
        tpr["macro"],
        label="macro-average ROC curve (area = {0:0.2f})".format(roc_auc["macro"]),
        linewidth=3,
    )

    # Plot ROC curve for each class
    for i, label in enumerate(class_labels):
        plt.plot(
            fpr[i],
            tpr[i],
            label="ROC curve of class {0} (area = {1:0.2f})".format(label, roc_auc[i]),
            linewidth=2,
            linestyle=":",
        )


def finalize_plot():
    """
    Finalizes and displays the ROC curve plot with appropriate labels and formatting.
    """
    # Plot the diagonal line representing chance level performance
    plt.plot([0, 1], [0, 1], "k--")
    # Set x-axis limits from 0 to 1
    plt.xlim([0.0, 1.0])
    # Set y-axis limits from 0 to 1.05 for better visualization
    plt.ylim([0.0, 1.05])
    # Set x-axis label
    plt.xlabel("False Positive Rate")
    # Set y-axis label
    plt.ylabel("True Positive Rate")
    # Set plot title
    plt.title("Receiver Operating Characteristic (ROC) Curve")
    # Display legend to identify each ROC curve
    plt.legend(loc="lower right")
    # Show the plot
    plt.show()


def plot_model_roc_curve(
    clf, features, true_labels, label_encoder=None, class_names=None
):
    """
    Plots ROC curve(s) for a classifier's predictions.

    Args:
        clf: A trained classifier object.
        features: Feature matrix to make predictions on.
        true_labels: True labels corresponding to the features.
        label_encoder: Optional. A fitted LabelEncoder object.
        class_names: Optional. A list of class name strings.

    Raises:
        ValueError: If the number of classes is less than 2.
    """
    # Get class labels for the classifier
    class_labels = get_class_labels(clf, label_encoder, class_names)

    # Get the number of classes from the class labels
    n_classes = len(class_labels)

    # Binarize the true labels for ROC curve calculation, handling multi-class scenarios
    y_test = label_binarize(true_labels, classes=class_labels)

    # Check if it's a binary classification problem (2 classes)
    if n_classes == 2:
        # Get probability predictions from the classifier
        prob = get_prediction_scores(clf, features)
        # Extract probabilities for the positive class (assuming binary case, last column)
        y_score = prob[:, prob.shape[1] - 1]
        # Compute Area Under the ROC Curve (ROC AUC)
        fpr, tpr, roc_auc = calculate_binary_roc_data(y_test, y_score)
        plot_binary_roc_curve(fpr, tpr, roc_auc)

    # Else if it's a multi-class classification problem (more than 2 classes)
    elif n_classes > 2:
        # Get probability predictions from the classifier
        y_score = get_prediction_scores(clf, features)
        fpr, tpr, roc_auc = calculate_multiclass_roc_data(y_test, y_score, n_classes)
        plot_multiclass_roc_curves(fpr, tpr, roc_auc, class_labels)

    # Else if the number of classes is less than 2 (invalid case)
    else:
        # Raise a ValueError for insufficient number of classes for ROC curve plotting
        raise ValueError("Number of classes should be atleast 2 or more")

    finalize_plot()


# Example function call to plot ROC curve for a logistic regression classifier 'logistic'
plot_model_roc_curve(clf=logistic, features=X_test, true_labels=y_test)

Ideally, the best prediction model would give a point
on the top left corner (0, 1) indicating perfect classification (100% sensitivity & specificity). A diagonal line
depicts a classifier that does a random guess. Ideally if your ROC curve occurs in the top half of the graph,
you have a decent classifier which is better than average. The plot above shows a near perfect ROC curve.


## Clustering Model Evaluation Metrics

The lack of a validated ground truth, i.e. the absence of true labels in the data makes the evaluation of clustering (or unsupervised models in general) very difficult.


### Build two clustering models on the breast cancer dataset

We will leverage the breast cancer
dataset available in the variables X for the data and y for the observation labels. We will also use the K-means
algorithm to fit two models on this data—one with two clusters and the second one with five clusters—and
then evaluate their performance.


In [None]:
# Import KMeans class from scikit-learn library
from sklearn.cluster import KMeans

# Initialize KMeans clustering with 2 clusters and a fixed random state for reproducibility, then fit it to the data X.
km2 = KMeans(n_clusters=2, random_state=42).fit(X)
# Get the cluster labels assigned by the KMeans model (km2) after fitting.
km2_labels = km2.labels_

# Initialize KMeans clustering with 5 clusters and a fixed random state for reproducibility, then fit it to the data X.
km5 = KMeans(n_clusters=5, random_state=42).fit(X)
# Get the cluster labels assigned by the KMeans model (km5) after fitting.
km5_labels = km5.labels_

### External validation

External validation means validating the clustering model when we have some ground truth available
as labeled data. The presence of external labels reduces most of the complexity of model evaluation as
the clustering (unsupervised) model can be validated in similar fashion to classification models.

Three popular metrics can be used in this scenario:

- **Homogeneity**: A clustering model prediction result satisfies homogeneity if all of
  its clusters contain only data points that are members of a single class (based on the
  true class labels).
- **Completeness**: A clustering model prediction result satisfies completeness if
  all the data points of a specific ground truth class label are also elements of the
  same cluster.
- **V-measure**: The harmonic mean of homogeneity and completeness scores gives us
  the V-measure value.

Values are typically bounded between 0 and 1 and usually higher values are better. Let’s compute these
metric on our two K-means clustering models.


In [None]:
# Calculates the Homogeneity, Completeness, and V-measure metrics for KMeans clustering with 2 clusters and rounds the results to 3 decimal places.
km2_hcv = np.round(metrics.homogeneity_completeness_v_measure(y, km2_labels), 3)
# Calculates the Homogeneity, Completeness, and V-measure metrics for KMeans clustering with 5 clusters and rounds the results to 3 decimal places.
km5_hcv = np.round(metrics.homogeneity_completeness_v_measure(y, km5_labels), 3)

# Prints the Homogeneity, Completeness, and V-measure metrics for the KMeans clustering with 2 clusters.
print("Homogeneity, Completeness, V-measure metrics for num clusters=2: ", km2_hcv)
# Prints the Homogeneity, Completeness, and V-measure metrics for the KMeans clustering with 5 clusters.
print("Homogeneity, Completeness, V-measure metrics for num clusters=5: ", km5_hcv)

We can see that the V-measure for the first model with two clusters is better than the one with five
clusters and the reason is because of higher completeness score.


### Internal validation

Internal validation means validating a clustering model by defining metrics that capture the expected
behavior of a good clustering model. A good clustering model can be identified by two very desirable traits:

- Compact groups, i.e. the data points in one cluster occur close to each other.
- Well separated groups, i.e. two groups\clusters have as large distance among
  them as possible.


#### Silhouette Coefficient

Silhouette coefficient is a metric that tries to combine the two requirements of a good clustering model. The
silhouette coefficient is defined for each sample and is a combination of its similarity to the data points in its
own cluster and its dissimilarity to the data points not in its cluster.

The silhouette coefficient is usually bounded between -1 (incorrect clustering) and +1 (excellent quality
dense clusters). A higher value of silhouette coefficient generally means that the clustering model is leading
to clusters that are dense and well separated and distinguishable from each other. Lower scores indicate
overlapping clusters.


In [None]:
# Calculates the silhouette score for k-means clustering with 2 clusters using the 'euclidean' distance metric.
km2_silc = metrics.silhouette_score(X, km2_labels, metric="euclidean")
# Calculates the silhouette score for k-means clustering with 5 clusters using the 'euclidean' distance metric.
km5_silc = metrics.silhouette_score(X, km5_labels, metric="euclidean")

# Prints the silhouette coefficient calculated for the k-means model with 2 clusters.
print("Silhouette Coefficient for num clusters=2: ", km2_silc)
# Prints the silhouette coefficient calculated for the k-means model with 5 clusters.
print("Silhouette Coefficient for num clusters=5: ", km5_silc)

We can observe that from the metric results it seems like we have better
cluster quality with two clusters as compared to five clusters.


#### Calinski-Harabaz Index

The Calinski-Harabaz index is another metric that we can use to evaluate clustering models when the
ground truth is not known. The Calinski-Harabaz score is given as the ratio of the between-clusters
dispersion and the within-cluster dispersion.

A higher score normally indicates that the clusters are dense and well separated, which
relates to the general principles of clustering models.


In [None]:
# Calculate the Calinski-Harabasz Index for k-means clustering with 2 clusters.
km2_chi = metrics.calinski_harabasz_score(X, km2_labels)
# Calculate the Calinski-Harabasz Index for k-means clustering with 5 clusters.
km5_chi = metrics.calinski_harabasz_score(X, km5_labels)

# Print the Calinski-Harabasz Index for the clustering with 2 clusters.
print("Calinski-Harabaz Index for num clusters=2: ", km2_chi)
# Print the Calinski-Harabasz Index for the clustering with 5 clusters.
print("Calinski-Harabaz Index for num clusters=5: ", km5_chi)

We can see that both the scores are pretty high with the results for five clusters being even higher. This
goes to show that just relying on metric number alone is not sufficient and you must try multiple evaluation
methods coupled with feedback from data scientists as well as domain experts.


## Conclusion

Through this tutorial, we've learned essential skills in model building and evaluation, including:

- How to properly evaluate classification models using multiple metrics
- Understanding the differences between internal and external clustering validation
- Practical application of evaluation metrics on real biomedical data
- Interpreting model performance through various visualization techniques

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.
