<img src="./images/banner.png" width="800">

# Model Evaluation Metrics

In machine learning, evaluating the performance of a model is crucial to understanding its effectiveness and making informed decisions about its deployment. Model evaluation metrics provide quantitative measures to assess how well a model performs on a given task. These metrics help us compare different models, tune hyperparameters, and select the best model for our specific problem.


Choosing the appropriate evaluation metric depends on the type of machine learning problem you are solving, such as regression, classification, clustering, or reinforcement learning. Each type of problem has its own set of evaluation metrics that are commonly used.


In this lecture, we will explore various evaluation metrics for different types of machine learning problems, including:

- Supervised Learning
  - Regression Metrics
  - Classification Metrics
  - Multilabel Classification Metrics
- Unsupervised Learning
  - Clustering Metrics
  - Dimensionality Reduction Metrics
- Reinforcement Learning Metrics


We will also discuss model selection and validation techniques, such as cross-validation, to ensure the robustness and generalizability of our models.


By the end of this lecture, you will have a solid understanding of the importance of model evaluation metrics and how to select the appropriate metric for your specific machine learning problem. You will also gain practical knowledge of how to implement and interpret these metrics using code examples in Python.


**Table of contents**<a id='toc0_'></a>    
- [Supervised Learning Metrics](#toc1_)    
  - [Regression Metrics](#toc1_1_)    
    - [Mean Absolute Error (MAE)](#toc1_1_1_)    
    - [Mean Squared Error (MSE)](#toc1_1_2_)    
    - [Root Mean Squared Error (RMSE)](#toc1_1_3_)    
    - [R-squared (R²)](#toc1_1_4_)    
    - [Adjusted R-squared](#toc1_1_5_)    
  - [Classification Metrics](#toc1_2_)    
    - [Accuracy](#toc1_2_1_)    
    - [Precision](#toc1_2_2_)    
    - [Recall](#toc1_2_3_)    
    - [F1-score](#toc1_2_4_)    
    - [Confusion Matrix](#toc1_2_5_)    
    - [ROC Curve and AUC (Area Under the Curve)](#toc1_2_6_)    
- [Unsupervised Learning Metrics](#toc2_)    
  - [Clustering Metrics](#toc2_1_)    
    - [Silhouette Coefficient](#toc2_1_1_)    
    - [Davies-Bouldin Index](#toc2_1_2_)    
    - [Calinski-Harabasz Index](#toc2_1_3_)    
  - [Dimensionality Reduction Metrics](#toc2_2_)    
    - [Reconstruction Error](#toc2_2_1_)    
    - [Explained Variance](#toc2_2_2_)    
- [Reinforcement Learning Metrics](#toc3_)    
    - [Cumulative Reward](#toc3_1_1_)    
    - [Average Reward per Episode](#toc3_1_2_)    
    - [Average Q-value](#toc3_1_3_)    
- [Model Selection and Validation](#toc4_)    
    - [Cross-validation](#toc4_1_1_)    
    - [Stratified k-fold cross-validation](#toc4_1_2_)    
- [Conclusion](#toc5_)    
  - [Recap of the importance of model evaluation metrics](#toc5_1_)    
  - [Choosing the right metric for your problem](#toc5_2_)    
  - [Further reading and resources](#toc5_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Supervised Learning Metrics](#toc0_)

### <a id='toc1_1_'></a>[Regression Metrics](#toc0_)


Regression metrics are used to evaluate the performance of models that predict continuous values. These metrics measure how well the predicted values match the actual values. Let's explore some commonly used regression metrics:


<img src="./images/regression-metrics.png" width="800">

#### <a id='toc1_1_1_'></a>[Mean Absolute Error (MAE)](#toc0_)

MAE measures the average absolute difference between the predicted and actual values. It gives an idea of the average magnitude of the errors without considering their direction.


In [1]:
from sklearn.metrics import mean_absolute_error

y_true = [1, 2, 3, 4, 5]
y_pred = [1.2, 1.8, 3.2, 4.1, 4.8]

mae = mean_absolute_error(y_true, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")

Mean Absolute Error: 0.18


#### <a id='toc1_1_2_'></a>[Mean Squared Error (MSE)](#toc0_)

MSE measures the average squared difference between the predicted and actual values. It penalizes larger errors more heavily than smaller ones.


In [2]:
from sklearn.metrics import mean_squared_error

y_true = [1, 2, 3, 4, 5]
y_pred = [1.2, 1.8, 3.2, 4.1, 4.8]

mse = mean_squared_error(y_true, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

Mean Squared Error: 0.03


#### <a id='toc1_1_3_'></a>[Root Mean Squared Error (RMSE)](#toc0_)

RMSE is the square root of the MSE. It provides a measure of the average magnitude of the errors, similar to MAE, but with a stronger emphasis on larger errors.


In [3]:
from sklearn.metrics import mean_squared_error
import numpy as np

y_true = [1, 2, 3, 4, 5]
y_pred = [1.2, 1.8, 3.2, 4.1, 4.8]

mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error: {rmse:.2f}")

Root Mean Squared Error: 0.18


#### <a id='toc1_1_4_'></a>[R-squared (R²)](#toc0_)

R-squared, also known as the coefficient of determination, measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It ranges from 0 to 1, with 1 indicating a perfect fit.


In [4]:
from sklearn.metrics import r2_score

y_true = [1, 2, 3, 4, 5]
y_pred = [1.2, 1.8, 3.2, 4.1, 4.8]

r2 = r2_score(y_true, y_pred)
print(f"R-squared: {r2:.2f}")

R-squared: 0.98


#### <a id='toc1_1_5_'></a>[Adjusted R-squared](#toc0_)

Adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors in the model. It penalizes the addition of unnecessary predictors and provides a more accurate measure of the model's goodness of fit.


In [5]:
from sklearn.metrics import r2_score

y_true = [1, 2, 3, 4, 5]
y_pred = [1.2, 1.8, 3.2, 4.1, 4.8]
n_samples = len(y_true)
n_features = 1

r2 = r2_score(y_true, y_pred)
adjusted_r2 = 1 - (1 - r2) * (n_samples - 1) / (n_samples - n_features - 1)
print(f"Adjusted R-squared: {adjusted_r2:.2f}")

Adjusted R-squared: 0.98


These regression metrics provide different perspectives on the model's performance. MAE, MSE, and RMSE focus on the magnitude of the errors, while R-squared and adjusted R-squared measure the goodness of fit. It's important to consider multiple metrics and choose the ones that align with your specific problem and goals.

### <a id='toc1_2_'></a>[Classification Metrics](#toc0_)

Classification metrics are used to evaluate the performance of models that predict discrete class labels. These metrics measure how well the model distinguishes between different classes. Let's explore some commonly used classification metrics:


<img src="./images/classification-metrics.png" width="800">

#### <a id='toc1_2_1_'></a>[Accuracy](#toc0_)

Accuracy measures the proportion of correct predictions among the total number of predictions.


In [10]:
from sklearn.metrics import accuracy_score

y_true = [1, 0, 1, 1, 0]
y_pred = [1, 1, 1, 0, 0]

accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.60


#### <a id='toc1_2_2_'></a>[Precision](#toc0_)

Precision measures the proportion of true positive predictions among all positive predictions. It answers the question: "Out of all instances predicted as positive, how many are actually positive?"


In [9]:
from sklearn.metrics import precision_score

y_true = [1, 0, 1, 1, 0]
y_pred = [1, 1, 1, 0, 0]

precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.2f}")

Precision: 0.67


#### <a id='toc1_2_3_'></a>[Recall](#toc0_)

Recall, also known as sensitivity or true positive rate, measures the proportion of true positive predictions among all actual positive instances. It answers the question: "Out of all actual positive instances, how many are correctly predicted as positive?"


In [8]:
from sklearn.metrics import recall_score

y_true = [1, 0, 1, 1, 0]
y_pred = [1, 1, 1, 0, 0]

recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.2f}")

Recall: 0.67


#### <a id='toc1_2_4_'></a>[F1-score](#toc0_)

The F1-score is the harmonic mean of precision and recall. It provides a balanced measure of a model's performance, especially when the classes are imbalanced.


In [1]:
from sklearn.metrics import f1_score

y_true = [1, 0, 1, 1, 0]
y_pred = [1, 1, 1, 0, 0]

f1 = f1_score(y_true, y_pred)
print(f"F1-score: {f1:.2f}")

F1-score: 0.67


#### <a id='toc1_2_5_'></a>[Confusion Matrix](#toc0_)

A confusion matrix is a tabular summary of the model's performance, showing the counts of true positive, true negative, false positive, and false negative predictions.


In [6]:
from sklearn.metrics import confusion_matrix

y_true = [1, 0, 1, 1, 0]
y_pred = [1, 1, 1, 0, 0]

cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)

Confusion Matrix:
[[1 1]
 [1 2]]


#### <a id='toc1_2_6_'></a>[ROC Curve and AUC (Area Under the Curve)](#toc0_)

The ROC curve plots the true positive rate (recall) against the false positive rate at various threshold settings. AUC measures the area under the ROC curve, providing an aggregate measure of the model's performance across all possible classification thresholds.


In [7]:
from sklearn.metrics import roc_curve, auc

y_true = [1, 0, 1, 1, 0]
y_pred_proba = [0.8, 0.6, 0.7, 0.4, 0.3]

fpr, tpr, thresholds = roc_curve(y_true, y_pred_proba)
roc_auc = auc(fpr, tpr)

print(f"AUC: {roc_auc:.2f}")

AUC: 0.83


These classification metrics provide different perspectives on the model's performance. Accuracy gives an overall measure of correctness, while precision and recall focus on the model's performance for the positive class. The F1-score combines precision and recall into a single metric. The confusion matrix provides a detailed breakdown of the model's predictions, and the ROC curve and AUC assess the model's performance across different classification thresholds.


It's important to consider the characteristics of your problem, such as class imbalance, and choose the metrics that align with your specific goals. Additionally, visualizing the ROC curve can provide insights into the trade-off between true positive rate and false positive rate at different thresholds.

## <a id='toc2_'></a>[Unsupervised Learning Metrics](#toc0_)

### <a id='toc2_1_'></a>[Clustering Metrics](#toc0_)


Clustering metrics are used to evaluate the quality of clustering results in unsupervised learning. These metrics assess the compactness and separation of clusters, as well as the overall goodness of the clustering structure. Let's explore some commonly used clustering metrics:


<img src="./images/clustering-metrics.png" width="800">

#### <a id='toc2_1_1_'></a>[Silhouette Coefficient](#toc0_)

The Silhouette Coefficient measures the compactness and separation of clusters. It ranges from -1 to 1, where a higher value indicates better-defined clusters. The Silhouette Coefficient considers both the intra-cluster distance (how close points are to other points within the same cluster) and the inter-cluster distance (how far points are from points in other clusters).


In [11]:
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

X = [[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]]
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
labels = kmeans.labels_

silhouette_avg = silhouette_score(X, labels)
print(f"Silhouette Coefficient: {silhouette_avg:.2f}")

Silhouette Coefficient: 0.75


#### <a id='toc2_1_2_'></a>[Davies-Bouldin Index](#toc0_)

The Davies-Bouldin Index measures the average similarity between clusters, where similarity is defined as the ratio of within-cluster distances to between-cluster distances. A lower Davies-Bouldin Index indicates better clustering, with well-separated and compact clusters.


In [12]:
from sklearn.metrics import davies_bouldin_score
from sklearn.cluster import KMeans

X = [[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]]
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
labels = kmeans.labels_

db_index = davies_bouldin_score(X, labels)
print(f"Davies-Bouldin Index: {db_index:.2f}")

Davies-Bouldin Index: 0.28


#### <a id='toc2_1_3_'></a>[Calinski-Harabasz Index](#toc0_)

The Calinski-Harabasz Index, also known as the Variance Ratio Criterion, measures the ratio of between-cluster dispersion to within-cluster dispersion. A higher Calinski-Harabasz Index indicates better-defined clusters, with good separation between clusters and compactness within clusters.


In [13]:
from sklearn.metrics import calinski_harabasz_score
from sklearn.cluster import KMeans

X = [[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]]
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
labels = kmeans.labels_

ch_index = calinski_harabasz_score(X, labels)
print(f"Calinski-Harabasz Index: {ch_index:.2f}")

Calinski-Harabasz Index: 35.59


These clustering metrics provide different perspectives on the quality of the clustering results. The Silhouette Coefficient considers both the compactness and separation of clusters, giving an overall measure of the clustering structure. The Davies-Bouldin Index focuses on the similarity between clusters, penalizing clusters that are too close to each other. The Calinski-Harabasz Index measures the ratio of between-cluster dispersion to within-cluster dispersion, favoring well-separated and compact clusters.


It's important to note that these metrics have their own assumptions and limitations. They may be sensitive to the shape, size, and density of clusters, as well as the presence of noise or outliers. Therefore, it's recommended to use multiple metrics and visualize the clustering results to gain a comprehensive understanding of the clustering quality.


Additionally, the interpretation of these metrics may vary depending on the specific problem and domain. It's crucial to consider the characteristics of your data and the goals of your clustering analysis when selecting and interpreting clustering metrics.

### <a id='toc2_2_'></a>[Dimensionality Reduction Metrics](#toc0_)

Dimensionality reduction techniques aim to reduce the number of features while preserving the essential information in the data. Evaluating the quality of dimensionality reduction results is important to ensure that the reduced representation captures the relevant patterns and minimizes information loss. Let's explore two commonly used dimensionality reduction metrics:


<img src="./images/dimensionality-reduction-metrics.png" width="800">

#### <a id='toc2_2_1_'></a>[Reconstruction Error](#toc0_)

Reconstruction error measures the difference between the original data and the reconstructed data after dimensionality reduction. It quantifies the information loss incurred during the reduction process. A lower reconstruction error indicates better preservation of the original data.


In [19]:
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error
import numpy as np

X = np.random.rand(100, 20)
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
X_reconstructed = pca.inverse_transform(X_reduced)

reconstruction_error = mean_squared_error(X, X_reconstructed)
print(f"Reconstruction Error: {reconstruction_error:.2f}")

Reconstruction Error: 0.07


In this example, we use Principal Component Analysis (PCA) to reduce the dimensionality of the data from 3 to 2 components. We then reconstruct the original data using the reduced representation and calculate the reconstruction error using mean squared error.


#### <a id='toc2_2_2_'></a>[Explained Variance](#toc0_)

Explained variance measures the proportion of the total variance in the data that is captured by each principal component. It helps determine the number of components required to retain a desired level of information.


In [20]:
from sklearn.decomposition import PCA

X = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
pca = PCA(n_components=3)
pca.fit(X)

explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio:")
for i, ratio in enumerate(explained_variance_ratio):
    print(f"Principal Component {i+1}: {ratio:.2f}")

Explained Variance Ratio:
Principal Component 1: 1.00
Principal Component 2: 0.00
Principal Component 3: 0.00


In this example, we apply PCA to the data and calculate the explained variance ratio for each principal component. The explained variance ratio indicates the proportion of the total variance explained by each component. We can see that the first principal component captures 99% of the variance, while the second and third components capture the remaining 1% and 0%, respectively.


These dimensionality reduction metrics provide insights into the quality of the reduced representation. Reconstruction error quantifies the information loss during the reduction process, with a lower error indicating better preservation of the original data. Explained variance helps determine the number of components needed to retain a desired level of information, allowing us to strike a balance between dimensionality reduction and information preservation.


It's important to consider the specific requirements of your problem when selecting dimensionality reduction techniques and evaluating their performance. The choice of the number of components depends on factors such as the desired level of information retention, computational efficiency, and interpretability of the reduced representation.


Additionally, visualizing the reduced data using techniques like scatter plots or heatmaps can provide qualitative insights into the structure and patterns captured by the dimensionality reduction process.


Remember that dimensionality reduction is an exploratory technique, and the metrics discussed here are just a few examples. Other metrics, such as silhouette score or t-SNE perplexity, can also be used depending on the specific algorithm and the nature of the data.

## <a id='toc3_'></a>[Reinforcement Learning Metrics](#toc0_)

Reinforcement learning involves an agent learning to make decisions by interacting with an environment to maximize a reward signal. Evaluating the performance of reinforcement learning algorithms requires specialized metrics that capture the agent's learning progress and the quality of its decision-making. Let's explore some commonly used reinforcement learning metrics:


#### <a id='toc3_1_1_'></a>[Cumulative Reward](#toc0_)

Cumulative reward is the sum of rewards obtained by the agent over a certain number of time steps or episodes. It measures the overall performance of the agent in accumulating rewards throughout its interaction with the environment.


```python
def evaluate_cumulative_reward(env, agent, num_episodes):
    cumulative_rewards = []
    for _ in range(num_episodes):
        state = env.reset()
        done = False
        episode_reward = 0
        while not done:
            action = agent.get_action(state)
            next_state, reward, done, _ = env.step(action)
            episode_reward += reward
            state = next_state
        cumulative_rewards.append(episode_reward)
    return cumulative_rewards

env = gym.make('CartPole-v1')
agent = YourRLAgent()  # Replace with your actual agent implementation
num_episodes = 100

cumulative_rewards = evaluate_cumulative_reward(env, agent, num_episodes)
print(f"Cumulative Rewards: {cumulative_rewards}")
print(f"Average Cumulative Reward: {np.mean(cumulative_rewards):.2f}")
```


In this example, we define a function `evaluate_cumulative_reward` that evaluates the cumulative reward of an agent over a specified number of episodes. The agent interacts with the environment, and the rewards obtained in each episode are accumulated. Finally, we calculate the average cumulative reward across all episodes.


#### <a id='toc3_1_2_'></a>[Average Reward per Episode](#toc0_)
Average reward per episode measures the average reward obtained by the agent in each episode. It provides an indication of the agent's performance on a per-episode basis.


```python
def evaluate_average_reward(env, agent, num_episodes):
    episode_rewards = []
    for _ in range(num_episodes):
        state = env.reset()
        done = False
        episode_reward = 0
        while not done:
            action = agent.get_action(state)
            next_state, reward, done, _ = env.step(action)
            episode_reward += reward
            state = next_state
        episode_rewards.append(episode_reward)
    return np.mean(episode_rewards)

env = gym.make('CartPole-v1')
agent = YourRLAgent()  # Replace with your actual agent implementation
num_episodes = 100

average_reward = evaluate_average_reward(env, agent, num_episodes)
print(f"Average Reward per Episode: {average_reward:.2f}")
```


In this example, we define a function `evaluate_average_reward` that evaluates the average reward per episode. The agent interacts with the environment for a specified number of episodes, and the rewards obtained in each episode are recorded. Finally, we calculate the average reward across all episodes.


#### <a id='toc3_1_3_'></a>[Average Q-value](#toc0_)
Average Q-value is a metric specific to Q-learning algorithms. It measures the average estimated Q-value of the actions taken by the agent. Higher Q-values indicate that the agent has learned to assign higher values to favorable actions.


```python
def evaluate_average_q_value(env, agent, num_episodes):
    q_values = []
    for _ in range(num_episodes):
        state = env.reset()
        done = False
        while not done:
            action = agent.get_action(state)
            q_value = agent.get_q_value(state, action)
            q_values.append(q_value)
            next_state, _, done, _ = env.step(action)
            state = next_state
    return np.mean(q_values)

env = gym.make('CartPole-v1')
agent = YourQLearningAgent()  # Replace with your actual Q-learning agent implementation
num_episodes = 100

average_q_value = evaluate_average_q_value(env, agent, num_episodes)
print(f"Average Q-value: {average_q_value:.2f}")
```


In this example, we define a function `evaluate_average_q_value` that evaluates the average Q-value of the actions taken by the agent. The agent interacts with the environment for a specified number of episodes, and the Q-values of the selected actions are recorded. Finally, we calculate the average Q-value across all recorded values.


These reinforcement learning metrics provide different perspectives on the agent's performance. Cumulative reward captures the overall performance in accumulating rewards, average reward per episode measures the agent's performance on a per-episode basis, and average Q-value is specific to Q-learning algorithms and indicates the quality of the learned action-value estimates.


It's important to note that these metrics are just a few examples, and the choice of metrics depends on the specific reinforcement learning algorithm and the problem at hand. Other metrics, such as the number of steps per episode, success rate, or time to reach a goal state, can also be used to evaluate the agent's performance.


When interpreting these metrics, it's crucial to consider the characteristics of the environment, the complexity of the task, and the specific goals of the reinforcement learning problem. Comparing the metrics across different agents or algorithmic variations can provide insights into their relative performance and help in selecting the most suitable approach for the given task.

## <a id='toc4_'></a>[Model Selection and Validation](#toc0_)

Model selection and validation are crucial steps in the machine learning workflow to ensure the generalization performance of models and prevent overfitting. These techniques help in selecting the best model architecture, hyperparameters, and assessing the model's performance on unseen data. Let's explore two commonly used model selection and validation techniques:


#### <a id='toc4_1_1_'></a>[Cross-validation](#toc0_)

Cross-validation is a technique that involves splitting the data into multiple subsets, training and evaluating the model on different combinations of these subsets, and aggregating the results to obtain a more robust performance estimate. The most common variant is k-fold cross-validation.


<img src="./images/k-fold.png" width="800">

In [21]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.svm import SVC

X, y = load_iris(return_X_y=True)
clf = SVC(kernel='linear', C=1, random_state=42)

scores = cross_val_score(clf, X, y, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Average cross-validation score: {np.mean(scores):.2f}")

Cross-validation scores: [0.96666667 1.         0.96666667 0.96666667 1.        ]
Average cross-validation score: 0.98


In this example, we use the `cross_val_score` function from scikit-learn to perform 5-fold cross-validation on a support vector machine (SVM) classifier. The data is split into 5 folds, and the model is trained and evaluated on different combinations of these folds. The resulting scores for each fold are reported, along with the average cross-validation score.


Cross-validation helps in assessing the model's performance on different subsets of the data, providing a more reliable estimate of its generalization ability.


#### <a id='toc4_1_2_'></a>[Stratified k-fold cross-validation](#toc0_)

Stratified k-fold cross-validation is a variant of k-fold cross-validation that ensures the class distribution in each fold is representative of the overall class distribution in the dataset. This is particularly useful when dealing with imbalanced datasets.


In [22]:
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
clf = SVC(kernel='linear', C=1, random_state=42)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = []

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    score = accuracy_score(y_test, y_pred)
    scores.append(score)

print(f"Stratified k-fold cross-validation scores: {scores}")
print(f"Average stratified k-fold cross-validation score: {np.mean(scores):.2f}")

Stratified k-fold cross-validation scores: [1.0, 1.0, 0.9333333333333333, 1.0, 1.0]
Average stratified k-fold cross-validation score: 0.99


In this example, we use the `StratifiedKFold` class from scikit-learn to perform stratified 5-fold cross-validation. The `split` method is used to generate the train and test indices for each fold, ensuring that the class distribution is maintained. We iterate over the folds, train the model on the training set, make predictions on the test set, and calculate the accuracy score for each fold. Finally, we report the scores for each fold and the average stratified k-fold cross-validation score.


Stratified k-fold cross-validation is particularly useful when the class distribution is imbalanced, as it ensures that each fold contains a representative proportion of samples from each class.


These model selection and validation techniques help in assessing the model's performance, comparing different models, and selecting the best one for the given task. They provide a more reliable estimate of the model's generalization ability and help in preventing overfitting.


It's important to note that the choice of the number of folds (k) depends on the size of the dataset and the computational resources available. A common choice is k=5 or k=10, but it can be adjusted based on the specific requirements of the problem.


Additionally, other model selection techniques, such as hold-out validation, leave-one-out cross-validation, or nested cross-validation, can be used depending on the characteristics of the dataset and the goals of the analysis.

## <a id='toc5_'></a>[Conclusion](#toc0_)

In this lecture, we have explored the importance of model evaluation metrics in assessing the performance of machine learning models. We covered various metrics for different types of machine learning problems, including supervised learning (regression, classification, multilabel classification), unsupervised learning (clustering, dimensionality reduction), and reinforcement learning.


### <a id='toc5_1_'></a>[Recap of the importance of model evaluation metrics](#toc0_)

Model evaluation metrics provide quantitative measures to assess how well a model performs on a given task. They help in comparing different models, selecting the best model for the problem at hand, and making informed decisions about model deployment. Without proper evaluation metrics, it would be challenging to determine the effectiveness and reliability of machine learning models.


Evaluation metrics allow us to:
- Assess the model's performance on unseen data
- Compare different models and select the best one
- Identify areas for improvement and optimize the model
- Communicate the model's performance to stakeholders
- Monitor the model's performance over time


### <a id='toc5_2_'></a>[Choosing the right metric for your problem](#toc0_)

Choosing the appropriate evaluation metric is crucial for effectively assessing the model's performance. The choice of metric depends on the specific problem, the characteristics of the data, and the goals of the analysis.


For supervised learning problems:
- Regression metrics, such as MAE, MSE, RMSE, and R-squared, are used to evaluate the model's ability to predict continuous values.
- Classification metrics, such as accuracy, precision, recall, F1-score, and ROC AUC, are used to assess the model's performance in predicting discrete class labels.
- Multilabel classification metrics, such as micro-averaging, macro-averaging, and weighted-averaging, are used when instances can belong to multiple classes simultaneously.


For unsupervised learning problems:
- Clustering metrics, such as Silhouette Coefficient, Davies-Bouldin Index, and Calinski-Harabasz Index, are used to evaluate the quality of clustering results.
- Dimensionality reduction metrics, such as reconstruction error and explained variance, are used to assess the quality of the reduced representation.


For reinforcement learning problems:
- Metrics such as cumulative reward, average reward per episode, and average Q-value are used to evaluate the agent's learning progress and decision-making quality.


It's important to consider multiple metrics and understand their strengths and limitations to gain a comprehensive understanding of the model's performance.


### <a id='toc5_3_'></a>[Further reading and resources](#toc0_)

To dive deeper into model evaluation metrics and their applications, here are some recommended resources:

- Scikit-learn documentation on model evaluation: [https://scikit-learn.org/stable/modules/model_evaluation.html](https://scikit-learn.org/stable/modules/model_evaluation.html)
- "Evaluating Machine Learning Models" by Alice Zheng: [https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/](https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/)
- "Machine Learning Mastery" blog by Jason Brownlee: [https://machinelearningmastery.com/](https://machinelearningmastery.com/)
- "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron: [https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)


Remember, model evaluation is an iterative process. It involves selecting appropriate metrics, assessing the model's performance, and refining the model based on the insights gained. By understanding and applying the right evaluation metrics, you can make informed decisions and build reliable and effective machine learning models.