# Evaluating Machine-Learning Models

This notebook provides lecture notes on evaluating machine-learning models, covering key concepts such as training, validation, and test sets, and important considerations during evaluation.

## Training, Validation, and Test Sets

When developing and evaluating machine learning models, it's crucial to split your dataset into distinct sets:

*   **Training Set:** Used to train the model. The model learns patterns and parameters from this data.
*   **Validation Set:** Used to tune hyperparameters and evaluate the model's performance during development. It helps prevent overfitting to the training data.
*   **Test Set:** Used for a final, unbiased evaluation of the model's performance after training and hyperparameter tuning are complete. This set should not be used during the training or validation phases.

A common split ratio is 70% for training, 15% for validation, and 15% for testing, but this can vary depending on the dataset size and problem.

Mathematically, if your dataset $D$ has $N$ samples, you can split it into:

*   $D_{train}$ with $N_{train}$ samples
*   $D_{val}$ with $N_{val}$ samples
*   $D_{test}$ with $N_{test}$ samples

where $N_{train} + N_{val} + N_{test} = N$.

The goal is to minimize a loss function $L$ on the training set:

$$ \min_{\theta} \frac{1}{N_{train}} \sum_{(x, y) \in D_{train}} L(f_{\theta}(x), y) $$

where $f_{\theta}$ is the model with parameters $\theta$, $x$ is the input, and $y$ is the true output.

The validation set is used to estimate the generalization error during training:

$$ \text{Validation Error} = \frac{1}{N_{val}} \sum_{(x, y) \in D_{val}} L(f_{\theta}(x), y) $$

The test set provides an unbiased estimate of the generalization error on unseen data:

$$ \text{Test Error} = \frac{1}{N_{test}} \sum_{(x, y) \in D_{test}} L(f_{\theta}(x), y) $$

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd

# Load a real-world dataset (e.g., the Iris dataset)
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

# Display the head of the dataset
iris_df = pd.DataFrame(X, columns=iris.feature_names)
iris_df['target'] = y
print("Head of the dataset:")
display(iris_df.head())

# Split data into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y) # Added stratify
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp) # Added stratify

print(f"Training set shape: {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")
print(f"Test set shape: {X_test.shape}")

Head of the dataset:


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


Training set shape: (105, 4)
Validation set shape: (22, 4)
Test set shape: (23, 4)


## Things to Keep in Mind

*   **Data Leakage:** Avoid using information from the validation or test sets during the training process. This can lead to overly optimistic performance estimates.
*   **Stratified Sampling:** For classification problems, ensure that the proportion of each class is maintained in the training, validation, and test sets. This is especially important for imbalanced datasets.
*   **Cross-Validation:** When dealing with limited data, k-fold cross-validation can be used to make better use of the available data for training and validation. The data is split into $k$ folds, and the model is trained $k$ times, each time using a different fold as the validation set and the remaining folds as the training set.
*   **Evaluation Metrics:** Choose appropriate evaluation metrics based on the problem type (e.g., accuracy, precision, recall, F1-score for classification; MSE, RMSE, R2 for regression).
*   **Baseline Models:** Always compare your model's performance against a simple baseline model (e.g., a random guess, a model predicting the most frequent class).
*   **Interpreting Results:** Don't just look at a single metric. Analyze the confusion matrix, look at per-class performance, and understand the model's strengths and weaknesses.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Train a simple model (example using Logistic Regression)
model = LogisticRegression(max_iter=200) # Increased max_iter for convergence
model.fit(X_train, y_train)

# Evaluate on the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")

# Print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Test Accuracy: 1.0000

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       1.00      1.00      1.00         8
           2       1.00      1.00      1.00         8

    accuracy                           1.00        23
   macro avg       1.00      1.00      1.00        23
weighted avg       1.00      1.00      1.00        23



## Summary and Conclusion

### Summary

The contents of the notebook can be summarized into the following:

*   **Data Splitting:** Crucial for unbiased model evaluation (Training, Validation, Test sets).
*   **Avoiding Data Leakage:** Essential for reliable performance estimates.
*   **Handling Imbalance:** Stratified sampling helps maintain class proportions.
*   **Cross-Validation:** Useful for limited datasets to maximize data utility.
*   **Metric Selection:** Choose metrics appropriate for the problem type.
*   **Baseline Comparison:** Evaluate model performance against simple baselines.
*   **Interpreting Results:** Go beyond single metrics for a comprehensive understanding.

### Conclusion

Effective model evaluation is vital for building robust and reliable machine learning systems. By following best practices like proper data splitting, avoiding data leakage, and using appropriate metrics, we can gain confidence in our model's ability to generalize to unseen data and make informed decisions about model selection and improvement.