# Evaluating Machine Learning Classification and Clustering Models

Machine learning models must be evaluated with appropriate metrics to ensure they perform well and generalize to new data. In this tutorial, we will demonstrate how to evaluate both classification and clustering models using the Iris dataset. We will cover common evaluation metrics, discuss pitfalls, and explain best practices.

We'll use **Python** and **scikit-learn** throughout this notebook (no TensorFlow).

## 1. Dataset and Setup

For this tutorial, we will use the classic **Iris dataset**. The Iris dataset contains 150 samples with 4 features (sepal length, sepal width, petal length, petal width) and 3 classes (Setosa, Versicolor, Virginica). It is balanced (50 samples per class) and is well-suited for both classification and clustering demonstrations.

In [1]:
from sklearn.datasets import load_iris
import numpy as np

# Load the Iris dataset
iris = load_iris()
X = iris.data        # feature matrix (150 x 4)
y = iris.target      # labels (0, 1, 2 for the three species)

print("Features shape:", X.shape)
print("Label distribution:", np.bincount(y))
print("Class names:", iris.target_names)

Features shape: (150, 4)
Label distribution: [50 50 50]
Class names: ['setosa' 'versicolor' 'virginica']


## 2. Classification Model Evaluation

We will first train a simple **Logistic Regression** classifier on the Iris dataset. We'll split the data into training and testing sets (70% training, 30% testing) and evaluate the classifier using several metrics:

- **Accuracy**
- **Precision**
- **Recall**
- **F1-score**
- **Confusion Matrix**
- **ROC Curve / AUC** (using one-vs-rest for one class)

Let's begin by training our classifier.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# Train a Logistic Regression classifier
clf = LogisticRegression(max_iter=200, random_state=42)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

In [3]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate evaluation metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, average='macro')
rec = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

print(f"Accuracy: {acc:.3f}")
print(f"Precision (macro-average): {prec:.3f}")
print(f"Recall (macro-average): {rec:.3f}")
print(f"F1-score (macro-average): {f1:.3f}")

Accuracy: 0.933
Precision (macro-average): 0.935
Recall (macro-average): 0.933
F1-score (macro-average): 0.933


### Confusion Matrix

A **confusion matrix** breaks down how many samples of each true class were correctly or incorrectly predicted. It helps to visualize which classes are getting confused by the model.

Let's compute and print the confusion matrix for our classifier.

In [4]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)



Confusion Matrix:
 [[15  0  0]
 [ 0 14  1]
 [ 0  2 13]]


### ROC Curve and AUC (for one class)

For binary classification, the ROC (Receiver Operating Characteristic) curve visualizes the True Positive Rate vs. False Positive Rate at various thresholds. In multi-class scenarios, we can compute a ROC curve for one class using a one-vs-rest approach. Here, we'll compute the ROC curve and AUC for class **virginica** (label 2).

First, we need the predicted probabilities for each class.

In [6]:
from sklearn.metrics import roc_curve, roc_auc_score

# Get predicted probabilities for the test set
y_prob = clf.predict_proba(X_test)

# For class 'virginica' (label 2), create binary labels (1 if virginica, else 0)
y_test_binary = (y_test == 2).astype(int)
y_score = y_prob[:, 2]

fpr, tpr, thresholds = roc_curve(y_test_binary, y_score)
auc_score = roc_auc_score(y_test_binary, y_score)
print("AUC for virginica vs. rest:", auc_score)

AUC for virginica vs. rest: 0.9933333333333333


## 3. Clustering Model Evaluation

Now, let's evaluate clustering performance. We'll use **K-Means** clustering on the Iris dataset (ignoring the true labels during clustering) and then evaluate the clusters using internal and external metrics:

- **Silhouette Score**: Measures how similar an object is to its own cluster versus other clusters (values range from -1 to 1, higher is better).
- **Davies-Bouldin Index (DBI)**: Evaluates the average similarity between each cluster and its most similar one (lower is better).
- **Adjusted Rand Index (ARI)**: Compares the clustering against the true labels (1.0 means perfect agreement).

Let's perform K-Means clustering and compute these metrics.

In [7]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score, adjusted_rand_score

# Cluster the Iris data into 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(X)

# Compute clustering metrics
sil_score = silhouette_score(X, cluster_labels)
db_index = davies_bouldin_score(X, cluster_labels)
ari = adjusted_rand_score(y, cluster_labels)  # External metric comparing to true labels

print(f"Silhouette Score: {sil_score:.3f}")
print(f"Davies-Bouldin Index: {db_index:.3f}")
print(f"Adjusted Rand Index (vs true labels): {ari:.3f}")

Silhouette Score: 0.551
Davies-Bouldin Index: 0.666
Adjusted Rand Index (vs true labels): 0.716


## 4. Common Pitfalls in Model Evaluation

Even with the right metrics, there are several common pitfalls in evaluating ML models:

### Class Imbalance
- **Issue:** High accuracy can be misleading if one class dominates the dataset.
- **Solution:** Use metrics such as precision, recall, F1-score, and analyze the confusion matrix.

### Overfitting
- **Issue:** A model may perform excellently on training data but poorly on unseen data.
- **Solution:** Always evaluate on a hold-out test set or using cross-validation.

### Data Leakage
- **Issue:** Unintentional use of test data in training (e.g., via preprocessing) can inflate performance metrics.
- **Solution:** Ensure strict separation between training and testing data and fit preprocessing steps on training data only.

### Metric Misuse
- **Issue:** Relying on a single metric (like accuracy) or using the wrong metric for the problem can lead to misinterpretation.
- **Solution:** Choose metrics that align with the business goal or research question, and consider multiple evaluation perspectives (e.g., confusion matrices, ROC curves, etc.).

## 5. Conclusion

In this tutorial, we demonstrated how to evaluate machine learning models for both classification and clustering tasks using the Iris dataset. We covered metrics such as accuracy, precision, recall, F1-score, ROC/AUC for classification, and silhouette score, Davies-Bouldin index, and Adjusted Rand Index for clustering. Additionally, we discussed common pitfalls including class imbalance, overfitting, data leakage, and metric misuse.

**Key Takeaways:**
- Always use multiple metrics to get a complete picture of model performance.
- Be cautious of pitfalls such as data leakage and overfitting.
- For clustering, rely on both internal metrics (like silhouette score) and external metrics (if labels are available) to assess performance.

Evaluating models correctly is critical for deploying reliable and trustworthy AI systems.