# Machine Learning Fundamentals: Supervised vs. Unsupervised Learning

**Objective:** Understand the difference between supervised and unsupervised learning, and learn how to evaluate models using core metrics.

**Author:** ML Fundamentals Tutorial  
**Level:** Intern/Beginner  
**Date:** 2025

## Table of Contents
1. [Introduction](#introduction)
2. [Setup & Imports](#setup)
3. [Datasets](#datasets)
4. [Supervised Learning Example](#supervised)
5. [Unsupervised Learning Example](#unsupervised)
6. [Model Evaluation Metrics](#metrics)
7. [Conclusions](#conclusions)

## 1. Introduction <a id='introduction'></a>

### What is Supervised Learning?

**Supervised learning** involves training a model on labeled data, where each input has a known output (target/label). The model learns to map inputs to outputs and can then predict labels for new, unseen data.

**Examples:**
- **Classification:** Predicting discrete categories (e.g., spam vs. not spam, disease diagnosis)
- **Regression:** Predicting continuous values (e.g., house prices, temperature)

### What is Unsupervised Learning?

**Unsupervised learning** works with unlabeled data. The model tries to find patterns, structure, or groupings in the data without any predefined labels.

**Examples:**
- **Clustering:** Grouping similar data points (e.g., customer segmentation, image compression)
- **Dimensionality Reduction:** Reducing feature space while preserving information (e.g., PCA, t-SNE)

### Train/Validation/Test Split

When building machine learning models, we split data into:
- **Training set:** Used to train the model (typically 60-80%)
- **Validation set:** Used to tune hyperparameters (optional, 10-20%)
- **Test set:** Used to evaluate final model performance (10-20%)

**Why?** This prevents **overfitting** (model memorizes training data) and ensures the model generalizes well to new data.

### Why Model Evaluation Matters

Evaluation metrics help us:
1. Assess model performance objectively
2. Compare different models
3. Understand where the model succeeds or fails
4. Make informed decisions about model deployment

## 2. Setup & Imports <a id='setup'></a>

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score,
    confusion_matrix,
    classification_report,
    adjusted_rand_score
)

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Configure plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✓ All libraries imported successfully!")

## 3. Datasets <a id='datasets'></a>

We'll use scikit-learn's built-in datasets:
- **Breast Cancer Dataset:** For supervised classification (binary: malignant vs. benign)
- **Iris Dataset:** For unsupervised clustering (3 flower species)

In [None]:
# Load Breast Cancer dataset for supervised learning
cancer_data = datasets.load_breast_cancer()
X_cancer = cancer_data.data
y_cancer = cancer_data.target

print("Breast Cancer Dataset:")
print(f"  Samples: {X_cancer.shape[0]}")
print(f"  Features: {X_cancer.shape[1]}")
print(f"  Classes: {cancer_data.target_names}")
print(f"  Class distribution: {np.bincount(y_cancer)}\n")

# Load Iris dataset for unsupervised learning
iris_data = datasets.load_iris()
X_iris = iris_data.data
y_iris_true = iris_data.target  # We'll use this only for evaluation

print("Iris Dataset:")
print(f"  Samples: {X_iris.shape[0]}")
print(f"  Features: {X_iris.shape[1]}")
print(f"  Classes: {iris_data.target_names}")
print(f"  Class distribution: {np.bincount(y_iris_true)}")

## 4. Supervised Learning Example <a id='supervised'></a>

### Classification Task: Breast Cancer Diagnosis

We'll build a binary classifier to predict whether a tumor is **malignant (0)** or **benign (1)** based on cell measurements.

In [None]:
# Split data into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X_cancer, y_cancer, 
    test_size=0.2, 
    random_state=RANDOM_STATE,
    stratify=y_cancer  # Maintain class distribution in splits
)

print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")
print(f"Training class distribution: {np.bincount(y_train)}")
print(f"Testing class distribution: {np.bincount(y_test)}")

In [None]:
# Train a Logistic Regression model
log_reg = LogisticRegression(max_iter=10000, random_state=RANDOM_STATE)
log_reg.fit(X_train, y_train)

# Make predictions on test set
y_pred = log_reg.predict(X_test)

print("✓ Model trained successfully!")
print(f"First 10 predictions: {y_pred[:10]}")
print(f"First 10 true labels: {y_test[:10]}")

## 5. Model Evaluation Metrics <a id='metrics'></a>

### Key Metrics Explained

#### 1. Accuracy
**Definition:** Proportion of correct predictions out of total predictions.

$$\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}$$

**When to use:** Good for balanced datasets where all classes are equally important.

**Limitation:** Misleading with imbalanced classes (e.g., 95% of data is class A, predicting all A gives 95% accuracy).

#### 2. Precision
**Definition:** Of all positive predictions, how many were actually correct?

$$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}}$$

**When to use:** When **false positives are costly** (e.g., spam detection—you don't want legitimate emails marked as spam).

#### 3. Recall (Sensitivity)
**Definition:** Of all actual positives, how many did we correctly identify?

$$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}$$

**When to use:** When **false negatives are costly** (e.g., disease diagnosis—missing a sick patient is critical).

#### 4. Confusion Matrix
A table showing true vs. predicted labels:

|                | Predicted Negative | Predicted Positive |
|----------------|--------------------|--------------------||
| **Actual Negative** | True Negative (TN) | False Positive (FP) |
| **Actual Positive** | False Negative (FN) | True Positive (TP) |

#### Precision-Recall Tradeoff
- **High Precision, Low Recall:** Model is conservative (only predicts positive when very confident)
- **Low Precision, High Recall:** Model is aggressive (predicts positive more liberally)
- Balance depends on the problem domain and cost of errors

In [None]:
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='binary')
recall = recall_score(y_test, y_pred, average='binary')

print("=" * 50)
print("SUPERVISED LEARNING - MODEL PERFORMANCE")
print("=" * 50)
print(f"Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"Precision: {precision:.4f} ({precision*100:.2f}%)")
print(f"Recall:    {recall:.4f} ({recall*100:.2f}%)")
print("=" * 50)

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

# Display detailed classification report
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=cancer_data.target_names))

In [None]:
# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=cancer_data.target_names,
            yticklabels=cancer_data.target_names,
            cbar_kws={'label': 'Count'})
plt.title('Confusion Matrix - Breast Cancer Classification', fontsize=14, fontweight='bold')
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.tight_layout()
plt.show()

# Extract confusion matrix values for interpretation
tn, fp, fn, tp = cm.ravel()
print(f"\nConfusion Matrix Breakdown:")
print(f"  True Negatives (TN):  {tn}")
print(f"  False Positives (FP): {fp}")
print(f"  False Negatives (FN): {fn}")
print(f"  True Positives (TP):  {tp}")

### Interpretation of Results

**Model Performance Commentary:**

1. **Accuracy:** Our model correctly classifies most samples. However, accuracy alone doesn't tell the full story, especially in medical diagnosis.

2. **Precision:** High precision means when the model predicts "benign," it's usually correct. Few false alarms.

3. **Recall:** High recall means we catch most actual benign cases. In medical diagnosis, we especially care about not missing malignant cases (the negative class here).

4. **Class Imbalance Considerations:**
   - The Breast Cancer dataset is relatively balanced (~63% benign, ~37% malignant)
   - In highly imbalanced scenarios (e.g., 99:1 ratio), accuracy becomes less meaningful
   - We'd rely more on precision, recall, F1-score, or area under ROC curve
   - Consider using class weights or resampling techniques for severe imbalance

## 6. Unsupervised Learning Example <a id='unsupervised'></a>

### Clustering Task: Iris Species Grouping

We'll use **K-Means clustering** to group iris flowers based on their measurements, **without using the species labels**.

In [None]:
# Apply K-Means clustering (k=3 since we know there are 3 species)
# In real unsupervised scenarios, we'd use techniques like the elbow method to find k
kmeans = KMeans(n_clusters=3, random_state=RANDOM_STATE, n_init=10)
cluster_labels = kmeans.fit_predict(X_iris)

print("✓ K-Means clustering completed!")
print(f"Cluster centers shape: {kmeans.cluster_centers_.shape}")
print(f"Cluster distribution: {np.bincount(cluster_labels)}")
print(f"\nFirst 10 cluster assignments: {cluster_labels[:10]}")
print(f"First 10 true species labels: {y_iris_true[:10]}")

### Evaluating Unsupervised Learning

**Challenge:** In true unsupervised learning, we don't have labels, so traditional metrics (accuracy, precision, recall) don't apply directly.

**Approaches:**
1. **Internal metrics:** Silhouette score, inertia (sum of squared distances to cluster centers)
2. **External metrics (when labels available for validation):** Adjusted Rand Index, Normalized Mutual Information
3. **Visual inspection:** Plot clusters in 2D/3D space

**Important Note:** We're using true labels here only for educational purposes to see how well clusters align with actual species. In production, you wouldn't have these labels!

In [None]:
# Compare clusters to true labels using Adjusted Rand Index
# ARI measures similarity between two clusterings (ranges from -1 to 1, 1 = perfect match)
ari_score = adjusted_rand_score(y_iris_true, cluster_labels)

print("=" * 50)
print("UNSUPERVISED LEARNING - CLUSTER EVALUATION")
print("=" * 50)
print(f"Adjusted Rand Index: {ari_score:.4f}")
print("  (1.0 = perfect clustering, 0.0 = random clustering)")
print("=" * 50)

# Create a cross-tabulation to see cluster-to-species mapping
print("\nCluster vs. True Species Cross-Tabulation:")
crosstab = pd.crosstab(
    pd.Series(y_iris_true, name='True Species'),
    pd.Series(cluster_labels, name='Cluster'),
    margins=True
)
# Map numeric labels to species names
crosstab.index = [iris_data.target_names[i] if isinstance(i, (int, np.integer)) else i 
                  for i in crosstab.index]
print(crosstab)

In [None]:
# Reduce to 2D using PCA for visualization
pca = PCA(n_components=2, random_state=RANDOM_STATE)
X_iris_2d = pca.fit_transform(X_iris)

print(f"PCA explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.2%}")

In [None]:
# Visualize clusters in 2D space
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Clusters from K-Means
scatter1 = axes[0].scatter(X_iris_2d[:, 0], X_iris_2d[:, 1], 
                           c=cluster_labels, cmap='viridis', 
                           s=50, alpha=0.6, edgecolors='black')
axes[0].scatter(pca.transform(kmeans.cluster_centers_)[:, 0], 
                pca.transform(kmeans.cluster_centers_)[:, 1],
                c='red', marker='X', s=200, edgecolors='black', 
                label='Centroids', linewidths=2)
axes[0].set_title('K-Means Clusters (Unsupervised)', fontsize=14, fontweight='bold')
axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)', fontsize=11)
axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)', fontsize=11)
axes[0].legend()
axes[0].grid(True, alpha=0.3)
plt.colorbar(scatter1, ax=axes[0], label='Cluster')

# Plot 2: True species labels (for comparison)
scatter2 = axes[1].scatter(X_iris_2d[:, 0], X_iris_2d[:, 1], 
                           c=y_iris_true, cmap='viridis', 
                           s=50, alpha=0.6, edgecolors='black')
axes[1].set_title('True Species Labels (Ground Truth)', fontsize=14, fontweight='bold')
axes[1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)', fontsize=11)
axes[1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)', fontsize=11)
axes[1].grid(True, alpha=0.3)
cbar = plt.colorbar(scatter2, ax=axes[1], ticks=[0, 1, 2])
cbar.set_label('Species')
cbar.ax.set_yticklabels(iris_data.target_names)

plt.tight_layout()
plt.show()

### Interpretation of Clustering Results

**Key Observations:**

1. **Visual Comparison:** By comparing the two plots, we can see how well K-Means discovered the natural groupings in the data without any label information.

2. **Adjusted Rand Index:** This score tells us the agreement between clusters and true species. A high score indicates K-Means found meaningful patterns that align with biological species.

3. **Limitations of Applying Supervised Metrics:**
   - **Cluster labels are arbitrary:** Cluster 0 might correspond to species 2, cluster 1 to species 0, etc.
   - **No notion of "correct" in pure unsupervised learning:** We can't calculate accuracy/precision/recall without ground truth labels
   - **Real-world scenario:** In production, you wouldn't have true labels, so you'd rely on:
     - Domain expertise to interpret clusters
     - Internal validation metrics (silhouette score, Davies-Bouldin index)
     - Business objectives (e.g., customer segments that lead to higher sales)

4. **PCA Visualization:** The 2D projection captures ~95% of variance, giving us confidence that the visualization represents the actual data structure well.

## 7. Conclusions <a id='conclusions'></a>

### Summary of Key Learnings

#### Supervised vs. Unsupervised Learning

| Aspect | Supervised Learning | Unsupervised Learning |
|--------|---------------------|----------------------|
| **Data** | Labeled (input-output pairs) | Unlabeled (input only) |
| **Goal** | Predict labels for new data | Find hidden patterns/structure |
| **Examples** | Classification, Regression | Clustering, Dimensionality Reduction |
| **Evaluation** | Accuracy, Precision, Recall, etc. | Silhouette score, ARI (if labels available), visual inspection |

#### Model Evaluation Best Practices

1. **Always use train/test split** to avoid overfitting
2. **Don't rely on accuracy alone**, especially with imbalanced data
3. **Choose metrics based on business context:**
   - Medical diagnosis → High recall (don't miss sick patients)
   - Spam detection → High precision (don't block legitimate emails)
4. **Use confusion matrix** to understand types of errors
5. **Consider the precision-recall tradeoff** for your specific use case

#### When to Use Each Approach

**Use Supervised Learning when:**
- You have labeled training data
- You want to predict specific outcomes
- Examples: fraud detection, image classification, price prediction

**Use Unsupervised Learning when:**
- You don't have labels or labeling is expensive
- You want to explore data structure
- Examples: customer segmentation, anomaly detection, data compression

### Next Steps for Further Learning

1. **Explore more algorithms:**
   - Supervised: SVM, Neural Networks, Gradient Boosting
   - Unsupervised: DBSCAN, Hierarchical Clustering, Autoencoders

2. **Advanced topics:**
   - Cross-validation for robust evaluation
   - Hyperparameter tuning (GridSearch, RandomSearch)
   - Feature engineering and selection
   - Handling imbalanced datasets (SMOTE, class weights)

3. **Practice with real datasets:**
   - Kaggle competitions
   - UCI Machine Learning Repository
   - Your own domain-specific problems

---

**Congratulations!** You've completed this introduction to supervised and unsupervised learning. Keep practicing and experimenting with different datasets and algorithms!

In [None]:
# Quick reference: All metrics at a glance
print("=" * 60)
print("FINAL SUMMARY - METRICS AT A GLANCE")
print("=" * 60)
print("\nSUPERVISED LEARNING (Breast Cancer Classification):")
print(f"  Model: Logistic Regression")
print(f"  Accuracy:  {accuracy:.4f}")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  Test samples: {len(y_test)}")
print("\nUNSUPERVISED LEARNING (Iris Clustering):")
print(f"  Algorithm: K-Means (k=3)")
print(f"  Adjusted Rand Index: {ari_score:.4f}")
print(f"  PCA variance explained: {pca.explained_variance_ratio_.sum():.2%}")
print(f"  Total samples: {len(cluster_labels)}")
print("=" * 60)