## Introduction

This project explores visual concept organization in the PatternMind dataset which contains 25,557 images distributed across 233 categories. Our primary objective is understanding and organization rather than maximizing classification accuracy: we aim to uncover patterns, clusters and relationships among visual categories in feature space.

Our framework proceeds in four stages: (1) we train a custom CNN to extract 256-dimensional feature embeddings that capture the visual essence of each image; (2) we apply PCA for dimensionality reduction, retaining 85% of variance in 50 components; (3) we use hierarchical clustering on category centroids to discover data-driven macro-categories that reveal natural visual themes without predefined semantic labels; (4) we evaluate K-Means clustering against both fine-grained (233) and macro-category labels using Purity, ARI, and NMI metrics to assess unsupervised organization quality at different granularities.

To establish performance baselines we also compare three supervised classifiers (Logistic Regression, SVM, and Random Forest) on the learned features. Our key findings demonstrate that CNN features encode meaningful visual structure: hierarchical clustering reveals coherent groupings such as space objects (saturn, mars, comet), birds (duck, goose, hummingbird), and vehicles, while simple linear classifiers achieve 40.9% accuracy on 233 classes, far above the 0.43% random baseline.

## 1. Data Loading and Initial Exploration

In [None]:
from scipy.stats import ttest_rel
from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix
from sklearn.metrics import silhouette_score
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import os
import numpy as np
import pandas as pd
from keras import regularizers
from PIL import Image
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import keras
from collections import Counter
from keras import layers
import tensorflow as tf
from tensorflow.data import AUTOTUNE
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from scipy.cluster.hierarchy import linkage, fcluster, leaves_list
from scipy.spatial.distance import pdist, squareform

### 1.1 Load Image Paths and Labels

We load all images from the dataset directory where each subfolder represents a distinct category. The code below creates a DataFrame containing the file path, category name, and numeric label for each image.

In [None]:
dataset_path = "db/patternmind_dataset/"

# Get all category folders (each subfolder is a class)
categories = sorted([d for d in os.listdir(dataset_path) if os.path.isdir(
    os.path.join(dataset_path, d)) and not d.startswith('.')])

# Build a list of all images with their metadata
data = []
for label_id, category in enumerate(categories):
    folder_path = os.path.join(dataset_path, category)
    images = [f for f in os.listdir(
        folder_path) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]

    for img in images:
        data.append({
            'path': os.path.join(folder_path, img),
            'category': category,
            'label_id': label_id
        })

df = pd.DataFrame(data)

### 1.2 Dataset Statistics and Quality Check

The code below checks for missing values, duplicates and computes summary statistics about the class distribution.

In [None]:
# Check for missing values and data quality
print(f"Missing values in DataFrame: {df.isnull().sum().sum()}")
print(f"Duplicate rows: {df.duplicated().sum()}")

# Detailed statistics
counts = df['category'].value_counts()
print(f"\nTotal images: {len(df)}")
print(f"Total categories: {len(categories)}")
print(f"Average images per category: {len(df) / len(categories):.2f}")
print(f"Min images per category: {counts.min()}")
print(f"Max images per category: {counts.max()}")
print(f"Median images per category: {counts.median()}")
print(f"Std deviation: {counts.std():.2f}")
print("\nEdge cases (fewest images):")
print(counts.tail())
print("\nEdge cases (most images):")
print(counts.head())

# Class imbalance analysis
imbalance_ratio = counts.max() / counts.min()
print(f"\nClass imbalance ratio (max/min): {imbalance_ratio:.2f}")

The output above shows:

- No missing values or duplicates
- 25557 total images across 233 categories
- Average of ~110 images per category, but with significant variation
- Class imbalance ratio of 10.57x between the most and least frequent categories

## 2. Exploratory Data Analysis (EDA)

### 2.1 Class Distribution Analysis

We first examine how images are distributed across categories.

In [None]:
fig_dist = px.bar(counts, x=counts.index, y=counts.values,
                  title="Class Distribution",
                  labels={'index': 'Category', 'y': 'Count'})
fig_dist.update_layout(xaxis={'categoryorder': 'total descending'})
fig_dist.show()

The bar chart reveals significant class imbalance in the dataset:

- The most frequent category ("clutter") has 761 images
- The least frequent category ("top-hat") has only 72 images
- This represents a 10x variation in class sizes

This imbalance matters for both supervised classification and unsupervised clustering. During CNN training (Section 3) we use data augmentation to help the model generalize across all categories regardless of frequency. For clustering evaluation (Section 6) we use multiple metrics to assess different aspects of cluster quality: Adjusted Rand Index (ARI, which accounts for chance agreement and penalizes trivial solutions like assigning each point to its own cluster), and Normalized Mutual Information (NMI, which measures information overlap between the hierarchical macro-categories and K-Means clusterings).

### 2.2 Image Dimension Analysis

Before feeding images to a neural network, we need to understand the variation in image sizes. Neural networks require fixed-size inputs, so we must choose an appropriate target resolution. We analyze a sample of 5,000 images to understand the distribution of widths and heights.

In [None]:
# Sample images to analyze dimension variability
sample_size = 5000
sample_df = df.sample(n=sample_size)
sizes = [Image.open(p).size for p in sample_df['path']]
widths, heights = zip(*sizes)

# Create side-by-side histograms for width and height distributions
fig_sizes = make_subplots(rows=1, cols=2, subplot_titles=(
    "Width Distribution", "Height Distribution"))
fig_sizes.add_trace(go.Histogram(x=widths, name="Width"), row=1, col=1)
fig_sizes.add_trace(go.Histogram(x=heights, name="Height"), row=1, col=2)
fig_sizes.update_layout(
    title_text=f"Image Size Distribution (Sample n={sample_size})")
fig_sizes.show()

The histograms show that image dimensions vary considerably:

- Most images have widths and heights concentrated around 200-400 pixels
- Some images are very small (thumbnails) while others are much larger

We chose 224×224 resolution (the ImageNet standard) because it is large enough to preserve important visual details like edges, textures, and shapes, while remaining computationally efficient by reducing memory usage and training time.

### 2.3 Visual Sample Inspection

To understand the visual diversity in our dataset we display random samples from five different categories. This helps us appreciate the challenges our models will face: varying backgrounds, lighting conditions, object orientations and image quality.

In [None]:
sample_categories = np.random.choice(categories, 5, replace=False)
fig_imgs = make_subplots(rows=1, cols=5, subplot_titles=sample_categories)

for i, cat in enumerate(sample_categories):
    img_path = df[df['category'] == cat].sample(1).iloc[0]['path']
    img = Image.open(img_path)
    img_array = np.array(img)
    fig_imgs.add_trace(go.Image(z=img_array), row=1, col=i+1)

fig_imgs.update_layout(title_text="Sample Images from Random Categories")
fig_imgs.show()

The random samples reveal substantial visual diversity within and across categories.

To address this variability, we use data augmentation (random flips, rotations, zoom, translations) so the model sees different orientations and scales of the same objects. The CNN’s stacked convolutional layers then learn hierarchical features that stay robust to these variations.

## 3. CNN Feature Extraction

We use a CNN for feature extraction because these networks learn representations at multiple levels of complexity. Early layers detect simple patterns like edges and textures while deeper layers recognize more complex features like shapes and objects. The layer just before the final classification output contains a rich and compressed representation of the image that captures its essential visual characteristics. These learned features work much better for clustering than using raw pixel values directly.

### 3.1 Data Preprocessing and Augmentation

Before training the CNN, we standardize inputs and apply light augmentation:

- Resize to 224×224: fixed input shape
- Normalize to [0, 1]: scale pixels for faster convergence
- Random horizontal flip: handle left/right orientation
- Random rotation (±0.2 rad): cover tilt
- Random zoom (±20%): cover scale changes
- Random translation (±15% height/width): cover shifts
- Random contrast (±20%): handle lighting variations
- Random brightness (±20%): handle exposure variations

We split 80% for training and 20% for validation. In this project, we use the validation set both for early stopping during CNN training and for reporting final performance. For a production system requiring extensive hyperparameter tuning, a three-way split (train/validation/test) would prevent overfitting to the validation set. However, given that our CNN training takes approximately 5 hours and our primary objective is feature extraction for clustering rather than achieving peak classification accuracy, we trained the model only once with carefully chosen hyperparameters. This makes the two-way split acceptable for our use case, as we avoid the computational cost of multiple training runs.

In [None]:
# type: ignore
IMG_SIZE = (224, 224)
BATCH_SIZE = 32

# Data augmentation: random flips, rotations, zooms, translations, contrast, brightness
data_augmentation = keras.Sequential([
    layers.RandomFlip("horizontal"),
    layers.RandomRotation(0.2),
    layers.RandomZoom(0.2, 0.2),
    layers.RandomTranslation(0.15, 0.15),
    layers.RandomContrast(0.2),
    layers.RandomBrightness(0.2),
], name='augmentation')

# Load training and validation datasets (80/20 split)
train_ds = keras.utils.image_dataset_from_directory(
    dataset_path,
    validation_split=0.2,
    subset="training",
    seed=123,
    image_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    label_mode='categorical'
)

val_ds = keras.utils.image_dataset_from_directory(
    dataset_path,
    validation_split=0.2,
    subset="validation",
    seed=123,
    image_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    label_mode='categorical'
)

# Apply augmentation to training data and normalize both datasets
train_ds = train_ds.map(lambda x, y: (x / 255.0, y),
                        num_parallel_calls=AUTOTUNE)
train_ds = train_ds.cache()
train_ds = train_ds.map(lambda x, y: (data_augmentation(x, training=True), y))
train_ds = train_ds.shuffle(1000).prefetch(AUTOTUNE)

val_ds = val_ds.map(lambda x, y: (x / 255.0, y), num_parallel_calls=AUTOTUNE)
val_ds = val_ds.cache().prefetch(AUTOTUNE)

### 3.2 CNN Model Architecture

We design a custom CNN with three convolutional blocks to extract visual features from images.

| Layer Type    | Details                                                              | Purpose                                        |
| ------------- | -------------------------------------------------------------------- | ---------------------------------------------- |
| Conv Block 1  | 32 filters × 2 layers, 3×3 kernel, SAME pad, L2=1e-4                 | Detect low-level features (edges, colors)      |
| Conv Block 2  | 64 filters × 2 layers, 3×3 kernel, SAME pad, L2=1e-4 + Dropout 0.25  | Detect mid-level features (textures, patterns) |
| Conv Block 3  | 128 filters × 2 layers, 3×3 kernel, SAME pad, L2=1e-4 + Dropout 0.25 | Detect high-level features (shapes, parts)     |
| GlobalAvgPool |                                                                      | Aggregate spatial features                     |
| Dense Layer   | 256 neurons, ReLU, L2=1e-4 + Dropout 0.5                             | Compact 256-D representation for clustering    |
| Output Layer  | 233 neurons, softmax                                                 | Classify into one of 233 categories            |

Key choices:

- Activation: ReLU throughout convolutional and dense layers.
- Padding: padding="same" keeps spatial size, preserving edges.
- Regularization: L2 weight decay 1e-4 on all Conv/Dense layers.
- Pooling: MaxPooling halves spatial dimensions in each block; GlobalAveragePooling before Dense.
- Dropout: 0.25 after pooling in blocks 2–3; 0.5 before the output layer.
- Batch size: 32, balancing stability and memory.
- Optimizer: Adam with learning rate 0.0005.
- Loss: Categorical crossentropy.
- Callbacks: Early stopping on val_loss (patience 5, restore best) and ReduceLROnPlateau (factor 0.5, patience 3, min_lr 1e-7).

Our goal isn’t peak classification accuracy, we primarily use the 256-D Dense layer as the feature embedding for clustering.

In [None]:
MODEL_PATH = 'out/models/cnn_feature_extractor_v3.keras'
HISTORY_PATH = 'out/models/training_history_v3.npy'

if os.path.exists(MODEL_PATH):
    model = keras.models.load_model(MODEL_PATH)
    # history_data = np.load(HISTORY_PATH, allow_pickle=True).item()
    print("Model loaded successfully!")
else:
    # Build CNN with 3 convolutional blocks (32, 64, 128 filters)
    model = keras.Sequential([
        layers.Input(shape=(IMG_SIZE[0], IMG_SIZE[1], 3)),

        # Block 1
        layers.Conv2D(32, (3, 3), activation='relu', padding='same',
                      kernel_regularizer=regularizers.l2(1e-4)),
        layers.Conv2D(32, (3, 3), activation='relu', padding='same',
                      kernel_regularizer=regularizers.l2(1e-4)),
        layers.BatchNormalization(),
        layers.MaxPooling2D((2, 2)),

        # Block 2
        layers.Conv2D(64, (3, 3), activation='relu', padding='same',
                      kernel_regularizer=regularizers.l2(1e-4)),
        layers.Conv2D(64, (3, 3), activation='relu', padding='same',
                      kernel_regularizer=regularizers.l2(1e-4)),
        layers.BatchNormalization(),
        layers.MaxPooling2D((2, 2)),
        layers.Dropout(0.25),

        # Block 3
        layers.Conv2D(128, (3, 3), activation='relu', padding='same',
                      kernel_regularizer=regularizers.l2(1e-4)),
        layers.Conv2D(128, (3, 3), activation='relu', padding='same',
                      kernel_regularizer=regularizers.l2(1e-4)),
        layers.BatchNormalization(),
        layers.MaxPooling2D((2, 2)),
        layers.Dropout(0.25),

        layers.GlobalAveragePooling2D(),

        layers.Dense(256, activation='relu', name='features',
                     kernel_regularizer=regularizers.l2(1e-4)),
        layers.Dropout(0.5),
        layers.Dense(len(categories), activation='softmax')
    ])

    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=0.0005),
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )

    # Training callbacks for early stopping and learning rate reduction
    callbacks = [
        keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=5,
            restore_best_weights=True
        ),
        keras.callbacks.ReduceLROnPlateau(
            monitor='val_loss',
            factor=0.5,
            patience=3,
            min_lr=1e-7
        )
    ]

    history = model.fit(
        train_ds,
        epochs=50,
        validation_data=val_ds,
        callbacks=callbacks
    )

    # Save the trained model
    os.makedirs('out/models', exist_ok=True)
    model.save(MODEL_PATH)
    print(f"Model saved to {MODEL_PATH}")

    # Save training history
    np.save(HISTORY_PATH, history.history)
    print(f"Training history saved to {HISTORY_PATH}")

### 3.3 Training Performance Visualization

The code below creates two side-by-side plots showing how the model's accuracy and loss changed during training. We plot key epochs where we recorded metrics to show the overall training progression. The vertical line marks epoch 33, where the best validation loss occurred and whose weights were restored by early stopping.

In [None]:
# Training history from reported epochs
epochs_reported = [1, 5, 10, 15, 20, 25, 30, 33, 37, 38]
accuracy_reported = [0.0709, 0.1471, 0.2088, 0.2652,
                     0.3246, 0.3562, 0.3810, 0.4091, 0.4306, 0.4326]
loss_reported = [5.151, 4.347, 3.841, 3.498,
                 3.111, 2.930, 2.778, 2.624, 2.501, 2.496]
val_accuracy_reported = [0.0605, 0.1285, 0.1446, 0.2000,
                         0.2618, 0.2487, 0.2657, 0.3023, 0.3033, 0.3009]
val_loss_reported = [5.081, 4.558, 4.479, 3.997,
                     3.656, 3.726, 3.689, 3.409, 3.445, 3.470]

# Visualize training history
fig_history = make_subplots(
    rows=1, cols=2, subplot_titles=('Accuracy', 'Loss'))

fig_history.add_trace(go.Scatter(x=epochs_reported, y=accuracy_reported, name='Train',
                                 mode='lines+markers', line=dict(color='#2E86AB')), row=1, col=1)
fig_history.add_trace(go.Scatter(x=epochs_reported, y=val_accuracy_reported, name='Val',
                                 mode='lines+markers', line=dict(color='#A23B72')), row=1, col=1)

fig_history.add_trace(go.Scatter(x=epochs_reported, y=loss_reported, name='Train',
                                 mode='lines+markers', line=dict(color='#2E86AB'), showlegend=False), row=1, col=2)
fig_history.add_trace(go.Scatter(x=epochs_reported, y=val_loss_reported, name='Val',
                                 mode='lines+markers', line=dict(color='#A23B72'), showlegend=False), row=1, col=2)

fig_history.add_vline(x=33, line_dash="dash", line_color="green",
                      opacity=0.5, annotation_text="Best (Epoch 33)")

fig_history.update_xaxes(title_text="Epoch")
fig_history.update_yaxes(title_text="Accuracy", row=1, col=1)
fig_history.update_yaxes(title_text="Loss", row=1, col=2)
fig_history.update_layout(title='CNN Training History', height=500, width=1000)
fig_history.show()

| Epoch | Accuracy | Loss  | Val Accuracy | Val Loss | Learning Rate |
| ----- | -------- | ----- | ------------ | -------- | ------------- |
| 1     | 7.09%    | 5.151 | 6.05%        | 5.081    | 0.00050       |
| 5     | 14.71%   | 4.347 | 12.85%       | 4.558    | 0.00050       |
| 10    | 20.88%   | 3.841 | 14.46%       | 4.479    | 0.00050       |
| 15    | 26.52%   | 3.498 | 20.00%       | 3.997    | 0.00050       |
| 20    | 32.46%   | 3.111 | 26.18%       | 3.656    | 0.00025       |
| 25    | 35.62%   | 2.930 | 24.87%       | 3.726    | 0.00025       |
| 30    | 38.10%   | 2.778 | 26.57%       | 3.689    | 0.00025       |
| 33    | 40.91%   | 2.624 | 30.23%       | 3.409    | 0.000125      |
| 37    | 43.06%   | 2.501 | 30.33%       | 3.445    | 0.0000625     |
| 38    | 43.26%   | 2.496 | 30.09%       | 3.470    | 0.0000625     |

The model trained for 38 epochs before early stopping triggered (patience=5). The best validation loss of 3.41 occurred at epoch 33, and restore_best_weights=True restored those weights. While epoch 37 achieved the highest validation accuracy (30.33%), early stopping optimizes for loss, not accuracy.

Looking at performance, the model hit its best validation accuracy of 30.3% at epoch 37 and its lowest validation loss of 3.41 at epoch 33. By the final epoch, training accuracy reached 43.3% while validation sat at 30.1%: a gap of about 13 percentage points. This gap shows the model is overfitting somewhat.

That said, 30% validation accuracy on 233 categories isn't terrible (random guessing would only get 0.43%). More importantly, we're not really after high classification scores here. What matters is that the network learned useful visual patterns in those intermediate layers and the 256-dimensional feature embedding it produces is what we'll actually use for clustering downstream.

### 3.4 Feature Extraction from All Images

Now we extract 256-dimensional feature vectors from every image in the dataset. We do this by:

1. Removing the final classification layer from the trained CNN
2. Passing each image through the network
3. Collecting the output from the Dense(256) layer

These features capture the "essence" of each image as learned by the CNN: a compact representation that preserves visual similarity. Images that look alike will have similar feature vectors making them suitable for clustering.

In [None]:
FEATURES_PATH = 'out/features/features_v3.npy'
LABELS_PATH = 'out/features/labels_v3.npy'

if os.path.exists(FEATURES_PATH) and os.path.exists(LABELS_PATH):
    features = np.load(FEATURES_PATH)
    labels = np.load(LABELS_PATH)
    print(f"Features loaded: {features.shape}")
    print(f"Labels loaded: {labels.shape}")
else:
    # Create feature extractor by removing final classification layer
    feature_extractor = keras.Sequential(model.layers[:-1])  # type: ignore

    # Load all images without augmentation
    all_ds = keras.utils.image_dataset_from_directory(
        dataset_path,
        label_mode='int',
        shuffle=False,
        image_size=IMG_SIZE,
        batch_size=BATCH_SIZE
    )

    all_ds = all_ds.map(lambda x, y: (x / 255.0, y))  # type: ignore

    # Extract 256-dimensional features for all images
    features = feature_extractor.predict(all_ds.map(lambda x, y: x))
    labels = np.concatenate([y.numpy() for _, y in all_ds], axis=0)

    # Save features and labels
    os.makedirs('out/features', exist_ok=True)
    np.save(FEATURES_PATH, features)
    np.save(LABELS_PATH, labels)
    print(f"Features extracted and saved:")
    print(f"Features shape: {features.shape}")
    print(f"Labels shape: {labels.shape}")

## 4. Dimensionality Reduction with PCA

Our CNN outputs 256-dimensional feature vectors for each image, but working with that many dimensions has drawbacks. In high-dimensional spaces clustering algorithms slow down and visualizations become impossible. Plus, the CNN's 256 features likely contain redundancy: some dimensions might encode similar information or just noise.

Principal Component Analysis (PCA) helps by finding new axes (principal components) that capture the maximum variance in our data. We can then keep only the most important components, throwing away redundant or noisy dimensions while preserving the essential structure. This makes clustering faster and more effective and lets us create 2D visualizations to understand what the CNN learned.

### 4.1 Feature Standardization and Variance Analysis

We standardize all 256 CNN features to zero mean and unit variance then fit PCA to analyze how variance is distributed across components. This tells us how many dimensions actually carry useful information versus redundancy or noise. The cumulative variance plot below guides our choice of how many components to retain for downstream clustering.

In [None]:
features_loaded = np.load(FEATURES_PATH)
labels_loaded = np.load(LABELS_PATH)

# Standardize features (mean=0, std=1)
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features_loaded)

# Fit PCA to analyze variance distribution
pca_full = PCA()
pca_full.fit(features_scaled)

explained_variance = pca_full.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)

fig_variance = go.Figure()
fig_variance.add_trace(go.Scatter(
    x=list(range(1, len(explained_variance) + 1)),
    y=cumulative_variance,
    mode='lines+markers',
    name='Cumulative Variance'
))
fig_variance.update_layout(
    title='PCA Explained Variance',
    xaxis_title='Number of Components',
    yaxis_title='Cumulative Explained Variance',
    hovermode='x'
)
fig_variance.show()

print(
    f"Variance explained by first 2 components: {cumulative_variance[1]:.2%}")
print(
    f"Variance explained by first 10 components: {cumulative_variance[9]:.2%}")
print(
    f"Variance explained by first 50 components: {cumulative_variance[49]:.2%}")
print(
    f"Variance explained by first 60 components: {cumulative_variance[59]:.2%}")
print(
    f"Variance explained by first 75 components: {cumulative_variance[74]:.2%}")
print(
    f"Variance explained by first 100 components: {cumulative_variance[99]:.2%}")
print(
    f"Variance explained by first 120 components: {cumulative_variance[119]:.2%}")

| Components | Variance Explained |
| ---------- | ------------------ |
| First 2    | 15.82%             |
| First 10   | 50.51%             |
| First 50   | 85.06%             |
| First 60   | 87.31%             |
| First 75   | 89.88%             |
| First 100  | 92.95%             |
| First 120  | 94.71%             |

The variance distribution shows that the CNN's 256-dimensional output contains substantial redundancy.

- The first 10 components alone capture over half the variance (50.51%)
- The first 50 components retain 85% of the information
- Beyond 100 components, variance gains diminish significantly
- The curve shows smooth accumulation without a sharp elbow, indicating that variance is distributed across many components rather than concentrated in a few

We choose 50 components for downstream clustering. This configuration retains 85.06% of the total variance while reducing dimensionality by over 80% compared to the original 256 dimensions.

### 4.2 2D Visualization

While 50 dimensions is optimal for clustering, we cannot visualize 50D data directly. We project features to just 2 dimensions (PC1 and PC2) to create scatter plots that give us intuition about the data structure. We visualize 5 categories selected for their semantic diversity: airplanes, saturn, mars, duck and goose. Note that this 2D projection captures only ~16% of the total variance, so some overlap is expected.

In [None]:
# Reduce to 50D for clustering
pca_50d = PCA(n_components=50)
features_50d = pca_50d.fit_transform(features_scaled)

# Reduce to 2D for visualization
pca_2d = PCA(n_components=2)
features_2d = pca_2d.fit_transform(features_scaled)

# Visualize semantically distinct categories
distinct_categories = ['airplanes', 'saturn', 'mars', 'duck', 'goose']
distinct_indices = [i for i, cat in enumerate(
    categories) if cat in distinct_categories]

mask = np.isin(labels_loaded, distinct_indices)
df_pca_distinct = pd.DataFrame({
    'PC1': features_2d[mask, 0],
    'PC2': features_2d[mask, 1],
    'category': [categories[label] for label in labels_loaded[mask]]
})

fig_pca_distinct = px.scatter(
    df_pca_distinct,
    x='PC1',
    y='PC2',
    color='category',
    title='PCA 2D Projection - Semantically Distinct Categories',
    opacity=0.7
)
fig_pca_distinct.update_traces(marker=dict(size=5))
fig_pca_distinct.update_layout(width=1000, height=800)
fig_pca_distinct.show()

The scatter plot reveals clear separation between semantically distinct categories. Airplanes (blue) form a tight, well-defined cluster in the upper-left region, demonstrating that the CNN learned distinctive features for this category.

The space-related categories show interesting behavior: mars (purple) and saturn (orange) occupy the center region and overlap substantially with each other. This makes sense visually since both are celestial bodies with similar color palettes (warm oranges/reds) and spherical shapes. The CNN's features capture this shared "space object" signature.

Similarly, the bird categories duck (orange) and goose (green) cluster together in the right portion of the plot, reflecting their shared visual characteristics: feathers, beaks, similar body shapes, and often similar backgrounds (water, grass).

The key insight is that even in just 2 dimensions (capturing only ~16% of variance), the CNN features successfully separate semantically different concepts (vehicles vs. space objects vs. birds) while grouping visually similar categories together. This confirms that the learned representations encode meaningful visual structure that hierarchical and K-Means clustering can exploit in the full 50-dimensional space.

### 4.3 Data-Driven Macro-Category Discovery

Rather than imposing predefined semantic groupings (e.g., "animals," "vehicles"), we let the CNN features reveal natural visual clusters. We apply hierarchical clustering to the 233 category centroids using Ward linkage, which minimizes within-cluster variance at each merge step. The resulting dendrogram shows how categories progressively combine as we relax similarity thresholds.

Categories merging at low heights are visually similar according to the CNN's learned representations, while those merging only at high heights are fundamentally different. By cutting the dendrogram at various heights, we can create different numbers of macro-categories (from 5 broad groups to 30+ finer divisions) for use in our clustering evaluation in Section 6.

In [None]:
# Compute centroids for each of the 233 categories
# A centroid is the mean feature vector of all images in that category
category_centroids = []
for i, category in enumerate(categories):
    mask = labels_loaded == i
    centroid = features_50d[mask].mean(axis=0)
    category_centroids.append(centroid)

category_centroids = np.array(category_centroids)

# Perform hierarchical clustering on category centroids using Ward linkage
# Ward linkage minimizes within-cluster variance at each merge step
linkage_matrix = linkage(category_centroids, method='ward')

color_threshold = 0.5 * max(linkage_matrix[:, 2])  # ~50% of max height

fig_dendro_analysis = ff.create_dendrogram(
    category_centroids,
    labels=categories,
    linkagefun=lambda x: linkage_matrix,
    color_threshold=color_threshold
)

# Add horizontal line showing a suggested cut threshold
fig_dendro_analysis.add_hline(
    y=color_threshold,
    line_dash="dash",
    line_color="red",
    annotation_text=f"Cut threshold ({color_threshold:.1f})",
    annotation_position="top right"
)

fig_dendro_analysis.update_layout(
    title='Hierarchical Clustering Dendrogram of 233 Category Centroids<br>(Ward Linkage, 50D PCA Features)',
    xaxis_title='Category',
    yaxis_title='Ward Distance',
    height=800,
    width=1400,
    xaxis={'tickangle': 90, 'tickfont': {'size': 6}}
)

fig_dendro_analysis.show()

# Analyze the structure at different cut heights
print("Macro-category counts at different cut heights:")
cut_heights = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
max_height = max(linkage_matrix[:, 2])

for ratio in cut_heights:
    cut_height = ratio * max_height
    n_clusters = len(
        set(fcluster(linkage_matrix, cut_height, criterion='distance')))
    print(
        f"Cut at {ratio*100:.0f}% max height ({cut_height:.1f}): {n_clusters} clusters")

# Show example macro-categories at K=15 level
n_macro_example = 15
macro_assignments = fcluster(
    linkage_matrix, n_macro_example, criterion='maxclust')

# Group categories by their macro-category assignment
macro_groups = {}
for cat_idx, macro_id in enumerate(macro_assignments):
    if macro_id not in macro_groups:
        macro_groups[macro_id] = []
    macro_groups[macro_id].append(categories[cat_idx])

# Sort by group size and show the largest macro-categories
sorted_groups = sorted(macro_groups.items(),
                       key=lambda x: len(x[1]), reverse=True)

# Find groups with 3-8 members (likely more coherent)
coherent_groups = [(macro_id, members) for macro_id, members in sorted_groups
                   if 3 <= len(members) <= 8]

for macro_id, members in coherent_groups[:5]:
    print(f"\nMacro-category {macro_id}: {', '.join(members)}")

Our hierarchical clustering shows how the 233 categories naturally group into bigger visual families. By cutting the dendrogram at different heights, we can create different numbers of macro-categories: cutting at 50% of maximum height gives us 6 broad groups, while cutting at 30% gives us 19 smaller clusters. This lets us look at visual patterns at different levels of detail.

When we look at 15 macro-categories, the CNN clearly organizes by visual features rather than semantic meaning. Macro-category 10 groups space-related images (comet, fireworks, galaxy, lightning, mars, rainbow, saturn) because they share common visual traits: radial patterns, glowing effects, dark backgrounds, and warm colors (oranges, yellows, purples). Macro-category 11 brings together organic subjects (butterfly, grapes, grasshopper, hibiscus, hummingbird, iris, praying-mantis, spider) based on detailed patterns, bright colors, curved shapes, and fine textures.

However, Macro-category 12 shows an important weakness of visual-only clustering: it puts together conch shells, goldfish, hamburgers, hot dogs, ice cream cones, spaghetti, and teddy bears. These items have nothing in common semantically, but they look similar to the CNN: rounded shapes, warm colors (browns, yellows, reds), and compact layouts. This shows that unsupervised clustering groups by appearance in feature space, not by conceptual meaning. Despite this limitation, these results confirm that our CNN learned useful visual patterns for organizing large image collections into understandable groups.

### 4.4 Category Centroid Distance Heatmap

To visualize how categories relate to each other in the 50D feature space we compute pairwise Euclidean distances between all 233 category centroids. The distance matrix is then reordered according to the hierarchical clustering leaf order, so that visually similar categories appear adjacent to each other. We display the first 100 categories to keep the heatmap readable.

In [None]:
# Compute pairwise Euclidean distances between all 233 category centroids
centroid_distances = pdist(category_centroids, metric='euclidean')
distance_matrix = squareform(centroid_distances)

# Get the optimal leaf ordering from hierarchical clustering
# This reorders categories so that similar ones are adjacent
leaf_order = leaves_list(linkage_matrix)
ordered_categories = [categories[i] for i in leaf_order]

# Reorder the distance matrix according to clustering
ordered_distance_matrix = distance_matrix[np.ix_(leaf_order, leaf_order)]

# Show a zoomed version
n_zoom = 100
zoom_categories = ordered_categories[:n_zoom]

fig_heatmap_zoom = go.Figure(data=go.Heatmap(
    z=ordered_distance_matrix[:n_zoom, :n_zoom],
    x=zoom_categories,
    y=zoom_categories,
    colorscale='Viridis',
    reversescale=True,
    colorbar=dict(title='Euclidean Distance'),
    hovertemplate='%{x} ↔ %{y}<br>Distance: %{z:.3f}<extra></extra>'
))

fig_heatmap_zoom.update_layout(
    title=f'Zoomed Heatmap: First {n_zoom} Categories (Clustered Order)',
    xaxis=dict(tickangle=45, tickfont=dict(size=9)),
    yaxis=dict(tickfont=dict(size=9), autorange='reversed'),
    width=900,
    height=800
)

fig_heatmap_zoom.show()

The heatmap reveals the distance structure among the first 100 categories ordered by hierarchical clustering. The diagonal shows zero distance (yellow) as expected for self-comparisons. Several block structures are visible along the diagonal, indicating groups of categories with low inter-category distances (teal/green regions).

Notable patterns include a cluster of electronic/household objects in the upper-left (ipod, breadmaker, photocopier, microwave) and another group of elongated objects mid-section (chopsticks, tuning-fork, eyeglasses). The bottom-right corner shows a distinct block containing architectural structures (skyscraper, minaret, pyramid) with relatively low distances to each other but high distances (purple) to most other categories.

The predominant teal coloring across most of the matrix indicates moderate distances between categories, suggesting that while the CNN learned to distinguish broad visual concepts, many categories share enough visual features (shapes, textures, colors) to remain relatively close in feature space.

## 5. Supervised Baseline Classifier

Before evaluating unsupervised clustering, we establish a supervised baseline to measure how well the CNN features support classification when ground-truth labels are available. This upper bound helps us contextualize the clustering results: the gap between supervised accuracy and unsupervised purity reveals how much information is lost when labels are unavailable.

We use Logistic Regression because it's fast, interpretable, and effective with high-dimensional features. Logistic Regression learns simple linear decision boundaries in the 256-dimensional feature space. Strong performance here would confirm that the CNN successfully encoded discriminative visual information suitable for both classification and clustering tasks.

### 5.1 Train-Test Split with Stratification

We split our data into 80% training and 20% test sets using stratified sampling. Stratification ensures that each category maintains the same proportion in both sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    features_scaled,
    labels_loaded,
    test_size=0.2,
    random_state=42,
    stratify=labels_loaded
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"Feature dimensions: {X_train.shape[1]}")
print(f"Number of classes: {len(np.unique(y_train))}")

### 5.2 Train Logistic Regression Classifier

We train multiple Logistic Regression configurations on the 256-dimensional CNN features to find the best balance between fitting the training data and generalizing to unseen examples:

| Configuration    | Description                                                   |
| ---------------- | ------------------------------------------------------------- |
| Baseline (C=1.0) | Default regularization strength                               |
| C=0.1            | Stronger L2 regularization to reduce overfitting              |
| C=0.1 + balanced | Regularization with class weighting for imbalanced categories |
| PCA-50D + C=0.1  | Regularization on dimensionality-reduced features             |

Key settings:

- max_iter=1000: Sufficient iterations for convergence with 233 classes
- solver='lbfgs': Efficient optimization for multinomial classification
- C parameter: Controls regularization strength (lower = stronger regularization)
- n_jobs=-1: Parallel processing across all CPU cores

In [None]:
# type: ignore

# Baseline: default Logistic Regression
lr_baseline = LogisticRegression(
    max_iter=1000,
    random_state=42,
    solver='lbfgs',
    n_jobs=-1
)
lr_baseline.fit(X_train, y_train)

# Approach 1: Stronger regularization (C=0.1)
lr_regularized = LogisticRegression(
    max_iter=1000,
    random_state=42,
    solver='lbfgs',
    C=0.1,
    n_jobs=-1
)
lr_regularized.fit(X_train, y_train)

# Approach 2: Regularization + class balancing
lr_balanced = LogisticRegression(
    max_iter=1000,
    random_state=42,
    solver='lbfgs',
    C=0.1,
    class_weight='balanced',
    n_jobs=-1
)
lr_balanced.fit(X_train, y_train)

# Approach 3: PCA-reduced features (50D)
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(
    features_50d, labels_loaded, test_size=0.2, random_state=42, stratify=labels_loaded
)
lr_pca = LogisticRegression(
    max_iter=1000,
    random_state=42,
    solver='lbfgs',
    C=0.1,
    n_jobs=-1
)
lr_pca.fit(X_train_pca, y_train_pca)

# Collect results
results = []
configs = [
    ("Baseline (C=1.0)", lr_baseline, X_train, X_test, y_train, y_test),
    ("C=0.1", lr_regularized, X_train, X_test, y_train, y_test),
    ("C=0.1 + balanced", lr_balanced, X_train, X_test, y_train, y_test),
    ("PCA-50D + C=0.1", lr_pca, X_train_pca, X_test_pca, y_train_pca, y_test_pca),
]

for name, model, X_tr, X_te, y_tr, y_te in configs:
    train_acc = accuracy_score(y_tr, model.predict(X_tr))
    test_acc = accuracy_score(y_te, model.predict(X_te))
    y_pred = model.predict(X_te)
    p_macro, r_macro, f1_macro, _ = precision_recall_fscore_support(
        y_te, y_pred, average='macro', zero_division=0
    )
    p_weighted, r_weighted, f1_weighted, _ = precision_recall_fscore_support(
        y_te, y_pred, average='weighted', zero_division=0
    )
    results.append({
        'Configuration': name,
        'Train Acc': train_acc,
        'Test Acc': test_acc,
        'Macro F1': f1_macro,
        'Weighted F1': f1_weighted
    })

df_results = pd.DataFrame(results)
print(df_results.to_string(index=False))

# Select best model for downstream analysis
best_idx = df_results['Test Acc'].idxmax()
best_config = df_results.loc[best_idx, 'Configuration']

# Use best model for predictions
if 'PCA' in best_config:
    lr_classifier = lr_pca
    y_test_pred = lr_pca.predict(X_test_pca)
    y_train_pred = lr_pca.predict(X_train_pca)
    train_accuracy = accuracy_score(y_train_pca, y_train_pred)
    test_accuracy = accuracy_score(y_test_pca, y_test_pred)
elif 'balanced' in best_config:
    lr_classifier = lr_balanced
    y_test_pred = lr_balanced.predict(X_test)
    y_train_pred = lr_balanced.predict(X_train)
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)
elif 'C=0.1' in best_config:
    lr_classifier = lr_regularized
    y_test_pred = lr_regularized.predict(X_test)
    y_train_pred = lr_regularized.predict(X_train)
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)
else:
    lr_classifier = lr_baseline
    y_test_pred = lr_baseline.predict(X_test)
    y_train_pred = lr_baseline.predict(X_train)
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)

# Final metrics for best model
precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(
    y_test if 'PCA' not in best_config else y_test_pca,
    y_test_pred, average='macro', zero_division=0
)
precision_weighted, recall_weighted, f1_weighted, _ = precision_recall_fscore_support(
    y_test if 'PCA' not in best_config else y_test_pca,
    y_test_pred, average='weighted', zero_division=0
)

print(f"\nBest Model: {best_config}")
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"\nMacro-averaged metrics:")
print(f"  Precision: {precision_macro:.4f}")
print(f"  Recall: {recall_macro:.4f}")
print(f"  F1-Score: {f1_macro:.4f}")
print(f"\nWeighted-averaged metrics:")
print(f"  Precision: {precision_weighted:.4f}")
print(f"  Recall: {recall_weighted:.4f}")
print(f"  F1-Score: {f1_weighted:.4f}")

| Configuration    | Train Acc | Test Acc | Gap   | Macro F1 | Weighted F1 |
| ---------------- | --------- | -------- | ----- | -------- | ----------- |
| Baseline (C=1.0) | 80.7%     | 37.7%    | 43.0% | 31.6%    | 37.3%       |
| C=0.1            | 63.8%     | 40.9%    | 22.9% | 33.8%    | 39.4%       |
| C=0.1 + balanced | 63.9%     | 39.2%    | 24.7% | 33.3%    | 38.7%       |
| PCA-50D + C=0.1  | 46.7%     | 39.2%    | 7.5%  | 31.5%    | 37.1%       |

Best Model: C=0.1 (strongest regularization on full 256D features)

### 5.2.1 Cross-Validation Analysis

To obtain more robust performance estimates, we perform 5-fold stratified cross-validation on our best configuration (C=0.1). This tests the model on different data splits, reducing variance from a single train/test split.

In [None]:
# 5-fold stratified cross-validation
lr_cv = LogisticRegression(
    max_iter=1000,
    random_state=42,
    solver='lbfgs',
    C=0.1,
    n_jobs=-1
)

cv_results = cross_validate(
    lr_cv,
    features_scaled,
    labels_loaded,
    cv=5,
    scoring=['accuracy', 'f1_macro', 'f1_weighted'],
    n_jobs=-1,
    return_train_score=True
)

# Compute statistics
cv_stats = {
    'Train Accuracy': {
        'mean': cv_results['train_accuracy'].mean(),
        'std': cv_results['train_accuracy'].std()
    },
    'Test Accuracy': {
        'mean': cv_results['test_accuracy'].mean(),
        'std': cv_results['test_accuracy'].std()
    },
    'Macro F1': {
        'mean': cv_results['test_f1_macro'].mean(),
        'std': cv_results['test_f1_macro'].std()
    },
    'Weighted F1': {
        'mean': cv_results['test_f1_weighted'].mean(),
        'std': cv_results['test_f1_weighted'].std()
    }
}

print("5-Fold Cross-Validation Results (C=0.1):")
print(
    f"Test Accuracy:  {cv_stats['Test Accuracy']['mean']:.4f} ± {cv_stats['Test Accuracy']['std']:.4f}")
print(
    f"Train Accuracy: {cv_stats['Train Accuracy']['mean']:.4f} ± {cv_stats['Train Accuracy']['std']:.4f}")
print(
    f"Macro F1:       {cv_stats['Macro F1']['mean']:.4f} ± {cv_stats['Macro F1']['std']:.4f}")
print(
    f"Weighted F1:    {cv_stats['Weighted F1']['mean']:.4f} ± {cv_stats['Weighted F1']['std']:.4f}")

| Metric         | Mean   | Std Dev  |
| -------------- | ------ | -------- |
| Train Accuracy | 0.6375 | ± 0.0032 |
| Test Accuracy  | 0.4042 | ± 0.0039 |
| Macro F1       | 0.3356 | ± 0.0033 |
| Weighted F1    | 0.3893 | ± 0.0041 |

The 5-fold cross-validation confirms the robustness of our Logistic Regression classifier with strong regularization (C=0.1), achieving a mean test accuracy of 40.42% ± 0.39% across different data splits. The low standard deviation (less than 0.5%) indicates that performance is stable and not heavily dependent on the particular train/test split. The percentage gap between train and test accuracy reveals moderate overfitting, but this is expected given the challenge of distinguishing 233 visually similar categories with limited training samples per class. The macro F1 score of 33.56% is notably lower than the weighted F1 of 38.93%, indicating that the classifier performs worse on smaller, less frequent categories compared to larger ones: a direct consequence of class imbalance in the dataset. These cross-validation results validate our single-split findings from Section 5.2, confirming that 40% test accuracy represents a reliable upper bound for supervised classification on this dataset.

### 5.3 Per-Category Performance Analysis

We examine how classification accuracy varies across individual categories. This reveals which types of images are easiest or hardest to classify, and whether performance correlates with category size or visual distinctiveness.

In [None]:
# Calculate per-category accuracy
per_class_accuracy = []
for i, category in enumerate(categories):
    mask = y_test == i
    if mask.sum() > 0:
        class_acc = (y_test_pred[mask] == i).sum() / mask.sum()
        per_class_accuracy.append({
            'category': category,
            'label_id': i,
            'accuracy': class_acc,
            'test_samples': mask.sum()
        })

df_class_acc = pd.DataFrame(per_class_accuracy)
df_class_acc_sorted = df_class_acc.sort_values('accuracy', ascending=False)

print("\nTop 10 Best Performing Categories:")
print(df_class_acc_sorted.head(10)[
      ['category', 'accuracy', 'test_samples']].to_string(index=False))

print("\nTop 10 Worst Performing Categories:")
print(df_class_acc_sorted.tail(10)[
      ['category', 'accuracy', 'test_samples']].to_string(index=False))

print(f"\nMean per-class accuracy: {df_class_acc['accuracy'].mean():.4f}")
print(f"Median per-class accuracy: {df_class_acc['accuracy'].median():.4f}")
print(f"Std per-class accuracy: {df_class_acc['accuracy'].std():.4f}")

fig_acc_dist = go.Figure()
fig_acc_dist.add_trace(go.Histogram(
    x=df_class_acc['accuracy'],
    nbinsx=30,
    name='Per-Class Accuracy'
))
fig_acc_dist.update_layout(
    title='Distribution of Per-Class Accuracy',
    xaxis_title='Accuracy',
    yaxis_title='Number of Classes',
    showlegend=False,
    height=500
)
fig_acc_dist.show()

| Best Categories | Accuracy | Test Samples | Worst Categories | Accuracy | Test Samples |
| --------------- | -------- | ------------ | ---------------- | -------- | ------------ |
| car-side        | 100.0%   | 21           | windmill         | 0%       | 17           |
| leopards        | 100.0%   | 35           | tuning-fork      | 0%       | 18           |
| faces-easy      | 98.7%    | 79           | rifle            | 0%       | 19           |
| motorbikes      | 98.6%    | 144          | dog              | 0%       | 19           |
| airplanes       | 97.9%    | 144          | mailbox          | 0%       | 17           |

Most categories cluster around 20-40% accuracy while a small subset achieves 80%+ accuracy. Two categories (car-side, leopards) achieve perfect 100% classification and several others exceed 90% (faces-easy, motorbikes, airplanes, sunflower, mars). These high-performing categories share distinctive visual signatures: consistent shapes, unique textures, or characteristic compositions that make them easily separable in feature space.

Categories that fail completely (0% accuracy) include everyday objects that appear in varied contexts (dog, goat, sneaker) or items with ambiguous visual features (tuning-fork, stirrups, spoon). These objects lack the consistent visual patterns needed for reliable classification.

| Statistic                 | Value  |
| ------------------------- | ------ |
| Mean per-class accuracy   | 34.52% |
| Median per-class accuracy | 30.43% |
| Standard deviation        | 23.90% |

The high standard deviation (23.9%) confirms dramatic performance variation across categories.

### 5.3.1 Confusion Matrix Analysis

To understand misclassification patterns, we visualize a confusion matrix for the 20 most frequent categories. This reveals which category pairs the classifier confuses most often, providing insight into visual similarity as perceived by the learned features.

In [None]:
# Select top 30 most frequent categories
top_30_counts = df['category'].value_counts().head(30)
top_30_categories = top_30_counts.index.tolist()
top_30_indices = [categories.index(cat) for cat in top_30_categories]

# Filter predictions and ground truth for top 30 categories
mask_top30 = np.isin(y_test, top_30_indices)
y_test_top30 = y_test[mask_top30]
y_pred_top30 = y_test_pred[mask_top30]

# Map to 0-29 range for confusion matrix
index_mapping = {old_idx: new_idx for new_idx,
                 old_idx in enumerate(top_30_indices)}

# Only include predictions that are in top 30 (filter out misclassifications to other categories)
valid_mask = np.isin(y_pred_top30, top_30_indices)
y_test_top30_filtered = y_test_top30[valid_mask]
y_pred_top30_filtered = y_pred_top30[valid_mask]

y_test_top30_remapped = np.array(
    [index_mapping[idx] for idx in y_test_top30_filtered])
y_pred_top30_remapped = np.array(
    [index_mapping[idx] for idx in y_pred_top30_filtered])

# Compute confusion matrix
cm = confusion_matrix(y_test_top30_remapped,
                      y_pred_top30_remapped, labels=range(30))
# Avoid division by zero
cm_normalized = cm.astype('float') / (cm.sum(axis=1, keepdims=True) + 1e-10)

fig_cm = go.Figure(data=go.Heatmap(
    z=cm_normalized,
    x=top_30_categories,
    y=top_30_categories,
    colorscale='Blues',
    text=cm,
    texttemplate='%{text}',
    textfont={'size': 8},
    hovertemplate='True: %{y}<br>Predicted: %{x}<br>Count: %{text}<br>Rate: %{z:.2%}<extra></extra>',
    colorbar=dict(title='Normalized Rate')
))

fig_cm.update_layout(
    title='Confusion Matrix: Top 30 Most Frequent Categories<br>(Normalized by Row)',
    xaxis_title='Predicted Category',
    yaxis_title='True Category',
    xaxis={'tickangle': 45, 'tickfont': {'size': 8}},
    yaxis={'tickfont': {'size': 8}, 'autorange': 'reversed'},
    width=1000,
    height=900
)

fig_cm.show()

# Identify most common misclassifications within top 30
misclass_pairs = []
for i in range(len(top_30_categories)):
    for j in range(len(top_30_categories)):
        if i != j and cm[i, j] > 0:
            misclass_pairs.append({
                'true_category': top_30_categories[i],
                'predicted_category': top_30_categories[j],
                'count': cm[i, j],
                'error_rate': cm_normalized[i, j]
            })

df_misclass = pd.DataFrame(misclass_pairs).sort_values(
    'count', ascending=False)

print("\nTop 10 Most Common Misclassifications (within Top 30 categories):")
print(df_misclass.head(10)[
      ['true_category', 'predicted_category', 'count', 'error_rate']].to_string(index=False))

# Report how many predictions were outside top 30
n_outside = (~valid_mask).sum()
print(f"\nNote: {n_outside}/{len(y_pred_top30)} predictions ({100*n_outside/len(y_pred_top30):.1f}%) were for categories outside the top 30.")

The confusion matrix and misclassification analysis reveal that the classifier struggles most with visually ambiguous objects and cluttered scenes. The dominant error pattern is objects being misclassified as "clutter" (ladder -> clutter at 26.3%, t-shirt -> clutter at 10.5%), indicating the model conflates objects in messy backgrounds with the clutter category itself. Mattress images frequently confuse the classifier, being misidentified as bathtub (15.4%) or t-shirt (11.5%), likely due to shared visual properties.

Several misclassifications expose the CNN's reliance on shape over semantics: hot-tub -> hammock (12.5%) reflects curved/suspended forms, while people -> faces-easy (15.8%) shows focus on facial features rather than full-body context. The fact that 25.2% of predictions fell outside the top 30 categories entirely indicates the classifier often reaches for rare categories when uncertain.

### 5.4 Supervised Classification on Macro-Categories

To further validate that the data-driven macro-categories represent meaningful visual groupings, we train a Logistic Regression classifier on macro-category labels instead of the original 233 fine-grained categories. If macro-categories capture coherent visual concepts, classification accuracy should improve significantly compared to the fine-grained baseline.

In [None]:
# Train Logistic Regression on macro-categories at different granularities
macro_classification_results = []
macro_granularities = [5, 10, 15, 20, 30]


def get_macro_labels(linkage_matrix, n_macro, category_labels):
    """Map each image's category to its macro-category from dendrogram cut."""
    cat_to_macro = fcluster(linkage_matrix, n_macro, criterion='maxclust')
    return np.array([cat_to_macro[cat] - 1 for cat in category_labels])


for n_macro in macro_granularities:
    # Get macro-category labels for all images
    macro_labels_all = get_macro_labels(linkage_matrix, n_macro, labels_loaded)

    # Split data with stratification on macro-labels
    X_train_macro, X_test_macro, y_train_macro, y_test_macro = train_test_split(
        features_scaled,
        macro_labels_all,
        test_size=0.2,
        random_state=42,
        stratify=macro_labels_all
    )

    # Train Logistic Regression with best config (C=0.1)
    lr_macro = LogisticRegression(
        max_iter=1000,
        random_state=42,
        solver='lbfgs',
        C=0.1,  # Best configuration from Section 5.2
        n_jobs=-1
    )
    lr_macro.fit(X_train_macro, y_train_macro)

    # Evaluate
    train_acc = accuracy_score(y_train_macro, lr_macro.predict(X_train_macro))
    test_acc = accuracy_score(y_test_macro, lr_macro.predict(X_test_macro))

    macro_classification_results.append({
        'n_macro_categories': n_macro,
        'train_accuracy': train_acc,
        'test_accuracy': test_acc,
        'random_baseline': 1 / n_macro
    })

    print(
        f"Macro-categories: {n_macro:2d} | Train Acc: {train_acc:.4f} | Test Acc: {test_acc:.4f} | Random: {1/n_macro:.4f}")

df_macro_class = pd.DataFrame(macro_classification_results)

# Visualize results
fig_macro_class = go.Figure()
fig_macro_class.add_trace(go.Scatter(
    x=df_macro_class['n_macro_categories'],
    y=df_macro_class['test_accuracy'],
    mode='lines+markers',
    name='Test Accuracy',
    marker=dict(size=10)
))
fig_macro_class.add_trace(go.Scatter(
    x=df_macro_class['n_macro_categories'],
    y=df_macro_class['random_baseline'],
    mode='lines+markers',
    name='Random Baseline',
    line=dict(dash='dash')
))
fig_macro_class.add_hline(
    y=0.409,
    line_dash="dot",
    line_color="red",
    annotation_text="Fine-grained (233 classes): 40.9%",
    annotation_position="top right"
)
fig_macro_class.update_layout(
    title='Logistic Regression Accuracy on Data-Driven Macro-Categories (C=0.1)',
    xaxis_title='Number of Macro-Categories',
    yaxis_title='Test Accuracy',
    height=500,
    width=800
)
fig_macro_class.show()

As we reduce the number of target macro-categories from 30 to 5 test accuracy increases from 54% to 72%, showing that the CNN features support progressively better classification as the task becomes less granular. Even with just 5 broad visual groups, the classifier achieves 72% accuracy demonstrating that hierarchical clustering successfully identified meaningful visual structure. At all granularities, test accuracy remains far above the random baseline confirming that the learned features encode genuine visual patterns rather than noise. Reducing to 5-10 macro-categories achieves higher accuracy, validating our hierarchical clustering approach and suggesting that for practical applications with limited labeled data, training classifiers on 5-15 macro-categories is more effective than attempting fine-grained 233-way classification.

### 5.5 Logistic Regression vs SVM vs Random Forest

While Logistic Regression provides interpretable linear decision boundaries, other classifiers may capture different patterns in the CNN features. We compare three approaches:

1. Logistic Regression (C=0.1): Our best linear model from Section 5.2
2. Support Vector Machine (SVM): Finds optimal separating hyperplanes with kernel trick
3. Random Forest: Ensemble of decision trees using bagging

All models use the same 256D CNN features and train/test split for fair comparison.

In [None]:
# SVM with RBF kernel
svm_classifier = SVC(
    kernel='rbf',
    C=1.0,
    gamma='scale',
    random_state=42
)
svm_classifier.fit(X_train, y_train)

# Random Forest
rf_classifier = RandomForestClassifier(
    n_estimators=100,
    max_depth=20,
    min_samples_split=5,
    random_state=42,
    n_jobs=-1
)
rf_classifier.fit(X_train, y_train)

# Compare all three classifiers
classifier_comparison = []
for name, clf in [
    ("Logistic Regression (C=0.1)", lr_regularized),
    ("SVM (RBF)", svm_classifier),
    ("Random Forest", rf_classifier)
]:
    train_pred = clf.predict(X_train)
    test_pred = clf.predict(X_test)

    train_acc = accuracy_score(y_train, train_pred)
    test_acc = accuracy_score(y_test, test_pred)
    gap = train_acc - test_acc

    _, _, f1_macro, _ = precision_recall_fscore_support(
        y_test, test_pred, average='macro', zero_division=0)
    _, _, f1_weighted, _ = precision_recall_fscore_support(
        y_test, test_pred, average='weighted', zero_division=0)

    classifier_comparison.append({
        'Classifier': name,
        'Train Acc': train_acc,
        'Test Acc': test_acc,
        'Gap': gap,
        'Macro F1': f1_macro,
        'Weighted F1': f1_weighted
    })

df_comparison = pd.DataFrame(classifier_comparison)
print(df_comparison.to_string(index=False))

# Visualization
fig_comparison = go.Figure()
for metric in ['Train Acc', 'Test Acc', 'Macro F1']:
    fig_comparison.add_trace(go.Bar(
        name=metric,
        x=[row['Classifier'] for row in classifier_comparison],
        y=[row[metric] for row in classifier_comparison],
        text=[f"{row[metric]:.3f}" for row in classifier_comparison],
        textposition='auto'
    ))

fig_comparison.update_layout(
    title='Classifier Performance Comparison on 256D CNN Features',
    xaxis_title='Classifier',
    yaxis_title='Score',
    barmode='group',
    height=500,
    width=900,
    yaxis=dict(range=[0, 1])
)
fig_comparison.show()

| Classifier                      | Train Acc | Test Acc  | Gap   | Macro F1  | Weighted F1 |
| ------------------------------- | --------- | --------- | ----- | --------- | ----------- |
| **Logistic Regression (C=0.1)** | 63.8%     | **40.9%** | 22.9% | **33.8%** | **39.4%**   |
| SVM (RBF)                       | 61.0%     | 38.5%     | 22.5% | 30.9%     | 36.2%       |
| Random Forest                   | 98.9%     | 33.2%     | 65.7% | 22.7%     | 27.6%       |

The comparison reveals some surprising results about what works best for classifying CNN features. Logistic Regression, despite being the simplest of the three models, achieves the highest test accuracy at 40.9% and the best Macro F1 score at 33.8%. This tells us something important: the CNN has already done the hard work of transforming images into a feature space where simple linear decision boundaries work well.

The SVM with an RBF kernel performs slightly worse than Logistic Regression, getting 38.5% test accuracy. The RBF kernel is designed to find curved decision boundaries by projecting data into higher dimensions, but it doesn't help here. This reinforces the idea that the 256-dimensional CNN features are already well-organized for classification. Adding the kernel's complexity doesn't improve results and actually hurts performance slightly.

Random Forest shows a dramatic case of overfitting. It achieves nearly perfect training accuracy at 98.9% but crashes down to just 33.2% on the test set. With 233 categories and 256 features, the trees can find all sorts of splits that perfectly separate training data but represent noise rather than real visual relationships. These patterns don't hold up when we test on new images.

The lesson here is that regularization matters more than model sophistication when working with CNN features. Logistic Regression's L2 penalty (controlled by C=0.1) prevents it from overfitting by keeping the model weights small and simple. Random Forest, despite being a powerful ensemble method, lacks comparable regularization and ends up learning patterns that are too specific to the training set.
The CNN has already learned the complex non-linear transformations during its training, so downstream classifiers just need to draw decision boundaries in the resulting feature space. Adding another layer of complexity tends to introduce overfitting without providing better discriminative power.

### 5.5.1 Statistical Significance Testing

To verify that performance differences between classifiers are statistically significant rather than due to random variation, we run each model with 10 different random seeds and perform paired t-tests.

In [None]:
# Run each classifier 10 times with different seeds
n_runs = 10
seeds = [42, 123, 456, 789, 1011, 1213, 1415, 1617, 1819, 2021]

classifier_runs = {
    'Logistic Regression (C=0.1)': [],
    'SVM (RBF)': [],
    'Random Forest': []
}

print("Running classifiers with multiple seeds...")
for seed in seeds:
    # Split with different seed
    X_tr, X_te, y_tr, y_te = train_test_split(
        features_scaled, labels_loaded,
        test_size=0.2, random_state=seed, stratify=labels_loaded
    )

    # Logistic Regression
    lr = LogisticRegression(max_iter=1000, random_state=seed,
                            solver='lbfgs', C=0.1, n_jobs=-1)
    lr.fit(X_tr, y_tr)
    lr_acc = accuracy_score(y_te, lr.predict(X_te))
    classifier_runs['Logistic Regression (C=0.1)'].append(lr_acc)

    # SVM
    svm = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=seed)
    svm.fit(X_tr, y_tr)
    svm_acc = accuracy_score(y_te, svm.predict(X_te))
    classifier_runs['SVM (RBF)'].append(svm_acc)

    # Random Forest
    rf = RandomForestClassifier(n_estimators=100, max_depth=20, min_samples_split=5,
                                random_state=seed, n_jobs=-1)
    rf.fit(X_tr, y_tr)
    rf_acc = accuracy_score(y_te, rf.predict(X_te))
    classifier_runs['Random Forest'].append(rf_acc)

    print(f"Seed {seed}: LR={lr_acc:.4f}, SVM={svm_acc:.4f}, RF={rf_acc:.4f}")

# Compute statistics
stats_results = []
for name, scores in classifier_runs.items():
    stats_results.append({
        'Classifier': name,
        'Mean': np.mean(scores),
        'Std': np.std(scores),
        'Min': np.min(scores),
        'Max': np.max(scores)
    })

df_stats = pd.DataFrame(stats_results)
print("\nStatistical Summary (10 runs):")
print(df_stats.to_string(index=False))

# Paired t-tests
print("\nPaired T-Tests (two-tailed):")
lr_scores = classifier_runs['Logistic Regression (C=0.1)']
svm_scores = classifier_runs['SVM (RBF)']
rf_scores = classifier_runs['Random Forest']

t_stat_lr_svm, p_val_lr_svm = ttest_rel(lr_scores, svm_scores)
t_stat_lr_rf, p_val_lr_rf = ttest_rel(lr_scores, rf_scores)
t_stat_svm_rf, p_val_svm_rf = ttest_rel(svm_scores, rf_scores)

print(f"LR vs SVM:  t={t_stat_lr_svm:.3f}, p={p_val_lr_svm:.4f}")
print(f"LR vs RF:   t={t_stat_lr_rf:.3f}, p={p_val_lr_rf:.4f}")
print(f"SVM vs RF:  t={t_stat_svm_rf:.3f}, p={p_val_svm_rf:.4f}")

# Visualize distributions
fig_sig = go.Figure()
for name, scores in classifier_runs.items():
    fig_sig.add_trace(go.Box(
        y=scores,
        name=name,
        boxmean='sd'
    ))

fig_sig.update_layout(
    title='Test Accuracy Distribution Across 10 Random Seeds',
    yaxis_title='Test Accuracy',
    showlegend=True,
    height=500,
    width=900
)
fig_sig.show()

The statistical significance testing across 10 different random seeds confirms that Logistic Regression (C=0.1) is the superior classifier for CNN features on this dataset. Logistic Regression achieves a mean test accuracy of 41.13% ± 0.44% across all runs, significantly outperforming both SVM (39.07% ± 0.59%) and Random Forest (33.18% ± 0.52%). The paired t-tests reveal that all performance differences are highly statistically significant (p < 0.001). The low standard deviations across all methods (< 0.6%) indicate that results are stable and reproducible regardless of the train/test split. These findings definitively establish that for CNN-derived features, simple linear classification with proper regularization outperforms more complex models: the CNN's convolutional layers have already learned the critical non-linear transformations, so downstream classifiers benefit more from regularization that prevents overfitting than from additional model complexity.

## 6. K-Means Clustering Analysis

Having established data-driven macro-categories through hierarchical clustering (Section 4.3), we now evaluate how well K-Means clustering aligns with these visual groupings. K-Means partitions images into k clusters by minimizing within-cluster variance, assigning each image to the nearest centroid. We measure cluster quality using three complementary metrics: Purity (cluster homogeneity), ARI (Adjusted Rand Index, which accounts for chance agreement), and NMI (Normalized Mutual Information, which measures information overlap). By evaluating these metrics against both the original 233 fine-grained categories and our hierarchically-derived macro-categories (5-30 groups), we can assess whether K-Means captures meaningful visual structure at different levels of granularity.

In [None]:
# Run K-Means with different cluster counts and compute multiple metrics

cluster_counts = [10, 20, 50, 100, 233]


def calculate_purity(cluster_labels, true_labels):
    correct = sum(
        Counter(true_labels[cluster_labels == c]).most_common(1)[0][1]
        for c in np.unique(cluster_labels)
    )
    return correct / len(cluster_labels)


# Run K-Means and calculate all metrics for different configurations
all_clustering_results = []
for k in cluster_counts:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(features_50d)

    # Metrics against fine-grained (233) categories
    fine_purity = calculate_purity(cluster_labels, labels_loaded)
    fine_ari = adjusted_rand_score(labels_loaded, cluster_labels)
    fine_nmi = normalized_mutual_info_score(labels_loaded, cluster_labels)

    # Metrics against macro-categories at different granularities
    for n_macro in macro_granularities:
        macro_labels = get_macro_labels(linkage_matrix, n_macro, labels_loaded)
        macro_purity = calculate_purity(cluster_labels, macro_labels)
        macro_ari = adjusted_rand_score(macro_labels, cluster_labels)
        macro_nmi = normalized_mutual_info_score(macro_labels, cluster_labels)

        all_clustering_results.append({
            'n_kmeans_clusters': k,
            'n_macro_categories': n_macro,
            'fine_grained_purity': fine_purity,
            'fine_grained_ari': fine_ari,
            'fine_grained_nmi': fine_nmi,
            'macro_purity': macro_purity,
            'macro_ari': macro_ari,
            'macro_nmi': macro_nmi
        })

df_clustering = pd.DataFrame(all_clustering_results)

### Purity, ARI, and NMI

We visualize K-Means clustering quality using three complementary metrics across different granularities. Three heatmaps show Purity, ARI, and NMI across all combinations of K-Means clusters (10, 20, 50, 100, 233) and target category granularities (5, 10, 15, 20, 30 macro-categories, plus the original 233 fine-grained categories).

- Purity measures cluster homogeneity but artificially increases with K, making it unreliable for comparing different cluster counts
- ARI (Adjusted Rand Index) corrects for chance agreement, where ARI=0 means random clustering and ARI>0 proves real structure
- NMI (Normalized Mutual Information) quantifies information overlap between cluster assignments and true categories

Higher values (green in heatmaps) indicate better alignment between K-Means clusters and target categories.

In [None]:
# Visualization: Three heatmaps for Purity, ARI, and NMI across macro-categories
fig_heatmaps = make_subplots(
    rows=1, cols=3,
    subplot_titles=('Purity', 'Adjusted Rand Index (ARI)',
                    'Normalized Mutual Information (NMI)'),
    horizontal_spacing=0.12
)

# Prepare data for all three heatmaps
pivot_purity = df_clustering.pivot(index='n_kmeans_clusters',
                                   columns='n_macro_categories', values='macro_purity')
pivot_purity[233] = df_clustering.groupby('n_kmeans_clusters')[
    'fine_grained_purity'].first()
pivot_purity = pivot_purity.sort_index(axis=1, ascending=False)

pivot_ari = df_clustering.pivot(index='n_kmeans_clusters',
                                columns='n_macro_categories', values='macro_ari')
pivot_ari[233] = df_clustering.groupby('n_kmeans_clusters')[
    'fine_grained_ari'].first()
pivot_ari = pivot_ari.sort_index(axis=1, ascending=False)

pivot_nmi = df_clustering.pivot(index='n_kmeans_clusters',
                                columns='n_macro_categories', values='macro_nmi')
pivot_nmi[233] = df_clustering.groupby('n_kmeans_clusters')[
    'fine_grained_nmi'].first()
pivot_nmi = pivot_nmi.sort_index(axis=1, ascending=False)

# Purity heatmap
fig_heatmaps.add_trace(
    go.Heatmap(
        z=pivot_purity.values,
        x=[str(c) for c in pivot_purity.columns],
        y=[f'K={k}' for k in pivot_purity.index],
        colorscale='RdYlGn',
        text=np.round(pivot_purity.values, 3),
        texttemplate='%{text}',
        textfont={'size': 10},
        colorbar=dict(title='Purity', x=0.29),
        showscale=True
    ),
    row=1, col=1
)

# ARI heatmap
fig_heatmaps.add_trace(
    go.Heatmap(
        z=pivot_ari.values,
        x=[str(c) for c in pivot_ari.columns],
        y=[f'K={k}' for k in pivot_ari.index],
        colorscale='RdYlGn',
        text=np.round(pivot_ari.values, 3),
        texttemplate='%{text}',
        textfont={'size': 10},
        colorbar=dict(title='ARI', x=0.63),
        showscale=True
    ),
    row=1, col=2
)

# NMI heatmap
fig_heatmaps.add_trace(
    go.Heatmap(
        z=pivot_nmi.values,
        x=[str(c) for c in pivot_nmi.columns],
        y=[f'K={k}' for k in pivot_nmi.index],
        colorscale='RdYlGn',
        text=np.round(pivot_nmi.values, 3),
        texttemplate='%{text}',
        textfont={'size': 10},
        colorbar=dict(title='NMI', x=1.0),
        showscale=True
    ),
    row=1, col=3
)

fig_heatmaps.update_xaxes(title_text='Target Categories', row=1, col=1)
fig_heatmaps.update_xaxes(title_text='Target Categories', row=1, col=2)
fig_heatmaps.update_xaxes(title_text='Target Categories', row=1, col=3)
fig_heatmaps.update_yaxes(title_text='K-Means Clusters', row=1, col=1)

fig_heatmaps.update_layout(
    title_text='K-Means Clustering: Purity, ARI, and NMI Across Different Granularities',
    height=500,
    width=1400
)
fig_heatmaps.show()

The three heatmaps reveal how K-Means clustering performs differently depending on which metric we use and how we group the categories. Looking at purity (left), we see values climbing from 10.6% (K=10, 233 categories) all the way to 66.7% (K=233, 5 macro-categories). This diagonal pattern makes sense: purity goes up when we either create more clusters or reduce the number of target groups we're comparing against. Purity, however, has a built-in bias that makes it look better as K increases, even if the clusters aren't actually meaningful.

The ARI heatmap (middle) tells a more honest story. The best ARI scores (~0.12-0.13) show up in the middle range, around 10-20 macro-categories, suggesting that's where K-Means structure actually matches our hierarchical groupings. Against the original 233 fine-grained categories, ARI stays below 0.10 for all K values, confirming that K-Means just can't recover those detailed distinctions.

Finally, NMI (right) provides the most stable view: values increase smoothly from 0.16-0.19 at 5 macro-categories up to 0.34-0.41 at 233 categories. K-Means works great for broad categorization (hitting 57-67% purity on 5 macro-categories), but it can't match the 40.9% supervised accuracy we got on fine-grained labels. For real-world use, this means K-Means on CNN features is useful for organizing images into general groups but you'll need labeled data and supervised learning if you want to tell apart specific categories.

## 7. Conclusions

### Summary of Findings

This project shows that CNN features combined with hierarchical clustering can organize large image collections without exhaustive labeling. Our CNN achieved 30.3% validation accuracy on 233 classes: modest, but sufficient to learn meaningful visual representations. The 256-dimensional embeddings capture genuine similarity: space objects cluster by radial patterns and glowing effects, birds group by feather textures and body shapes, and vehicles form coherent structural clusters.

Supervised classification confirms the features are discriminative. Logistic Regression achieved 40.9% test accuracy, outperforming SVM (38.5%) and Random Forest (33.2%). The CNN already learned the necessary non-linear transformations, so simple regularized classifiers work best. When classifying into 5 macro-categories instead of 233 fine-grained labels, accuracy jumped to 72%, validating that our hierarchical groupings capture real visual coherence.

K-Means clustering achieved 57-67% purity on broad categories but struggled with fine-grained distinctions (ARI < 0.10 for 233 classes). Unsupervised methods work well for coarse organization but cannot replace supervised learning for detailed categorization.

### Limitations

Three main limitations constrain this work. First, our CNN is shallow compared to ResNet or EfficientNet and we trained from scratch rather than using ImageNet pretrained weights. This was intentional: to see what patterns a CNN learns from our data alone, but limits feature quality. Second, our evaluation metrics assume ground-truth labels are correct, yet PatternMind categories mix semantic concepts with visual properties inconsistently. The "clutter" category exemplifies this problem: it contains images of messy, disorganized scenes, but our confusion matrix (Section 5.3.1) shows that objects photographed against busy backgrounds (ladders, t-shirts) frequently get misclassified as clutter. The CNN cannot distinguish between an inherently cluttered scene and a clean object in a cluttered context: both produce similar visual features (complex textures, varied colors, lack of clear focal point). This label ambiguity propagates through our entire evaluation pipeline, inflating error rates on categories that happen to appear in visually noisy contexts. Third, CNN features optimized for classification don't form geometrically compact clusters, causing substantial overlap between categories in feature space.

### Future Work

Two directions would most improve this work. Transfer learning with a pretrained backbone (ResNet-50 or Vision Transformer) would yield stronger features while cutting training time from 5 hours to under 30 minutes. Multimodal embeddings combining visual features with text descriptions could resolve ambiguity zones where visually similar but semantically unrelated items cluster together (e.g., conch shells with hamburgers based on warm colors and rounded shapes).