# Preprocessing
**Author:** Sandy Tam
<br>

**Dataset:** Kaggle - Cat Breeds Dataset

This notebook documents the full data preparation pipeline for my cat breed classification model.  
Steps:

1. Download and load the Kaggle Cat Breeds dataset.
2. Inspect the raw directory structure and class labels.
3. Clean the data (invalid files, empty classes, etc.).
4. Create the train/test splits.
5. Apply standard image preprocessing and augmentations.
6. Save a clean, reproducible dataset layout for model training.

In [1]:
import os
import numpy as np
import pandas as pd

import torch
import torch.nn.functional as F

from sklearn.model_selection import train_test_split

### Importing the Dataset

In [2]:
cat_imgs_dir="dataset/images"

if not os.path.isdir(cat_imgs_dir):
    raise FileNotFoundError(f"{cat_imgs_dir} not found")

image_paths = []
labels = []

for breed_name in os.listdir(cat_imgs_dir):
    breed_dir = os.path.join(cat_imgs_dir, breed_name)
    if not os.path.isdir(breed_dir):
        continue
    for fname in os.listdir(breed_dir):
        if fname.lower().endswith(".jpg"):
            image_paths.append(os.path.join(breed_dir, fname))
            labels.append(breed_name)

print("Total images:", len(image_paths))
print("Unique breeds:", len(set(labels)))
print("Example:", image_paths[0], "->", labels[0])

Total images: 126607
Unique breeds: 67
Example: dataset/images\Abyssinian\12136161_252.jpg -> Abyssinian


Each subfolder corresponds to a cat breed label which will be used as class labels.

In [3]:
unique_labels = sorted(list(set(labels)))
label_to_idx = {lbl: i for i, lbl in enumerate(unique_labels)}
num_classes = len(unique_labels)

print("Number of classes:", num_classes)
print("Example mapping:", list(label_to_idx.items())[:5])

Number of classes: 67
Example mapping: [('Abyssinian', 0), ('American Bobtail', 1), ('American Curl', 2), ('American Shorthair', 3), ('American Wirehair', 4)]


In [4]:
train_dir = "dataset/split/train"
test_dir = "dataset/split/test"

def load_split(split_root, label_to_idx):
    paths = []
    y_int = []
    for breed_name in os.listdir(split_root):
        breed_dir = os.path.join(split_root, breed_name)
        if not os.path.isdir(breed_dir):
            continue
        if breed_name not in label_to_idx:
            continue  # skip any weird folder
        label_idx = label_to_idx[breed_name]
        for fname in os.listdir(breed_dir):
            if fname.lower().endswith(".jpg"):
                paths.append(os.path.join(breed_dir, fname))
                y_int.append(label_idx)
    return paths, np.array(y_int, dtype=np.int64)

train_paths, train_y_int = load_split(train_dir, label_to_idx)
test_paths, test_y_int = load_split(test_dir, label_to_idx)

print("Train samples:", len(train_paths))
print("Test samples: ", len(test_paths))

y_train_tensor = torch.tensor(train_y_int, dtype=torch.int64)
y_test_tensor  = torch.tensor(test_y_int,  dtype=torch.int64)

y_train_onehot = F.one_hot(y_train_tensor, num_classes=num_classes).float()
y_test_onehot  = F.one_hot(y_test_tensor,  num_classes=num_classes).float()

print("One-hot label shapes:")
print("Train:", y_train_onehot.shape)
print("Test: ", y_test_onehot.shape)

Train samples: 101260
Test samples:  25347
One-hot label shapes:
Train: torch.Size([101260, 67])
Test:  torch.Size([25347, 67])
