# 1. Data Exploration and Preprocessing

## Overview

In this notebook, we'll explore the **Muxspace Facial Expression Dataset** and prepare it for training our emotion classification models.

### What We'll Learn:
1. How to load and inspect the dataset
2. Understanding class distribution (are some emotions more common?)
3. Visualizing sample images from each emotion class
4. Data preprocessing steps (resizing, normalization)
5. Data augmentation techniques to prevent overfitting
6. Creating train/validation/test splits

### Why This Matters:
Before training any machine learning model, we MUST understand our data. Common issues like:
- Class imbalance (too many "neutral" faces, too few "fear")
- Noisy labels (mislabeled images)
- Data quality issues (corrupted images, wrong formats)

...can all be caught during exploration and will save you hours of debugging later!

## Step 1: Import Libraries

We'll use:
- **pandas**: For loading and manipulating the CSV labels file
- **numpy**: For numerical operations on image arrays
- **matplotlib/seaborn**: For creating visualizations
- **PIL (Pillow)**: For loading and displaying images
- **torch/torchvision**: For PyTorch data loading utilities

In [None]:
# Standard library imports
import os
from pathlib import Path
from collections import Counter

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Image processing
from PIL import Image

# PyTorch
import torch
from torchvision import transforms

# Set style for nicer plots
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# For reproducibility
np.random.seed(42)

print("All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")

## Step 2: Define Paths and Constants

We'll set up all the paths to our data and define the emotion classes we're working with.

### The 7 Emotion Classes:
1. **Anger** - Furrowed brows, tight lips
2. **Disgust** - Wrinkled nose, raised upper lip
3. **Fear** - Wide eyes, raised eyebrows
4. **Happiness** - Smile, raised cheeks
5. **Neutral** - Relaxed face, no strong expression
6. **Sadness** - Drooping mouth corners, lowered brows
7. **Surprise** - Raised eyebrows, open mouth

In [None]:
# =============================================================================
# PATH CONFIGURATION
# =============================================================================

# Get the project root (parent of notebook_version folder)
NOTEBOOK_DIR = Path().absolute()
PROJECT_ROOT = NOTEBOOK_DIR.parent

# Data paths
DATA_DIR = PROJECT_ROOT / "data" / "facial_expressions-master"
IMAGES_DIR = DATA_DIR / "images"
LEGEND_PATH = DATA_DIR / "data" / "legend.csv"

# Verify paths exist
print("Checking paths...")
print(f"  Project root: {PROJECT_ROOT}")
print(f"  Data directory exists: {DATA_DIR.exists()}")
print(f"  Images directory exists: {IMAGES_DIR.exists()}")
print(f"  Legend file exists: {LEGEND_PATH.exists()}")

# =============================================================================
# EMOTION CLASSES
# =============================================================================

# The 7 emotion classes we'll use
EMOTION_CLASSES = [
    "anger",
    "disgust", 
    "fear",
    "happiness",
    "neutral",
    "sadness",
    "surprise",
]

# Create mappings between emotion names and numeric labels
# Neural networks need numeric labels, not strings!
EMOTION_TO_IDX = {emotion: idx for idx, emotion in enumerate(EMOTION_CLASSES)}
IDX_TO_EMOTION = {idx: emotion for idx, emotion in enumerate(EMOTION_CLASSES)}

print(f"\nNumber of emotion classes: {len(EMOTION_CLASSES)}")
print(f"\nEmotion to index mapping:")
for emotion, idx in EMOTION_TO_IDX.items():
    print(f"  {emotion}: {idx}")

## Step 3: Load and Explore the Labels

The dataset comes with a CSV file (`legend.csv`) that maps each image filename to its emotion label.

Let's load it and see what we're working with!

In [None]:
# Load the legend CSV file
df_raw = pd.read_csv(LEGEND_PATH)

# Display basic info
print("=" * 60)
print("RAW DATASET OVERVIEW")
print("=" * 60)
print(f"\nTotal samples: {len(df_raw)}")
print(f"\nColumns: {list(df_raw.columns)}")
print(f"\nFirst 10 rows:")
df_raw.head(10)

In [None]:
# Check unique emotion values
print("Unique emotion labels in the raw data:")
print(df_raw['emotion'].value_counts())

### Problem Detected: Inconsistent Label Casing!

Notice how we have both `happiness` and `HAPPINESS`? This is a common data quality issue.

We need to normalize all labels to lowercase to treat them as the same class.

In [None]:
# =============================================================================
# DATA CLEANING
# =============================================================================

# Create a copy to avoid modifying the original
df = df_raw.copy()

# Step 1: Normalize emotion labels to lowercase
df['emotion'] = df['emotion'].str.lower().str.strip()

print("After normalizing to lowercase:")
print(df['emotion'].value_counts())

In [None]:
# Step 2: Filter to only include our 7 emotion classes
# (This removes 'contempt' which has very few samples, and the header row 'emotion')

print(f"\nBefore filtering: {len(df)} samples")

df = df[df['emotion'].isin(EMOTION_CLASSES)].copy()

print(f"After filtering to 7 classes: {len(df)} samples")

# Step 3: Create full image paths and verify they exist
df['image_path'] = df['image'].apply(lambda x: IMAGES_DIR / x)
df['exists'] = df['image_path'].apply(lambda x: x.exists())

print(f"\nImages that exist: {df['exists'].sum()}")
print(f"Images missing: {(~df['exists']).sum()}")

# Keep only existing images
df = df[df['exists']].copy()

# Step 4: Add numeric labels
df['label'] = df['emotion'].map(EMOTION_TO_IDX)

print(f"\nFinal dataset size: {len(df)} samples")
df.head()

## Step 4: Analyze Class Distribution

Class imbalance is a major concern in classification tasks. If 80% of our data is "neutral" faces, the model might just predict "neutral" for everything and still get 80% accuracy!

Let's visualize the distribution of emotion classes.

In [None]:
# Count samples per class
class_counts = df['emotion'].value_counts()

# Create a nice bar plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
colors = sns.color_palette('husl', n_colors=len(EMOTION_CLASSES))
bars = axes[0].bar(class_counts.index, class_counts.values, color=colors)
axes[0].set_xlabel('Emotion', fontsize=12)
axes[0].set_ylabel('Number of Samples', fontsize=12)
axes[0].set_title('Class Distribution in Dataset', fontsize=14)
axes[0].tick_params(axis='x', rotation=45)

# Add value labels on bars
for bar, count in zip(bars, class_counts.values):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
                 str(count), ha='center', va='bottom', fontsize=10)

# Pie chart
axes[1].pie(class_counts.values, labels=class_counts.index, autopct='%1.1f%%',
            colors=colors, startangle=90)
axes[1].set_title('Class Distribution (Percentage)', fontsize=14)

plt.tight_layout()
plt.show()

# Print statistics
print("\n" + "=" * 60)
print("CLASS DISTRIBUTION STATISTICS")
print("=" * 60)
print(f"\nTotal samples: {len(df)}")
print(f"\nSamples per class:")
for emotion in EMOTION_CLASSES:
    count = class_counts.get(emotion, 0)
    pct = 100 * count / len(df)
    print(f"  {emotion:12s}: {count:5d} ({pct:.1f}%)")

print(f"\nImbalance ratio (max/min): {class_counts.max() / class_counts.min():.1f}x")

### Observations:

1. **Major imbalance**: `neutral` and `happiness` dominate the dataset
2. **Minority classes**: `fear`, `disgust`, `anger` have very few samples
3. **Impact**: We'll need to use techniques like:
   - Weighted loss function (penalize mistakes on minority classes more)
   - Stratified sampling (maintain class ratios in train/val/test splits)
   - Data augmentation (create more variations of minority class samples)

## Step 5: Visualize Sample Images

Let's look at some actual images from each emotion class to get a feel for the data.

In [None]:
def show_samples_per_class(df, n_samples=4):
    """
    Display sample images from each emotion class.
    
    Args:
        df: DataFrame with image paths and emotion labels
        n_samples: Number of samples to show per class
    """
    fig, axes = plt.subplots(len(EMOTION_CLASSES), n_samples, 
                             figsize=(3*n_samples, 3*len(EMOTION_CLASSES)))
    
    for row, emotion in enumerate(EMOTION_CLASSES):
        # Get samples for this emotion
        emotion_df = df[df['emotion'] == emotion]
        samples = emotion_df.sample(n=min(n_samples, len(emotion_df)), random_state=42)
        
        for col, (_, sample) in enumerate(samples.iterrows()):
            # Load and display image
            img = Image.open(sample['image_path'])
            axes[row, col].imshow(img)
            axes[row, col].axis('off')
            
            # Add emotion label on first column
            if col == 0:
                axes[row, col].set_ylabel(emotion.upper(), fontsize=12, 
                                          rotation=0, ha='right', va='center')
    
    plt.suptitle('Sample Images from Each Emotion Class', fontsize=16, y=1.02)
    plt.tight_layout()
    plt.show()

# Show samples
show_samples_per_class(df, n_samples=4)

## Step 6: Analyze Image Properties

Before we can feed images to a neural network, we need to understand their properties:
- Size (width x height)
- Color channels (RGB vs grayscale)
- File format

In [None]:
# Analyze a sample of images
sample_df = df.sample(n=min(500, len(df)), random_state=42)

widths = []
heights = []
modes = []

for _, row in sample_df.iterrows():
    img = Image.open(row['image_path'])
    widths.append(img.width)
    heights.append(img.height)
    modes.append(img.mode)

print("=" * 60)
print("IMAGE PROPERTIES (from 500 sample images)")
print("=" * 60)

print(f"\nImage sizes:")
print(f"  Width  - min: {min(widths)}, max: {max(widths)}, mean: {np.mean(widths):.0f}")
print(f"  Height - min: {min(heights)}, max: {max(heights)}, mean: {np.mean(heights):.0f}")

print(f"\nColor modes:")
mode_counts = Counter(modes)
for mode, count in mode_counts.items():
    print(f"  {mode}: {count} ({100*count/len(modes):.1f}%)")

# Visualize size distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(widths, bins=30, color='steelblue', edgecolor='white')
axes[0].set_xlabel('Width (pixels)')
axes[0].set_ylabel('Count')
axes[0].set_title('Distribution of Image Widths')
axes[0].axvline(np.mean(widths), color='red', linestyle='--', label=f'Mean: {np.mean(widths):.0f}')
axes[0].legend()

axes[1].hist(heights, bins=30, color='darkorange', edgecolor='white')
axes[1].set_xlabel('Height (pixels)')
axes[1].set_ylabel('Count')
axes[1].set_title('Distribution of Image Heights')
axes[1].axvline(np.mean(heights), color='red', linestyle='--', label=f'Mean: {np.mean(heights):.0f}')
axes[1].legend()

plt.tight_layout()
plt.show()

### Key Observations:

1. **Variable sizes**: Images have different dimensions - we'll need to resize them all to a fixed size
2. **Mostly RGB**: Most images are color (RGB), but some might be grayscale (L mode)
3. **We'll standardize to**:
   - Size: 224x224 (standard for transfer learning with ImageNet models)
   - Mode: RGB (convert grayscale to RGB)

## Step 7: Data Augmentation

Data augmentation is a technique to artificially increase the size of our training dataset by applying random transformations to images. This helps:

1. **Prevent overfitting** - Model sees more variations
2. **Improve generalization** - Model learns to handle different conditions
3. **Handle limited data** - Especially important for minority classes

### Augmentations we'll use:

| Augmentation | Why? |
|--------------|------|
| Horizontal Flip | Faces are roughly symmetric |
| Rotation (±10°) | Account for head tilt |
| Color Jitter | Simulate different lighting conditions |
| Normalization | Required for pretrained models |

In [None]:
# =============================================================================
# DEFINE TRANSFORMS
# =============================================================================

# ImageNet normalization values (required for pretrained models)
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD = [0.229, 0.224, 0.225]

IMG_SIZE = 224  # Standard size for transfer learning

# Training transforms (with augmentation)
train_transform = transforms.Compose([
    transforms.Resize((IMG_SIZE, IMG_SIZE)),      # Resize to fixed size
    transforms.RandomHorizontalFlip(p=0.5),       # 50% chance of flip
    transforms.RandomRotation(degrees=10),        # Rotate up to 10 degrees
    transforms.ColorJitter(                       # Random color adjustments
        brightness=0.2,
        contrast=0.2
    ),
    transforms.ToTensor(),                        # Convert to tensor [0, 1]
    transforms.Normalize(                         # Normalize with ImageNet stats
        mean=IMAGENET_MEAN,
        std=IMAGENET_STD
    )
])

# Test/Validation transforms (NO augmentation)
test_transform = transforms.Compose([
    transforms.Resize((IMG_SIZE, IMG_SIZE)),
    transforms.ToTensor(),
    transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD)
])

print("Training transforms:")
print(train_transform)
print("\nTest transforms:")
print(test_transform)

In [None]:
# Visualize augmentation effects
def show_augmentation_examples(image_path, transform, n_examples=6):
    """
    Show multiple augmented versions of the same image.
    """
    original = Image.open(image_path).convert('RGB')
    
    fig, axes = plt.subplots(2, n_examples//2 + 1, figsize=(15, 6))
    axes = axes.flatten()
    
    # Show original
    axes[0].imshow(original)
    axes[0].set_title('Original', fontsize=12)
    axes[0].axis('off')
    
    # Show augmented versions
    for i in range(1, n_examples + 1):
        # Apply transform (need to undo normalization for display)
        augmented = transform(original)
        
        # Denormalize for display
        mean = torch.tensor(IMAGENET_MEAN).view(3, 1, 1)
        std = torch.tensor(IMAGENET_STD).view(3, 1, 1)
        img_display = augmented * std + mean
        img_display = img_display.permute(1, 2, 0).numpy()
        img_display = np.clip(img_display, 0, 1)
        
        axes[i].imshow(img_display)
        axes[i].set_title(f'Augmented {i}', fontsize=12)
        axes[i].axis('off')
    
    plt.suptitle('Data Augmentation Examples', fontsize=14, y=1.02)
    plt.tight_layout()
    plt.show()

# Pick a sample image
sample_image_path = df.iloc[100]['image_path']
show_augmentation_examples(sample_image_path, train_transform)

## Step 8: Create PyTorch Dataset and DataLoaders

PyTorch uses two key abstractions for data loading:

1. **Dataset**: Defines how to access individual samples
2. **DataLoader**: Handles batching, shuffling, and parallel loading

We'll create a custom Dataset class for our facial expressions.

In [None]:
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

class FacialExpressionDataset(Dataset):
    """
    PyTorch Dataset for facial expression images.
    
    This class:
    - Loads images from disk on-demand (memory efficient)
    - Applies transforms (augmentation, normalization)
    - Returns (image_tensor, label) pairs
    """
    
    def __init__(self, dataframe, transform=None):
        """
        Args:
            dataframe: DataFrame with 'image_path' and 'label' columns
            transform: Optional torchvision transforms to apply
        """
        self.data = dataframe.reset_index(drop=True)
        self.transform = transform
    
    def __len__(self):
        """Return the total number of samples."""
        return len(self.data)
    
    def __getitem__(self, idx):
        """
        Get a single sample by index.
        
        Args:
            idx: Index of the sample to retrieve
            
        Returns:
            Tuple of (image_tensor, label)
        """
        # Get image path and label from dataframe
        row = self.data.iloc[idx]
        image_path = row['image_path']
        label = row['label']
        
        # Load image and convert to RGB (handles grayscale images)
        image = Image.open(image_path).convert('RGB')
        
        # Apply transforms if provided
        if self.transform:
            image = self.transform(image)
        
        return image, label

print("FacialExpressionDataset class defined!")

In [None]:
# =============================================================================
# CREATE TRAIN/VALIDATION/TEST SPLITS
# =============================================================================

# Split ratios
TEST_SPLIT = 0.2   # 20% for testing
VAL_SPLIT = 0.1    # 10% for validation (of original data)

# First split: separate test set (20%)
train_val_df, test_df = train_test_split(
    df,
    test_size=TEST_SPLIT,
    random_state=42,
    stratify=df['label']  # IMPORTANT: Maintain class ratios!
)

# Second split: separate validation from training
train_df, val_df = train_test_split(
    train_val_df,
    test_size=VAL_SPLIT / (1 - TEST_SPLIT),  # Adjust ratio
    random_state=42,
    stratify=train_val_df['label']
)

print("=" * 60)
print("DATA SPLITS")
print("=" * 60)
print(f"\nTraining set:   {len(train_df):,} samples ({100*len(train_df)/len(df):.1f}%)")
print(f"Validation set: {len(val_df):,} samples ({100*len(val_df)/len(df):.1f}%)")
print(f"Test set:       {len(test_df):,} samples ({100*len(test_df)/len(df):.1f}%)")
print(f"\nTotal:          {len(train_df) + len(val_df) + len(test_df):,} samples")

In [None]:
# Verify stratification worked (class ratios should be similar)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, (name, split_df) in zip(axes, [('Train', train_df), ('Validation', val_df), ('Test', test_df)]):
    counts = split_df['emotion'].value_counts()
    ax.bar(counts.index, counts.values, color=colors)
    ax.set_title(f'{name} Set Distribution (n={len(split_df)})')
    ax.tick_params(axis='x', rotation=45)
    ax.set_ylabel('Count')

plt.tight_layout()
plt.show()

print("Class ratios are preserved across all splits (stratified sampling worked!)")

In [None]:
# =============================================================================
# CREATE DATASETS AND DATALOADERS
# =============================================================================

BATCH_SIZE = 32

# Create datasets
train_dataset = FacialExpressionDataset(train_df, transform=train_transform)
val_dataset = FacialExpressionDataset(val_df, transform=test_transform)
test_dataset = FacialExpressionDataset(test_df, transform=test_transform)

# Create data loaders
train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,      # Shuffle training data each epoch
    num_workers=0,     # Use 0 for Windows compatibility
    pin_memory=True    # Faster GPU transfer
)

val_loader = DataLoader(
    val_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,     # No shuffle for validation
    num_workers=0,
    pin_memory=True
)

test_loader = DataLoader(
    test_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,     # No shuffle for testing
    num_workers=0,
    pin_memory=True
)

print("DataLoaders created!")
print(f"\nBatches per epoch:")
print(f"  Training:   {len(train_loader)}")
print(f"  Validation: {len(val_loader)}")
print(f"  Test:       {len(test_loader)}")

In [None]:
# Test the data loader
print("Testing data loader...")

# Get one batch
images, labels = next(iter(train_loader))

print(f"\nBatch shape: {images.shape}")
print(f"  - Batch size: {images.shape[0]}")
print(f"  - Channels: {images.shape[1]} (RGB)")
print(f"  - Height: {images.shape[2]}")
print(f"  - Width: {images.shape[3]}")
print(f"\nLabels shape: {labels.shape}")
print(f"Labels: {labels.tolist()}")

## Step 9: Visualize a Training Batch

Let's visualize a batch of images as they would be fed to the model (after transforms).

In [None]:
def show_batch(images, labels, n_show=8):
    """
    Visualize a batch of images with their labels.
    """
    fig, axes = plt.subplots(2, n_show//2, figsize=(15, 6))
    axes = axes.flatten()
    
    # Denormalize for display
    mean = torch.tensor(IMAGENET_MEAN).view(3, 1, 1)
    std = torch.tensor(IMAGENET_STD).view(3, 1, 1)
    
    for i in range(n_show):
        img = images[i] * std + mean
        img = img.permute(1, 2, 0).numpy()
        img = np.clip(img, 0, 1)
        
        label = labels[i].item()
        emotion = IDX_TO_EMOTION[label]
        
        axes[i].imshow(img)
        axes[i].set_title(f'{emotion} ({label})', fontsize=12)
        axes[i].axis('off')
    
    plt.suptitle('Sample Training Batch (After Augmentation)', fontsize=14)
    plt.tight_layout()
    plt.show()

# Get a fresh batch and visualize
images, labels = next(iter(train_loader))
show_batch(images, labels)

## Summary

In this notebook, we've:

1. **Loaded the dataset** - 13,690 labeled facial expression images

2. **Cleaned the data** - Normalized labels, filtered to 7 classes

3. **Analyzed class distribution** - Found significant imbalance (neutral/happiness dominate)

4. **Visualized samples** - Understood what each emotion looks like

5. **Set up data augmentation** - Horizontal flip, rotation, color jitter

6. **Created train/val/test splits** - 70%/10%/20% with stratification

7. **Built PyTorch DataLoaders** - Ready for training!

### Next Steps:
- **Notebook 2**: Build and train a baseline model (HOG + SVM)
- **Notebook 3**: Build and train a custom CNN
- **Notebook 4**: Use transfer learning with ResNet18

In [None]:
# Save processed dataframes for use in other notebooks
import pickle

processed_data = {
    'train_df': train_df,
    'val_df': val_df,
    'test_df': test_df,
    'emotion_classes': EMOTION_CLASSES,
    'emotion_to_idx': EMOTION_TO_IDX,
    'idx_to_emotion': IDX_TO_EMOTION
}

# Save to notebook_version folder
with open('processed_data.pkl', 'wb') as f:
    pickle.dump(processed_data, f)

print("Processed data saved to processed_data.pkl")
print("\nThis file will be used by subsequent notebooks.")