# 03: Data Augmentation & Train/Val/Test Split

## Objective
This notebook performs:
1. **Label Mapping** from raw conditions to FINAL model classes (8 clinically refined classes)
2. **Stratified Train/Val/Test Split** (70/15/15) preserving both final class and FST (V vs VI) distributions
3. **Light Augmentation** (≤3×) for training set regularization only
4. **Quality Validation** to ensure clinical realism and no data leakage

## Context
- Total images: 2,155 (FST V–VI from Fitzpatrick17k)
- Original conditions: 112 dermatological diagnoses
- **Final model classes: 8** (clinically refined taxonomy for AI-assisted triage)
- Augmentation purpose: **Regularization, NOT balancing**
- Augmentation strategy: Light, clinically plausible transforms only (≤3× factor)

## Final Classes (Model Output Space)
**8 Clinically Refined Classes:**
1. **malignant** - Malignant lesions requiring urgent specialist referral
2. **benign_neoplastic** - Benign neoplastic lesions (non-urgent)
3. **eczematous_dermatitis** - Inflammatory eczematous conditions
4. **papulosquamous** - Papulosquamous disorders (psoriasis, pityriasis rubra pilaris)
5. **autoimmune** - Autoimmune connective tissue diseases (lichen planus, lupus)
6. **genetic_neurocutaneous** - Genetic and neurocutaneous syndromes (neurofibromatosis, genodermatoses)
7. **pigmentary** - Pigmentary disorders (vitiligo)
8. **parasitic** - Parasitic infestations (scabies)

In [6]:
# Import required libraries
import os
import json
import shutil
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
from sklearn.model_selection import train_test_split
from collections import Counter
import albumentations as A
from tqdm.auto import tqdm
import warnings

warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print(f"Random seed set to: {RANDOM_SEED}")
print("Libraries imported successfully!")

Random seed set to: 42
Libraries imported successfully!


## 1. Configuration & Paths

In [7]:
# Define paths
BASE_DIR = Path('../')
DATA_RAW_DIR = BASE_DIR / 'data' / 'raw' / 'fitzpatrick17k'
IMAGES_DIR = DATA_RAW_DIR / 'images'
METADATA_FILE = DATA_RAW_DIR / 'fitzpatrick17k_fst_v_vi.csv'

# Output directories: splits under processed/fitzpatrick17k
OUTPUT_DIR = BASE_DIR / 'data' / 'processed' / 'fitzpatrick17k'
TRAIN_DIR = OUTPUT_DIR / 'train'
VAL_DIR = OUTPUT_DIR / 'val'
TEST_DIR = OUTPUT_DIR / 'test'

# Augmented training set (under augmented/train/ for consistency)
AUGMENTED_DIR = BASE_DIR / 'data' / 'augmented' / 'train'

# Split ratios
TRAIN_RATIO = 0.70
VAL_RATIO = 0.15
TEST_RATIO = 0.15

# Image preprocessing
TARGET_SIZE = (224, 224)

# Create directories
for dir_path in [OUTPUT_DIR, TRAIN_DIR, VAL_DIR, TEST_DIR, AUGMENTED_DIR]:
    dir_path.mkdir(parents=True, exist_ok=True)

print("Paths configured:")
print(f"  Metadata: {METADATA_FILE}")
print(f"  Images: {IMAGES_DIR}")
print(f"  Output (processed): {OUTPUT_DIR}")
print(f"  Output (augmented): {AUGMENTED_DIR}")
print(f"\nSplit ratios: {TRAIN_RATIO}/{VAL_RATIO}/{TEST_RATIO}")

Paths configured:
  Metadata: ../data/raw/fitzpatrick17k/fitzpatrick17k_fst_v_vi.csv
  Images: ../data/raw/fitzpatrick17k/images
  Output (processed): ../data/processed/fitzpatrick17k
  Output (augmented): ../data/augmented/train

Split ratios: 0.7/0.15/0.15


## 2. Load Data & Verify Image Existence

In [8]:
# Load metadata
df = pd.read_csv(METADATA_FILE)
print(f"Loaded {len(df)} records from metadata")
print(f"Columns: {list(df.columns)}")
print(f"\nFirst 3 rows:")
print(df.head(3))

# Verify image existence
df['image_path'] = df['md5hash'].apply(lambda x: IMAGES_DIR / f"{x}.jpg")
df['image_exists'] = df['image_path'].apply(lambda x: x.exists())

missing_count = (~df['image_exists']).sum()
print(f"\nImage verification:")
print(f"  Total records: {len(df)}")
print(f"  Images found: {df['image_exists'].sum()}")
print(f"  Images missing: {missing_count}")

# Remove missing images
if missing_count > 0:
    print(f"\nRemoving {missing_count} records with missing images...")
    df = df[df['image_exists']].copy()
    print(f"Remaining records: {len(df)}")

print(f"\nFitzpatrick Skin Type distribution:")
print(df['fitzpatrick_scale'].value_counts().sort_index())

Loaded 2155 records from metadata
Columns: ['md5hash', 'fitzpatrick_scale', 'fitzpatrick_centaur', 'label', 'nine_partition_label', 'three_partition_label', 'qc', 'url', 'url_alphanum']

First 3 rows:
                            md5hash  fitzpatrick_scale  fitzpatrick_centaur  \
0  45f7fe0e10214e32e890cad9d29d4811                  6                    5   
1  ddcad677b7b1e9084f3f51a8e026aa8d                  5                    5   
2  1e119546f5bc2b9165bb10ddd7fe5f69                  5                    4   

                   label nine_partition_label three_partition_label   qc  \
0         kaposi sarcoma     malignant dermal             malignant  NaN   
1           hidradenitis         inflammatory        non-neoplastic  NaN   
2  xeroderma pigmentosum       genodermatoses        non-neoplastic  NaN   

                                                 url  \
0  https://www.dermaamin.com/site/images/clinical...   
1  https://www.dermaamin.com/site/images/clinical...   
2  https:

## 3. Label Mapping: Raw Conditions → Final Classes

**FINAL 8-CLASS CLINICAL TAXONOMY**

This mapping consolidates 112 raw conditions into 8 clinically coherent classes designed for AI-assisted triage:

**8 Final Classes:**
1. **malignant** - All malignant lesions (SCC, BCC, melanoma, Kaposi sarcoma, etc.)
2. **benign_neoplastic** - Benign neoplastic lesions (seborrheic keratosis, nevi, dermatofibromas, etc.)
3. **eczematous_dermatitis** - Inflammatory/eczematous conditions (eczema, contact dermatitis, etc.)
4. **papulosquamous** - Papulosquamous disorders (psoriasis, pityriasis rubra pilaris)
5. **autoimmune** - Autoimmune connective tissue diseases (lichen planus, lupus)
6. **genetic_neurocutaneous** - Genetic disorders and neurocutaneous syndromes (neurofibromatosis, genodermatoses)
7. **pigmentary** - Pigmentary disorders (vitiligo)
8. **parasitic** - Parasitic infestations (scabies)

**Important:** All class names use underscores (no spaces) for filesystem compatibility.

In [9]:
# Define class mapping (source → target)
CLASS_MAPPING = {
    'malignant_skin_lesions': 'malignant',
    'squamous_cell_carcinoma': 'malignant',
    'benign_neoplastic_lesions': 'benign_neoplastic',
    'inflammatory': 'eczematous_dermatitis',
    'psoriasis': 'papulosquamous',
    'pityriasis_rubra_pilaris': 'papulosquamous',
    'lichen_planus': 'autoimmune',
    'lupus_erythematosus': 'autoimmune',
    'genodermatoses': 'genetic_neurocutaneous',
    'neurofibromatosis': 'genetic_neurocutaneous',
    'vitiligo': 'pigmentary',
    'scabies': 'parasitic'
}

# Define label mapping directly (embedded in notebook for self-containment)
# Format: 'original_condition': ('intermediate_class', 'class_type')
# NOTE: All class names use underscores, NO SPACES

CONDITION_MAPPING = {
    # Independent classes (≥50 train samples) - will be mapped to final classes
    'pityriasis rubra pilaris': ('pityriasis_rubra_pilaris', 'independent'),
    'lichen planus': ('lichen_planus', 'independent'),
    'lupus erythematosus': ('lupus_erythematosus', 'independent'),
    'psoriasis': ('psoriasis', 'independent'),
    'scabies': ('scabies', 'independent'),
    'vitiligo': ('vitiligo', 'independent'),
    'neurofibromatosis': ('neurofibromatosis', 'independent'),
    'squamous cell carcinoma': ('squamous_cell_carcinoma', 'independent'),
    
    # Grouped: Inflammatory (rare inflammatory conditions)
    'neutrophilic dermatoses': ('inflammatory', 'grouped'),
    'scleroderma': ('inflammatory', 'grouped'),
    'nematode infection': ('inflammatory', 'grouped'),
    'papilomatosis confluentes and reticulate': ('inflammatory', 'grouped'),
    'lichen amyloidosis': ('inflammatory', 'grouped'),
    'pityriasis rosea': ('inflammatory', 'grouped'),
    'acanthosis nigricans': ('inflammatory', 'grouped'),
    'fixed eruptions': ('inflammatory', 'grouped'),
    'factitial dermatitis': ('inflammatory', 'grouped'),
    'allergic contact dermatitis': ('inflammatory', 'grouped'),
    'eczema': ('inflammatory', 'grouped'),
    'drug eruption': ('inflammatory', 'grouped'),
    'erythema multiforme': ('inflammatory', 'grouped'),
    'acne vulgaris': ('inflammatory', 'grouped'),
    'granuloma annulare': ('inflammatory', 'grouped'),
    'photodermatoses': ('inflammatory', 'grouped'),
    'lichen simplex': ('inflammatory', 'grouped'),
    'tungiasis': ('inflammatory', 'grouped'),
    'scleromyxedema': ('inflammatory', 'grouped'),
    'acne': ('inflammatory', 'grouped'),
    'necrobiosis lipoidica': ('inflammatory', 'grouped'),
    'dermatomyositis': ('inflammatory', 'grouped'),
    'porphyria': ('inflammatory', 'grouped'),
    'stasis edema': ('inflammatory', 'grouped'),
    'erythema elevatum diutinum': ('inflammatory', 'grouped'),
    'lupus subacute': ('inflammatory', 'grouped'),
    'mucinosis': ('inflammatory', 'grouped'),
    'urticaria pigmentosa': ('inflammatory', 'grouped'),
    'drug induced pigmentary changes': ('inflammatory', 'grouped'),
    'calcinosis cutis': ('inflammatory', 'grouped'),
    'behcets disease': ('inflammatory', 'grouped'),
    'juvenile xanthogranuloma': ('inflammatory', 'grouped'),
    'hidradenitis': ('inflammatory', 'grouped'),
    'seborrheic dermatitis': ('inflammatory', 'grouped'),
    'striae': ('inflammatory', 'grouped'),
    'dyshidrotic eczema': ('inflammatory', 'grouped'),
    'cheilitis': ('inflammatory', 'grouped'),
    'xanthomas': ('inflammatory', 'grouped'),
    'sun damaged skin': ('inflammatory', 'grouped'),
    'pustular psoriasis': ('inflammatory', 'grouped'),
    'myiasis': ('inflammatory', 'grouped'),
    'keratosis pilaris': ('inflammatory', 'grouped'),
    'neurodermatitis': ('inflammatory', 'grouped'),
    'pityriasis lichenoides chronica': ('inflammatory', 'grouped'),
    'lyme disease': ('inflammatory', 'grouped'),
    'stevens johnson syndrome': ('inflammatory', 'grouped'),
    'langerhans cell histiocytosis': ('inflammatory', 'grouped'),
    'erythema nodosum': ('inflammatory', 'grouped'),
    'rhinophyma': ('inflammatory', 'grouped'),
    'paronychia': ('inflammatory', 'grouped'),
    'neurotic excoriations': ('inflammatory', 'grouped'),
    'tick bite': ('inflammatory', 'grouped'),
    'urticaria': ('inflammatory', 'grouped'),
    'livedo reticularis': ('inflammatory', 'grouped'),
    'pediculosis lids': ('inflammatory', 'grouped'),
    'aplasia cutis': ('inflammatory', 'grouped'),
    'erythema annulare centrifigum': ('inflammatory', 'grouped'),
    'acquired autoimmune bullous diseaseherpes gestationis': ('inflammatory', 'grouped'),
    'perioral dermatitis': ('inflammatory', 'grouped'),
    'sarcoidosis': ('inflammatory', 'grouped'),
    'keloid': ('inflammatory', 'grouped'),
    'folliculitis': ('inflammatory', 'grouped'),
    
    # Grouped: Genodermatoses (genetic disorders)
    'dariers disease': ('genodermatoses', 'grouped'),
    'ehlers danlos syndrome': ('genodermatoses', 'grouped'),
    'xeroderma pigmentosum': ('genodermatoses', 'grouped'),
    'tuberous sclerosis': ('genodermatoses', 'grouped'),
    'ichthyosis vulgaris': ('genodermatoses', 'grouped'),
    'epidermolysis bullosa': ('genodermatoses', 'grouped'),
    'hailey hailey disease': ('genodermatoses', 'grouped'),
    'incontinentia pigmenti': ('genodermatoses', 'grouped'),
    'acrodermatitis enteropathica': ('genodermatoses', 'grouped'),
    
    # Grouped: Malignant skin lesions
    'mycosis fungoides': ('malignant_skin_lesions', 'grouped'),
    'actinic keratosis': ('malignant_skin_lesions', 'grouped'),
    'malignant melanoma': ('malignant_skin_lesions', 'grouped'),
    'lentigo maligna': ('malignant_skin_lesions', 'grouped'),
    'superficial spreading melanoma ssm': ('malignant_skin_lesions', 'grouped'),
    'solid cystic basal cell carcinoma': ('malignant_skin_lesions', 'grouped'),
    'basal cell carcinoma morpheiform': ('malignant_skin_lesions', 'grouped'),
    'kaposi sarcoma': ('malignant_skin_lesions', 'grouped'),
    'melanoma': ('malignant_skin_lesions', 'grouped'),
    'basal cell carcinoma': ('malignant_skin_lesions', 'grouped'),
    
    # Grouped: Benign neoplastic lesions
    'prurigo nodularis': ('benign_neoplastic_lesions', 'grouped'),
    'naevus comedonicus': ('benign_neoplastic_lesions', 'grouped'),
    'porokeratosis actinic': ('benign_neoplastic_lesions', 'grouped'),
    'syringoma': ('benign_neoplastic_lesions', 'grouped'),
    'lymphangioma': ('benign_neoplastic_lesions', 'grouped'),
    'mucous cyst': ('benign_neoplastic_lesions', 'grouped'),
    'telangiectases': ('benign_neoplastic_lesions', 'grouped'),
    'seborrheic keratosis': ('benign_neoplastic_lesions', 'grouped'),
    'pilar cyst': ('benign_neoplastic_lesions', 'grouped'),
    'pyogenic granuloma': ('benign_neoplastic_lesions', 'grouped'),
    'nevus sebaceous of jadassohn': ('benign_neoplastic_lesions', 'grouped'),
    'milia': ('benign_neoplastic_lesions', 'grouped'),
    'granuloma pyogenic': ('benign_neoplastic_lesions', 'grouped'),
    'porokeratosis of mibelli': ('benign_neoplastic_lesions', 'grouped'),
    'fordyce spots': ('benign_neoplastic_lesions', 'grouped'),
    'disseminated actinic porokeratosis': ('benign_neoplastic_lesions', 'grouped'),
    'becker nevus': ('benign_neoplastic_lesions', 'grouped'),
    'dermatofibroma': ('benign_neoplastic_lesions', 'grouped'),
    'epidermal nevus': ('benign_neoplastic_lesions', 'grouped'),
    'port wine stain': ('benign_neoplastic_lesions', 'grouped'),
    'halo nevus': ('benign_neoplastic_lesions', 'grouped'),
    'congenital nevus': ('benign_neoplastic_lesions', 'grouped'),
    'nevocytic nevus': ('benign_neoplastic_lesions', 'grouped'),
}

print(f"Label mapping defined with {len(CONDITION_MAPPING)} conditions")
print(f"\nMapping to {len(set([v[0] for v in CONDITION_MAPPING.values()]))} intermediate classes")

# Apply mapping to main dataframe (first to intermediate classes)
df['intermediate_class'] = df['label'].apply(
    lambda x: CONDITION_MAPPING.get(x, (None, None))[0]
)
df['class_type'] = df['label'].apply(
    lambda x: CONDITION_MAPPING.get(x, (None, None))[1]
)

# Check for unmapped conditions
unmapped = df['intermediate_class'].isna().sum()
if unmapped > 0:
    print(f"\n⚠️ WARNING: {unmapped} records could not be mapped")
    print("Unmapped conditions:")
    print(df[df['intermediate_class'].isna()]['label'].unique())
    # Remove unmapped records
    df = df[df['intermediate_class'].notna()].copy()
    print(f"Removed unmapped records. Remaining: {len(df)}")
else:
    print("\n✓ All conditions successfully mapped to intermediate classes")

# Now map to final 8 classes
df['final_class'] = df['intermediate_class'].apply(
    lambda x: CLASS_MAPPING.get(x, x)
)

# Display final class distribution
print(f"\nFinal Class Distribution (8 classes):")
final_class_dist = df['final_class'].value_counts().sort_values(ascending=False)
print(final_class_dist)

print(f"\nTotal final classes: {df['final_class'].nunique()}")

# Show intermediate to final mapping summary
print(f"\nIntermediate → Final Class Mapping Summary:")
intermediate_to_final = df.groupby(['intermediate_class', 'final_class']).size().reset_index(name='count')
for _, row in intermediate_to_final.iterrows():
    print(f"  {row['intermediate_class']:35s} → {row['final_class']:25s} ({row['count']:4d} samples)")

Label mapping defined with 112 conditions

Mapping to 12 intermediate classes

✓ All conditions successfully mapped to intermediate classes

Final Class Distribution (8 classes):
final_class
eczematous_dermatitis     952
genetic_neurocutaneous    244
papulosquamous            212
malignant                 206
autoimmune                205
benign_neoplastic         203
parasitic                  73
pigmentary                 60
Name: count, dtype: int64

Total final classes: 8

Intermediate → Final Class Mapping Summary:
  benign_neoplastic_lesions           → benign_neoplastic         ( 203 samples)
  genodermatoses                      → genetic_neurocutaneous    ( 170 samples)
  inflammatory                        → eczematous_dermatitis     ( 952 samples)
  lichen_planus                       → autoimmune                ( 116 samples)
  lupus_erythematosus                 → autoimmune                (  89 samples)
  malignant_skin_lesions              → malignant                 ( 1

## 4. Prepare Stratification Labels

We stratify by BOTH:
- Final class label
- Fitzpatrick Skin Type (V vs VI)

This ensures both class and skin type distributions are preserved across splits.

In [10]:
# Create FST binary label (V vs VI)
df['fst_label'] = df['fitzpatrick_scale'].apply(
    lambda x: 'V' if x == 5 else 'VI'
)

# Create combined stratification label
df['stratify_label'] = df['final_class'] + '_' + df['fst_label']

print("Stratification labels created")
print(f"\nSample stratification labels:")
print(df[['final_class', 'fst_label', 'stratify_label']].head(10))

# Check stratification group sizes
strat_counts = df['stratify_label'].value_counts().sort_values()
print(f"\nStratification group sizes (smallest to largest):")
print(strat_counts.head(10))

min_group_size = strat_counts.min()
print(f"\nSmallest stratification group: {min_group_size} samples")
print(f"Test split requires ≥2 samples per group (with 15% split)")

if min_group_size < 2:
    print("\n⚠️ Some groups are too small for stratification")
    print("Groups with <2 samples:")
    print(strat_counts[strat_counts < 2])

Stratification labels created

Sample stratification labels:
              final_class fst_label             stratify_label
0               malignant        VI               malignant_VI
1   eczematous_dermatitis         V    eczematous_dermatitis_V
2  genetic_neurocutaneous         V   genetic_neurocutaneous_V
3               malignant        VI               malignant_VI
4               malignant         V                malignant_V
5  genetic_neurocutaneous        VI  genetic_neurocutaneous_VI
6          papulosquamous         V           papulosquamous_V
7   eczematous_dermatitis         V    eczematous_dermatitis_V
8       benign_neoplastic         V        benign_neoplastic_V
9   eczematous_dermatitis         V    eczematous_dermatitis_V

Stratification group sizes (smallest to largest):
stratify_label
parasitic_VI                  18
pigmentary_VI                 21
pigmentary_V                  39
benign_neoplastic_VI          44
papulosquamous_VI             48
parasitic_V    

## 5. Train/Val/Test Split (70/15/15)

Stratified split preserving:
- Final class distribution
- FST V vs VI distribution
- No data leakage between splits

In [11]:
def safe_stratified_split(df, stratify_col, test_size, random_state=42):
    """
    Perform stratified split, handling groups with insufficient samples.
    Groups with only 1 sample are randomly assigned.
    """
    # Identify groups with sufficient samples for stratification
    group_counts = df[stratify_col].value_counts()
    min_samples_needed = 2  # Minimum for any split
    
    sufficient_groups = group_counts[group_counts >= min_samples_needed].index
    insufficient_groups = group_counts[group_counts < min_samples_needed].index
    
    # Split sufficient groups with stratification
    df_sufficient = df[df[stratify_col].isin(sufficient_groups)]
    df_insufficient = df[df[stratify_col].isin(insufficient_groups)]
    
    if len(df_sufficient) > 0:
        train_val, test = train_test_split(
            df_sufficient,
            test_size=test_size,
            stratify=df_sufficient[stratify_col],
            random_state=random_state
        )
    else:
        train_val = pd.DataFrame()
        test = pd.DataFrame()
    
    # For insufficient groups, randomly assign
    if len(df_insufficient) > 0:
        train_val_insuf, test_insuf = train_test_split(
            df_insufficient,
            test_size=test_size,
            random_state=random_state
        )
        train_val = pd.concat([train_val, train_val_insuf])
        test = pd.concat([test, test_insuf])
    
    return train_val, test

# First split: train+val vs test
print("Splitting data: train+val vs test...")
train_val_df, test_df = safe_stratified_split(
    df,
    'stratify_label',
    test_size=TEST_RATIO,
    random_state=RANDOM_SEED
)

# Second split: train vs val
print("Splitting train+val: train vs val...")
val_ratio_adjusted = VAL_RATIO / (TRAIN_RATIO + VAL_RATIO)
train_df, val_df = safe_stratified_split(
    train_val_df,
    'stratify_label',
    test_size=val_ratio_adjusted,
    random_state=RANDOM_SEED
)

print(f"\n=== Split Summary ===")
print(f"Train: {len(train_df)} samples ({len(train_df)/len(df)*100:.1f}%)")
print(f"Val:   {len(val_df)} samples ({len(val_df)/len(df)*100:.1f}%)")
print(f"Test:  {len(test_df)} samples ({len(test_df)/len(df)*100:.1f}%)")
print(f"Total: {len(df)} samples")

# Verify no overlap
train_hashes = set(train_df['md5hash'])
val_hashes = set(val_df['md5hash'])
test_hashes = set(test_df['md5hash'])

assert len(train_hashes & val_hashes) == 0, "Data leakage: train-val overlap!"
assert len(train_hashes & test_hashes) == 0, "Data leakage: train-test overlap!"
assert len(val_hashes & test_hashes) == 0, "Data leakage: val-test overlap!"
print("\n✓ No data leakage detected")

Splitting data: train+val vs test...
Splitting train+val: train vs val...

=== Split Summary ===
Train: 1507 samples (69.9%)
Val:   324 samples (15.0%)
Test:  324 samples (15.0%)
Total: 2155 samples

✓ No data leakage detected


## 6. Verify Split Quality

Check that final class and FST distributions are preserved.

In [12]:
def print_split_statistics(split_df, split_name):
    print(f"\n{'='*60}")
    print(f"{split_name.upper()} SET STATISTICS")
    print(f"{'='*60}")
    
    # Final class distribution
    print(f"\nFinal Class Distribution:")
    class_dist = split_df['final_class'].value_counts().sort_values(ascending=False)
    for cls, count in class_dist.items():
        pct = count / len(split_df) * 100
        print(f"  {cls:35s}: {count:4d} ({pct:5.1f}%)")
    
    # FST distribution
    print(f"\nFitzpatrick Skin Type Distribution:")
    fst_dist = split_df['fst_label'].value_counts().sort_index()
    for fst, count in fst_dist.items():
        pct = count / len(split_df) * 100
        print(f"  FST {fst}: {count:4d} ({pct:5.1f}%)")

# Print statistics for each split
print_split_statistics(train_df, "TRAIN")
print_split_statistics(val_df, "VALIDATION")
print_split_statistics(test_df, "TEST")

# Compare distributions across splits
print(f"\n{'='*60}")
print("DISTRIBUTION COMPARISON")
print(f"{'='*60}")

comparison_df = pd.DataFrame({
    'Train': train_df['final_class'].value_counts(normalize=True) * 100,
    'Val': val_df['final_class'].value_counts(normalize=True) * 100,
    'Test': test_df['final_class'].value_counts(normalize=True) * 100
}).fillna(0).round(1)

print("\nFinal Class Distribution (%) across splits:")
print(comparison_df)


TRAIN SET STATISTICS

Final Class Distribution:
  eczematous_dermatitis              :  666 ( 44.2%)
  genetic_neurocutaneous             :  172 ( 11.4%)
  papulosquamous                     :  148 (  9.8%)
  malignant                          :  144 (  9.6%)
  autoimmune                         :  143 (  9.5%)
  benign_neoplastic                  :  141 (  9.4%)
  parasitic                          :   51 (  3.4%)
  pigmentary                         :   42 (  2.8%)

Fitzpatrick Skin Type Distribution:
  FST V: 1067 ( 70.8%)
  FST VI:  440 ( 29.2%)

VALIDATION SET STATISTICS

Final Class Distribution:
  eczematous_dermatitis              :  143 ( 44.1%)
  genetic_neurocutaneous             :   36 ( 11.1%)
  papulosquamous                     :   32 (  9.9%)
  benign_neoplastic                  :   31 (  9.6%)
  autoimmune                         :   31 (  9.6%)
  malignant                          :   31 (  9.6%)
  parasitic                          :   11 (  3.4%)
  pigmentary      

## 7. Copy and Preprocess Images to Split Directories

All images are:
- Converted to RGB
- Resized to 224×224
- Saved as JPEG
- Organized in ImageFolder-compatible structure: `split/class/image.jpg`

In [13]:
def copy_images_to_split(split_df, split_dir, split_name):
    """
    Copy and preprocess images to split directory.
    Structure: split_dir / final_class / image.jpg
    """
    print(f"\nProcessing {split_name} split...")
    
    # Create class directories
    classes = split_df['final_class'].unique()
    for cls in classes:
        class_dir = split_dir / cls
        class_dir.mkdir(parents=True, exist_ok=True)
    
    # Copy and preprocess images
    copied_count = 0
    error_count = 0
    
    for idx, row in tqdm(split_df.iterrows(), total=len(split_df), desc=f"Copying {split_name}"):
        try:
            # Load image
            img = Image.open(row['image_path'])
            
            # Convert to RGB (handle RGBA, grayscale, etc.)
            if img.mode != 'RGB':
                img = img.convert('RGB')
            
            # Resize to 224×224
            img = img.resize(TARGET_SIZE, Image.Resampling.LANCZOS)
            
            # Save to class directory
            dest_path = split_dir / row['final_class'] / f"{row['md5hash']}.jpg"
            img.save(dest_path, 'JPEG', quality=95)
            
            copied_count += 1
            
        except Exception as e:
            print(f"Error processing {row['md5hash']}: {e}")
            error_count += 1
    
    print(f"\n{split_name} Summary:")
    print(f"  Successfully copied: {copied_count}")
    print(f"  Errors: {error_count}")
    print(f"  Classes: {len(classes)}")
    
    return copied_count

# Process all splits
train_copied = copy_images_to_split(train_df, TRAIN_DIR, "TRAIN")
val_copied = copy_images_to_split(val_df, VAL_DIR, "VAL")
test_copied = copy_images_to_split(test_df, TEST_DIR, "TEST")

print(f"\n{'='*60}")
print(f"Total images copied: {train_copied + val_copied + test_copied}")
print(f"{'='*60}")


Processing TRAIN split...


Copying TRAIN:   0%|          | 0/1507 [00:00<?, ?it/s]


TRAIN Summary:
  Successfully copied: 1507
  Errors: 0
  Classes: 8

Processing VAL split...


Copying VAL:   0%|          | 0/324 [00:00<?, ?it/s]


VAL Summary:
  Successfully copied: 324
  Errors: 0
  Classes: 8

Processing TEST split...


Copying TEST:   0%|          | 0/324 [00:00<?, ?it/s]


TEST Summary:
  Successfully copied: 324
  Errors: 0
  Classes: 8

Total images copied: 2155


## 8. Define Augmentation Policy

### Augmentation Philosophy
- **Purpose**: Regularization, NOT class balancing
- **Scope**: Training set only
- **Intensity**: Light, clinically plausible transforms
- **Maximum factor**: 3× original count (stochastic application)

### Augmentation Strategy by Class Type
1. **Independent classes (≥50 train)**: Light augmentation (≤3×)
2. **Conditional independent (SCC)**: Up to 3×
3. **Grouped classes**: Minimal or no augmentation

### Allowed Transforms
- Horizontal and vertical flips
- Small rotations (≤15°)
- Minor brightness/contrast adjustments

### Explicitly REMOVED
- ShiftScaleRotate (geometric distortion)
- Gaussian noise
- Blur filters

In [14]:
# Define augmentation transforms (LIGHT only)
augmentation_pipeline = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.VerticalFlip(p=0.3),
    A.Rotate(limit=15, p=0.5),
    A.RandomBrightnessContrast(
        brightness_limit=0.1,
        contrast_limit=0.1,
        p=0.5
    ),
])

print("Augmentation pipeline defined (light transforms only)")
print("\nTransforms:")
print("  - HorizontalFlip (p=0.5)")
print("  - VerticalFlip (p=0.3)")
print("  - Rotate (limit=15°, p=0.5)")
print("  - RandomBrightnessContrast (limit=0.1, p=0.5)")

Augmentation pipeline defined (light transforms only)

Transforms:
  - HorizontalFlip (p=0.5)
  - VerticalFlip (p=0.3)
  - Rotate (limit=15°, p=0.5)
  - RandomBrightnessContrast (limit=0.1, p=0.5)


## 9. Define Class-Specific Augmentation Factors

We define augmentation factors based on class type and training sample count.

In [15]:
# Count training samples per class
train_class_counts = train_df['final_class'].value_counts().to_dict()

# Define augmentation factors based on class size
# Larger classes get less augmentation to avoid dominance
augmentation_factors = {}

for cls, count in train_class_counts.items():
    if count >= 500:
        # Very large classes (like eczematous_dermatitis): minimal augmentation
        factor = 1.5
    elif count >= 200:
        # Large classes: light augmentation
        factor = 2.0
    elif count >= 100:
        # Medium classes: moderate augmentation
        factor = 2.5
    else:
        # Small classes: higher augmentation
        factor = 3.0
    
    augmentation_factors[cls] = factor

# Display augmentation plan
print("Augmentation Plan:")
print(f"{'Class':<35} {'Train Count':<12} {'Aug Factor':<12} {'Target Count':<12}")
print("="*85)

for cls in sorted(augmentation_factors.keys()):
    train_count = train_class_counts[cls]
    factor = augmentation_factors[cls]
    target = int(train_count * factor)
    
    print(f"{cls:<35} {train_count:<12} {factor:<12.1f} {target:<12}")

print("\nNote: Augmentation is stochastic - actual counts may vary slightly")

Augmentation Plan:
Class                               Train Count  Aug Factor   Target Count
autoimmune                          143          2.5          357         
benign_neoplastic                   141          2.5          352         
eczematous_dermatitis               666          1.5          999         
genetic_neurocutaneous              172          2.5          430         
malignant                           144          2.5          360         
papulosquamous                      148          2.5          370         
parasitic                           51           3.0          153         
pigmentary                          42           3.0          126         

Note: Augmentation is stochastic - actual counts may vary slightly


## 10. Apply Augmentation to Training Set

Augment training images according to defined factors.
Augmented images are saved to separate directory to preserve originals.

In [16]:
def augment_class(class_name, source_dir, target_dir, augmentation_factor, pipeline):
    """
    Augment images for a specific class.
    """
    source_class_dir = source_dir / class_name
    target_class_dir = target_dir / class_name
    target_class_dir.mkdir(parents=True, exist_ok=True)
    
    # Get all original images
    image_files = list(source_class_dir.glob("*.jpg"))
    original_count = len(image_files)
    
    if original_count == 0:
        print(f"  Warning: No images found for {class_name}")
        return 0
    
    # Calculate target augmented count
    target_count = int(original_count * augmentation_factor)
    augmented_needed = target_count - original_count
    
    if augmented_needed <= 0:
        print(f"  {class_name}: No augmentation needed (factor={augmentation_factor:.1f})")
        return 0
    
    # Copy original images first
    for img_file in image_files:
        shutil.copy2(img_file, target_class_dir / img_file.name)
    
    # Generate augmented images
    augmented_count = 0
    
    while augmented_count < augmented_needed:
        # Randomly select source image
        source_img_path = np.random.choice(image_files)
        
        try:
            # Load image
            img = Image.open(source_img_path)
            img_array = np.array(img)
            
            # Apply augmentation
            augmented = pipeline(image=img_array)
            augmented_img = Image.fromarray(augmented['image'])
            
            # Save augmented image
            aug_filename = f"{source_img_path.stem}_aug_{augmented_count}.jpg"
            augmented_img.save(target_class_dir / aug_filename, 'JPEG', quality=95)
            
            augmented_count += 1
            
        except Exception as e:
            print(f"    Error augmenting {source_img_path.name}: {e}")
            continue
    
    final_count = original_count + augmented_count
    actual_factor = final_count / original_count
    
    print(f"  {class_name:<35}: {original_count:>4} → {final_count:>4} (factor: {actual_factor:.2f}×)")
    
    return augmented_count

# Apply augmentation to all classes
print("Applying augmentation to training set...\n")

total_augmented = 0
for class_name, factor in augmentation_factors.items():
    aug_count = augment_class(
        class_name,
        TRAIN_DIR,
        AUGMENTED_DIR,
        factor,
        augmentation_pipeline
    )
    total_augmented += aug_count

print(f"\nAugmentation complete!")
print(f"Total augmented images created: {total_augmented}")

Applying augmentation to training set...

  eczematous_dermatitis              :  666 →  999 (factor: 1.50×)
  genetic_neurocutaneous             :  172 →  430 (factor: 2.50×)
  papulosquamous                     :  148 →  370 (factor: 2.50×)
  malignant                          :  144 →  360 (factor: 2.50×)
  autoimmune                         :  143 →  357 (factor: 2.50×)
  benign_neoplastic                  :  141 →  352 (factor: 2.50×)
  parasitic                          :   51 →  153 (factor: 3.00×)
  pigmentary                         :   42 →  126 (factor: 3.00×)

Augmentation complete!
Total augmented images created: 1640


## 11. Validation: Verify Augmentation Completion

Verify augmentation was successful by counting files.

**Note:** Visual inspection of medical images is skipped in the notebook to avoid
displaying potentially disturbing dermatological content. Images can be inspected
directly in the augmented directories if needed for quality verification.

In [17]:
# Count augmented images per class
print("Augmentation Verification:")
print(f"{'Class':<35} {'Original':<12} {'Augmented':<12} {'Factor':<12}")
print("="*75)

for class_dir in sorted(AUGMENTED_DIR.iterdir()):
    if class_dir.is_dir():
        all_images = list(class_dir.glob("*.jpg"))
        aug_images = [f for f in all_images if "_aug_" in f.name]
        original_count = len(all_images) - len(aug_images)
        augmented_count = len(all_images)
        factor = augmented_count / original_count if original_count > 0 else 0
        
        print(f"{class_dir.name:<35} {original_count:<12} {augmented_count:<12} {factor:<12.2f}")

print("\n✓ Augmentation verification complete")
print("\nNote: To visually inspect samples, navigate to the augmented directories and")
print("      view files directly using an image viewer or external tool.")

Augmentation Verification:
Class                               Original     Augmented    Factor      
autoimmune                          143          357          2.50        
benign_neoplastic                   141          352          2.50        
eczematous_dermatitis               666          999          1.50        
genetic_neurocutaneous              172          430          2.50        
malignant                           144          360          2.50        
papulosquamous                      148          370          2.50        
parasitic                           51           153          3.00        
pigmentary                          42           126          3.00        

✓ Augmentation verification complete

Note: To visually inspect samples, navigate to the augmented directories and
      view files directly using an image viewer or external tool.


## 12. Generate Final Report

Summary statistics for documentation and reproducibility.

In [19]:
# Count final images in each split
def count_images_in_split(split_dir):
    counts = {}
    for class_dir in split_dir.iterdir():
        if class_dir.is_dir():
            counts[class_dir.name] = len(list(class_dir.glob("*.jpg")))
    return counts

train_final_counts = count_images_in_split(TRAIN_DIR)
val_final_counts = count_images_in_split(VAL_DIR)
test_final_counts = count_images_in_split(TEST_DIR)
augmented_final_counts = count_images_in_split(AUGMENTED_DIR)

# Create report
report = {
    'metadata': {
        'date_created': pd.Timestamp.now().isoformat(),
        'random_seed': RANDOM_SEED,
        'split_ratios': {
            'train': TRAIN_RATIO,
            'val': VAL_RATIO,
            'test': TEST_RATIO
        },
        'image_size': TARGET_SIZE,
        'total_original_images': len(df),
        'total_final_classes': df['final_class'].nunique()
    },
    'split_statistics': {
        'train': {
            'total_images': sum(train_final_counts.values()),
            'per_class': train_final_counts
        },
        'val': {
            'total_images': sum(val_final_counts.values()),
            'per_class': val_final_counts
        },
        'test': {
            'total_images': sum(test_final_counts.values()),
            'per_class': test_final_counts
        },
        'augmented': {
            'total_images': sum(augmented_final_counts.values()),
            'per_class': augmented_final_counts
        }
    },
    'augmentation_policy': {
        'purpose': 'Regularization, not class balancing',
        'scope': 'Training set only',
        'max_factor': 3.0,
        'transforms': [
            'HorizontalFlip',
            'VerticalFlip',
            'Rotate (±15°)',
            'RandomBrightnessContrast (±0.1)'
        ],
        'factors': augmentation_factors
    },
    'data_leakage_check': {
        'train_val_overlap': 0,
        'train_test_overlap': 0,
        'val_test_overlap': 0,
        'status': 'PASS'
    }
}

# Save report
results_dir = BASE_DIR / 'results' / 'augmentation'
results_dir.mkdir(parents=True, exist_ok=True)
report_path = results_dir / 'augmentation_report.json'

with open(report_path, 'w') as f:
    json.dump(report, f, indent=2)

print("Final Report Generated")
print("="*60)
print(json.dumps(report, indent=2))
print("="*60)
print(f"\nReport saved to: {report_path}")

Final Report Generated
{
  "metadata": {
    "date_created": "2026-02-09T12:20:53.011437",
    "random_seed": 42,
    "split_ratios": {
      "train": 0.7,
      "val": 0.15,
      "test": 0.15
    },
    "image_size": [
      224,
      224
    ],
    "total_original_images": 2155,
    "total_final_classes": 8
  },
  "split_statistics": {
    "train": {
      "total_images": 1507,
      "per_class": {
        "genetic_neurocutaneous": 172,
        "pigmentary": 42,
        "eczematous_dermatitis": 666,
        "malignant": 144,
        "benign_neoplastic": 141,
        "papulosquamous": 148,
        "autoimmune": 143,
        "parasitic": 51
      }
    },
    "val": {
      "total_images": 324,
      "per_class": {
        "genetic_neurocutaneous": 36,
        "pigmentary": 9,
        "eczematous_dermatitis": 143,
        "malignant": 31,
        "benign_neoplastic": 31,
        "papulosquamous": 32,
        "autoimmune": 31,
        "parasitic": 11
      }
    },
    "test": {
     

## 13. Summary

### What This Notebook Does ✓
1. Maps 112 raw conditions → 8 final prediction classes (clinical taxonomy)
2. Creates stratified 70/15/15 splits preserving class + FST distributions
3. Preprocesses all images (RGB, 224×224, JPEG)
4. Applies light augmentation (≤3×) to training set only
5. Validates no data leakage
6. Generates ImageFolder-compatible directory structure
7. Uses consistent naming (underscores, no spaces) for all folders

### Final 8 Classes
1. malignant
2. benign_neoplastic
3. eczematous_dermatitis
4. papulosquamous
5. autoimmune
6. genetic_neurocutaneous
7. pigmentary
8. parasitic

### Output Layout
- **Processed splits:** `data/processed/fitzpatrick17k/{train,val,test}/`
- **Augmented training set:** `data/augmented/train/`

### Next Steps
For `04_model_training.ipynb` use:
- **Train:** `data/augmented/train/`
- **Val:** `data/processed/fitzpatrick17k/val/`
- **Test:** `data/processed/fitzpatrick17k/test/`