<a href="https://colab.research.google.com/github/tamara-kostova/MSc_Thesis_Neuroimaging/blob/master/03_data_loaders_augmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 03 â€“ Data Loaders and Augmentation

This notebook defines the PyTorch datasets, data loaders, and augmentation pipelines used for
training and evaluating deep learning models on neuroimaging data (MRI and CT).

All images have already been preprocessed in the previous step (grayscale conversion and resizing).
Here, we focus on:
- Dataset abstractions
- Train/validation/test data loaders
- Medically plausible data augmentation


In [4]:
BASE_DIR = "/content/drive/MyDrive/MSc_Thesis_Neuroimaging"
RAW_DIR = f"{BASE_DIR}/data/raw"
PROC_DIR = f"{BASE_DIR}/data/processed"
SPLIT_DIR = f"{BASE_DIR}/data/split"

In [6]:
import glob
def collect_images(class_path):
    files = []
    for ext in ("*.png", "*.jpg", "*.jpeg"):
        files.extend(glob.glob(os.path.join(class_path, ext)))
    return files

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Dataset class
Key design choices:
- Images are loaded in **grayscale**, consistent with MRI and CT imaging
- Labels are inferred from folder structure
- Class-to-index mappings are explicit and reusable across splits
- The same dataset class supports binary and multiclass classification tasks

In [7]:
import os
import numpy as np
from torch.utils.data import Dataset
from PIL import Image

class NeuroImageDataset(Dataset):
    def __init__(self, data_dir, transform=None, class_to_idx=None):
        self.data_dir = data_dir
        self.transform = transform

        self.samples = []
        self.class_to_idx = class_to_idx or {}

        for class_name in sorted(os.listdir(data_dir)):
            class_path = os.path.join(data_dir, class_name)
            if not os.path.isdir(class_path):
                continue

            if class_name not in self.class_to_idx:
                self.class_to_idx[class_name] = len(self.class_to_idx)

            label = self.class_to_idx[class_name]

            for img_path in collect_images(class_path):
                self.samples.append((img_path, label))

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        img_path, label = self.samples[idx]

        image = np.array(Image.open(img_path).convert("L"))  # grayscale

        if self.transform:
            image = self.transform(image=image)["image"]

        return image, label


### Augmentations
Augmentations are carefully chosen to be **medically plausible**:
- Small rotations to simulate patient positioning variability
- Horizontal flips where anatomically acceptable
- Mild intensity and noise variations to simulate scanner differences


In [8]:
import albumentations as A
from albumentations.pytorch import ToTensorV2

train_aug = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.Rotate(limit=10, p=0.5),
    A.RandomBrightnessContrast(p=0.2),
    A.GaussNoise(p=0.15),
    A.Normalize(mean=(0.5,), std=(0.5,)),
    ToTensorV2()
])

val_test_aug = A.Compose([
    A.Normalize(mean=(0.5,), std=(0.5,)),
    ToTensorV2()
])


## DataLoaders
PyTorch `DataLoader` objects are created for each dataset split:
- **Training**: shuffled batches
- **Validation/Test**: deterministic order

This setup ensures:
- No information leakage between splits
- Efficient batching and parallel data loading
- Reproducible evaluation

In [11]:
from torch.utils.data import DataLoader

dataset_paths = {
    "tumor_binary": f"{SPLIT_DIR}/MRI_tumor_binary_norm",
    "stroke_binary": f"{SPLIT_DIR}/CT_stroke_binary_norm"
}

for name, base_path in dataset_paths.items():
    print(f"\n{name.upper()}")

    class_to_idx = None  # shared mapping across splits

    for split in ["train", "val", "test"]:
        split_path = os.path.join(base_path, split)

        if not os.path.exists(split_path):
            continue

        ds = NeuroImageDataset(
            split_path,
            transform=train_aug if split == "train" else val_test_aug,
            class_to_idx=class_to_idx
        )

        class_to_idx = ds.class_to_idx  # preserve mapping

        dl = DataLoader(
            ds,
            batch_size=32,
            shuffle=(split == "train"),
            num_workers=2,
            pin_memory=True
        )

        images, labels = next(iter(dl))
        print(f"{split}: images={images.shape}, labels={labels.shape}")



TUMOR_BINARY




train: images=torch.Size([32, 1, 224, 224]), labels=torch.Size([32])
val: images=torch.Size([32, 1, 224, 224]), labels=torch.Size([32])
test: images=torch.Size([32, 1, 224, 224]), labels=torch.Size([32])

STROKE_BINARY
train: images=torch.Size([32, 1, 224, 224]), labels=torch.Size([32])
val: images=torch.Size([32, 1, 224, 224]), labels=torch.Size([32])
test: images=torch.Size([32, 1, 224, 224]), labels=torch.Size([32])
