# Image Data Analysis Roadmap
*This notebook provides a structured roadmap for analyzing image datasets. It outlines each step from setup, loading, cleaning, and preprocessing to exploratory analysis, augmentation, and evaluation. Students can follow these stages to prepare and analyze image data for tasks such as classification, detection, and segmentation.*

## 1) Setup & Imports

**Core**

* `numpy`, `pandas`, `matplotlib.pyplot`, `PIL.Image`, `cv2`
* For deep learning: `torch`, `torchvision` (or `tensorflow`, `keras`)
* Augmentations: `albumentations`, `torchvision.transforms`
* Hashing/duplicates: `imagehash`, `PIL`
* EXIF: `piexif`, `PIL.ExifTags`
* Format/IO at scale: `tqdm`, `pathlib`, `pyyaml`, `json`, `webdataset`/`tfrecord`

**Reproducibility**

* Set seeds (`random`, `numpy`, framework seeds)
* Log package versions (`pip freeze`), GPU info

---

## 2) Load Image Data

**Folder structures**

* Classification: `data/{train,val,test}/{class_name}/*.jpg`
* Detection/Segmentation: images + labels (COCO/YOLO/Pascal VOC; masks as PNG)
* External sources: Kaggle ZIPs, Google Drive, S3

**Methods**

* Build a manifest DataFrame: file path, split, class/labels, width, height, format, checksum/hash
* Validate paths & count per split/class
* For detection/segmentation: parse label files; verify image–annotation alignment

**Quick checks**

* Can every listed image be opened?
* Any zero-byte or unreadable files?

**Task-specific Data Loaders (must build; no code here)**

* **Classification:** a loader that returns `(image, class_label)`
* **Detection:** a loader that returns `(image, bounding_boxes, class_labels)` and is **bbox/mask-aware** for transforms
* **Segmentation:** a loader that returns `(image, mask)` with **aligned** image–mask sizes

**Loader requirements**

* Apply consistent preprocessing (resize, normalization, augmentations)
* Cleanly support `train/val/test` splits
* Be modular so datasets can be swapped without changing training logic

---

## 3) Inspect Image Data

**Image metadata**

* Resolution stats (min/median/max, aspect ratios)
* Color mode (`L`, `RGB`, `RGBA`); color profile/ICC
* EXIF presence (orientation, datetime)

**Dataset composition**

* Per-class counts & imbalance ratios
* Split sizes & class balance **per split**
* Annotation coverage (avg boxes per image, mask area %, category frequency)

**Visual peeks**

* Random grids per class
* Few annotated examples (boxes/masks overlay)

---

## 4) Data Cleaning (Quality & Integrity)

**File & format**

* Remove corrupted/unreadable files (`PIL`/`cv2` try/except)
* Standardize formats (e.g., all `.jpg` or `.png`)
* Fix EXIF orientation and strip problematic EXIF if needed

**Near-duplicates / leakage**

* Perceptual hashing (`imagehash.average_hash/phash/dhash`) to find duplicates
* Cluster by hash distance to remove or keep one representative
* **Split leakage check**: ensure near-duplicates do not cross train/val/test

**Label QA**

* Class spelling/ontology normalization (e.g., `cat` vs `cats`)
* Out-of-range boxes, negative coords, boxes outside image bounds → fix or drop
* Masks: empty masks, mismatched sizes, non-binary pixels → fix
* Inter-annotator agreement sample (Cohen’s κ for category labels)

**Image quality**

* Blur detection (Laplacian variance); flag too-blurry images
* Noise/artifacts (hot/dead pixels, compression blocks) → consider denoising
* Exposure issues: histogram clipping; extreme brightness/contrast
* Broken transparency (unexpected alpha channel)

**Ethics/PII**

* Faces/plates—blur or remove if policy requires
* Copyright/source audit trail

---

## 5) Splitting Strategy (Prevent Leakage)

**Methods**

* **Stratified** split by class for classification
* **Group-aware** split (patient ID, video ID, scene/location) to avoid correlated leakage
* **Time-aware** split if the task is temporal
* Ensure **aspect-ratio distribution** and **quality distributions** are similar across splits

**Validation design**

* K-fold or GroupKFold for small datasets
* Fixed hold-out + cross-val on train when tuning

---

## 6) Preprocessing (Standardization)

**Geometry**

* Resize with preserved content:
  * Classification: fixed size (e.g., 224×224, 299×299) with center/letterbox padding
  * Detection: keep aspect ratio + letterbox
  * Segmentation: consistent size for image & mask
* Optional cropping (center, face/object-aware, or content-aware)

**Photometric**

* Normalize to [0,1] or [-1,1]; mean/std normalization (dataset-specific or ImageNet stats)
* Color space conversion (`BGR↔RGB`, `RGB↔Lab/HSV`)
* Histogram Equalization / CLAHE (grayscale or per-channel; careful for color shifts)

**Denoise / Deblur (if needed)**

* Median/Bilateral filter (small, conservative)
* Non-local means; gentle settings
* Avoid heavy deblurring unless justified (can distort labels)

**Consistency**

* Ensure preprocessing mirrors inference (document transforms)
* Cache preprocessed outputs if I/O bound

---

## 7) Data Augmentation (Task-Specific)

**Geometric**

* Flip (H/V), small rotations, scale, shear, translate, perspective
* For detection: use bbox/mask-aware transforms (Albumentations with bbox/mask params)
* Maintain label validity after transforms

**Photometric**

* Brightness/contrast, gamma, hue/saturation, grayscale, Gaussian noise
* JPEG compression artifacts for robustness

**Cut-style / Mix-style**

* Cutout, Random Erasing
* MixUp, CutMix (classification)
* Mosaic, MixUp (detection, e.g., YOLO-style)

**Policies**

* RandAugment, AutoAugment (classification)
* Probability scheduling (lighter aug early or later)
* Keep an **augment vs. no-augment** ablation to justify choices

---

## 8) Feature Engineering (Optional, Classical)

**Classical descriptors**

* Color histograms, Haralick textures (GLCM), LBP
* SIFT/ORB (if allowed) → Bag-of-Visual-Words / VLAD

**Deep features**

* Extract embeddings from pretrained CNN backbones (ResNet, ViT) for:
  * EDA (t-SNE/UMAP)
  * Simple classifiers (Logistic/Linear SVM) as baselines

---

## 9) Exploratory Data Analysis (EDA)

**Distribution plots**

* Class counts (bar), per-class sample grids
* Image width/height & aspect ratio histograms
* Sharpness/blur score distribution; brightness/contrast distributions
* Channel means/stds per split

**Qualitative panels**

* Before/after preprocessing and augmentation comparisons
* Outlier gallery: extreme sizes, extreme brightness, heavy blur

**Embeddings**

* t-SNE/UMAP of deep features to see cluster separability & mislabeled points

**Annotation EDA**

* Detection: box size distribution, aspect ratios, per-image box counts
* Segmentation: mask area %, boundary complexity, class co-occurrence

**Leakage & imbalance**

* Near-duplicate heatmaps across splits
* Imbalance visuals; decide on weighted loss, focal loss, or sampling

---

## 10) Baselines & Checks (Tiny Models)

**Baselines**

* Classical: deep-feature + logistic regression
* DL quick baseline: small CNN or pretrained head (frozen backbone)

**Sanity**

* **Label shuffle test** (should train poorly)
* Train on a **small subset** (should overfit)
* Quick Grad-CAM on a few images to check the model looks at the object, not corners/watermarks

---

## 11) Documentation & Dataset Card

**Include**

* Data source(s), licenses, collection dates
* Preprocessing & augmentation summary
* Known limitations/biases & PII handling
* Splits rationale and leakage checks
* Versioning: dataset vX.Y with changelog
* Repro steps: scripts/commands to rebuild

---

## 12) Packaging for Training

**Pipelines**

* PyTorch `Dataset`/`DataLoader` or TF `tf.data`
* COCO/YOLO/VOC converters (one canonical format)
* Sharding: TFRecords or WebDataset (tar shards) for speed
* Caching: memory-mapped images (`opencv + turbojpeg`), on-the-fly decode

**Performance**

* Worker count, pinned memory, prefetching
* Mixed precision readiness (if training later)

---

## 13) Readiness Checklist (Go/No-Go)

* [ ] All images open; formats standardized
* [ ] No cross-split near-duplicates
* [ ] Labels validated (boxes/masks sane)
* [ ] Splits stratified/group-aware and balanced
* [ ] Preprocessing & augmentation defined and reproducible
* [ ] EDA done; issues addressed or documented
* [ ] Baseline sanity checks passed
* [ ] Dataset card complete; license clear

---

## 14) Task-Specific Metrics (for when you evaluate)

* **Classification**: Accuracy, Precision/Recall/F1 (macro), AUROC, confusion matrix
* **Detection**: mAP@[.5:.95], per-class AP, AR vs. #detections
* **Segmentation**: mIoU, Dice/F1, boundary F-score
* **Image quality tasks**: PSNR, SSIM (if relevant)
* **Robustness**: performance under augment-like perturbations

---

### Notes for Students

* **Start simple**: get a clean baseline before heavy augmentation.
* **Measure, don’t guess**: every cleaning step should show a measurable benefit or clear risk reduction (e.g., leakage).
* **Document everything**: what you changed and why—future you will thank present you.
