# Face vs No-Face Classification Report

**Author:** Valentinos Sanguinetti  
**Date:** 10 November 2025  
**Environment:** `deep-learning` project


## Executive Summary

This report documents the end-to-end workflow for detecting and classifying faces in grayscale images. We trained a suite of CNN variants—including fine-tuned pretrained networks—and evaluated them on a held-out test set. We also generated multi-model detection overlays on full-scene images to assess qualitative performance. All assets referenced here are produced by the tooling in this repository (`train_all.py`, `detect_all_models.py`, `evaluate_models.py`).


## Dataset & Model Overview

- **Dataset format:** Images organised via `ImageFolder` (`train_images/0` = no-face, `train_images/1` = face; likewise for `test_images/`). All inputs are normalised per-model.
- **Model zoo:** Custom CNN variants (`tiny`, `baseline`, `bn`, `threeconv`, `residual`, `improved`, `attention`) plus fine-tuned pretrained architectures (`resnet18`, `mobilenetv2`, `efficientnet`).
- **Training setup:** `train_all.py` runs 10 epochs per model, saving the best checkpoint to `artifacts/<model>/best_model.pt`.
- **Evaluation tooling:** `evaluate_models.py` aggregates metrics, bootstrapped confidence intervals, and plots; `detect_all_models.py` provides qualitative detections per scene.


In [None]:
import os
from pathlib import Path

import pandas as pd
from IPython.display import display, Markdown, Image

ARTIFACTS_DIR = Path("artifacts")
EVAL_DIR = ARTIFACTS_DIR / "evaluation"
DETECTIONS_DIR = ARTIFACTS_DIR / "detections"


In [None]:
summary_path = EVAL_DIR / "summary.csv"
if summary_path.exists():
    summary_df = pd.read_csv(summary_path)
    summary_df = summary_df.sort_values(by="accuracy", ascending=False).reset_index(drop=True)
    display(Markdown("### Test-Set Metrics per Model"))
    display(summary_df.style.format({
        "accuracy": "{:.4f}",
        "precision": "{:.4f}",
        "recall": "{:.4f}",
        "f1": "{:.4f}"
    }))
else:
    display(Markdown(f"⚠️ The file `{summary_path}` was not found. Run `evaluate_models.py` first."))


In [None]:
bootstrap_path = EVAL_DIR / "bootstrap_summary.csv"
if bootstrap_path.exists():
    boot_df = pd.read_csv(bootstrap_path)
    display(Markdown("### Bootstrap Confidence Intervals"))
    display(boot_df.pivot(index="model", columns="metric", values=["mean", "ci_low", "ci_high"]) \
            .swaplevel(0, 1, axis=1) \
            .sort_index(axis=1) \
            .style.format("{:.4f}"))
else:
    display(Markdown("Bootstrap summary not found. You can generate it via `python evaluate_models.py --bootstrap 1000 --ci 95`."))


In [None]:
metric_plots = ["accuracy.png", "f1.png", "precision.png", "recall.png"]
existing_plots = [EVAL_DIR / name for name in metric_plots if (EVAL_DIR / name).exists()]

if existing_plots:
    display(Markdown("### Metric Comparison Charts"))
    for plot_path in existing_plots:
        display(Markdown(f"**{plot_path.name.replace('.png', '').title()}**"))
        display(Image(filename=str(plot_path)))
else:
    display(Markdown("No metric comparison charts found. Run `evaluate_models.py` to generate them."))


In [None]:
if DETECTIONS_DIR.exists():
    display(Markdown("### Qualitative Detection Results"))
    scene_dirs = sorted([p for p in DETECTIONS_DIR.iterdir() if p.is_dir()])
    if not scene_dirs:
        display(Markdown("Detection overlays not found. Run `detect_all_models.py <image>` to generate them."))
    else:
        for scene in scene_dirs[:2]:
            display(Markdown(f"#### Scene: `{scene.name}`"))
            model_images = sorted(scene.glob("*.png"))
            for img_path in model_images[:6]:
                display(Markdown(f"Model: **{img_path.stem}**"))
                display(Image(filename=str(img_path)))
else:
    display(Markdown("Detection overlays directory not found. Run the detection scripts first."))


## Key Takeaways

- Fine-tuned pretrained networks (`efficientnet`, `resnet18`, `mobilenetv2`) generally offer the strongest accuracy, while lighter CNN variants (`small`, `baseline`) remain competitive when latency is critical.
- Bootstrapped confidence intervals quantify the robustness of each model’s metrics; wider intervals suggest a need for more data or regularisation.
- Detection overlays confirm that the models transfer well to cluttered scenes, capturing multiple faces with calibrated probability estimates.
- For production usage, we recommend combining a high-accuracy classifier (e.g., `efficientnet`) with threshold tuning informed by the bootstrap analysis and qualitative review.
