# TP3: Object Detection with YOLO

**Day 3 - AI for Sciences Winter School**

**Instructor:** Raphael Cousin

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/racousin/ai_for_sciences/blob/main/day3/tp3.ipynb)

---

## Objectives

In this practical, you will learn **object detection** - the task of locating and classifying multiple objects in an image.

By the end of this practical, you will:

1. **Understand object detection**: Classification + localization with bounding boxes
2. **Learn YOLO format**: How to represent bounding box annotations
3. **Fine-tune YOLOv8**: Adapt a pre-trained model to detect aquarium animals
4. **Evaluate detection**: Understand mAP and analyze predictions

---

# Part 1: Object Detection vs Classification

## What's the Difference?

| Task | Input | Output | Example |
|------|-------|--------|---------|
| **Classification** | Image | Single label | "This is a fish" |
| **Detection** | Image | Multiple boxes + labels | "Fish at (x1,y1,x2,y2), Shark at (x3,y3,x4,y4)" |

```
Classification:                    Detection:
┌─────────────────┐               ┌─────────────────┐
│                 │               │  ┌───┐          │
│    (image)      │   →  "fish"   │  │fish│  ┌────┐│
│                 │               │  └───┘  │shark││
│                 │               │         └────┘│
└─────────────────┘               └─────────────────┘
```

## YOLO: You Only Look Once

**YOLO** is a fast, accurate object detection model that processes the entire image in one pass.

Key advantages:
- **Real-time**: Can process video at 30+ FPS
- **End-to-end**: Single neural network predicts boxes and classes
- **Transfer learning**: Pre-trained on COCO (80 classes), easy to fine-tune

## Setup

In [None]:
!pip install -q git+https://github.com/racousin/ai_for_sciences.git
!pip install -q ultralytics

import os
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
from pathlib import Path
import yaml
import random

print("Setup complete!")

---

# Part 2: The Aquarium Dataset

We'll use the **Aquarium Dataset** - images from real aquariums with labeled marine animals.

**7 classes**: fish, jellyfish, penguin, puffin, shark, starfish, stingray

**638 images**: 448 train / 127 validation / 63 test

This is a great dataset for learning detection because:
- Multiple objects per image
- Varying object sizes
- Relevant to marine biology research

In [None]:
# Download and extract the dataset
!wget -q https://www.raphaelcousin.com/modules/sandbox/aquarium_yolo.zip
!unzip -q -o aquarium_yolo.zip
!rm aquarium_yolo.zip

# Check the structure
print("Dataset structure:")
!ls -la

In [None]:
# Load dataset configuration
with open('data.yaml', 'r') as f:
    data_config = yaml.safe_load(f)

class_names = data_config['names']
num_classes = data_config['nc']

print(f"Number of classes: {num_classes}")
print(f"\nClasses:")
for i, name in enumerate(class_names):
    print(f"  {i}: {name}")

In [None]:
# Count images in each split
train_images = list(Path('train/images').glob('*.jpg'))
valid_images = list(Path('valid/images').glob('*.jpg'))
test_images = list(Path('test/images').glob('*.jpg'))

print(f"Dataset splits:")
print(f"  Train:      {len(train_images)} images")
print(f"  Validation: {len(valid_images)} images")
print(f"  Test:       {len(test_images)} images")
print(f"  Total:      {len(train_images) + len(valid_images) + len(test_images)} images")

## Understanding YOLO Label Format

YOLO uses a simple text format for bounding box annotations:

```
class_id  x_center  y_center  width  height
```

All coordinates are **normalized** (0-1 relative to image size):

```
┌─────────────────────────────┐
│ (0,0)                       │
│      ┌─────────┐            │
│      │ (x_c,   │            │
│      │  y_c)   │ height     │
│      │    *    │            │
│      └─────────┘            │
│         width               │
│                       (1,1) │
└─────────────────────────────┘
```

In [None]:
# Look at a sample label file
sample_label = list(Path('train/labels').glob('*.txt'))[0]
print(f"Sample label file: {sample_label.name}")
print(f"\nContents (class_id, x_center, y_center, width, height):")
print("-" * 50)
with open(sample_label) as f:
    for line in f:
        parts = line.strip().split()
        class_id = int(parts[0])
        coords = [float(x) for x in parts[1:]]
        print(f"  Class {class_id} ({class_names[class_id]:10s}): x={coords[0]:.3f}, y={coords[1]:.3f}, w={coords[2]:.3f}, h={coords[3]:.3f}")

## Visualize the Dataset

In [None]:
# Colors for each class
colors = plt.cm.tab10(np.linspace(0, 1, num_classes))

def plot_image_with_boxes(image_path, label_path, ax=None):
    """Plot image with YOLO format bounding boxes."""
    if ax is None:
        fig, ax = plt.subplots(figsize=(8, 8))
    
    # Load image
    img = Image.open(image_path)
    img_width, img_height = img.size
    ax.imshow(img)
    
    # Load and draw boxes
    if label_path.exists():
        with open(label_path) as f:
            for line in f:
                parts = line.strip().split()
                class_id = int(parts[0])
                x_center, y_center, width, height = [float(x) for x in parts[1:]]
                
                # Convert normalized coords to pixels
                x1 = (x_center - width/2) * img_width
                y1 = (y_center - height/2) * img_height
                box_w = width * img_width
                box_h = height * img_height
                
                # Draw box
                rect = patches.Rectangle(
                    (x1, y1), box_w, box_h,
                    linewidth=2, edgecolor=colors[class_id], facecolor='none'
                )
                ax.add_patch(rect)
                
                # Add label
                ax.text(x1, y1-5, class_names[class_id], color=colors[class_id],
                       fontsize=10, fontweight='bold',
                       bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
    
    ax.axis('off')
    return ax

In [None]:
# Display sample images with bounding boxes
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

sample_images = random.sample(train_images, 6)

for ax, img_path in zip(axes.flat, sample_images):
    label_path = Path('train/labels') / (img_path.stem + '.txt')
    plot_image_with_boxes(img_path, label_path, ax=ax)

plt.suptitle('Aquarium Dataset - Training Samples', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

### Question 1

Look at the sample images:
1. How many objects are typically in each image?
2. Which classes seem most common? Which are rare?
3. What challenges do you notice? (occlusion, size variation, etc.)

---

# Part 3: Load Pre-trained YOLO

YOLOv8 comes pre-trained on **COCO** (80 classes including some animals).

Let's see how well it works on our aquarium images **before fine-tuning**.

In [None]:
from ultralytics import YOLO

# Load pre-trained YOLOv8 nano (smallest, fastest)
model_pretrained = YOLO('yolov8n.pt')

print("Pre-trained YOLOv8n loaded!")
print(f"Model trained on COCO dataset (80 classes)")

In [None]:
# Test on a sample image
sample_image = random.choice(train_images)
results = model_pretrained.predict(source=str(sample_image), conf=0.25, verbose=False)

# Display result
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Original with ground truth
label_path = Path('train/labels') / (sample_image.stem + '.txt')
plot_image_with_boxes(sample_image, label_path, ax=axes[0])
axes[0].set_title('Ground Truth Labels', fontsize=12)

# Pre-trained model predictions
axes[1].imshow(Image.open(sample_image))
for box in results[0].boxes:
    x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
    conf = box.conf[0].cpu().numpy()
    cls = int(box.cls[0].cpu().numpy())
    cls_name = model_pretrained.names[cls]
    
    rect = patches.Rectangle((x1, y1), x2-x1, y2-y1,
                             linewidth=2, edgecolor='red', facecolor='none')
    axes[1].add_patch(rect)
    axes[1].text(x1, y1-5, f'{cls_name} {conf:.2f}', color='red', fontsize=9,
                bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

axes[1].axis('off')
axes[1].set_title(f'Pre-trained YOLO Predictions ({len(results[0].boxes)} detections)', fontsize=12)

plt.tight_layout()
plt.show()

print(f"\nPre-trained model detected: {[model_pretrained.names[int(b.cls[0])] for b in results[0].boxes]}")

### Question 2

1. Did the pre-trained model detect the aquarium animals correctly?
2. What classes from COCO might it confuse with our classes?
3. Why do we need to fine-tune instead of using the pre-trained model directly?

---

# Part 4: Fine-tune YOLO on Aquarium Data

Now we'll fine-tune the pre-trained model on our specific dataset.

**Transfer learning for detection:**
- Keep the backbone features (edges, textures, shapes)
- Adapt the detection head to our 7 classes

In [None]:
# First, we need to fix the data.yaml paths for our current directory
data_yaml_content = f"""train: train/images
val: valid/images
test: test/images

nc: {num_classes}
names: {class_names}
"""

with open('aquarium.yaml', 'w') as f:
    f.write(data_yaml_content)

print("Created aquarium.yaml:")
print(data_yaml_content)

In [None]:
# Load a fresh model for fine-tuning
model = YOLO('yolov8n.pt')

# Fine-tune on our dataset
# Note: In Colab with GPU, this takes ~5-10 minutes
results = model.train(
    data='aquarium.yaml',
    epochs=30,           # Number of training epochs
    imgsz=640,           # Image size
    batch=16,            # Batch size (reduce if out of memory)
    patience=10,         # Early stopping patience
    device=0,            # GPU (use 'cpu' if no GPU)
    verbose=True,
    plots=True           # Generate training plots
)

### Understanding Training Metrics

During training, YOLO reports several metrics:

| Metric | Description |
|--------|-------------|
| **box_loss** | How well boxes match ground truth (lower = better) |
| **cls_loss** | Classification accuracy (lower = better) |
| **mAP50** | Mean Average Precision at IoU=0.5 (higher = better) |
| **mAP50-95** | mAP averaged over IoU 0.5-0.95 (stricter metric) |

In [None]:
# Display training results
from IPython.display import Image as IPImage, display

# Show training curves
results_dir = Path(model.trainer.save_dir)
print(f"Results saved to: {results_dir}")

# Display training plots if they exist
plots = ['results.png', 'confusion_matrix.png', 'F1_curve.png', 'PR_curve.png']
for plot_name in plots:
    plot_path = results_dir / plot_name
    if plot_path.exists():
        print(f"\n{plot_name}:")
        display(IPImage(filename=str(plot_path), width=700))

---

# Part 5: Evaluate the Fine-tuned Model

Let's test our fine-tuned model on the validation set and compare to the pre-trained model.

In [None]:
# Load the best model from training
best_model_path = results_dir / 'weights' / 'best.pt'
model_finetuned = YOLO(str(best_model_path))

print(f"Loaded fine-tuned model from: {best_model_path}")

In [None]:
# Evaluate on validation set
metrics = model_finetuned.val(data='aquarium.yaml', verbose=False)

print("\n" + "="*50)
print("VALIDATION RESULTS")
print("="*50)
print(f"mAP50:    {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")
print("\nPer-class AP50:")
for i, ap in enumerate(metrics.box.ap50):
    print(f"  {class_names[i]:12s}: {ap:.3f}")

## Visual Comparison: Before vs After Fine-tuning

In [None]:
def visualize_predictions(image_path, model, ax, title):
    """Run model on image and visualize predictions."""
    results = model.predict(source=str(image_path), conf=0.25, verbose=False)
    
    img = Image.open(image_path)
    ax.imshow(img)
    
    for box in results[0].boxes:
        x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
        conf = box.conf[0].cpu().numpy()
        cls = int(box.cls[0].cpu().numpy())
        cls_name = model.names[cls]
        
        color = colors[cls] if cls < len(colors) else 'red'
        rect = patches.Rectangle((x1, y1), x2-x1, y2-y1,
                                 linewidth=2, edgecolor=color, facecolor='none')
        ax.add_patch(rect)
        ax.text(x1, y1-5, f'{cls_name} {conf:.2f}', color=color, fontsize=9,
                fontweight='bold', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
    
    ax.axis('off')
    ax.set_title(f'{title}\n({len(results[0].boxes)} detections)', fontsize=11)
    return len(results[0].boxes)

In [None]:
# Compare on validation images
test_samples = random.sample(valid_images, 4)

fig, axes = plt.subplots(4, 3, figsize=(15, 18))

for row, img_path in enumerate(test_samples):
    # Ground truth
    label_path = Path('valid/labels') / (img_path.stem + '.txt')
    plot_image_with_boxes(img_path, label_path, ax=axes[row, 0])
    axes[row, 0].set_title('Ground Truth', fontsize=11)
    
    # Pre-trained model
    visualize_predictions(img_path, model_pretrained, axes[row, 1], 'Pre-trained YOLO')
    
    # Fine-tuned model
    visualize_predictions(img_path, model_finetuned, axes[row, 2], 'Fine-tuned YOLO')

plt.suptitle('Comparison: Ground Truth vs Pre-trained vs Fine-tuned', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### Question 3

Comparing the results:
1. How much did fine-tuning improve detection accuracy?
2. Which classes are detected best? Which are still challenging?
3. Do you notice any false positives or false negatives?

---

# Part 6: Test on Unseen Images

Let's evaluate on the test set (images the model has never seen during training).

In [None]:
# Run inference on test images
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

test_samples = random.sample(test_images, 6)

for ax, img_path in zip(axes.flat, test_samples):
    visualize_predictions(img_path, model_finetuned, ax, '')

plt.suptitle('Fine-tuned Model - Test Set Predictions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

---

# Part 7: Exercise - Experiment with Training

Try modifying the training parameters to improve results.

In [None]:
# TODO: Experiment with different training settings
# Try modifying these parameters:

# Option 1: Use a larger model (better accuracy, slower)
# model = YOLO('yolov8s.pt')  # small instead of nano

# Option 2: Train longer
# epochs = 50  # instead of 30

# Option 3: Different image size
# imgsz = 800  # instead of 640

# Option 4: Adjust learning rate
# lr0 = 0.001  # initial learning rate

# Uncomment and run to experiment:
# model_exp = YOLO('yolov8s.pt')
# results_exp = model_exp.train(
#     data='aquarium.yaml',
#     epochs=50,
#     imgsz=640,
#     batch=16,
#     device=0,
# )

### Question 4

1. What is the trade-off between model size (nano/small/medium) and accuracy?
2. How does the number of epochs affect overfitting?
3. For a real scientific application, how would you decide which model to use?

---

# Summary

## Key Takeaways

1. **Object detection** = classification + localization (multiple objects per image)

2. **YOLO format**: `class_id x_center y_center width height` (normalized 0-1)

3. **Fine-tuning strategy**:
   - Start with pre-trained weights (COCO)
   - Train on your domain-specific data
   - Monitor mAP metrics during training

4. **Key metrics**:
   - **mAP50**: Standard detection metric (IoU threshold = 0.5)
   - **mAP50-95**: Stricter metric (averaged over multiple IoU thresholds)

5. **Model selection**:
   - YOLOv8n: Fastest, least accurate
   - YOLOv8s/m/l/x: Progressively more accurate but slower

## Scientific Applications

Object detection is used in many scientific domains:
- **Biology**: Cell counting, animal tracking, species identification
- **Medicine**: Tumor detection, organ localization in scans
- **Ecology**: Wildlife monitoring, population surveys
- **Astronomy**: Galaxy/star detection in telescope images

---

## Reflection Questions

Before finishing, think about:

1. **In your research**: What objects would you want to detect? (cells, animals, particles, structures?)

2. **Data annotation**: How would you create bounding box labels for your data? (manual annotation tools like LabelImg, CVAT)

3. **Pre-trained models**: Would COCO pre-training help for your domain, or would you need domain-specific pre-training?

4. **Deployment**: How would you use a trained detection model in practice? (batch processing, real-time video, edge devices?)