# Understanding YOLO v2: Better, Faster, Stronger

YOLO v2 represents a significant improvement over YOLO v1 in object detection. I'll explain the key concepts, improvements, and technical details in depth.

## Recap of YOLO v1

YOLO (You Only Look Once) v1 was a single-shot detector that treated object detection as a regression problem. Unlike region-based approaches, YOLO v1:

- Processed the entire image in a single pass
- Used a grid-based approach (dividing the image into 7×7 grid cells)
- Each grid cell predicted 2 bounding boxes and confidence scores
- Used a unified architecture to predict bounding boxes and class probabilities
- Provided class probability distribution for each grid cell (not per box)

The main limitation was that YOLO v1 could only detect a maximum of 49 objects (7×7 grid cells) since each cell was responsible for only one object, regardless of how many boxes it predicted.

## YOLO v2 Improvements

YOLO v2 aimed to address several limitations of YOLO v1:

1. **Accuracy**: YOLO v1 was faster than Faster R-CNN but significantly less accurate (~10 mAP difference)
2. **Localization issues**: Poor bounding box placement
3. **Recall rate**: Missed too many objects

Through a series of modifications, YOLO v2 improved mAP from 63.4% to 78.6% while maintaining speed advantages.

## Key Modifications in YOLO v2

### 1. Batch Normalization

- Added batch normalization to all convolutional layers
- Stabilized training and improved convergence
- Provided regularization effect, eliminating the need for dropout
- Result: +2% mAP improvement

### 2. High-Resolution Classifier

YOLO v1's training process:
- Pre-trained on ImageNet at 224×224 resolution
- Directly fine-tuned for detection at 448×448 resolution

YOLO v2's improved approach:
- Pre-trained on ImageNet at 224×224 resolution
- Fine-tuned the classifier on ImageNet at 448×448 for 10 epochs
- Then fine-tuned for detection at 448×448
- Result: +4% mAP improvement

This gradual resolution increase helped the network adapt better to higher resolution inputs.

### 3. Convolutional With Anchor Boxes

YOLO v1 used fully connected layers for the final prediction, limiting the number of boxes it could predict (98 boxes total).

YOLO v2 made several changes:
- Removed the fully connected layers
- Made the network fully convolutional
- Increased grid resolution from 7×7 to 13×13
- Introduced anchor boxes (predefined box shapes)
- Predicted class probabilities per box rather than per cell
- Result: Significant improvement in recall rate

### 4. Dimension Clusters (Anchor Box Selection)

Rather than using manually defined anchor boxes, YOLO v2 used k-means clustering on the training dataset's ground truth boxes to determine optimal anchor box shapes:

- Run k-means clustering on all ground truth bounding boxes
- Found that 5 anchor boxes offered the best trade-off
- The clustered anchors had better IoU with ground truth boxes than 9 hand-picked anchors in Faster R-CNN
- Result: More efficient detection with fewer anchors

### 5. Direct Location Prediction

YOLO v2 modified how it predicted box coordinates:
- YOLO v1: Directly predicted coordinates relative to image
- Faster R-CNN: Predicted offsets relative to anchor boxes
- YOLO v2: Hybrid approach:
  - Predicted x,y coordinates relative to grid cell (using sigmoid to constrain values 0-1)
  - Predicted width/height as multipliers of anchor box dimensions (using exponential function to ensure positive values)

This approach prevented the network from predicting boxes far from the grid cell's position, improving stability and accuracy.

### 6. Fine-Grained Features (Pass-through Layer)

To improve detection of small objects:
- Added a "pass-through" layer that brought features from earlier in the network
- Connected a 26×26×512 feature map from an earlier layer to the 13×13 feature map
- Reshaped the 26×26×512 into 13×13×2048 and concatenated with existing features
- Result: +1% mAP improvement

### 7. Multi-Scale Training

Because YOLO v2 was fully convolutional, it could process different image sizes:
- During training, randomly changed input resolution every 10 batches
- Used resolutions from 288×288 to 608×608 (multiples of 32)
- Result: Model learned to predict well at different scales
- Provided a speed/accuracy trade-off at inference time

### New Backbone: Darknet-19

YOLO v2 introduced a new classification backbone:
- 19 convolutional layers (hence the name)
- Fewer parameters than VGG-16 (used in many other detectors)
- 5.58 billion operations vs 30.69 billion for VGG-16
- Achieved 72.9% top-1 accuracy on ImageNet (comparable to GoogleNet)
- Faster than the original YOLO backbone

## Technical Details of YOLO v2 Architecture

### Network Output Format
YOLO v2's output for a 13×13 grid with 5 anchor boxes and 20 classes:
- 13×13×125 tensor (125 = 5×(5+20))
- For each grid cell and anchor box combination:
  - 5 values: tx, ty, tw, th, confidence score
  - 20 values: class probabilities
- Total parameters: 13×13×5×(5+20) = 21,125

### Target Calculation
For each ground truth object:
- Assign it to the grid cell containing its center
- Assign it to the anchor box with highest IoU
- Set target values:
  - tx, ty: object center relative to grid cell (0 to 1)
  - tw, th: width/height relative to anchor box dimension
  - confidence: 1 for boxes with objects, 0 otherwise
  - class probabilities: one-hot encoded vector for the correct class

### Loss Function
YOLO v2 used a multi-part squared error loss function:
1. Loss for grid cells with no objects (objectness score)
2. Anchor box alignment loss (only for first 12,000 iterations)
3. Bounding box coordinate loss (x,y,w,h)
4. Object confidence score loss
5. Class probability loss

### Performance Comparison
On Pascal VOC dataset:
- YOLO v2 (608×608): 78.6% mAP at 40 FPS
- SSD512: 76.8% mAP at 22 FPS
- Faster R-CNN: 73.2% mAP at 7 FPS

On COCO dataset:
- YOLO v2: 44.0% mAP, slightly behind SSD512 and Faster R-CNN
- Still 3-4× faster than competing methods

## Key Improvements Summary

1. **Batch Normalization**: +2.0% mAP
2. **High-resolution classifier**: +4.0% mAP
3. **Convolutional with anchor boxes**: +0.3% mAP (but improved recall)
4. **Dimension clusters (k-means anchors)**: +5.0% mAP
5. **Direct location prediction**: +1.0% mAP
6. **Fine-grained features (pass-through)**: +1.0% mAP
7. **Multi-scale training**: +1.3% mAP

The cumulative effect was a 15.2% mAP improvement over YOLO v1 while maintaining high frame rates.

In summary, YOLO v2 represents a significant advancement in real-time object detection, offering a better balance between accuracy and speed compared to its predecessor and competitors. Its innovations in anchor box determination, feature extraction, and training methodology have influenced many subsequent object detection models.