# YOLO-9000: An Object Detector for 9000 Classes



## The Challenge: Detecting 9000 Object Classes

YOLO-9000 aimed to create an object detector capable of recognizing 9000 different classes - a significant advancement compared to existing datasets at the time:
- Pascal VOC: Only 20 object classes
- COCO: Only 80 object classes

The primary challenge was obtaining labeled data. While object classification datasets like ImageNet contained millions of images with class labels, object detection requires additional bounding box annotations showing where each object is located in the image. These annotations are extremely time-consuming to create manually, making it impractical to build a dataset with bounding boxes for 9000 classes.

## The Core Innovation: Joint Training on Classification and Detection Datasets

YOLO-9000 solved this problem by combining:
1. **COCO dataset**: 80 classes with both classification labels and bounding box annotations
2. **ImageNet dataset**: Thousands of classes with only classification labels (no bounding boxes)

This required several key innovations:

### 1. The WordTree Hierarchy

The researchers created a hierarchical tree structure called "WordTree" to organize object classes, where:
- More general categories (like "dog") became parent nodes
- More specific categories (like "German Shepherd" or "Siberian Husky") became child nodes
- COCO's 80 classes (with bounding boxes) were positioned as intermediate nodes
- ImageNet's thousands of fine-grained classes became leaf nodes

This hierarchy established relationships between classes that had bounding box annotations (COCO) and those that didn't (ImageNet).

### 2. Modified Classification Approach

Traditional image classification uses a single softmax layer across all classes, assuming classes are mutually exclusive. YOLO-9000 modified this approach:

- **Multiple softmax groups**: Instead of a single softmax across all classes, they applied softmax separately to each group of related classes in the hierarchy.
- **Conditional probabilities**: They calculated probabilities conditioned on parent nodes. For example, P(German Shepherd | Dog | Animal).
- **Tree traversal for prediction**: Starting from the root, they followed the path with highest probability scores, stopping when confidence dropped below 0.5.

This approach had a key benefit: "performance degrades gracefully on new or unknown objects." If the model wasn't confident about a specific breed (e.g., "Sighthound"), it would still predict the parent category ("Hunting Dog") with higher confidence.

### 3. Joint Training Methodology

YOLO-9000 trained on both datasets simultaneously:
- For COCO images (with bounding boxes): Backpropagated both classification and detection losses
- For ImageNet images (no bounding boxes): Backpropagated only classification loss

This clever approach allowed the model to learn to predict bounding boxes even for classes that never had explicit box annotations during training, leveraging the hierarchical relationships between classes.

## Technical Implementation Details

- Based on YOLO v2 architecture with some modifications
- Used Darknet-19 as the backbone network
- Reduced anchor boxes from 5 to 3 per grid cell to accommodate more class predictions
- Created WordTree-1K with 1369 classes for initial training
- Later expanded to WordTree with 9418 labels (combining COCO and top-9000 ImageNet classes)
- Darknet-19 achieved 71.9% top-1 accuracy and 90.4% top-5 accuracy on the 1369-class dataset

## Results and Performance

- **Overall performance**: 19.7 mAP (mean Average Precision) across all 9000 classes
- **Novel objects**: 16 mAP for objects not in COCO (objects that never had bounding box training data)
- **Best performing categories**: Animals (likely because animal classes have clear hierarchical relationships)
- **Worst performing categories**: Apparel items like sunglasses (156 categories had no corresponding class in COCO)

## Examples of Detection Results

The materials show several examples demonstrating the system's capabilities:
1. **Multiple dogs**: Instead of just detecting "dog" (COCO label), it correctly identified "German Shepherd" and "Siberian Husky" (ImageNet labels) with appropriate bounding boxes
2. **Dolphin and person**: Instead of just "person," it identified "skin diver" along with a specific dolphin type
3. **Mushroom**: Shows limitations with objects that have no corresponding class in COCO - bounding box prediction was poor

## Significance

YOLO-9000 represented a breakthrough in object detection by:
1. Dramatically expanding the number of detectable object classes without requiring exhaustive bounding box annotations
2. Creating a novel hierarchical approach to classification that accommodates overlapping categories
3. Developing a joint training methodology that could leverage both fully-annotated detection datasets and classification-only datasets
4. Demonstrating that transfer of bounding box prediction is possible across related classes

This approach paved the way for more scalable object detection systems that could recognize a much wider range of objects than was previously possible.