# YOLOv9 Video Object Detection

This notebook demonstrates object detection on video files using YOLOv9 with MIT license.

**Key Features:**
- Uses MIT-licensed YOLO implementation
- Processes video frame by frame
- Saves output video with detections

> _NOTE_: here we use a MIT-licensed implementation of YOLO. It implements YOLOv9. There are many versions of YOLO around, many of which require enterprise licenses to be used in any commercial context. This version of YOLO is currently the best open one, but there are other non-open versions that could be better for your task. It's up to you to decide what fits your use case best, but the usage is always similar so you will be able to apply what you learn here no matter your choice.

## Part 1: detecting objects in a video frame by frame

In [None]:
# Define video paths
input_video = "cars_on_bridge.mp4"
output_video = "output_detected.m4v"

# Only needed on the Udacity workspace. Comment this out if running on another system.
import os
os.environ['HF_HOME'] = '/voc/data/huggingface'
os.environ['OLLAMA_MODELS'] = '/voc/data/ollama/cache'
os.environ['HF_HUB_OFFLINE'] = '1'
os.environ['PATH'] = f"/voc/data/ollama/bin:/voc/data/ffmpeg/bin:{os.environ.get('PATH', '')}"
os.environ['LD_LIBRARY_PATH'] = f"/voc/data/ollama/lib:/voc/data/ffmpeg/lib:{os.environ.get('LD_LIBRARY_PATH', '')}"

Let's start with a small utility function to get frame rate (FPS) and size of the video (the size of each frame in pixels):

In [2]:
import cv2 

def get_fps_and_video_size(video_path):
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)

    # Get frame size
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    cap.release()
    
    return fps, (frame_width, frame_height)

fps, frame_size = get_fps_and_video_size(input_video)

## YOLOv9 Inference on Video

Because this is a custom code, we need to do a bit of work to get this to run. First, let's define a helper function to load the model:

In [3]:
from hydra import compose, initialize_config_module
from omegaconf import DictConfig
import torch

from yolo.tools.solver import InferenceModel


def get_model_instance(input_video: str) -> tuple[InferenceModel, DictConfig]:

    # Select device (use GPU if available)
    device = (
        "cuda"
        if torch.cuda.is_available()
        else "mps" if torch.backends.mps.is_available() else "cpu"
    )
    print(f"Using device: {device}")

    # This is necssary to avoid issues with tensors on different devices
    # for this particular version of YOLO
    torch.set_default_device(device)

    # We load the default YOLO configuration, then we override some of its parameters
    # (this is the hidiomatic way of doing things for Hydra, a configuration management tool)
    with initialize_config_module(config_module="yolo.config", version_base=None):
        cfg = compose(
            config_name="config",
            # These are the parameters we want to override
            overrides=[
                "task.task=inference",
                # v9-s is the smallest model
                "model=v9-s",
                # We point to our video file
                f"task.data.source={input_video}",
                # We do not want to track on Weights and Biases
                "use_wandb=false",
                # We set out device
                f"device={device}",
            ],
        )
    # This is the way of loading and setting up a model
    # with this version of YOLOv7
    model = InferenceModel(cfg).to(device)
    model.eval()
    # This is a custom step that is necessary to setup the
    # post-processing step of the model (which includes the 
    # Non-Maximum Suppression)
    model.setup(cfg.task.task)

    return model, cfg


model, cfg = get_model_instance(input_video)

Using device: mps


In [4]:
from typing import Callable
import numpy as np
from tqdm import tqdm
from yolo.tools.data_augmentation import PadAndResize
from torchvision.transforms.functional import to_tensor
from PIL import Image
import cv2
from torch.amp import autocast


def preprocess_frame(
    frame: np.ndarray,
    pad_and_resize: PadAndResize,
    device: str = "cpu",
) -> tuple[torch.Tensor, torch.Tensor, Image.Image]:
    # We need to pad and resize every frame to match the expected
    # input resolution of the model

    frame = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
    untransformed_frame = frame.copy()

    # PadAndResize can also operate on the ground truth boxes,
    # which we don't have here (because this is inference on unknown data)
    # So we use a dummy tensor
    fake_boxes = torch.zeros((1, 6))
    transformed_frame, _, transform_info = pad_and_resize(frame, fake_boxes)
    transformed_frame = to_tensor(transformed_frame)
    batch_of_one = transformed_frame[None]
    rev_tensor = transform_info[None]

    batch_of_one = batch_of_one.to(device)
    rev_tensor = rev_tensor.to(device)

    return batch_of_one, rev_tensor, untransformed_frame


def run_inference_on_one_frame(
    model: InferenceModel, frame: np.ndarray, pad_and_resize: Callable
) -> list:
    
    # Pre-process the frame and get:
    # the batch of one (the pre-processed frame ready to be fed to the model)
    # the rev_tensor (the information needed to reverse the transformations)
    # the untransformed_frame (the original frame, needed for visualization)
    batch_of_one, rev_tensor, untransformed_frame = preprocess_frame(
        frame, pad_and_resize, device=model.device
    )

    # Run YOLO. This will return the raw outputs of the model
    outputs = model(batch_of_one)

    # Re-format outputs and apply Non-Maximum Suppression to remove
    # duplicate detections
    predicts = model.post_process(outputs, rev_tensor=rev_tensor)

    # We expect only one element in the batch (one frame)
    assert len(predicts) == 1

    return untransformed_frame, predicts[0].detach().cpu()

def run_inference_on_video(
    model: InferenceModel, input_video: str
) -> list:

    # We use opencv to loop through the frames of the video
    cap = cv2.VideoCapture(input_video)
    # Get the total number of frames in the video
    n_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    # We need to pad and resize every frame to match the expected
    # input resolution of the model
    pad_and_resize = PadAndResize(cfg.image_size)

    results = []

    with torch.no_grad():

        # NOTE: this is absolutely necessary for good results with this
        # version of YOLO. Failing to do this will result in very poor
        # performance, because of the way the model has been trained.
        with autocast(model.device.type):

            for _ in tqdm(range(n_frames), total=n_frames):

                # Read frame from the video
                ret, frame = cap.read()

                if not ret:
                    # Video is finished
                    break

                untransformed_frame, predicts = run_inference_on_one_frame(
                    model, frame, pad_and_resize
                )

                # Append results for this frame
                results.append([untransformed_frame, predicts])

    cap.release()

    return results


results = run_inference_on_video(model, input_video)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 56/56 [00:06<00:00,  8.00it/s]



## Display Results

Now we write a little utility function that transforms the results into images with the boxes overlayed:

In [5]:
from yolo.tools.drawer import draw_bboxes


def visualize(results, class_list):
        
        return [
            draw_bboxes(origin_frame, predicts, idx2label=class_list)
            for origin_frame, predicts in results
        ]

frames = visualize(results, cfg.dataset.class_list)

and finally let's transform the frames into a video so we can see the results:

In [None]:
import numpy as np
from IPython.display import Video
import PIL.Image


def frames_to_video(frames, output_name='output.mp4', fps=5):

    # Read first frame to get dimensions
    first_frame = frames[0] # type: PIL.Image.Image
    width, height = first_frame.size
    
    # Create video writer
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    video = cv2.VideoWriter(output_name, fourcc, fps, (width, height))
    
    # Add frames to video
    for frame in frames:
        # Convert RGB -> BGR
        frame = cv2.cvtColor(np.array(frame), cv2.COLOR_RGB2BGR)
        video.write(frame)

    video.release()
    print(f"Video saved as {output_name}")
    return output_name

video_file = frames_to_video(frames, fps=fps, output_name="output.mp4")

display(
        Video("output.mp4", embed=False)
)

Video saved as output.mp4


We can see that things are working, but the cars are not detected anymore when they get far enough on the bridge. This is a typical problem with YOLO system, we will see in part 2 how to approach that.

## Part 2: Object tracking in videos

In this second part we are going to move from merely detecting objects independently frame by frame, to tracking each object through the frames. We are going to use the `supervision` library, so let's start by importing it:

In [7]:
import supervision as sv

### ByteTrack: track objects across frames

ByteTrack is a multi-object tracking algorithm that associates object detections across video frames to create consistent tracks (trajectories) for each object.

It works in two steps:

1. High-confidence association: Match high-confidence detections with existing tracks using similarity measures (IoU overlap + motion prediction based on previous frames)
2. Low-confidence recovery: Match remaining unmatched tracks with low-confidence detections that were initially ignored

In practice, BytesTrack starts new tracks for unmatched high-confidence detections, keep tracks alive briefly when unmatched, delete tracks unmatched too long. This allows it to recover from brief occlusions and to handle appearing objects.

Great, now let's define a class that encapsulates what we need to do to track objects in videos:

In [8]:
class YOLOVideoTrackerBasic:
    """
    A class that encapsulates the logic for tracking objects in a video using YOLO and ByteTrack.
    """

    def __init__(self, video_file: str):

        self.model, self.cfg = get_model_instance(video_file)

        # Get FPS
        fps, image_size = get_fps_and_video_size(video_file)

        # This is the algorithm that does the tracking.
        # We're using the default parameters here, but
        # you can tune them if you want to.
        self.byte_tracker = sv.ByteTrack(frame_rate=fps)

        self.bounding_box_annotator = sv.BoxAnnotator()
        self.label_annotator = sv.LabelAnnotator()

        # We need to pad and resize every frame to match the expected
        # input resolution of the model
        self.pad_and_resize = PadAndResize(cfg.image_size)

    @staticmethod
    def yolo_to_sv_detections(yolo_outputs: torch.Tensor):
        """
        Re-organize information in the format expected by the supervision tracker
        """

        yolo_outputs = yolo_outputs.cpu().numpy()

        detections = sv.Detections(
            # yolo_outputs is a tensor of shape (n_detections, 6)
            # where each detection is (class_id, x1, y1, x2, y2, score)
            xyxy=yolo_outputs[:, 1:5],  # box coordinates
            confidence=yolo_outputs[:, 5],  # confidence score
            class_id=yolo_outputs[:, 0].astype(int),  # class id as integer
        )

        return detections

    def _yolo_inference(self, image_slice: np.ndarray) -> sv.Detections:
        """
        Runs inference on one frame and returns results in the format 
        expected by the supervision tracker
        """
        _, predicts = run_inference_on_one_frame(
            self.model, image_slice, self.pad_and_resize
        )
        return self.yolo_to_sv_detections(predicts)

    def run_on_one_frame(self, frame: np.ndarray, index: int) -> np.ndarray:

        detections = self._yolo_inference(frame)
        # We update the tracker with the new detections
        detections = self.byte_tracker.update_with_detections(detections)

        labels = [
            f"{cfg.dataset.class_list[int(class_id)]} {tracker_id} {confidence:0.2f}"
            for _, class_id, confidence, tracker_id in zip(
                detections.xyxy,
                detections.class_id,
                detections.confidence,
                detections.tracker_id,
            )
        ]

        annotated_frame = self.bounding_box_annotator.annotate(
            scene=frame.copy(), detections=detections
        )

        annotated_frame = self.label_annotator.annotate(
            scene=annotated_frame, detections=detections, labels=labels
        )

        return annotated_frame

In [16]:
input_video = "cars_on_bridge.mp4"
output_video = "output_detected.m4v"

processor = YOLOVideoTrackerBasic(video_file=input_video)


# Supervision makes it very easy to apply a processing function
# to every frame of a video and save the result to a new video file
sv.process_video(
    source_path=input_video,
    target_path=output_video,
    callback=processor.run_on_one_frame,
    show_progress=True,
)

# Convert from m4v to mp4 so we can display it here
!ffmpeg -i {output_video} -c:v libx264 -tag:v avc1 cars_on_bridge_detected_orig.mp4 -y > /dev/null 2>&1


Using device: mps


Processing video:   0%|          | 0/56 [00:00<?, ?it/s]

In [17]:

display(
        Video("cars_on_bridge_detected_orig.mp4", embed=False)
)

We can see that objects keep their "identity" throught the video. This is the core concept of object tracking! However, the car detections are still disappearing towards the end, let's fix that!

### Fixing the small objects problem

Detecting small objects is a known challenge for YOLO algorithms. A small object is defined as an object with a typical size that is much smaller than the dimension of the image. 

We can see this happening in the video above: as the cars become smaller, the detections become unstable and then cars are not detected anymore.

There is a basic trick that works quite well in practice: we divide the image in sub-images with overlap, run YOLO independently on each, then remove redundant detections. This is accomplished in `supervision` by using the `sv.InferenceSlider` class. Let's add it to our detector class:

In [10]:
class YOLOVideoTrackerWithSlicing:
    """
    A class that encapsulates the logic for tracking objects in a video using YOLO and ByteTrack.
    """

    def __init__(
        self, video_file: str, with_slicing: bool = True
    ):
        
        self.model, self.cfg = get_model_instance(video_file)

        # Get FPS
        fps, image_size = get_fps_and_video_size(video_file)

        # This is the algorithm that does the tracking.
        # We're using the default parameters here, but
        # you can tune them if you want to.
        self.byte_tracker = sv.ByteTrack(frame_rate=fps)

        self.bounding_box_annotator = sv.BoxAnnotator()
        self.label_annotator = sv.LabelAnnotator()

        # We need to pad and resize every frame to match the expected
        # input resolution of the model
        self.pad_and_resize = PadAndResize(cfg.image_size)

        if with_slicing:
            self.slicer = sv.InferenceSlicer(
                # We slice the image with overlapping slices of
                # size half the image size, with an overlap of one sixth
                # of the image size
                slice_wh=(image_size[0] // 2, image_size[1] // 2),
                overlap_wh=((image_size[0] // 6), (image_size[1] // 6)),
                # This is the function that will be called on each slice
                callback=self._yolo_inference,
                overlap_ratio_wh=None,  # this is just to avoid a warning
            )
        else:
            # No slicing, just run YOLO on the whole image
            self.slicer = self._yolo_inference

    @staticmethod
    def yolo_to_sv_detections(yolo_outputs: torch.Tensor):
        """
        Re-organize information in the format expected by the supervision tracker
        """

        yolo_outputs = yolo_outputs.cpu().numpy()

        detections = sv.Detections(
            # yolo_outputs is a tensor of shape (n_detections, 6)
            # where each detection is (class_id, x1, y1, x2, y2, score)
            xyxy=yolo_outputs[:, 1:5],  # box coordinates
            confidence=yolo_outputs[:, 5],  # confidence score
            class_id=yolo_outputs[:, 0].astype(int),  # class id as integer
        )

        return detections

    def _yolo_inference(self, image_slice: np.ndarray) -> sv.Detections:
        """
        Runs inference on one frame and returns results in the format
        expected by the supervision tracker
        """
        _, predicts = run_inference_on_one_frame(
            self.model, image_slice, self.pad_and_resize
        )
        return self.yolo_to_sv_detections(predicts)

    def run_on_one_frame(self, frame: np.ndarray, index: int) -> np.ndarray:

        detections = self.slicer(frame)
        # We update the tracker with the new detections
        detections = self.byte_tracker.update_with_detections(detections)

        labels = [
            f"{cfg.dataset.class_list[int(class_id)]} {tracker_id} {confidence:0.2f}"
            for _, class_id, confidence, tracker_id in zip(
                detections.xyxy,
                detections.class_id,
                detections.confidence,
                detections.tracker_id,
            )
        ]

        annotated_frame = self.bounding_box_annotator.annotate(
            scene=frame.copy(), detections=detections
        )

        annotated_frame = self.label_annotator.annotate(
            scene=annotated_frame, detections=detections, labels=labels
        )

        return annotated_frame

In [18]:
input_video = "cars_on_bridge.mp4"
output_video = "output_detected.m4v"

processor = YOLOVideoTrackerWithSlicing(video_file=input_video, with_slicing=True)

sv.process_video(
    source_path=input_video,
    target_path=output_video,
    callback=processor.run_on_one_frame,
    show_progress=True,
)

# Convert from m4v to mp4 so we can display it here
!ffmpeg -i {output_video} -c:v libx264 -tag:v avc1 cars_on_bridge_detected_2.mp4 -y > /dev/null 2>&1

display(
        Video("cars_on_bridge_detected_2.mp4", embed=False)
)

Using device: mps


Processing video:   0%|          | 0/56 [00:00<?, ?it/s]

### Object counting

Now that our detections are of better quality, we can see one more common application of these technologies: object counting. In a typical scenario, we want to count how many objects enter or exit a certain area. In the case of our bridge, this would allow us for example to count how many vehicles pass on the bridge in a given unit of time, for traffic planning purposes.

With `supervision` this is easy to do: we need to define a line on the video, then the library can count how many vehicles cross that line in either directions ("in" or "out"). Let's extend our class for this use case:

In [12]:
class YOLOVideoObjectCounter:
    """
    A class that encapsulates the logic for counting objects in a video using YOLO and ByteTrack.
    """

    def __init__(
        self,
        video_file: str,
        with_slicing: bool = True,
        line_zone: sv.LineZone = None,
    ):
        
        self.model, self.cfg = get_model_instance(video_file)

        # Get FPS
        fps, image_size = get_fps_and_video_size(video_file)

        # This is the algorithm that does the tracking.
        # We're using the default parameters here, but
        # you can tune them if you want to.
        self.byte_tracker = sv.ByteTrack(frame_rate=fps)
        self.line_zone = line_zone

        # These are utilities to draw on the video for visualization
        # purposes
        self.line_zone_annotator = sv.LineZoneAnnotator(
            thickness=2, text_thickness=2, text_scale=1
        )
        self.bounding_box_annotator = sv.BoxAnnotator()
        self.label_annotator = sv.LabelAnnotator()

        # We need to pad and resize every frame to match the expected
        # input resolution of the model
        self.pad_and_resize = PadAndResize(cfg.image_size)

        if with_slicing:
            self.slicer = sv.InferenceSlicer(
                # We slice the image with overlapping slices of
                # size half the image size, with an overlap of one sixth
                # of the image size
                slice_wh=(image_size[0] // 2, image_size[1] // 2),
                overlap_wh=((image_size[0] // 6), (image_size[1] // 6)),
                # This is the function that will be called on each slice
                callback=self._yolo_inference,
                overlap_ratio_wh=None,  # this is just to avoid a warning
            )
        else:
            # No slicing, just run YOLO on the whole image
            self.slicer = self._yolo_inference

    @staticmethod
    def yolo_to_sv_detections(yolo_outputs: torch.Tensor):
        """
        Re-organize information in the format expected by the supervision tracker
        """

        yolo_outputs = yolo_outputs.cpu().numpy()

        detections = sv.Detections(
            # yolo_outputs is a tensor of shape (n_detections, 6)
            # where each detection is (class_id, x1, y1, x2, y2, score)
            xyxy=yolo_outputs[:, 1:5],  # box coordinates
            confidence=yolo_outputs[:, 5],  # confidence score
            class_id=yolo_outputs[:, 0].astype(int),  # class id as integer
        )

        return detections

    def _yolo_inference(self, image_slice: np.ndarray) -> sv.Detections:
        """
        Runs inference on one frame and returns results in the format
        expected by the supervision tracker
        """
        _, predicts = run_inference_on_one_frame(
            self.model, image_slice, self.pad_and_resize
        )
        return self.yolo_to_sv_detections(predicts)

    def run_on_one_frame(self, frame: np.ndarray, index: int) -> np.ndarray:

        detections = self.slicer(frame)
        # We update the tracker with the new detections
        detections = self.byte_tracker.update_with_detections(detections)

        if self.line_zone is not None:
            # Counting cars and trucks only
            # class_id 2 is car, class_id 7 is truck
            # You can change this to count other classes
            car_detections = detections[(detections.class_id == 2) | (detections.class_id == 7)]
            self.line_zone.trigger(car_detections)

        labels = [
            f"{cfg.dataset.class_list[int(class_id)]} {tracker_id} {confidence:0.2f}"
            for _, class_id, confidence, tracker_id in zip(
                detections.xyxy,
                detections.class_id,
                detections.confidence,
                detections.tracker_id,
            )
        ]

        annotated_frame = self.bounding_box_annotator.annotate(
            scene=frame.copy(), detections=detections
        )

        annotated_frame = self.label_annotator.annotate(
            scene=annotated_frame, detections=detections, labels=labels
        )

        if self.line_zone is not None:
            # Apply counting annotation to show the line and the
            # counts
            annotated_frame = self.line_zone_annotator.annotate(
                annotated_frame, line_counter=self.line_zone
            )

        return annotated_frame

In [13]:
input_video = "cars_on_bridge.mp4"
output_video = "output_detected.m4v"

# Let's define a line in the video
# We use a horizontal line in the middle of the bridge
_, image_size = get_fps_and_video_size(input_video)
START = sv.Point(0, image_size[1] // 4)
END = sv.Point(image_size[0], image_size[1] // 4)
line_zone = sv.LineZone(
    start=START, 
    end=END,
    # We trigger the count when the center of the bounding
    # box crosses the line
    triggering_anchors=[sv.Position.CENTER],
)

# This works as before
processor = YOLOVideoObjectCounter(video_file=input_video, with_slicing=True, line_zone=line_zone)

sv.process_video(
    source_path=input_video,
    target_path=output_video,
    callback=processor.run_on_one_frame,
    show_progress=True,
)

# Convert from m4v to mp4 so we can display it here
!ffmpeg -i {output_video} -c:v libx264 -tag:v avc1 cars_on_bridge_detected.mp4 -y > /dev/null 2>&1

display(
        Video("cars_on_bridge_detected.mp4", embed=False)
)


Using device: mps


Processing video:   0%|          | 0/56 [00:00<?, ?it/s]

Now let's test it on a different traffic video, where we can see that we count both things going into the area and out from the area. In this case, we can count cars going both directions on a highway:

In [14]:
input_video = "two_lanes_cut.mp4"
output_video = "traffic_detected.m4v"


# Let's define a line in the video
# We use a horizonthal line half way through the image
_, image_size = get_fps_and_video_size(input_video)
START = sv.Point(0, image_size[1] // 2)
END = sv.Point(image_size[0], image_size[1] // 2)
line_zone = sv.LineZone(
    start=START, 
    end=END,
    # We trigger the count when the center of the bounding
    # box crosses the line
    triggering_anchors=[sv.Position.CENTER],
)

processor = YOLOVideoObjectCounter(video_file=input_video, with_slicing=True, line_zone=line_zone)

sv.process_video(
    source_path=input_video,
    target_path=output_video,
    callback=processor.run_on_one_frame,
    show_progress=True,
)

# Convert from m4v to mp4 so we can display it here
!ffmpeg -i {output_video} -c:v libx264 -tag:v avc1 traffic_detected.mp4 -y > /dev/null 2>&1

display(
        Video("traffic_detected.mp4", embed=False)
)

Using device: mps


Processing video:   0%|          | 0/300 [00:00<?, ?it/s]

As we can see, this works fairly well! It is not perfect, but we also didn't spend any time optimizing parameters!