<a href="https://colab.research.google.com/github/tahamsi/computer-vision/blob/main/week-10/Object_tracking_and_counting_with_yolo11_supervision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/tahamsi/computer-vision)

# Object Tracking


**Object tracking** is a field within computer vision that involves the process of locating and following a specific object or multiple objects in a sequence of frames within a video. The primary goal of object tracking is to identify and trace the movement of objects over time as they move within a video or a series of consecutive frames.

The process of object tracking typically involves the following steps:

* Detection: Initially, an object detector or segmentation algorithm identifies and localizes objects within the first frame of the video sequence.

* Initialization: Once the object is detected in the first frame, a bounding box or a specific region of interest around the object is defined, and its characteristics (such as appearance features, color, shape, or motion) are extracted to create a representation.

* Tracking: Using the defined characteristics, the tracker continuously predicts the object's position or state in subsequent frames by updating and adjusting the initial representation. This is done by estimating the object's location, size, orientation, and other relevant attributes.

* Updating: As the object moves, changes direction, or experiences occlusion, the tracking algorithm adapts to these variations, maintaining the object's trajectory and characteristics across frames.

## YOLO
YOLO (You Only Look Once) is a popular deep learning model used for real-time object detection. It was introduced by Joseph Redmon and is known for its ability to detect multiple objects in an image or video frame with high speed and accuracy. YOLO stands out due to its innovative approach of treating object detection as a single regression problem, enabling it to predict bounding boxes and class probabilities directly from full images in one evaluation.

Key Features

* Single-Pass Detection: Unlike traditional object detection methods that use a multi-stage process (e.g., region proposal and classification), YOLO processes an image in a single neural network pass, making it extremely fast and suitable for real-time applications.
* Grid-Based Prediction: YOLO divides the input image into a grid and assigns each grid cell the responsibility of predicting bounding boxes and their associated class probabilities if the center of an object falls within that cell.
* End-to-End Learning: The model is trained end-to-end, optimizing for both object localization and classification simultaneously.
* Speed and Efficiency: YOLO is capable of processing images at high frame rates, making it suitable for applications that require real-time performance, such as video surveillance, autonomous vehicles, and interactive systems.

## YOLO11
[Ultralytics YOLO11](https://docs.ultralytics.com/modes/track/) is a state-of-the-art model that builds on the success of previous YOLO versions, incorporating new features and enhancements to further improve performance and flexibility. **YOLO11** is designed to be fast, accurate, and user-friendly, making it an ideal choice for a variety of tasks, including object detection, tracking, instance segmentation, image classification, and pose estimation.

The output from Ultralytics trackers aligns with standard object detection while incorporating object IDs, enabling seamless object tracking in video streams and advanced analytics. Here’s why Ultralytics YOLO is an excellent choice for your object tracking needs:

* **Efficiency**: Processes video streams in real-time with high accuracy.
* **Flexibility**: Supports various tracking algorithms and configurable options to suit diverse use cases.
* **Ease of Use**: Features a straightforward Python API and CLI for quick setup and deployment.
* **Customizability**: Compatible with custom-trained YOLO models, making it ideal for domain-specific applications.

## Before you start

Let's make sure that we have access to `GPU`. We can use `nvidia-smi` command to do that. In case of any problems navigate to `Edit -> Notebook settings -> Hardware accelerator`, set it to `GPU`, and then click `Save`.

In [None]:
!nvidia-smi

## Install YOLO

Install the Ultralytics package, along with all required dependencies, in a Python environment (version 3.8 or higher) with `PyTorch` (version 1.8 or higher) using the following command: `pip install ultralytics`.

In [1]:
!pip install ultralytics

from IPython import display
display.clear_output()

In [2]:
import os
HOME = os.getcwd()
!mkdir -p {HOME}/data

### Load Data

In [12]:
!wget https://raw.githubusercontent.com/tahamsi/computer-vision/refs/heads/main/images/people.mp4 -P {HOME}/data
source_video = f"{HOME}/data/people.mp4"
display.clear_output()

## Build a model

You can now choose the core model for your object tracking tasks: either object detection or instance segmentation as the base model. Whichever mode you select, the corresponding checkpoints will be automatically downloaded.

In [4]:
from ultralytics import YOLO

model = YOLO("yolo11n-seg.pt")

Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/root/.config/Ultralytics/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.
Downloading https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo11n-seg.pt to 'yolo11n-seg.pt'...


100%|██████████| 5.90M/5.90M [00:00<00:00, 53.1MB/s]


## ByteTrack

In this implementation, we utilize **ByteTrack** as our object tracking model.

[ByteTrack](https://github.com/ifzhang/ByteTrack) is a cutting-edge multi-object tracking (MOT) algorithm that enhances tracking accuracy by associating both high-confidence and low-confidence detections across frames. Unlike traditional trackers that often discard low-confidence detections, ByteTrack integrates them into the tracking process, which helps in handling occlusions, crowded scenes, and challenging environments more effectively. It relies on an advanced data association strategy that combines appearance and motion cues to robustly maintain object trajectories over time. ByteTrack achieves state-of-the-art performance on popular benchmarks, such as MOT Challenge, while remaining computationally efficient, making it suitable for real-world applications in surveillance, autonomous driving, and video analytics. Its versatility and robustness have made it a popular choice for multi-object tracking tasks.

### Supervision
You can either install and set up ByteTrack directly or use [Roboflow Supervision](https://github.com/roboflow/supervision), which conveniently packages it into an easy-to-use module.Supervision is designed to be model-agnostic, allowing seamless integration with any classification, detection, or segmentation model.

In [5]:
!pip install supervision

display.clear_output()

## Import required libraries

In [6]:
import supervision as sv
import numpy as np

### Initialization

Limit the model to focus exclusively on the objects of interest. Additionally, since the goal is to count moving objects, you can define a specific region in the video—preferably closer to the camera—to track and count these objects. To achieve this, the region's coordinates should be specified.

In [8]:
class_nmaes = model.model.names
# class_id of interest - person
class_ids = [0]

# settings
start = sv.Point(50, 700)
end = sv.Point(1700, 700)

target_video = f"{HOME}/result.mp4"

Analyse the input video. And create a VideoInfo instance

In [13]:
video_info = sv.VideoInfo.from_video_path(source_video)

Create `ByteTrack` instance.

In [14]:
# create BYTETracker instance
byte_tracker = sv.ByteTrack(
    track_activation_threshold=0.25,
    lost_track_buffer=30,
    minimum_matching_threshold=0.8,
    frame_rate=30,
    minimum_consecutive_frames=3)

byte_tracker.reset()

### Connect YOLO to ByteTrack

In [15]:
# create frame generator
generator = sv.get_video_frames_generator(source_video)

# create LineZone instance, it is previously called LineCounter class
line_zone = sv.LineZone(start=start, end=end)

# create instance of BoxAnnotator, LabelAnnotator, and TraceAnnotator
box_annotator = sv.BoxAnnotator(thickness=4)
label_annotator = sv.LabelAnnotator(text_thickness=2, text_scale=1.5, text_color=sv.Color.BLACK)
trace_annotator = sv.TraceAnnotator(thickness=4, trace_length=50)

# create LineZoneAnnotator instance, it is previously called LineCounterAnnotator class
line_zone_annotator = sv.LineZoneAnnotator(thickness=4, text_thickness=4, text_scale=2)

# define call back function to be used in video processing
def callback(frame: np.ndarray, index: int) -> np.ndarray:
    # model prediction on single frame and conversion to supervision Detections
    results = model(frame, verbose=False)[0]
    detections = sv.Detections.from_ultralytics(results)
    # only consider class id from selected_classes define above
    detections = detections[np.isin(detections.class_id, class_ids)]
    # tracking detections
    detections = byte_tracker.update_with_detections(detections)
    labels = [
        f"#{tracker_id} {model.model.names[class_id]} {confidence:0.2f}"
        for confidence, class_id, tracker_id
        in zip(detections.confidence, detections.class_id, detections.tracker_id)
    ]
    annotated_frame = frame.copy()
    annotated_frame = trace_annotator.annotate(
        scene=annotated_frame, detections=detections)
    annotated_frame = box_annotator.annotate(
        scene=annotated_frame, detections=detections)
    annotated_frame = label_annotator.annotate(
        scene=annotated_frame, detections=detections, labels=labels)

    # update line counter
    line_zone.trigger(detections)
    # return frame with box and line annotated result
    return line_zone_annotator.annotate(annotated_frame, line_counter=line_zone)

### Process the video

In [16]:
sv.process_video(
    source_path = source_video,
    target_path = target_video,
    callback=callback
)