These were specific packets that were needed to run the model.

In [None]:
%pip install torchvision --index-url https://download.pytorch.org/whl/cu118
%pip install opencv-python
%pip install opencv-contrib-python
%pip install opencv-python torch
%pip install opencv-python ultralytics
%pip install git+https://github.com/openai/CLIP.git


In [2]:
import cv2
import torch
from ultralytics import YOLOWorld
import numpy as np

<a class="anchor" id="1" name="1"></a>
## **YOLO-World Model**
The detect_objects function uses YOLO-World, a model that is open-vocabulary. Basically detecting any object that could be described with words, even if the model did not see it in the training. The PyTorch file that we are using is the yolov8x-worldv2. The "v8x" stands for the YOLOv8-Extra Large model. This model specificly is the most accurate but also the slowest.


In [None]:
def detect_objects(video_path, output_path, weight_file='yolov8x-worldv2.pt', classes=None, frame_skip=5, confidence_threshold=0.6):
    """
    Object Detection Pipeline using YOLO-World that saves annotated video.

    Args:
        video_path (str): Path to the input video.
        output_path (str): Path to save the annotated output video.
        weight_file (str): Path to YOLO-World model weights.
        classes (list of str): List of class names to detect.
        frame_skip (int): Process every 'frame_skip' frames to save compute.
        confidence_threshold (float): Minimum confidence to keep a detection.

    Returns:
        None
    """
    # Load the model
    model = YOLOWorld(weight_file)

    # Zero-shot detection
    if classes is not None:
        model.set_classes(classes)

    # Open the video
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        raise IOError("Cannot open video")

    # Video writer setup
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = cap.get(cv2.CAP_PROP_FPS)

    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_path, fourcc, fps // frame_skip, (width, height))

    frame_id = 0

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        frame_id += 1

        # Skip frames if needed
        if frame_id % frame_skip != 0:
            continue

        # Run detection
        results = model.predict(frame)
        boxes = results[0].boxes
        predictions = boxes.data.cpu().numpy()

        for pred in predictions:
            x1, y1, x2, y2, score, class_id = pred[0], pred[1], pred[2], pred[3], pred[4], int(pred[5])

            if score < confidence_threshold:
                continue

            label = f"{model.names[class_id]} {score:.2f}"
            color = (0, 255, 0)

            # Draw bounding box
            cv2.rectangle(frame, (int(x1), int(y1)), (int(x2), int(y2)), color, 2)

            # Draw label background
            (text_width, text_height), baseline = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1)
            cv2.rectangle(frame, (int(x1), int(y1) - text_height - baseline), (int(x1) + text_width, int(y1)), color, -1)

            # Draw label text
            cv2.putText(frame, label, (int(x1), int(y1) - baseline), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 1)

        out.write(frame)

    cap.release()
    out.release()


In the main function, we first specify the input video file path and create a new output video to save the annotated results. We then apply zero-shot object detection by providing the model with text prompts, allowing it to detect specified anomalies without additional training.

In [7]:
if __name__ == "__main__":
    video_file = r"C:\Users\tyler\OneDrive\Desktop\Computer Vision\CV Project\SD_1.mov"  # Example
    output_video_file = r"C:\Users\tyler\OneDrive\Desktop\Computer Vision\CV Project\annotated_output.mov"
  # Output video path
    walking_hazards = [
        'person', 'bicycle', 'car', 'motorcycle', 'bus', 'truck', 'traffic light', 'stop sign'
    ]

    detect_objects(video_file, output_video_file, classes=walking_hazards)
    print(f"Video saved")


0: 384x640 3 persons, 2 cars, 2 traffic lights, 64.5ms
Speed: 3.6ms preprocess, 64.5ms inference, 24.0ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 persons, 2 cars, 2 traffic lights, 35.6ms
Speed: 2.6ms preprocess, 35.6ms inference, 1.7ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 3 persons, 2 cars, 2 traffic lights, 35.6ms
Speed: 1.6ms preprocess, 35.6ms inference, 1.7ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 persons, 3 cars, 2 traffic lights, 33.4ms
Speed: 1.7ms preprocess, 33.4ms inference, 1.7ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 1 person, 3 cars, 3 traffic lights, 32.8ms
Speed: 2.2ms preprocess, 32.8ms inference, 2.5ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 4 persons, 3 cars, 2 traffic lights, 32.7ms
Speed: 1.7ms preprocess, 32.7ms inference, 2.1ms postprocess per image at shape (1, 3, 384, 640)

0: 384x640 2 persons, 3 cars, 2 traffic lights, 32.7ms
Speed: 2.2ms preprocess, 3