# Object Detection  - YOLO & OWL-ViT
This tutorial demonstrates how to use YOLO (You Only Look Once) from the [Ultralytics](https://github.com/ultralytics/yolov5) library for object detection. It includes steps for:

- Running object detection inference on images/videos
- Fine-tuning YOLO for custom datasets
- Comparing YOLO with OWl-VIT for zero-shot learning.


## 1. Perform Object Detection Inference
First thing We'll use YOLOv8 from Ultralyics for object detection on a sample image.
We aim to utilize the pre-trained YOLOv8 model to detect objects in a sample image. This involves loading the model, providing an image for input, and interpreting the model's predictions.

**Key Concepts:**
- **Inference**: The process of using a trained model to make predictions on new data.
- **YOLOv8**: A state-of-the-art version of the YOLO (You Only Look Once) architecture, known for its speed and accuracy in object detection tasks.

**Steps:**
1. Load the YOLOv8 model using the Ultralytics library.
2. Perform inference on a sample image to detect objects.
3. Visualize the results, including bounding boxes and class labels.

**Support Material:**
- https://docs.ultralytics.com/models/yolov8/
- https://docs.ultralytics.com/tasks/detect/

In [2]:
# Import YOLO and load a pre-trained model
from ultralytics import YOLO
import cv2

# Load the YOLOv8 pre-trained model
model = YOLO('yolov8n.pt')  # nano model for quick inference

# Run inference on a sample image

results = model('images/street_scene.jpg', save = True)  # Displays image with detections

for result in results:
    print(result.boxes)  # Boxes object for bounding box outputs




image 1/1 /workspaces/MultimodalInteraction_ObjDet/images/street_scene.jpg: 384x640 13 persons, 1 bicycle, 9 cars, 2 motorcycles, 1 traffic light, 1 bench, 4 birds, 1 handbag, 1 potted plant, 124.3ms
Speed: 1.8ms preprocess, 124.3ms inference, 1.3ms postprocess per image at shape (1, 3, 384, 640)
Results saved to [1mruns/detect/predict4[0m
ultralytics.engine.results.Boxes object with attributes:

cls: tensor([ 2.,  0.,  0.,  0., 58.,  0.,  2.,  9.,  0., 14.,  0.,  3.,  0.,  1.,  2., 14., 14.,  2.,  0.,  2.,  2.,  0.,  0., 26.,  0.,  3.,  2.,  0.,  0.,  2., 14., 13.,  2.])
conf: tensor([0.9098, 0.9041, 0.9005, 0.8934, 0.8477, 0.8331, 0.8173, 0.7737, 0.7585, 0.7313, 0.6779, 0.6606, 0.6198, 0.5686, 0.5105, 0.5057, 0.5043, 0.4675, 0.4564, 0.4517, 0.4201, 0.4165, 0.4037, 0.4015, 0.3767, 0.3745, 0.3659, 0.3221, 0.3095, 0.3049, 0.2999, 0.2989, 0.2811])
data: tensor([[9.5592e-01, 3.6429e+02, 6.0592e+02, 6.1893e+02, 9.0984e-01, 2.0000e+00],
        [1.1789e+03, 4.2397e+02, 1.4806e+03, 8.6406

## 2. Fine-Tuning YOLO on Custom Dataset
Fine-tuning YOLO requires a dataset formatted in the YOLO format. We'll use a small public dataset for demonstration.
We will adapt the pre-trained YOLO model to a custom dataset. This process, known as fine-tuning, enables YOLO to specialize in detecting specific objects not included in its original training.

**Key Concepts:**
- **Fine-tuning**: Adapting a pre-trained model to new data by continuing the training process.
- **Custom Dataset**: A dataset that contains specific objects relevant to a new application, different from those YOLO was trained on (e.g. https://docs.ultralytics.com/datasets/detect/signature/.)

**Steps:**
1. Prepare the custom dataset by organizing images and labels in the required format.
2. Configure the YOLO training pipeline.
3. Train the model and evaluate its performance.

**Support Material:** 
- https://docs.ultralytics.com/modes/train/
- https://docs.ultralytics.com/modes/val/




In [3]:
# Download a sample dataset (e.g., Signature)
!wget -q https://github.com/ultralytics/assets/releases/download/v0.0.0/signature.zip
!unzip -q signature.zip -d ./datasets

In [4]:
# Train YOLO on the dataset
results = model.train(data='./datasets/signature.yaml', epochs=10, imgsz=320, batch=2, workers=0, device='cpu')

New https://pypi.org/project/ultralytics/8.3.113 available 😃 Update with 'pip install -U ultralytics'
Ultralytics 8.3.39 🚀 Python-3.10.17 torch-2.5.1+cu124 CPU (Intel Xeon Platinum 8370C 2.80GHz)


[34m[1mengine/trainer: [0mtask=detect, mode=train, model=yolov8n.pt, data=./datasets/signature.yaml, epochs=10, time=None, patience=100, batch=2, imgsz=320, save=True, save_period=-1, cache=False, device=cpu, workers=0, project=None, name=train3, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False

100%|██████████| 11.3M/11.3M [00:00<00:00, 427MB/s]
Unzipping /workspaces/MultimodalInteraction_ObjDet/datasets/signature.zip to /workspaces/MultimodalInteraction_ObjDet/datasets/signature...: 100%|██████████| 364/364 [00:00<00:00, 3177.56file/s]

Dataset download success ✅ (0.8s), saved to [1m/workspaces/MultimodalInteraction_ObjDet/datasets[0m

Overriding model.yaml nc=80 with nc=1

                   from  n    params  module                                       arguments                     
  0                  -1  1       464  ultralytics.nn.modules.conv.Conv             [3, 16, 3, 2]                 
  1                  -1  1      4672  ultralytics.nn.modules.conv.Conv             [16, 32, 3, 2]                
  2                  -1  1      7360  ultralytics.nn.modules.block.C2f             [32, 32, 1, True]             
  3                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 2]                
  4                  -1  2     49664  ultralytics.nn.modules.block.C2f             [64, 64, 2, True]             
  5                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  6                  -1  2    197632  ultralytics.nn.modules




 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  1     37248  ultralytics.nn.modules.block.C2f             [192, 64, 1]                  
 16                  -1  1     36992  ultralytics.nn.modules.conv.Conv             [64, 64, 3, 2]                
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  1    123648  ultralytics.nn.modules.block.C2f             [192, 128, 1]                 
 19                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  1    493056  ultralytics.nn.modules.block.C2f             [384,

[34m[1mtrain: [0mScanning /workspaces/MultimodalInteraction_ObjDet/datasets/signature/train/labels... 143 images, 0 backgrounds, 0 corrupt: 100%|██████████| 143/143 [00:00<00:00, 2296.52it/s]

[34m[1mtrain: [0mNew cache created: /workspaces/MultimodalInteraction_ObjDet/datasets/signature/train/labels.cache



[34m[1mval: [0mScanning /workspaces/MultimodalInteraction_ObjDet/datasets/signature/valid/labels... 35 images, 0 backgrounds, 0 corrupt: 100%|██████████| 35/35 [00:00<00:00, 3391.65it/s]

[34m[1mval: [0mNew cache created: /workspaces/MultimodalInteraction_ObjDet/datasets/signature/valid/labels.cache





Plotting labels to runs/detect/train3/labels.jpg... 
[34m[1moptimizer:[0m 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
[34m[1moptimizer:[0m AdamW(lr=0.002, momentum=0.9) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.0005), 63 bias(decay=0.0)
Image sizes 320 train, 320 val
Using 0 dataloader workers
Logging results to [1mruns/detect/train3[0m
Starting training for 10 epochs...
Closing dataloader mosaic

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       1/10         0G      2.304      3.136       2.03          1        320: 100%|██████████| 72/72 [00:22<00:00,  3.14it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 9/9 [00:01<00:00,  6.93it/s]

                   all         35         35      0.531      0.257      0.381      0.287






      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       2/10         0G      1.504      2.125      1.407          1        320: 100%|██████████| 72/72 [00:22<00:00,  3.24it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 9/9 [00:01<00:00,  6.75it/s]

                   all         35         35       0.93      0.764      0.861      0.573






      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       3/10         0G      1.132       1.54      1.117          1        320: 100%|██████████| 72/72 [00:21<00:00,  3.31it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 9/9 [00:01<00:00,  6.49it/s]

                   all         35         35          1      0.882       0.93      0.646






      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       4/10         0G     0.9756      1.176     0.9967          1        320: 100%|██████████| 72/72 [00:21<00:00,  3.28it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 9/9 [00:01<00:00,  6.75it/s]

                   all         35         35      0.991      0.914      0.941      0.713






      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       5/10         0G     0.8739      1.084     0.9717          1        320: 100%|██████████| 72/72 [00:21<00:00,  3.28it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 9/9 [00:01<00:00,  6.88it/s]

                   all         35         35      0.969      0.943      0.949      0.737






      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       6/10         0G     0.7951     0.9716     0.9219          1        320: 100%|██████████| 72/72 [00:21<00:00,  3.34it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 9/9 [00:01<00:00,  6.77it/s]

                   all         35         35      0.997      0.943      0.961      0.742






      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       7/10         0G     0.6961     0.8455     0.8985          1        320: 100%|██████████| 72/72 [00:21<00:00,  3.31it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 9/9 [00:01<00:00,  6.65it/s]

                   all         35         35          1      0.936       0.96      0.781






      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       8/10         0G     0.6881      0.817     0.8926          1        320: 100%|██████████| 72/72 [00:21<00:00,  3.33it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 9/9 [00:01<00:00,  6.80it/s]

                   all         35         35      0.998      0.943      0.967      0.837






      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


       9/10         0G      0.653     0.7805     0.9034          1        320: 100%|██████████| 72/72 [00:21<00:00,  3.34it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 9/9 [00:01<00:00,  6.93it/s]

                   all         35         35      0.998      0.943      0.967      0.845






      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


      10/10         0G     0.6666     0.8182     0.8876          1        320: 100%|██████████| 72/72 [00:21<00:00,  3.31it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 9/9 [00:01<00:00,  6.72it/s]

                   all         35         35      0.997      0.943      0.968      0.842






10 epochs completed in 0.065 hours.
Optimizer stripped from runs/detect/train3/weights/last.pt, 6.2MB
Optimizer stripped from runs/detect/train3/weights/best.pt, 6.2MB

Validating runs/detect/train3/weights/best.pt...
Ultralytics 8.3.39 🚀 Python-3.10.17 torch-2.5.1+cu124 CPU (Intel Xeon Platinum 8370C 2.80GHz)
Model summary (fused): 168 layers, 3,005,843 parameters, 0 gradients, 8.1 GFLOPs


                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 9/9 [00:01<00:00,  6.59it/s]


                   all         35         35      0.998      0.943      0.967      0.845
Speed: 0.3ms preprocess, 27.6ms inference, 0.0ms loss, 0.5ms postprocess per image
Results saved to [1mruns/detect/train3[0m


In [9]:
model = YOLO("runs/detect/train3/weights/best.pt")  # load a custom model, check the path depending on your output before!!

# Predict with the model
results = model.predict("images/example_signature.jpg", conf=0.5) #check params if you need to improve detection


image 1/1 /workspaces/MultimodalInteraction_ObjDet/images/example_signature.jpg: 320x256 1 signature, 35.9ms
Speed: 1.1ms preprocess, 35.9ms inference, 0.7ms postprocess per image at shape (1, 3, 320, 256)


## 3. Zero-Shot Learning with OWL-ViT
Switch to `OWL-ViT` to see how it performs with zero-shot learning capabilities. Zero-shot means detecting objects without prior specific training.

OWL-ViT (Open Vocabulary Learning with Vision Transformers) is a cutting-edge model designed for open vocabulary object detection. Unlike traditional models, OWL-ViT combines vision transformers with text embeddings, enabling it to:\n\n
- Understand textual descriptions of objects, even if it hasn't seen them during training.
- Detect and classify objects based on descriptive input, making it suitable for diverse applications.
- Perform zero-shot learning by generalizing to new object classes without additional training.\n\n"

**Steps in Using OWL-ViT:**
1. Model Initialization**: Set up the OWL-ViT model.
2. Text Input for Object Descriptions: Provide descriptive prompts (e.g., 'a red car' or 'a black cat to guide detection.
3. Inference and Visualization: Process an image or video, detect objects based on text descriptions and visualize results.\n\n"

OWL-ViT excels in scenarios where predefined object classes are insufficient, such as detecting rare or domain-specific objects.

**Support Material**:
- https://huggingface.co/docs/transformers/en/model_doc/owlvit


In [11]:
from PIL import Image

import matplotlib.pyplot as plt
import matplotlib.patheffects as pe


from transformers import pipeline

image = Image.open("images/street_scene.jpg")


def preprocess_outputs(output):
    input_scores = [x["score"] for x in output]
    input_labels = [x["label"] for x in output]
    input_boxes = []
    for i in range(len(output)):
        input_boxes.append([*output[i]["box"].values()])
    input_boxes = [input_boxes]
    return input_scores, input_labels, input_boxes


def show_box(box, ax):
    x0, y0 = box[0], box[1]
    w, h = box[2] - box[0], box[3] - box[1]
    ax.add_patch(
        plt.Rectangle((x0, y0), w, h, edgecolor="green", facecolor=(0, 0, 0, 0), lw=2)
    )


def show_boxes_and_labels_on_image(raw_image, boxes, labels, scores):
    plt.figure(figsize=(10, 10))
    plt.imshow(raw_image)
    for i, box in enumerate(boxes):
        show_box(box, plt.gca())
        plt.text(
            x=box[0],
            y=box[1] - 12,
            s=f"{labels[i]}: {scores[i]:,.4f}",
            c="beige",
            path_effects=[pe.withStroke(linewidth=4, foreground="darkgreen")],
        )
    plt.axis("on")
    plt.show()

OWL_checkpoint = "google/owlvit-base-patch32"

text = ["a person on the floor", "a church ", "a car"]

# Load the model
detector = pipeline(
    model= OWL_checkpoint,
    task="zero-shot-object-detection"
)

output = detector(
    image,
    candidate_labels = text
)

print(output)

input_scores, input_labels, input_boxes = preprocess_outputs(output)

# Show the image with the bounding boxes
show_boxes_and_labels_on_image(
    image,
    input_boxes[0],
    input_labels,
    input_scores
)


[{'score': 0.5236293077468872, 'label': 'a car', 'box': {'xmin': -7, 'ymin': 364, 'xmax': 603, 'ymax': 616}}, {'score': 0.4921436309814453, 'label': 'a car', 'box': {'xmin': 577, 'ymin': 341, 'xmax': 874, 'ymax': 460}}, {'score': 0.4799966812133789, 'label': 'a car', 'box': {'xmin': 339, 'ymin': 326, 'xmax': 617, 'ymax': 422}}, {'score': 0.4671841859817505, 'label': 'a car', 'box': {'xmin': 1025, 'ymin': 339, 'xmax': 1088, 'ymax': 380}}, {'score': 0.41778579354286194, 'label': 'a car', 'box': {'xmin': 1543, 'ymin': 370, 'xmax': 1792, 'ymax': 487}}, {'score': 0.3334088623523712, 'label': 'a person on the floor', 'box': {'xmin': 584, 'ymin': 691, 'xmax': 1154, 'ymax': 1000}}, {'score': 0.25368502736091614, 'label': 'a car', 'box': {'xmin': 1031, 'ymin': 339, 'xmax': 1090, 'ymax': 370}}, {'score': 0.19445128738880157, 'label': 'a person on the floor', 'box': {'xmin': 408, 'ymin': 595, 'xmax': 658, 'ymax': 923}}, {'score': 0.16051188111305237, 'label': 'a person on the floor', 'box': {'xmi

<Figure size 1000x1000 with 1 Axes>