# Course 3: YOLO - You Only Look Once

The YOLO is a common library for recognizing objects in images. It is a very powerful tool for object detection. In this notebook, we will use a pre-trained model to detect objects in images. We will also use the YOLO model to detect objects in a video.

The [official YOLO website](https://docs.ultralytics.com/) provides a lot of information about the YOLOv8 model.

First, we should install the `ultralytics` library.

```bash
pip install ultralytics
```

Then, we can use `yolo` in the command or in the code.

```bash
yolo -version
```

## The YOLO model

In [1]:
from ultralytics import YOLO
import torch

model = YOLO('yolov9e.pt')

device = torch.device('mps')

model = model.to(device)

### Predict the image

Taking an example of a snapshot of *Honkai: Star Rail*, we can recognise items in the figure.

In [2]:
import cv2

image = cv2.imread('yolo/hsr.png')

results = model([image])  # return a list of Results objects

# Process results list
for result in results:
    boxes = result.boxes  # Boxes object for bounding box outputs
    masks = result.masks  # Masks object for segmentation masks outputs
    keypoints = result.keypoints  # Keypoints object for pose outputs
    probs = result.probs  # Probs object for classification outputs
    obb = result.obb  # Oriented boxes object for OBB outputs
    
    result.show()


0: 320x640 1 person, 11 chairs, 1 dining table, 1 clock, 339.0ms
Speed: 9.1ms preprocess, 339.0ms inference, 95.4ms postprocess per image at shape (1, 3, 320, 640)


### Predict the Video

Taking the snap of Zenless Zone Zero of example.

In [8]:
video = cv2.VideoCapture('./yolo/zzz.mp4')

while True:
    ret, frame = video.read()
    if not ret:
        break

    results = model(frame)
    
    for result in results:
        boxes = result.boxes
        masks = result.masks
        keypoints = result.keypoints
        probs = result.probs
        obb = result.obb
        
        # Assuming result.boxes returns a list of bounding boxes
        if boxes is not None: 
            for box in boxes:
                x, y, w, h = map(int, box.xywh.to('cpu')[0])
                cv2.rectangle(frame, (x - w // 2, y - h // 2), (x + w // 2, x + h // 2), (0, 255, 0), 2)  # Draw bounding box

        # Assuming result.keypoints returns a list of keypoints
        if keypoints is not None:
            for keypoint in keypoints:
                for x, y in keypoint:
                    cv2.circle(frame, (x, y), 5, (0, 0, 255), -1)  # Draw keypoints
    
        # Assuming result.masks returns a list of masks
        if masks is not None:
            for mask in masks:
                frame[mask] = (0, 255, 255)  # Apply mask to the frame

    # Display the resulting frame
    cv2.imshow('Frame', frame)
    cv2.waitKey(1)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break
    cv2.destroyAllWindows()


0: 288x640 1 person, 1 backpack, 6 tvs, 1 laptop, 1 keyboard, 80.2ms
Speed: 2.3ms preprocess, 80.2ms inference, 6.4ms postprocess per image at shape (1, 3, 288, 640)

0: 288x640 1 person, 1 backpack, 1 bottle, 8 tvs, 1 laptop, 42.8ms
Speed: 1.7ms preprocess, 42.8ms inference, 5.9ms postprocess per image at shape (1, 3, 288, 640)

0: 288x640 1 person, 1 bottle, 10 tvs, 1 laptop, 44.3ms
Speed: 2.5ms preprocess, 44.3ms inference, 6.6ms postprocess per image at shape (1, 3, 288, 640)

0: 288x640 1 person, 1 bottle, 11 tvs, 1 laptop, 1 mouse, 43.1ms
Speed: 1.6ms preprocess, 43.1ms inference, 5.3ms postprocess per image at shape (1, 3, 288, 640)

0: 288x640 1 person, 1 bottle, 1 cup, 9 tvs, 1 laptop, 43.4ms
Speed: 2.1ms preprocess, 43.4ms inference, 5.7ms postprocess per image at shape (1, 3, 288, 640)

0: 288x640 1 person, 1 bottle, 10 tvs, 1 laptop, 1 book, 41.3ms
Speed: 1.8ms preprocess, 41.3ms inference, 6.4ms postprocess per image at shape (1, 3, 288, 640)

0: 288x640 1 person, 1 bottl

Alternatively, we can use the `yolo` command to detect objects in images and videos.

In [11]:
!yolo detect predict --source=yolo/zzz.mp4 > ignore.txt