# **BOOTCAMP @ GIKI (Content designed by Usama Arshad) WEEK 3**

---



Week 3: Day 14 - Foundations of CNNs (Object Detection - Yolo)

## YOLO: You Only Look Once

### What is YOLO?

YOLO (You Only Look Once) is a popular real-time object detection algorithm. It can detect multiple objects in images or videos and draw bounding boxes around them. YOLO is known for its speed and accuracy, making it suitable for real-time applications.

### How Does YOLO Work?

YOLO treats object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Here's a simplified explanation:

1. **Input Image**: The input image is divided into a grid of S x S cells.
2. **Bounding Boxes**: Each grid cell predicts B bounding boxes and their confidence scores.
3. **Class Probabilities**: Each grid cell also predicts class probabilities for the object.
4. **Combining Predictions**: YOLO combines these predictions to produce final bounding boxes with associated class labels.

### Advantages of YOLO

- **Speed**: YOLO is incredibly fast because it makes predictions with a single network pass.
- **Accuracy**: It achieves high accuracy by considering contextual information in predictions.
- **Simplicity**: YOLO's design is simple, making it easy to understand and implement.

### Different Versions of YOLO

Over the years, YOLO has evolved through several versions, each improving upon the last.

#### YOLOv1

- **Introduction**: The first version of YOLO, introduced in 2016.
- **Architecture**: Uses a single convolutional neural network (CNN) to predict bounding boxes and class probabilities directly.
- **Speed**: Achieves real-time processing speeds.

#### YOLOv2 (YOLO9000)

- **Introduction**: Released in 2017, also known as YOLO9000.
- **Improvements**:
  - **Batch Normalization**: Helps in stabilizing the training process and improving accuracy.
  - **Anchor Boxes**: Introduces anchor boxes for better bounding box predictions.
  - **Multi-Scale Training**: Trains the model at different scales to improve performance.
- **Speed and Accuracy**: Better balance between speed and accuracy compared to YOLOv1.

#### YOLOv3

- **Introduction**: Released in 2018.
- **Improvements**:
  - **Darknet-53 Backbone**: Uses a deeper network (53 convolutional layers) for feature extraction.
  - **Multi-Scale Predictions**: Makes predictions at three different scales to detect objects of various sizes.
  - **Improved Bounding Box Predictions**: Enhances the accuracy of bounding box predictions.
- **Performance**: Improved performance and accuracy over YOLOv2.

#### YOLOv4

- **Introduction**: Released in 2020.
- **Improvements**:
  - **CSPDarknet53 Backbone**: Uses Cross-Stage Partial connections for better feature extraction.
  - **Bag of Freebies (BoF)**: Includes various training techniques that improve accuracy without increasing inference time.
  - **Bag of Specials (BoS)**: Includes additional layers and modules that enhance performance.
- **State-of-the-Art**: Achieves state-of-the-art performance in real-time object detection.

#### YOLOv5

- **Introduction**: Developed by Ultralytics and released in 2020.
- **Improvements**:
  - **PyTorch Implementation**: Implements YOLO in the PyTorch framework, making it more accessible.
  - **Ease of Use**: Provides a user-friendly interface and pre-trained models for quick deployment.
  - **Enhanced Features**: Includes various improvements in training techniques and model architecture.
- **Popularity**: Widely adopted due to its ease of use and integration with PyTorch.

#### YOLOv6 and Beyond

- **Continued Evolution**: YOLO continues to evolve with new versions being developed, incorporating advancements in neural network architectures and training techniques.
- **Focus Areas**: Emphasis on improving accuracy, speed, and robustness in real-world applications.

### Applications of YOLO

YOLO is used in a variety of applications due to its speed and accuracy:

- **Autonomous Vehicles**: Detecting pedestrians, vehicles, and other objects in real-time.
- **Surveillance**: Monitoring security cameras for suspicious activities.
- **Robotics**: Enabling robots to perceive and interact with their environment.
- **Healthcare**: Assisting in medical imaging analysis.
- **Augmented Reality**: Enhancing real-world experiences with virtual objects.

### Summary

YOLO has revolutionized the field of object detection with its real-time capabilities and high accuracy. Each version of YOLO has brought significant improvements, making it a versatile and powerful tool for various applications. Understanding the different versions and their enhancements helps in selecting the right model for specific use cases.



![Yolo Main Structure](https://projectgurukul.org/wp-content/uploads/2022/01/yolo-cnn.webp)


In [1]:
pip install opencv-python pillow numpy tk


Collecting opencv-python
  Obtaining dependency information for opencv-python from https://files.pythonhosted.org/packages/ec/6c/fab8113424af5049f85717e8e527ca3773299a3c6b02506e66436e19874f/opencv_python-4.10.0.84-cp37-abi3-win_amd64.whl.metadata
  Downloading opencv_python-4.10.0.84-cp37-abi3-win_amd64.whl.metadata (20 kB)
Downloading opencv_python-4.10.0.84-cp37-abi3-win_amd64.whl (38.8 MB)
   ---------------------------------------- 0.0/38.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/38.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/38.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/38.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/38.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/38.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/38.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/38.8 MB ? eta -:--:--
   ---------------------------------------- 

**Yolo v3 CFG file:**
https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg

**Yolo v3 Names File**
https://github.com/pjreddie/darknet/blob/master/data/coco.names

**Yolo v3 Weights**
https://github.com/patrick013/Object-Detection---Yolov3/blob/master/model/yolov3.weights



---


**Note: Code must be run in offline ide.**

In [None]:
import tkinter as tk
from tkinter import filedialog, messagebox
from PIL import Image, ImageTk
import cv2
import numpy as np
import threading

class YOLOFaceDetectionApp:
    def __init__(self, root):
        self.root = root
        self.root.title("YOLO Object Detection and Filters")

        self.weights_path = ""
        self.cfg_path = ""
        self.names_path = ""

        self.panel = tk.Label(root)
        self.panel.pack(padx=10, pady=10)

        btn_frame = tk.Frame(root)
        btn_frame.pack(fill=tk.X, pady=10)

        btn_select_weights = tk.Button(btn_frame, text="Select Weights", command=self.select_weights)
        btn_select_weights.pack(side=tk.LEFT, padx=10)

        btn_select_cfg = tk.Button(btn_frame, text="Select CFG", command=self.select_cfg)
        btn_select_cfg.pack(side=tk.LEFT, padx=10)

        btn_select_names = tk.Button(btn_frame, text="Select Names", command=self.select_names)
        btn_select_names.pack(side=tk.LEFT, padx=10)

        btn_select_image = tk.Button(btn_frame, text="Select Image", command=self.select_image)
        btn_select_image.pack(side=tk.LEFT, padx=10)

        btn_select_video = tk.Button(btn_frame, text="Select Video", command=self.select_video)
        btn_select_video.pack(side=tk.LEFT, padx=10)

        btn_live_video = tk.Button(btn_frame, text="Live Video", command=self.toggle_live_video)
        btn_live_video.pack(side=tk.LEFT, padx=10)

        btn_detect_objects = tk.Button(btn_frame, text="Detect Objects", command=self.toggle_detect_objects)
        btn_detect_objects.pack(side=tk.LEFT, padx=10)

        btn_edge_detection = tk.Button(btn_frame, text="Edge Detection", command=self.toggle_edge_detection)
        btn_edge_detection.pack(side=tk.LEFT, padx=10)

        btn_sharpen = tk.Button(btn_frame, text="Sharpen", command=self.toggle_sharpen)
        btn_sharpen.pack(side=tk.LEFT, padx=10)

        btn_blur = tk.Button(btn_frame, text="Blur", command=self.toggle_blur)
        btn_blur.pack(side=tk.LEFT, padx=10)

        btn_sepia = tk.Button(btn_frame, text="Sepia", command=self.toggle_sepia)
        btn_sepia.pack(side=tk.LEFT, padx=10)

        btn_negative = tk.Button(btn_frame, text="Negative", command=self.toggle_negative)
        btn_negative.pack(side=tk.LEFT, padx=10)

        btn_cartoon = tk.Button(btn_frame, text="Cartoon", command=self.toggle_cartoon)
        btn_cartoon.pack(side=tk.LEFT, padx=10)

        self.image_path = None
        self.video_path = None
        self.image = None
        self.video_capture = None
        self.net = None
        self.classes = None
        self.output_layers = None
        self.filter_mode = None
        self.detect_objects_flag = False
        self.running = False
        self.thread = None

    def select_weights(self):
        self.weights_path = filedialog.askopenfilename()
        self.load_yolo()

    def select_cfg(self):
        self.cfg_path = filedialog.askopenfilename()
        self.load_yolo()

    def select_names(self):
        self.names_path = filedialog.askopenfilename()
        self.load_yolo()

    def load_yolo(self):
        if self.weights_path and self.cfg_path and self.names_path:
            try:
                self.net = cv2.dnn.readNet(self.weights_path, self.cfg_path)
                self.layer_names = self.net.getLayerNames()
                self.output_layers = [self.layer_names[i - 1] for i in self.net.getUnconnectedOutLayers()]
                with open(self.names_path, "r") as f:
                    self.classes = [line.strip() for line in f.readlines()]
                messagebox.showinfo("YOLO", "YOLO model loaded successfully.")
            except Exception as e:
                messagebox.showerror("YOLO Error", f"Error loading YOLO: {e}")

    def select_image(self):
        self.image_path = filedialog.askopenfilename()
        if self.image_path:
            self.load_image()

    def select_video(self):
        self.video_path = filedialog.askopenfilename()
        if self.video_path:
            if self.running:
                self.stop_running()
            else:
                self.running = True
                self.thread = threading.Thread(target=self.detect_objects_video)
                self.thread.start()

    def toggle_live_video(self):
        if self.running:
            self.stop_running()
        else:
            self.video_capture = cv2.VideoCapture(0)
            self.running = True
            self.thread = threading.Thread(target=self.show_live_video)
            self.thread.start()

    def load_image(self):
        image = cv2.imread(self.image_path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        image = Image.fromarray(image)
        image = ImageTk.PhotoImage(image)

        self.panel.config(image=image)
        self.panel.image = image

    def toggle_detect_objects(self):
        self.detect_objects_flag = not self.detect_objects_flag
        if self.image_path:
            self.apply_filter_to_image()

    def toggle_edge_detection(self):
        if self.filter_mode == "edge":
            self.filter_mode = None
        else:
            self.filter_mode = "edge"
        if self.image_path:
            self.apply_filter_to_image()
        if not self.running:
            self.toggle_live_video()

    def toggle_sharpen(self):
        if self.filter_mode == "sharpen":
            self.filter_mode = None
        else:
            self.filter_mode = "sharpen"
        if self.image_path:
            self.apply_filter_to_image()
        if not self.running:
            self.toggle_live_video()

    def toggle_blur(self):
        if self.filter_mode == "blur":
            self.filter_mode = None
        else:
            self.filter_mode = "blur"
        if self.image_path:
            self.apply_filter_to_image()
        if not self.running:
            self.toggle_live_video()

    def toggle_sepia(self):
        if self.filter_mode == "sepia":
            self.filter_mode = None
        else:
            self.filter_mode = "sepia"
        if self.image_path:
            self.apply_filter_to_image()
        if not self.running:
            self.toggle_live_video()

    def toggle_negative(self):
        if self.filter_mode == "negative":
            self.filter_mode = None
        else:
            self.filter_mode = "negative"
        if self.image_path:
            self.apply_filter_to_image()
        if not self.running:
            self.toggle_live_video()

    def toggle_cartoon(self):
        if self.filter_mode == "cartoon":
            self.filter_mode = None
        else:
            self.filter_mode = "cartoon"
        if self.image_path:
            self.apply_filter_to_image()
        if not self.running:
            self.toggle_live_video()

    def _detect_objects(self, frame):
        return self.apply_yolo(frame)

    def detect_objects_video(self):
        cap = cv2.VideoCapture(self.video_path)
        while self.running:
            ret, frame = cap.read()
            if not ret:
                break
            if self.detect_objects_flag:
                frame = self.apply_yolo(frame)
            if self.filter_mode == "edge":
                frame = cv2.Canny(frame, 100, 200)
            elif self.filter_mode == "sharpen":
                kernel = np.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]])
                frame = cv2.filter2D(frame, -1, kernel)
            elif self.filter_mode == "blur":
                frame = cv2.GaussianBlur(frame, (15, 15), 0)
            elif self.filter_mode == "sepia":
                frame = self.apply_sepia(frame)
            elif self.filter_mode == "negative":
                frame = self.apply_negative(frame)
            elif self.filter_mode == "cartoon":
                frame = self.apply_cartoon(frame)
            cv2.imshow("Video", frame)
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break
        cap.release()
        cv2.destroyAllWindows()
        self.running = False

    def show_live_video(self):
        while self.running:
            ret, frame = self.video_capture.read()
            if not ret:
                break
            if self.detect_objects_flag:
                frame = self.apply_yolo(frame)
            if self.filter_mode == "edge":
                frame = cv2.Canny(frame, 100, 200)
            elif self.filter_mode == "sharpen":
                kernel = np.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]])
                frame = cv2.filter2D(frame, -1, kernel)
            elif self.filter_mode == "blur":
                frame = cv2.GaussianBlur(frame, (15, 15), 0)
            elif self.filter_mode == "sepia":
                frame = self.apply_sepia(frame)
            elif self.filter_mode == "negative":
                frame = self.apply_negative(frame)
            elif self.filter_mode == "cartoon":
                frame = self.apply_cartoon(frame)
            cv2.imshow("Live Video", frame)
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break
        self.stop_running()

    def stop_running(self):
        self.running = False
        self.detect_objects_flag = False
        self.filter_mode = None
        if self.video_capture is not None:
            self.video_capture.release()
            self.video_capture = None
        cv2.destroyAllWindows()

    def apply_yolo(self, image):
        if not self.net:
            messagebox.showerror("YOLO Error", "YOLO model is not loaded. Please load the model first.")
            return image

        height, width, channels = image.shape
        blob = cv2.dnn.blobFromImage(image, 0.00392, (416, 416), (0, 0, 0), True, crop=False)
        self.net.setInput(blob)
        outs = self.net.forward(self.output_layers)

        class_ids = []
        confidences = []
        boxes = []

        for out in outs:
            for detection in out:
                scores = detection[5:]
                class_id = np.argmax(scores)
                confidence = scores[class_id]
                if confidence > 0.5:
                    center_x = int(detection[0] * width)
                    center_y = int(detection[1] * height)
                    w = int(detection[2] * width)
                    h = int(detection[3] * height)
                    x = int(center_x - w / 2)
                    y = int(center_y - h / 2)
                    if (x, y, w, h) and isinstance(x, int) and isinstance(y, int) and isinstance(w, int) and isinstance(h, int):
                        boxes.append([x, y, w, h])
                        confidences.append(float(confidence))
                        class_ids.append(class_id)

        indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
        for i in range(len(boxes)):
            if i in indexes:
                x, y, w, h = boxes[i]
                label = str(self.classes[class_ids[i]])
                confidence = confidences[i]
                color = (0, 255, 0)
                if isinstance(x, int) and isinstance(y, int) and isinstance(w, int) and isinstance(h, int):
                    cv2.rectangle(image, (x, y), (x + w, y + h), color, 2)
                    cv2.putText(image, f"{label} {confidence:.2f}", (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
        return image

    def apply_filter_to_image(self):
        image = cv2.imread(self.image_path)
        if self.detect_objects_flag:
            image = self.apply_yolo(image)
        if self.filter_mode == "edge":
            image = cv2.Canny(image, 100, 200)
        elif self.filter_mode == "sharpen":
            kernel = np.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]])
            image = cv2.filter2D(image, -1, kernel)
        elif self.filter_mode == "blur":
            image = cv2.GaussianBlur(image, (15, 15), 0)
        elif self.filter_mode == "sepia":
            image = self.apply_sepia(image)
        elif self.filter_mode == "negative":
            image = self.apply_negative(image)
        elif self.filter_mode == "cartoon":
            image = self.apply_cartoon(image)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        image = Image.fromarray(image)
        image = ImageTk.PhotoImage(image)

        self.panel.config(image=image)
        self.panel.image = image

    def apply_sepia(self, image):
        sepia_filter = np.array([[0.272, 0.534, 0.131],
                                 [0.349, 0.686, 0.168],
                                 [0.393, 0.769, 0.189]])
        image = cv2.transform(image, sepia_filter)
        image = np.clip(image, 0, 255)
        return image

    def apply_negative(self, image):
        return cv2.bitwise_not(image)

    def apply_cartoon(self, image):
        # Convert to gray scale
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        # Apply median blur
        gray = cv2.medianBlur(gray, 5)
        # Detect edges in gray image
        edges = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                                      cv2.THRESH_BINARY, 9, 9)
        # Apply bilateral filter to color image
        color = cv2.bilateralFilter(image, 9, 300, 300)
        # Combine color image with edges
        cartoon = cv2.bitwise_and(color, color, mask=edges)
        return cartoon

if __name__ == "__main__":
    root = tk.Tk()
    app = YOLOFaceDetectionApp(root)
    root.mainloop()


# Some Questions:
**Why Convert the Image to RGB?**

OpenCV loads images in BGR (Blue, Green, Red) format by default. However, many deep learning models, including those trained with popular frameworks like TensorFlow, PyTorch, and Caffe, expect input images to be in RGB (Red, Green, Blue) format. Converting the image from BGR to RGB ensures compatibility with these models.

**Why Perform Other Preprocessing Steps?**

* Normalization (Scaling Factor):

Reason: Models perform better with normalized inputs. Normalization often involves scaling pixel values to a range of [0, 1] or [-1, 1].
Implementation: scalefactor=0.00392 scales pixel values.
​
 , converting them from [0, 255] to approximately [0, 1].
* Resizing (Size):

Reason: Neural networks expect input images to be of a fixed size. Resizing ensures that the image dimensions match the expected input size of the model.
Implementation: size=(416, 416) resizes the image to 416x416 pixels.
* Mean Subtraction (Mean):

Reason: Mean subtraction is a normalization technique that improves model performance by centering the data around zero.
Implementation: mean=(0, 0, 0) in this case means no mean subtraction is applied, but it could be used to subtract the average pixel values for each channel.
* Channel Swapping (swapRB):

Reason: Converts the image from BGR to RGB format.
Implementation: swapRB=True swaps the red and blue channels.
* Cropping (Crop):

Reason: Ensures the image maintains the aspect ratio or is adjusted appropriately for the model input size.


