<a href="https://colab.research.google.com/github/rahiakela/computer-vision-research-and-practice/blob/main/computer-vision-case-studies/object-detections/yolo-implementations/yolo-v1/implementing_YOLOV1_from_scratch_using_Keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Implementing YOLOV1 from scratch using Keras

In this notebook I am going to implement YOLOV1 as described in the paper [You Only Look Once](https://arxiv.org/abs/1506.02640). The goal is to replicate the model as described in the paper and in the process, understand the nuances of using Keras on a complex problem.

<img src='https://www.maskaravivek.com/post/yolov1/featured_hu2959f475cef1ef9098f72ca1a1294bd8_186245_720x0_resize_lanczos_2.png?raw=1' width='800'/>

**Reference**

[Implementing YOLOV1 from scratch using Keras Tensorflow 2.0](https://www.maskaravivek.com/post/yolov1/)

##Setup

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow import keras
import keras.backend as K
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer, Dropout, Flatten, Reshape
from tensorflow.keras.layers import Conv2D, MaxPooling2D, GlobalMaxPooling2D
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import ModelCheckpoint


import argparse
import xml.etree.ElementTree as ET
import os

import cv2 as cv
import numpy as np

import matplotlib.pyplot as plt    # for plotting the images
%matplotlib inline

I would be using [VOC 2007](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/) dataset as its size is manageable so it would be easy to run it using Google Colab.

First, I download and extract the dataset.

In [None]:
!wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
!wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar

!tar xvf VOCtrainval_06-Nov-2007.tar
!tar xvf VOCtest_06-Nov-2007.tar

!rm VOCtrainval_06-Nov-2007.tar
!rm VOCtest_06-Nov-2007.tar

##Data Preprocessing

Next, we process the annotations and write the labels in a text file. A text file is easier to consume as compared to XML.

In [3]:
parser = argparse.ArgumentParser(description='Build Annotations.')
parser.add_argument('dir', default='..', help='Annotations.')

sets = [('2007', 'train'), ('2007', 'val'), ('2007', 'test')]

classes_num = {
    'aeroplane': 0, 'bicycle': 1, 'bird': 2, 'boat': 3, 'bottle': 4, 'bus': 5,
    'car': 6, 'cat': 7, 'chair': 8, 'cow': 9, 'diningtable': 10, 'dog': 11,
    'horse': 12, 'motorbike': 13, 'person': 14, 'pottedplant': 15, 'sheep': 16,
    'sofa': 17, 'train': 18, 'tvmonitor': 19
}

In [10]:
def convert_annotation(year, image_id, f):
  in_file = os.path.join("VOCdevkit/VOC%s/Annotations/%s.xml" % (year, image_id))
  tree = ET.parse(in_file)
  root = tree.getroot()

  for obj in root.iter("object"):
    difficult = obj.find("difficult").text
    cls = obj.find("name").text
    classes = list(classes_num.keys())

    if cls not in classes or int(difficult) == 1:
      continue

    cls_id = classes.index(cls)
    xmlbox = obj.find("bndbox")

    b = (int(xmlbox.find("xmin").text), int(xmlbox.find("ymin").text), int(xmlbox.find("xmax").text), int(xmlbox.find("ymax").text))
    f.write(" " + ",".join([str(a) for a in b]) + "," + str(cls_id))

In [11]:
for year, image_set in sets:
  print(year, image_set)
  with open(os.path.join("VOCdevkit/VOC%s/ImageSets/Main/%s.txt" % (year, image_set)), "r") as f:
    image_ids = f.read().strip().split()
  with open(os.path.join("VOCdevkit", "%s_%s.txt" % (year, image_set)), "w") as f:
    for image_id in image_ids:
      f.write("%s/VOC%s/JPEGImages/%s.jpg" % ("VOCdevkit", year, image_id))
      convert_annotation(year, image_id, f)
      f.write("\n")

2007 train
2007 val
2007 test


Next, I am adding a function to prepare the input and the output. The input is a `(448, 448, 3)` image and the output is a `(7, 7, 30)` tensor. The output is based on `S x S x (B * 5 + C)`.

`S X S` is the number of grids, `B` is the number of bounding boxes per grid `C` is the number of predictions per grid.

In [12]:
def read(image_path, label):
  image = cv.imread(image_path)
  image = cv.cvtColor(image, cv.COLOR_BAYER_BG2RGB)
  image_h, image_w = image.shape[0:2]
  image = cv.resize(image, (448, 448))
  image = image / 255.

  label_matrix = np.zeros([7, 7, 30])
  for l in label:
    l = l.split(",")
    l = np.array(l, dtype=np.int)

    xmin = l[0]
    ymin = l[1]
    xmax = l[2]
    ymax = l[3]

    cls = l[4]

    x = (xmin + xmax) / 2 / image_w
    y = (ymin + ymax) / 2 / image_h
    w = (xmax - xmin) / image_w
    h = (ymax - ymin) / image_h

    loc = [7 * x, 7 * y]
    loc_i = int(loc[1])
    loc_j = int(loc[0])

    y = loc[1] - loc_i
    x = loc[0] - loc_j

    if label_matrix[loc_i, loc_j, 24] == 0:
      label_matrix[loc_i, loc_j, cls] = 1
      label_matrix[loc_i, loc_j, 20:24] = [x, y, w, h]
      label_matrix[loc_i, loc_j, 24] = 1  # response
      
  return image, label_matrix

##Defining custom generator

Next, I am defining a custom generator that returns a batch of input and outputs.