Object Detection on Custom Dataset with YOLO (v5)

TL;DR Learn how to build a custom dataset for YOLO v5 (darknet compatible) and use it to fine-tune a large object detection model. The model will be ready for real-time object detection on mobile devices.

In this tutorial, you'll learn how to fine-tune a pre-trained YOLO v5 model for detecting and classifying clothing items from images.

Run the notebook in your browser (Google Colab)
Read the Getting Things Done with Pytorch book

Here's what we'll go over:

Install required libraries
Build a custom dataset in YOLO/darknet format
Learn about YOLO model family history
Fine-tune the largest YOLO v5 model
Evaluate the model
Look at some predictions

How good our final model is going to be?

Prerequisites

Let's start by installing some required libraries by the YOLOv5 project:

!pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html
!pip install numpy==1.17
!pip install PyYAML==5.3.1
!pip install git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI

We'll also install Apex by NVIDIA to speed up the training of our model (this step is optional):

!git clone https://github.com/NVIDIA/apex && cd apex && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . --user && cd .. && rm -rf apex

Build a dataset

The dataset contains annotations for clothing items - bounding boxes around shirts, tops, jackets, sunglasses. The dataset is from DataTurks and is on Kaggle.

!gdown --id 1uWdQ2kn25RSQITtBHa9_zayplm27IXNC

The dataset contains a single JSON file with URLs to all images and bounding box data.

Let's import all required libraries:

from pathlib import Path
from tqdm import tqdm
import numpy as np
import json
import urllib
import PIL.Image as Image
import cv2
import torch
import torchvision
from IPython.display import display
from sklearn.model_selection import train_test_split

import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc

%matplotlib inline
%config InlineBackend.figure_format='retina'
sns.set(style='whitegrid', palette='muted', font_scale=1.2)
rcParams['figure.figsize'] = 16, 10

np.random.seed(42)

Each line in the dataset file contains a JSON object. Let's create a list of all annotations:

clothing = []
with open("clothing.json") as f:
    for line in f:
        clothing.append(json.loads(line))

Here's an example annotation:

clothing[0]

{'annotation': [{'imageHeight': 312,
   'imageWidth': 147,
   'label': ['Tops'],
   'notes': '',
   'points': [{'x': 0.02040816326530612, 'y': 0.2532051282051282},
    {'x': 0.9931972789115646, 'y': 0.8108974358974359}]}],
 'content': 'http://com.dataturks.a96-i23.open.s3.amazonaws.com/2c9fafb063ad2b650163b00a1ead0017/4bb8fd9d-8d52-46c7-aa2a-9c18af10aed6___Data_xxl-top-4437-jolliy-original-imaekasxahykhd3t.jpeg',
 'extras': None}

We have the labels, image dimensions, bounding box points (normalized in 0-1 range), and an URL to the image file.

Do we have images with multiple annotations?

for c in clothing:
  if len(c['annotation']) > 1:
    display(c)

{'annotation': [{'imageHeight': 312,
   'imageWidth': 265,
   'label': ['Jackets'],
   'notes': '',
   'points': [{'x': 0, 'y': 0.6185897435897436},
    {'x': 0.026415094339622643, 'y': 0.6185897435897436}]},
  {'imageHeight': 312,
   'imageWidth': 265,
   'label': ['Skirts'],
   'notes': '',
   'points': [{'x': 0.01509433962264151, 'y': 0.03205128205128205},
    {'x': 1, 'y': 0.9839743589743589}]}],
 'content': 'http://com.dataturks.a96-i23.open.s3.amazonaws.com/2c9fafb063ad2b650163b00a1ead0017/b3be330c-c211-45bb-b244-11aef08021c8___Data_free-sk-5108-mudrika-original-imaf4fz626pegq9f.jpeg',
 'extras': None}

Just a single example. We'll need to handle it, though.

Let's get all unique categories:

categories = []
for c in clothing:
  for a in c['annotation']:
    categories.extend(a['label'])
categories = list(set(categories))
categories.sort()
categories

    ['Jackets',
     'Jeans',
     'Shirts',
     'Shoes',
     'Skirts',
     'Tops',
     'Trousers',
     'Tshirts',
     'sunglasses']

We have 9 different categories. Let's split the data into a training and validation set:

train_clothing, val_clothing = train_test_split(clothing, test_size=0.1)
len(train_clothing), len(val_clothing)

(453, 51)

Sample image and annotation

Let's have a look at an image from the dataset. We'll start by downloading it:

row = train_clothing[10]

img = urllib.request.urlopen(row["content"])
img = Image.open(img)
img = img.convert('RGB')

img.save("demo_image.jpeg", "JPEG")

Here's how our sample annotation looks like:

row

{'annotation': [{'imageHeight': 312,
   'imageWidth': 145,
   'label': ['Tops'],
   'notes': '',
   'points': [{'x': 0.013793103448275862, 'y': 0.22756410256410256},
    {'x': 1, 'y': 0.7948717948717948}]}],
 'content': 'http://com.dataturks.a96-i23.open.s3.amazonaws.com/2c9fafb063ad2b650163b00a1ead0017/ec339ad6-6b73-406a-8971-f7ea35d47577___Data_s-top-203-red-srw-original-imaf2nfrxdzvhh3k.jpeg',
 'extras': None}

We can use OpenCV to read the image:

img = cv2.cvtColor(cv2.imread(f'demo_image.jpeg'), cv2.COLOR_BGR2RGB)
img.shape

(312, 145, 3)

Let's add the bounding box on top of the image along with the label:

for a in row['annotation']:
  for label in a['label']:

    w = a['imageWidth']
    h = a['imageHeight']

    points = a['points']
    p1, p2 = points

    x1, y1 = p1['x'] * w, p1['y'] * h
    x2, y2 = p2['x'] * w, p2['y'] * h

    cv2.rectangle(
      img,
      (int(x1), int(y1)),
      (int(x2), int(y2)),
      color=(0, 255, 0),
      thickness=2
    )

    ((label_width, label_height), _) = cv2.getTextSize(
        label,
        fontFace=cv2.FONT_HERSHEY_PLAIN,
        fontScale=1.75,
        thickness=2
    )

    cv2.rectangle(
      img,
      (int(x1), int(y1)),
      (int(x1 + label_width + label_width * 0.05), int(y1 + label_height + label_height * 0.25)),
      color=(0, 255, 0),
      thickness=cv2.FILLED
    )

    cv2.putText(
      img,
      label,
      org=(int(x1), int(y1 + label_height + label_height * 0.25)), # bottom left
      fontFace=cv2.FONT_HERSHEY_PLAIN,
      fontScale=1.75,
      color=(255, 255, 255),
      thickness=2
    )

The point coordinates are converted back to pixels and used to draw rectangles over the image. Here's the result:

plt.imshow(img)
plt.axis('off');

Convert to YOLO format

YOLO v5 requires the dataset to be in the darknet format. Here's an outline of what it looks like:

One txt with labels file per image
One row per object
Each row contains: class_index bbox_x_center bbox_y_center bbox_width bbox_height
Box coordinates must be normalized between 0 and 1

Let's create a helper function that builds a dataset in the correct format for us:

def create_dataset(clothing, categories, dataset_type):

  images_path = Path(f"clothing/images/{dataset_type}")
  images_path.mkdir(parents=True, exist_ok=True)

  labels_path = Path(f"clothing/labels/{dataset_type}")
  labels_path.mkdir(parents=True, exist_ok=True)

  for img_id, row in enumerate(tqdm(clothing)):

    image_name = f"{img_id}.jpeg"

    img = urllib.request.urlopen(row["content"])
    img = Image.open(img)
    img = img.convert("RGB")

    img.save(str(images_path / image_name), "JPEG")

    label_name = f"{img_id}.txt"

    with (labels_path / label_name).open(mode="w") as label_file:

      for a in row['annotation']:

        for label in a['label']:

          category_idx = categories.index(label)

          points = a['points']
          p1, p2 = points

          x1, y1 = p1['x'], p1['y']
          x2, y2 = p2['x'], p2['y']

          bbox_width = x2 - x1
          bbox_height = y2 - y1

          label_file.write(
            f"{category_idx} {x1 + bbox_width / 2} {y1 + bbox_height / 2} {bbox_width} {bbox_height}\n"
          )

We'll use it to create the train and validation datasets:

create_dataset(train_clothing, categories, 'train')
create_dataset(val_clothing, categories, 'val')

Let's have a look at the file structure:

!tree clothing -L 2

clothing
├── images
│   ├── train
│   └── val
└── labels
├── train
└── val

6 directories, 0 files

And a single annotation example:

!cat clothing/labels/train/0.txt

4 0.525462962962963 0.5432692307692308 0.9027777777777778 0.9006410256410257

Fine-tuning YOLO v5

The YOLO abbreviation stands for You Only Look Once. YOLO models are one stage object detectors.

One-stage vs two-stage object detectors. Image from the YOLO v4 paper

YOLO models are very light and fast. They are not the most accurate object detections around, though. Ultimately, those models are the choice of many (if not all) practitioners interested in real-time object detection (FPS >30).

Contreversy

Joseph Redmon introduced YOLO v1 in the 2016 paper You Only Look Once: Unified, Real-Time Object Detection. The implementation uses the Darknet Neural Networks library.

He also co-authored the YOLO v2 paper in 2017 YOLO9000: Better, Faster, Stronger. A significant improvement over the first iteration with much better localization of objects.

The final iteration, from the original author, was published in the 2018 paper YOLOv3: An Incremental Improvement.

Then things got a bit wacky. Alexey Bochkovskiy published YOLOv4: Optimal Speed and Accuracy of Object Detection on April 23, 2020. The project has an open-source repository on GitHub.

YOLO v5 got open-sourced on May 30, 2020 by Glenn Jocher from ultralytics. There is no published paper, but the complete project is on GitHub.

The community at Hacker News got into a heated debate about the project naming. Even the guys at Roboflow wrote Responding to the Controversy about YOLOv5 article about it. They also did a great comparison between YOLO v4 and v5.

My opinion? As long as you put out your work for the whole world to use/see - I don't give a flying fuck. I am not going to comment on points/arguments that are obvious.

YOLO v5 project setup

YOLO v5 uses PyTorch, but everything is abstracted away. You need the project itself (along with the required dependencies).

Let's start by cloning the GitHub repo and checking out a specific commit (to ensure reproducibility):

!git clone https://github.com/ultralytics/yolov5
%cd yolov5
!git checkout ec72eea62bf5bb86b0272f2e65e413957533507f

We need two configuration files. One for the dataset and one for the model we're going to use. Let's download them:

!gdown --id 1ZycPS5Ft_0vlfgHnLsfvZPhcH6qOAqBO -O data/clothing.yaml
!gdown --id 1czESPsKbOWZF7_PkCcvRfTiUUJfpx12i -O models/yolov5x.yaml

The model config changes the number of classes to 9 (equal to the ones in our dataset). The dataset config clothing.yaml is a bit more complex:

train: ../clothing/images/train/
val: ../clothing/images/val/

nc: 9

names:
  [
    "Jackets",
    "Jeans",
    "Shirts",
    "Shoes",
    "Skirts",
    "Tops",
    "Trousers",
    "Tshirts",
    "sunglasses",
  ]

This file specifies the paths to the training and validation sets. It also gives the number of classes and their names (you should order those correctly).

Training

Fine-tuning an existing model is very easy. We'll use the largest model YOLOv5x (89M parameters), which is also the most accurate.

In our case, we don't really care about speed. We just want the best accuracy you can get. The checkpoint you're going to use for a different problem(s) is contextually specific. Take a look at the overview of the pre-trained checkpoints.

To train a model on a custom dataset, we'll call the train.py script. We'll pass a couple of parameters:

img 640 - resize the images to 640x640 pixels
batch 4 - 4 images per batch
epochs 30 - train for 30 epochs
data ./data/clothing.yaml - path to dataset config
cfg ./models/yolov5x.yaml - model config
weights yolov5x.pt - use pre-trained weights from the YOLOv5x model
name yolov5x_clothing - name of our model
cache - cache dataset images for faster training

!python train.py --img 640 --batch 4 --epochs 30 \
  --data ./data/clothing.yaml --cfg ./models/yolov5x.yaml --weights yolov5x.pt \
  --name yolov5x_clothing --cache

The training took around 30 minutes on Tesla P100. The best model checkpoint is saved to weights/best_yolov5x_clothing.pt.

Evaluation

The project includes a great utility function plot_results() that allows you to evaluate your model performance on the last training run:

from utils.utils import plot_results

plot_results();

Looks like the mean average precision (mAP) is getting better throughout the training. The model might benefit from more training, but it is good enough.

Making predictions

Let's pick 50 images from the validation set and move them to inference/images to see how our model does on those:

!find ../clothing/images/val/ -maxdepth 1 -type f | head -50 | xargs cp -t "./inference/images/"

We'll use the detect.py script to run our model on the images. Here are the parameters we're using:

weights weights/best_yolov5x_clothing.pt - checkpoint of the model
img 640 - resize the images to 640x640 px
conf 0.4 - take into account predictions with confidence of 0.4 or higher
source ./inference/images/ - path to the images

!python detect.py --weights weights/best_yolov5x_clothing.pt \
  --img 640 --conf 0.4 --source ./inference/images/

We'll write a helper function to show the results:

def load_image(img_path: Path, resize=True):
  img = cv2.cvtColor(cv2.imread(str(img_path)), cv2.COLOR_BGR2RGB)
  img = cv2.resize(img, (128, 256), interpolation = cv2.INTER_AREA)
  return img

def show_grid(image_paths):
  images = [load_image(img) for img in image_paths]
  images = torch.as_tensor(images)
  images = images.permute(0, 3, 1, 2)
  grid_img = torchvision.utils.make_grid(images, nrow=11)
  plt.figure(figsize=(24, 12))
  plt.imshow(grid_img.permute(1, 2, 0))
  plt.axis('off');

Here are some of the images along with the detected clothing:

img_paths = list(Path("inference/output").glob("*.jpeg"))[:22]
show_grid(img_paths)

To be honest with you. I am really blown away with the results!

Summary

You now know how to create a custom dataset and fine-tune one of the YOLO v5 models on your own. Nice!

Run the notebook in your browser (Google Colab)
Read the Getting Things Done with Pytorch book

Here's what you've learned:

Install required libraries
Build a custom dataset in YOLO/darknet format
Learn about YOLO model family history
Fine-tune the largest YOLO v5 model
Evaluate the model
Look at some predictions

How well does your model do on your dataset? Let me know in the comments below.

In the next part, you'll learn how to deploy your model a mobile device.

References

Clothing Item Detection for E-Commerce dataset
YOLOv5 GitHub
YOLOv5 Train on Custom Data
NVIDIA Apex on GitHub
YOLOv4: Optimal Speed and Accuracy of Object Detection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

10.object-detection-with-YOLO-v5.md

10.object-detection-with-YOLO-v5.md

Object Detection on Custom Dataset with YOLO (v5)

Prerequisites

Build a dataset

Sample image and annotation

Convert to YOLO format

Fine-tuning YOLO v5

Contreversy

YOLO v5 project setup

Training

Evaluation

Making predictions

Summary

References

Files

10.object-detection-with-YOLO-v5.md

Latest commit

History

10.object-detection-with-YOLO-v5.md

File metadata and controls

Object Detection on Custom Dataset with YOLO (v5)

Prerequisites

Build a dataset

Sample image and annotation

Convert to YOLO format

Fine-tuning YOLO v5

Contreversy

YOLO v5 project setup

Training

Evaluation

Making predictions

Summary

References