TL;DR Learn how to build a custom dataset for YOLO v5 (darknet compatible) and use it to fine-tune a large object detection model. The model will be ready for real-time object detection on mobile devices.
In this tutorial, you'll learn how to fine-tune a pre-trained YOLO v5 model for detecting and classifying clothing items from images.
Here's what we'll go over:
- Install required libraries
- Build a custom dataset in YOLO/darknet format
- Learn about YOLO model family history
- Fine-tune the largest YOLO v5 model
- Evaluate the model
- Look at some predictions
How good our final model is going to be?
Let's start by installing some required libraries by the YOLOv5 project:
!pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html
!pip install numpy==1.17
!pip install PyYAML==5.3.1
!pip install git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI
We'll also install Apex by NVIDIA to speed up the training of our model (this step is optional):
!git clone https://github.com/NVIDIA/apex && cd apex && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . --user && cd .. && rm -rf apex
The dataset contains annotations for clothing items - bounding boxes around shirts, tops, jackets, sunglasses. The dataset is from DataTurks and is on Kaggle.
!gdown --id 1uWdQ2kn25RSQITtBHa9_zayplm27IXNC
The dataset contains a single JSON file with URLs to all images and bounding box data.
Let's import all required libraries:
from pathlib import Path
from tqdm import tqdm
import numpy as np
import json
import urllib
import PIL.Image as Image
import cv2
import torch
import torchvision
from IPython.display import display
from sklearn.model_selection import train_test_split
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
%matplotlib inline
%config InlineBackend.figure_format='retina'
sns.set(style='whitegrid', palette='muted', font_scale=1.2)
rcParams['figure.figsize'] = 16, 10
np.random.seed(42)
Each line in the dataset file contains a JSON object. Let's create a list of all annotations:
clothing = []
with open("clothing.json") as f:
for line in f:
clothing.append(json.loads(line))
Here's an example annotation:
clothing[0]
{'annotation': [{'imageHeight': 312,
'imageWidth': 147,
'label': ['Tops'],
'notes': '',
'points': [{'x': 0.02040816326530612, 'y': 0.2532051282051282},
{'x': 0.9931972789115646, 'y': 0.8108974358974359}]}],
'content': 'http://com.dataturks.a96-i23.open.s3.amazonaws.com/2c9fafb063ad2b650163b00a1ead0017/4bb8fd9d-8d52-46c7-aa2a-9c18af10aed6___Data_xxl-top-4437-jolliy-original-imaekasxahykhd3t.jpeg',
'extras': None}
We have the labels, image dimensions, bounding box points (normalized in 0-1 range), and an URL to the image file.
Do we have images with multiple annotations?
for c in clothing:
if len(c['annotation']) > 1:
display(c)
{'annotation': [{'imageHeight': 312,
'imageWidth': 265,
'label': ['Jackets'],
'notes': '',
'points': [{'x': 0, 'y': 0.6185897435897436},
{'x': 0.026415094339622643, 'y': 0.6185897435897436}]},
{'imageHeight': 312,
'imageWidth': 265,
'label': ['Skirts'],
'notes': '',
'points': [{'x': 0.01509433962264151, 'y': 0.03205128205128205},
{'x': 1, 'y': 0.9839743589743589}]}],
'content': 'http://com.dataturks.a96-i23.open.s3.amazonaws.com/2c9fafb063ad2b650163b00a1ead0017/b3be330c-c211-45bb-b244-11aef08021c8___Data_free-sk-5108-mudrika-original-imaf4fz626pegq9f.jpeg',
'extras': None}
Just a single example. We'll need to handle it, though.
Let's get all unique categories:
categories = []
for c in clothing:
for a in c['annotation']:
categories.extend(a['label'])
categories = list(set(categories))
categories.sort()
categories
['Jackets',
'Jeans',
'Shirts',
'Shoes',
'Skirts',
'Tops',
'Trousers',
'Tshirts',
'sunglasses']
We have 9 different categories. Let's split the data into a training and validation set:
train_clothing, val_clothing = train_test_split(clothing, test_size=0.1)
len(train_clothing), len(val_clothing)
(453, 51)
Let's have a look at an image from the dataset. We'll start by downloading it:
row = train_clothing[10]
img = urllib.request.urlopen(row["content"])
img = Image.open(img)
img = img.convert('RGB')
img.save("demo_image.jpeg", "JPEG")
Here's how our sample annotation looks like:
row
{'annotation': [{'imageHeight': 312,
'imageWidth': 145,
'label': ['Tops'],
'notes': '',
'points': [{'x': 0.013793103448275862, 'y': 0.22756410256410256},
{'x': 1, 'y': 0.7948717948717948}]}],
'content': 'http://com.dataturks.a96-i23.open.s3.amazonaws.com/2c9fafb063ad2b650163b00a1ead0017/ec339ad6-6b73-406a-8971-f7ea35d47577___Data_s-top-203-red-srw-original-imaf2nfrxdzvhh3k.jpeg',
'extras': None}
We can use OpenCV to read the image:
img = cv2.cvtColor(cv2.imread(f'demo_image.jpeg'), cv2.COLOR_BGR2RGB)
img.shape
(312, 145, 3)
Let's add the bounding box on top of the image along with the label:
for a in row['annotation']:
for label in a['label']:
w = a['imageWidth']
h = a['imageHeight']
points = a['points']
p1, p2 = points
x1, y1 = p1['x'] * w, p1['y'] * h
x2, y2 = p2['x'] * w, p2['y'] * h
cv2.rectangle(
img,
(int(x1), int(y1)),
(int(x2), int(y2)),
color=(0, 255, 0),
thickness=2
)
((label_width, label_height), _) = cv2.getTextSize(
label,
fontFace=cv2.FONT_HERSHEY_PLAIN,
fontScale=1.75,
thickness=2
)
cv2.rectangle(
img,
(int(x1), int(y1)),
(int(x1 + label_width + label_width * 0.05), int(y1 + label_height + label_height * 0.25)),
color=(0, 255, 0),
thickness=cv2.FILLED
)
cv2.putText(
img,
label,
org=(int(x1), int(y1 + label_height + label_height * 0.25)), # bottom left
fontFace=cv2.FONT_HERSHEY_PLAIN,
fontScale=1.75,
color=(255, 255, 255),
thickness=2
)
The point coordinates are converted back to pixels and used to draw rectangles over the image. Here's the result:
plt.imshow(img)
plt.axis('off');
YOLO v5 requires the dataset to be in the darknet format. Here's an outline of what it looks like:
- One txt with labels file per image
- One row per object
- Each row contains:
class_index bbox_x_center bbox_y_center bbox_width bbox_height
- Box coordinates must be normalized between 0 and 1
Let's create a helper function that builds a dataset in the correct format for us:
def create_dataset(clothing, categories, dataset_type):
images_path = Path(f"clothing/images/{dataset_type}")
images_path.mkdir(parents=True, exist_ok=True)
labels_path = Path(f"clothing/labels/{dataset_type}")
labels_path.mkdir(parents=True, exist_ok=True)
for img_id, row in enumerate(tqdm(clothing)):
image_name = f"{img_id}.jpeg"
img = urllib.request.urlopen(row["content"])
img = Image.open(img)
img = img.convert("RGB")
img.save(str(images_path / image_name), "JPEG")
label_name = f"{img_id}.txt"
with (labels_path / label_name).open(mode="w") as label_file:
for a in row['annotation']:
for label in a['label']:
category_idx = categories.index(label)
points = a['points']
p1, p2 = points
x1, y1 = p1['x'], p1['y']
x2, y2 = p2['x'], p2['y']
bbox_width = x2 - x1
bbox_height = y2 - y1
label_file.write(
f"{category_idx} {x1 + bbox_width / 2} {y1 + bbox_height / 2} {bbox_width} {bbox_height}\n"
)
We'll use it to create the train and validation datasets:
create_dataset(train_clothing, categories, 'train')
create_dataset(val_clothing, categories, 'val')
Let's have a look at the file structure:
!tree clothing -L 2
clothing
├── images
│ ├── train
│ └── val
└── labels
├── train
└── val
6 directories, 0 files
And a single annotation example:
!cat clothing/labels/train/0.txt
4 0.525462962962963 0.5432692307692308 0.9027777777777778 0.9006410256410257
The YOLO abbreviation stands for You Only Look Once. YOLO models are one stage object detectors.
One-stage vs two-stage object detectors. Image from the YOLO v4 paper
YOLO models are very light and fast. They are not the most accurate object detections around, though. Ultimately, those models are the choice of many (if not all) practitioners interested in real-time object detection (FPS >30).
Joseph Redmon introduced YOLO v1 in the 2016 paper You Only Look Once: Unified, Real-Time Object Detection. The implementation uses the Darknet Neural Networks library.
He also co-authored the YOLO v2 paper in 2017 YOLO9000: Better, Faster, Stronger. A significant improvement over the first iteration with much better localization of objects.
The final iteration, from the original author, was published in the 2018 paper YOLOv3: An Incremental Improvement.
Then things got a bit wacky. Alexey Bochkovskiy published YOLOv4: Optimal Speed and Accuracy of Object Detection on April 23, 2020. The project has an open-source repository on GitHub.
YOLO v5 got open-sourced on May 30, 2020 by Glenn Jocher from ultralytics. There is no published paper, but the complete project is on GitHub.
The community at Hacker News got into a heated debate about the project naming. Even the guys at Roboflow wrote Responding to the Controversy about YOLOv5 article about it. They also did a great comparison between YOLO v4 and v5.
My opinion? As long as you put out your work for the whole world to use/see - I don't give a flying fuck. I am not going to comment on points/arguments that are obvious.
YOLO v5 uses PyTorch, but everything is abstracted away. You need the project itself (along with the required dependencies).
Let's start by cloning the GitHub repo and checking out a specific commit (to ensure reproducibility):
!git clone https://github.com/ultralytics/yolov5
%cd yolov5
!git checkout ec72eea62bf5bb86b0272f2e65e413957533507f
We need two configuration files. One for the dataset and one for the model we're going to use. Let's download them:
!gdown --id 1ZycPS5Ft_0vlfgHnLsfvZPhcH6qOAqBO -O data/clothing.yaml
!gdown --id 1czESPsKbOWZF7_PkCcvRfTiUUJfpx12i -O models/yolov5x.yaml
The model config changes the number of classes to 9 (equal to the ones in our dataset). The dataset config clothing.yaml
is a bit more complex:
train: ../clothing/images/train/
val: ../clothing/images/val/
nc: 9
names:
[
"Jackets",
"Jeans",
"Shirts",
"Shoes",
"Skirts",
"Tops",
"Trousers",
"Tshirts",
"sunglasses",
]
This file specifies the paths to the training and validation sets. It also gives the number of classes and their names (you should order those correctly).
Fine-tuning an existing model is very easy. We'll use the largest model YOLOv5x (89M parameters), which is also the most accurate.
In our case, we don't really care about speed. We just want the best accuracy you can get. The checkpoint you're going to use for a different problem(s) is contextually specific. Take a look at the overview of the pre-trained checkpoints.
To train a model on a custom dataset, we'll call the train.py
script. We'll pass a couple of parameters:
- img 640 - resize the images to 640x640 pixels
- batch 4 - 4 images per batch
- epochs 30 - train for 30 epochs
- data ./data/clothing.yaml - path to dataset config
- cfg ./models/yolov5x.yaml - model config
- weights yolov5x.pt - use pre-trained weights from the YOLOv5x model
- name yolov5x_clothing - name of our model
- cache - cache dataset images for faster training
!python train.py --img 640 --batch 4 --epochs 30 \
--data ./data/clothing.yaml --cfg ./models/yolov5x.yaml --weights yolov5x.pt \
--name yolov5x_clothing --cache
The training took around 30 minutes on Tesla P100. The best model checkpoint is saved to weights/best_yolov5x_clothing.pt
.
The project includes a great utility function plot_results()
that allows you to evaluate your model performance on the last training run:
from utils.utils import plot_results
plot_results();
Looks like the mean average precision (mAP) is getting better throughout the training. The model might benefit from more training, but it is good enough.
Let's pick 50 images from the validation set and move them to inference/images
to see how our model does on those:
!find ../clothing/images/val/ -maxdepth 1 -type f | head -50 | xargs cp -t "./inference/images/"
We'll use the detect.py
script to run our model on the images. Here are the parameters we're using:
- weights weights/best_yolov5x_clothing.pt - checkpoint of the model
- img 640 - resize the images to 640x640 px
- conf 0.4 - take into account predictions with confidence of 0.4 or higher
- source ./inference/images/ - path to the images
!python detect.py --weights weights/best_yolov5x_clothing.pt \
--img 640 --conf 0.4 --source ./inference/images/
We'll write a helper function to show the results:
def load_image(img_path: Path, resize=True):
img = cv2.cvtColor(cv2.imread(str(img_path)), cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (128, 256), interpolation = cv2.INTER_AREA)
return img
def show_grid(image_paths):
images = [load_image(img) for img in image_paths]
images = torch.as_tensor(images)
images = images.permute(0, 3, 1, 2)
grid_img = torchvision.utils.make_grid(images, nrow=11)
plt.figure(figsize=(24, 12))
plt.imshow(grid_img.permute(1, 2, 0))
plt.axis('off');
Here are some of the images along with the detected clothing:
img_paths = list(Path("inference/output").glob("*.jpeg"))[:22]
show_grid(img_paths)
To be honest with you. I am really blown away with the results!
You now know how to create a custom dataset and fine-tune one of the YOLO v5 models on your own. Nice!
Here's what you've learned:
- Install required libraries
- Build a custom dataset in YOLO/darknet format
- Learn about YOLO model family history
- Fine-tune the largest YOLO v5 model
- Evaluate the model
- Look at some predictions
How well does your model do on your dataset? Let me know in the comments below.
In the next part, you'll learn how to deploy your model a mobile device.