<h1><center>Efficient YoloV5 Dataset Generator</center></h1>     

<center><img src = "https://i.imgur.com/iatgdo5.jpg" width = "635" height = "235"/></center>         

This dataset was built to be compatible with the train (train.py) script that can be found [HERE](https://github.com/ultralytics/yolov5). I also have a training notebook with WandB integration that you can find [HERE](https://www.kaggle.com/coldfir3/yolov5-train/edit/run/81607643). The inference notebook is still WIP. The resulting kaggle Dataset cand be found [HERE](https://www.kaggle.com/coldfir3/great-barrier-reef-yolov5)

The tree main tasks into converting this dataset to Yolo format are:
1. Splitting into train and test
1. Converting the bboxes to yolo format `[xc, yc, w, h]` and saving them into text files
1. Arranging the files in the expected folders and writting the `.yaml`

<h3 style='background:orange; color:black'><center>Consider upvoting this notebook if you found it helpful.</center></h3>

In [None]:
import os
from ast import literal_eval

import numpy as np
import pandas as pd

from tqdm.notebook import tqdm

import shutil

In [None]:
TRAIN_PATH = '/kaggle/input/tensorflow-great-barrier-reef/train_images'

## Reading the train data

In [None]:
df = pd.read_csv('/kaggle/input/tensorflow-great-barrier-reef/train.csv')
df.tail()

## Background images (no detections)

For this dataset I decided to use a total of `N=6000` images only. This is because most of the images don't have any annotation and Yolo recommends the folowing:
> Background images. Background images are images with no objects that are added to a dataset to reduce False Positives (FP). We recommend about 0-10% background images to help reduce FPs (COCO has 1000 background images for reference, 1% of the total). No labels are required for background images.

As you can see below, the dataset have way more than that.

In [None]:
n_with_annotations = (df['annotations'] != '[]').sum()
len(df), n_with_annotations

In [None]:
N = 6000
df = pd.concat([
    df[df['annotations'] != '[]'],
    df[df['annotations'] == '[]'].sample(N - n_with_annotations)
]).sample(frac=1).reset_index(drop = True)
df.tail()

## Train Val split

For this data, I believe that the only consistent way to split the images between train/val are by video.

In [None]:
df['video_id'].value_counts()

So we will use video 2 as validation and 0 and 1 as training. For better final performance I advise you to do a 3-fold split on this data and ensemble the final models using WBF that you can find [HERE](https://github.com/ZFTurbo/Weighted-Boxes-Fusion)

In [None]:
valid = df['video_id'] == 2
train = df['video_id'] != 2
df.loc[valid, 'is_valid'] = True
df.loc[train, 'is_valid'] = False

df['annotations'] = df['annotations'].apply(literal_eval)
df['path'] = df.apply(lambda row: f"{TRAIN_PATH}/video_{row['video_id']}/{row['video_frame']}.jpg", axis = 1)

df.tail()

## The .yaml file

Yolo uses an .yaml file to indicate the number and name of the classe aswell as the location of the images/labels.

In [None]:
%%writefile config.yaml

train: train 
val: valid

nc: 1  
names: ['starfish'] 

## Converting to yolo bbox format

Yolo uses a bbox format of `[xc, yc, w, y]` therefore we need to adjust our dataset to reflect that notation.

<center><img src = "https://user-images.githubusercontent.com/26833433/91506361-c7965000-e886-11ea-8291-c72b98c25eec.jpg" width = "635" height = "235"/></center>        

In [None]:
def to_yolo(box, img_w = 1280, img_h = 720):
    
    w = box['width']
    h = box['height']
    xc = box['x'] + int(np.round(w/2))
    yc = box['y'] + int(np.round(h/2))

    return [xc/img_w, yc/img_h, w/img_w, h/img_h]

## Looping through the dataframe and saving the files

In [None]:
os.makedirs('train/images', exist_ok=True)
os.makedirs('train/labels', exist_ok=True)
os.makedirs('valid/images', exist_ok=True)
os.makedirs('valid/labels', exist_ok=True)

In [None]:
for i, row in tqdm(df.iterrows(), total = len(df)):
    
    bboxes = row['annotations']
    bboxes = [to_yolo(bbox) for bbox in bboxes]
    
    base_dir = 'valid' if row['is_valid'] else 'train'
    fname = f"{row['video_id']}_{row['video_frame']}"
    
    with open(f'{base_dir}/labels/{fname}.txt', 'w+') as f:
        for bbox in bboxes:
            f.write('0 ' + ' '.join([str(round(b, 3)) for b in bbox]) + '\n')
    shutil.copyfile(row['path'], f"{base_dir}/images/{fname}.jpg")

## Ziping the files so kaggle can assemble a dataset

the final dataset can be found [HERE](https://www.kaggle.com/coldfir3/great-barrier-reef-yolov5)

In [None]:
shutil.make_archive('valid', 'zip', 'valid')
shutil.make_archive('train', 'zip', 'train')

In [None]:
shutil.rmtree('valid') 
shutil.rmtree('train') 