# Creating Datasets for YOLO

This notebook focuses on preparing datasets for the YOLO (You Only Look Once) object detection algorithm in the context of a particle detection challenge. The competition involves analyzing 3D volumetric images to locate particles with high precision. While there are existing approaches leveraging 3D images directly, these methods can be computationally expensive, especially in terms of VRAM usage. This notebook achieves to convert 3D volumetric images into 2D slices normalized for YOLO.

The presented code will work only if the specified "dataFolder" and "dataFolder" is updated depending on where the code is being exdcuted.

## Install and Import modules

In [None]:
!pip install zarr

Collecting zarr
  Downloading zarr-3.0.1-py3-none-any.whl.metadata (9.5 kB)
Collecting donfig>=0.8 (from zarr)
  Downloading donfig-0.8.1.post1-py3-none-any.whl.metadata (5.0 kB)
Collecting numcodecs>=0.14 (from numcodecs[crc32c]>=0.14->zarr)
  Downloading numcodecs-0.15.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.9 kB)
Collecting crc32c>=2.7 (from numcodecs[crc32c]>=0.14->zarr)
  Downloading crc32c-2.7.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Downloading zarr-3.0.1-py3-none-any.whl (181 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.4/181.4 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading donfig-0.8.1.post1-py3-none-any.whl (21 kB)
Downloading numcodecs-0.15.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/8.9 MB[0m [31m46.2 MB/s[0m eta [36m0:00:00[0m


In [None]:
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import zarr
from tqdm import tqdm
import glob, os
import cv2

In [None]:
#We use the path of our google drive dataset folder

from google.colab import drive
drive.mount('/content/gdrive')
dataFolder = "gdrive/My Drive/MLDM - Deep Learning/CRYOET PROJECT - DEEP LEARNING/2D U-NET APPROACH/1 - SOURCE DATA/1 - ORIGINAL DATASET/"

Mounted at /content/gdrive


In [None]:
# List the tomograms folders on the training dataset

runs = sorted(glob.glob(f'{dataFolder}train/overlay/ExperimentRuns/*'))
runs = [os.path.basename(x) for x in runs]
i2r_dict = {i:r for i, r in zip(range(len(runs)), runs)}
r2t_dict = {r:i for i, r in zip(range(len(runs)), runs)}
i2r_dict

{0: 'TS_5_4',
 1: 'TS_69_2',
 2: 'TS_6_4',
 3: 'TS_6_6',
 4: 'TS_73_6',
 5: 'TS_86_3',
 6: 'TS_99_9'}

## Normalize Function
To treat it as an image, normalize it to a value between 0 and 255.

1e-12 is very small and has the meaning of epsilon.

In [None]:
def convert_to_8bit(x):

    #Compute Percentile Bounds
    lower, upper = np.percentile(x, (0.5, 99.5))

    #Clipping Values
    x = np.clip(x, lower, upper)

    #Normalizes the data to fit within the 8-bit grayscale range
    x = (x - x.min()) / (x.max() - x.min() + 1e-12) * 255

    return x.round().astype("uint8")

## Label Information

In [None]:
#Particles types and and interest radius

p2i_dict = {
        'apo-ferritin': 0,
        'beta-amylase': 1,
        'beta-galactosidase': 2,
        'ribosome': 3,
        'thyroglobulin': 4,
        'virus-like-particle': 5
    }

i2p = {v:k for k, v in p2i_dict.items()}

particle_radius = {
        'apo-ferritin': 60,
        'beta-amylase': 65,
        'beta-galactosidase': 90,
        'ribosome': 150,
        'thyroglobulin': 130,
        'virus-like-particle': 135,
    }



In [None]:
particle_names = ['apo-ferritin', 'beta-amylase', 'beta-galactosidase', 'ribosome', 'thyroglobulin', 'virus-like-particle']

## Main function for making datasets for YOLO

In [None]:
def make_annotate_yolo(run_name, is_train_path=True):
    # to split validation
    is_train_path = 'train' if is_train_path else 'val'

    # read a volume
    vol = zarr.open(f'{dataFolder}train/static/ExperimentRuns/{run_name}/VoxelSpacing10.000/denoised.zarr', mode='r') #bug fixed. Thanks to @pratyushh
    # use largest images
    vol = vol['0']
    # normalize [0, 255]
    vol2 = convert_to_8bit(vol)

    n_imgs = vol2.shape[0]
    # process each slices
    for j in range(n_imgs):
        newvol = vol2[j]
        newvolf = np.stack([newvol]*3, axis=-1)
        # YOLO requires image_size is multiple of 32
        newvolf = cv2.resize(newvolf, (640,640))
        # save as 1 slice
        cv2.imwrite(f'{dataFolder}images/{is_train_path}/{run_name}_{j*10}.png', newvolf)
        # make txt file for annotation
        with open(f'{dataFolder}labels/{is_train_path}/{run_name}_{j*10}.txt', 'w'):
            pass # make empty file

    # process each paticle types
    for p, particle in enumerate(tqdm(particle_names)):
        # we do not have to detect beta-amylase which weight is 0
        if particle=="beta-amylase":
            continue
        json_each_paticle = f"{dataFolder}train/overlay/ExperimentRuns/{run_name}/Picks/{particle}.json"
        df = pd.read_json(json_each_paticle)
        # pick each coordinate of particles
        for axis in "x", "y", "z":
            df[axis] = df.points.apply(lambda x: x["location"][axis])


        radius = particle_radius[particle]
        for i, row in df.iterrows():
            # The radius from the center of the particle is used to determine the slices present.
            start_z = np.round(row['z'] - radius).astype(np.int32)
            start_z = max(0, start_z//10) # 10 means pixelspacing
            end_z = np.round(row['z'] + radius).astype(np.int32)
            end_z = min(n_imgs, end_z//10) # 10 means pixelspacing

            for j in range(start_z+1, end_z+1-1, 1):
                # white the results of annotation
                with open(f'{dataFolder}labels/{is_train_path}/{run_name}_{j*10}.txt', 'a') as f:
                    f.write(f'{p2i_dict[particle]} {row["x"]/10/vol2.shape[1]} {row["y"]/10/vol2.shape[2]} {radius/10/vol2.shape[1]*2} {radius/10/vol2.shape[2]*2} \n')


## Prepare Folders

For sending 2d images and txt labes to a new folder

In [None]:
yoloFolder = "gdrive/My Drive/MLDM - Deep Learning/CRYOET PROJECT - DEEP LEARNING/2D U-NET APPROACH/1 - SOURCE DATA/2 - YOLO DATASET/"

In [None]:
os.makedirs(f"{dataFolder}images/train", exist_ok=True)
os.makedirs(f"{dataFolder}images/val", exist_ok=True)
os.makedirs(f"{dataFolder}labels/val", exist_ok=True)
os.makedirs(f"{dataFolder}labels/train", exist_ok=True)

## Main loop to make slice images and annotations

In [None]:
runs

['TS_5_4', 'TS_69_2', 'TS_6_4', 'TS_6_6', 'TS_73_6', 'TS_86_3', 'TS_99_9']

In [None]:
# use TS_5_4 as validation
for i, r in enumerate(runs):
    make_annotate_yolo(r, is_train_path=False if i==0 else True)

100%|██████████| 6/6 [00:22<00:00,  3.79s/it]
100%|██████████| 6/6 [00:24<00:00,  4.12s/it]
100%|██████████| 6/6 [00:36<00:00,  6.08s/it]
100%|██████████| 6/6 [00:23<00:00,  3.85s/it]
100%|██████████| 6/6 [00:36<00:00,  6.17s/it]
100%|██████████| 6/6 [00:43<00:00,  7.19s/it]
100%|██████████| 6/6 [00:39<00:00,  6.53s/it]


Put them all in one folder.

In [None]:
import shutil
#os.makedirs(f'{dataFolder}datasets/czii_det2d', exist_ok=True)
shutil.move(f'{dataFolder}images/train', f'{yoloFolder}images/train')
shutil.move(f'{dataFolder}images/val', f'{yoloFolder}images')
shutil.move(f'{dataFolder}labels/train', f'{yoloFolder}labels/train')
shutil.move(f'{dataFolder}labels/val', f'{yoloFolder}labels')

## Create the yaml file for Training
We need to create a yaml configuration file for training

In [None]:
%%writefile 'gdrive/My Drive/MLDM - Deep Learning/CRYOET PROJECT - DEEP LEARNING/2D U-NET APPROACH/1 - SOURCE DATA/2 - YOLO DATASET/czii_conf.yaml'

path: gdrive/My Drive/MLDM - Deep Learning/CRYOET PROJECT - DEEP LEARNING/2D U-NET APPROACH/1 - SOURCE DATA/2 - YOLO DATASET/ # dataset root dir
train: images/train # train images (relative to 'path')
val: images/val # val images (relative to 'path')

# Classes
names:
  0: apo-ferritin
  1: beta-amylase
  2: beta-galactosidase
  3: ribosome
  4: thyroglobulin
  5: virus-like-particle