<a target="_blank" href="https://colab.research.google.com/github/mmeagher/experiments/blob/main/jupyter-notebooks/Object%20Classification%20and%20Localization/data-prep-fine-tuning.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:
"""
Author: Ryleigh J. Bruce
Date: June 17, 2024

Purpose: Preparing a dataset for fine tuning a YOLO model.


Note: The author generated this text in part with GPT-4,
OpenAI’s large-scale language-generation model. Upon generating
draft code, the authors reviewed, edited, and revised the code
to their own liking and takes ultimate responsibility for
the content of this code.

"""

## Introduction
This notebook walks through the process of preparing datasets for training YOLO object detection models. It shows how to download datasets like COCO (Common Objects in Context), convert them to the right format, and split them into training and validation sets. The guide is designed to make dataset preparation easier and more automated for computer vision projects.

## Critical Uses & Adaptability

### What the Notebook Can Be Used For:
**Dataset Exploration:** This notebook facilitates the exploration and preparation of datasets for object detection.

**Educational Purposes & Demonstrations:** The notebook is a guide for use of Python scripts and machine learning libraries to process image datasets.

### How the Notebook Can Be Adapted:

**Integration with Spatial Design:** This notebook can be applied to site analysis by preparing datasets specific to design projects. The code blocks for dataset preparation and export are particularly relevant.

**Variables & Customization:** Users can modify variables such as dataset paths, split percentages, and export directories to suit their needs. The code block for loading datasets using `fiftyone.zoo.load_zoo_dataset` demonstrates how to swap datasets effectively.

### Mount the Notebook to Google Drive

Here we import the drive module that allows us to link the Colab environment with our google drive, where the desired data set is stored. This allows us to access any files located within Google Drive and interact with them directly.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Download a Dataset from FiftyOne Data Zoo

The FiftyOne Data Zoo is an open-source tool for building and downloading high quality datasets for training machine learning models. Datasets are often preprocessed, minimizing the amount of labor required for preparing the dataset for training. More information about FiftyOne Data Zoo can be found at the following link: https://docs.voxel51.com/user_guide/dataset_zoo/index.html.

### Install the Necessary Libraries

First the FiftyOne library must be installed in the colab environment. This is done using the `!pip install` command.

In [None]:
!pip install fiftyone

Both the `fiftyone` and `fiftyone.zoo` modules must be imported in order to interact with the collection of datasets.

In [None]:
import fiftyone
import fiftyone.zoo

### Load the Dataset

Within the ‘try’ block, the `fiftyone.zoo.load_zoo_dataset()` function is used to load a specified dataset into the `dataset` variable. In this example the script is attempting to download a portion of the COCO dataset. The `split` variable indicates whether the training or validation portion of the dataset should be loaded, and `label_types` defines what types of labels need to be loaded with the dataset. `classes` specifies a subset of classes to be downloaded from the dataset, rather than the entire COCO dataset. This drastically reduces the amount of time required to download the images. `max_samples` allows the user to specify the maximum number of images to be downloaded from the subset.

The ‘except’ block ensures that any errors that occur are caught and a corresponding error message is printed along with the exception details.

In [None]:
try:
    dataset = fiftyone.zoo.load_zoo_dataset(
      "coco-2017", #adjuat this string according to the desired dataset
      split="train", #optional
      label_types=["detections", "segmentations"],
      classes=['bird', 'cat', 'dog', 'bear'], #optional
      max_samples=3000,
    )
    print("Dataset loaded successfully.")
except Exception as e:
    print(f"Error loading dataset: {e}")

### Export the Dataset

The `export_dir` variable specifies the path for the dataset to be saved.

In [None]:
export_dir = "/content/drive/MyDrive/shared-data/Notebook datafiles/Finetuning-YOLOv5-animals/coco"

The export directory path, export format, label field, and split are defined within the ‘try’ block. It is critical to ensure that `dataset_type` is `fiftyone.types.YOLOv5Dataset`, or else the dataset will be unusable for fine tuning the YOLO model. A print statement alerts the user when the script has been successfully completed.

The ‘except’ block ensures that any errors that occur are caught and a corresponding error message is printed along with the exception details.

Depending on the volume of images being downloaded the script may take a significant amount of time to complete. If downloading the dataset to Google Drive please allow additional time for the files to actually appear.

In [None]:
# Export the dataset in YOLO format
try:
  dataset.export(
    export_dir=export_dir,
    dataset_type=fiftyone.types.YOLOv5Dataset,
    label_field="detections", # This field specifies where the relevant detection labels are stored
    split='train'  # This line is optional unless specifically handling splits differently
  )
  print("Dataset exported successfully.")
except Exception as e:
  print(f"Error exporting dataset: {e}")

## Split the Dataset

### Download the Necessary Libraries

In order to sort the dataset into a suitable training and validation split for model finetuning, certain Python libraries will need to be imported. This includes the `os`, `shutil`, `random`, and `logging` modules which provide various crucial functions for interacting with files and debugging.

In [None]:
import os
import shutil
import random
import logging

This line of code configures the log that the script uses to keep track of events or changes that occur while the script is running. This specific logging format ensures that the following information is logged: when the event occurred, how important it was, and what the actual event or change was.

In [None]:
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

Here the paths are configured for the original dataset, as well as the destination files for the images and labels. When finetuning a YOLO model the following file organization is required:

```
# ├── dataset_directory
│   ├── train
│   │   ├── images
│   │   │   ├── image1.jpg
│   │   │   ├── image2.jpg
│   │   │   └── ...
│   │   └── labels
│   │       ├── image1.txt
│   │       ├── image2.txt
│   │       └── ...
│   ├── valid
│   │   ├── images
│   │   │   ├── image1.jpg
│   │   │   ├── image2.jpg
│   │   │   └── ...
│   │   └── labels
│   │       ├── image1.txt
│   │       ├── image2.txt
│   │       └── ...
```
It is crucial to ensure that the training and validation files follow this format, or else the finetuning script will not work.

`validation_split_percentage = 0.2` ensures that 20% of the dataset is reserved for validation, while the remaining 80% will be used to train the model. This spit percentage can be modified according to the available data and specific requirements of the model, but 0.2 is the standard.


In [None]:
# Paths configuration
dataset_path = '/content/drive/MyDrive/shared-data/Notebook datafiles/Finetuning-YOLOv5-animals/coco-dataset'
train_path = '/content/drive/MyDrive/shared-data/Notebook datafiles/Finetuning-YOLOv5-animals/images/train'
val_path = '/content/drive/MyDrive/shared-data/Notebook datafiles/Finetuning-YOLOv5-animals/images/val'
train_labels_path = '/content/drive/MyDrive/shared-data/Notebook datafiles/Finetuning-YOLOv5-animals/labels/train'
val_labels_path = '/content/drive/MyDrive/shared-data/Notebook datafiles/Finetuning-YOLOv5-animals/labels/val'
validation_split_percentage = 0.2

The `os` module is then used to check that the previously defined directories exist using the `os.makedirs()` function. If the directories do not exist, they are then created.

In [None]:
# Ensure target directories exist
os.makedirs(train_path, exist_ok=True)
os.makedirs(val_path, exist_ok=True)
os.makedirs(train_labels_path, exist_ok=True)
os.makedirs(val_labels_path, exist_ok=True)

### Define the Sorting Function

This large code block defines the `sort_data` function for later use. It incorporates several error handling blocks to ensure graceful failure should there be something wrong with the files.

The `sort_data` function finds the images and their matching label files and randomly shuffles them using the random module imported at the beginning of the script. Then, the files are moved into separate directories for training and validation images and labels. An ‘if’ block is included to print a warning if any images are missing labels.

For large image datasets this script may take a longer period of time to execute in colab. If the dataset is being sorted into files on Google Drive (as it is in this example) it may take a longer period of time for the images and label files to actually appear in the directories once the script has completed

In [None]:
# Function to sort data into train and validation sets
def sort_data(dataset_path, train_path, val_path, train_labels_path, val_labels_path, val_split):
    try:
        images_path = os.path.join(dataset_path, 'images')
        labels_path = os.path.join(dataset_path, 'labels')

        image_extensions = ['.png', '.jpg', '.jpeg']
        image_files = [f for f in os.listdir(images_path) if os.path.splitext(f)[1].lower() in image_extensions]
        label_files = {os.path.splitext(f)[0] for f in os.listdir(labels_path) if f.lower().endswith('.txt')}

        matching_files = [os.path.splitext(img)[0] for img in image_files if os.path.splitext(img)[0] in label_files]

        random.shuffle(matching_files)
        split_index = int(len(matching_files) * (1 - val_split))
        train_images = matching_files[:split_index]
        val_images = matching_files[split_index:]

        missing_labels = [img + '.txt' for img in matching_files if img not in label_files]

        for img in train_images:
            for ext in image_extensions:
                image_file = os.path.join(images_path, img + ext)
                if os.path.exists(image_file):
                    shutil.move(image_file, os.path.join(train_path, img + ext))
                    break  # Break out of the loop once the correct extension is found

            label_file = os.path.join(labels_path, img + '.txt')
            if os.path.exists(label_file):
                shutil.move(label_file, os.path.join(train_labels_path, img + '.txt'))
            else:
                logging.warning(f"Label file not found: {label_file}")

        for img in val_images:
            for ext in image_extensions:
                image_file = os.path.join(images_path, img + ext)
                if os.path.exists(image_file):
                    shutil.move(image_file, os.path.join(val_path, img + ext))
                    break  # Break out of the loop once the correct extension is found

            label_file = os.path.join(labels_path, img + '.txt')
            if os.path.exists(label_file):
                shutil.move(label_file, os.path.join(val_labels_path, img + '.txt'))
            else:
                logging.warning(f"Label file not found: {label_file}")

        if missing_labels:
            logging.warning("Some images are missing label files.")
            logging.warning(f"Missing labels: {missing_labels}")
        logging.info("Data sorted into training and validation directories")
    except Exception as e:
        logging.error(f"An error occurred: {str(e)}")

### Sort the Dataset

This block calls the previously defined `sort_data` function with several arguments consisting of the folder directories and validation split defined in an earlier module. When the script has finished sorting the dataset a message is printed to alert the user.

In [None]:
if __name__ == "__main__":
    sort_data(dataset_path, train_path, val_path, train_labels_path, val_labels_path, validation_split_percentage)

print('Dataset has been sorted into training and validation folders.')

## Creating a YAML File

### Write the File Contents

In order to successfully train a YOLO model it is crucial to have an accurate .yaml file. This tells the model where to find the training and validation files, the number of classes, and what classes it is looking for. The classes **must** exactly match the ones in the dataset label files or else the model training will fail.

The content of the .yaml file will be assigned in string format to the `yaml_content` variable.

`train` and `val` are the paths for the training and validation **image** folders, respectively. If the directory structure specified in the ‘Splitting the Dataset’ section was followed correctly then the model will be able to find the corresponding label folders without the label folder paths being provided.

`nc` is the total number of classes contained in the dataset. In this example, a custom deer dataset has been aggregated with the coco dataset yielding a total of 80 classes. Despite this, in the label files the deer class is labeled as ‘79’ as the list of class names starts at ‘0’.

In [None]:
yaml_content = """
train: /content/drive/MyDrive/shared-data/Notebook datafiles/Finetuning-YOLOv5-animals/images/train
val: /content/drive/MyDrive/shared-data/Notebook datafiles/Finetuning-YOLOv5-animals/images/val
nc: 80
names: ['dog', 'person', 'bench', 'potted plant', 'dining table', 'cup', 'knife', 'spoon',
        'cake', 'book', 'umbrella', 'handbag', 'bird', 'cell phone', 'car', 'tie', 'backpack',
        'traffic light', 'teddy bear', 'chair', 'clock', 'parking meter', 'elephant', 'cow',
        'boat', 'skateboard', 'baseball bat', 'baseball glove', 'bottle', 'truck', 'couch',
        'tennis racket', 'sports ball', 'fork', 'vase', 'zebra', 'horse', 'train', 'surfboard',
        'bus', 'fire hydrant', 'frisbee', 'suitcase', 'cat', 'bowl', 'bicycle', 'motorcycle',
        'airplane', 'tv', 'stop sign', 'laptop', 'wine glass', 'microwave', 'sink',
        'refrigerator', 'giraffe', 'sheep', 'broccoli', 'banana', 'oven', 'apple', 'orange',
        'kite', 'snowboard', 'remote', 'pizza', 'bed', 'skis', 'donut', 'sandwich', 'hot dog',
        'bear', 'toaster', 'scissors', 'toilet', 'toothbrush', 'carrot', 'mouse', 'keyboard',
        'deer']
"""

`yaml_path` is simply the path to where the finished .yaml file will be saved.

In [None]:
# Change this path to where you want to save the YAML file in your Google Drive
yaml_path = '/content/drive/MyDrive/shared-data/Notebook datafiles/Finetuning-YOLOv5-animals/coco-deer-training.yaml'

The script begins by opening a file at the path defined in the `yaml_path` variable using the `open()` function. The `‘w’` in the argument indicates that the file has been opened for writing, and the file object is assigned to the `f` variable. `f.write(yaml_content)` calls the object f to write the `yaml_content` string to the opened file. The file is automatically closed once the ‘with’ block is exited.

Once the file is closed a print statement indicates where the .yaml file has been saved.

In [None]:
with open(yaml_path, 'w') as f:
    f.write(yaml_content)

print(f"YAML file saved to {yaml_path}")

### Check the YAML File

If desired, the .yaml file can be loaded in order to check its contents. First the `yaml` library must be imported.

The `open()` function is used again to open a file at the `yaml_path`, but rather than creating a new file it opens the newly created .yaml file. The `‘r’` specifies that the file has been opened for reading, and `as f` creates a file object f. `yaml.safe_load` reads the content of the file and converts it to a Python dictionary or a list (depending on the file contents). The `safe_load` method ensures that only simple Python objects are allowed, preventing potentially dangerous .yaml files from being opened.

In [None]:
import yaml
with open(yaml_path, 'r') as f:
    loaded_yaml = yaml.safe_load(f)
print(loaded_yaml)