<a href="https://colab.research.google.com/github/samuelyoon17/sam2-auto-annotation-pipeline/blob/main/Step_1_Uploading_GWHD_2021_Dataset_to_Roboflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1 - Uploading GWHD 2021 Dataset to Roboflow

Goal
* To upload the original GHWD 2021 (Object Detection) Dataset to Roboflow

Resources
*   https://www.kaggle.com/code/vbookshelf/gwhd-how-to-parse-the-data
*   https://roboflow.com/formats/coco-json
* https://www.v7labs.com/blog/coco-dataset-guide
* https://github.com/levan92/cocojson/blob/main/docs/coco.md
* https://www.geeksforgeeks.org/python/reading-and-writing-json-to-a-file-in-python/
* https://www.geeksforgeeks.org/python/json-dumps-in-python/
* https://docs.roboflow.com/developer/upload-a-dataset
* https://www.kaggle.com/datasets/vbookshelf/global-wheat-head-dataset-2021

Dataset
* https://zenodo.org/records/5092309#.Y7ksF-xBzUL





# Environment Set-Up

In [None]:
!pip install roboflow

Collecting roboflow
  Downloading roboflow-1.2.0-py3-none-any.whl.metadata (9.7 kB)
Collecting idna==3.7 (from roboflow)
  Downloading idna-3.7-py3-none-any.whl.metadata (9.9 kB)
Collecting opencv-python-headless==4.10.0.84 (from roboflow)
  Downloading opencv_python_headless-4.10.0.84-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting pillow-heif<2 (from roboflow)
  Downloading pillow_heif-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting pillow-avif-plugin<2 (from roboflow)
  Downloading pillow_avif_plugin-1.5.2-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
Collecting python-dotenv (from roboflow)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Collecting filetype (from roboflow)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Downloading roboflow-1.2.0-py3-none-any.whl (86 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [

In [None]:
import pandas as pd
import numpy as np
import os
import cv2
import ast
import matplotlib.pyplot as plt
import json
import shutil
import roboflow
from roboflow import Roboflow

In [None]:
# Location to where the original dataset is stored
dataset_path = '/content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/gwhd_2021'

In [None]:
# Load training data

path = os.path.join(dataset_path, 'competition_train.csv')
df_train = pd.read_csv(path)

In [None]:
# Load validation data

path = os.path.join(dataset_path, 'competition_val.csv')
df_val = pd.read_csv(path)

In [None]:
# Load testing data

path = os.path.join(dataset_path, 'competition_test.csv')
df_test = pd.read_csv(path)

# Create COCO JSON files



In [None]:
def generate_json(df, split_dict):
  """
  Generate COCO JSON annotations and image information from a pandas dataframe.

  This function iterates through a pandas DataFrame containing image names and
  bounding box information, and populates a dictionary with the necessary
  data to create a COCO JSON file for object detection annotations.

  Args:
    df (pandas.DataFrame): DataFrame containing 'image_name' and 'BoxesString' columns.
                           'BoxesString' should contain bounding box coordinates
                           in the format 'xmin ymin xmax ymax;xmin ymin xmax ymax;...'.
    split_dict (dict): A dictionary to store the COCO JSON data. This dictionary
                       should have the following keys initialized: 'info', 'licenses',
                       'categories', 'images', and 'annotations'. The function
                       will append image and annotation data to the 'images' and
                       'annotations' lists respectively.

  Returns:
    None: The function modifies the `split_dict` in-place.
  """
  annotation_counter = 1
  for i in range(len(df)):
    image_name = df.loc[i, 'image_name']
    box_str = df.loc[i, 'BoxesString'] # Bounding box coordaintes given as (xmin, ymin, xmax, ymax)

    if box_str != "no_box":
      box_list = box_str.split(';')
      for index, item in enumerate(box_list):
        coords_list = item.split(' ')
        xmin = ast.literal_eval(coords_list[0])
        ymin = ast.literal_eval(coords_list[1])
        xmax = ast.literal_eval(coords_list[2])
        ymax = ast.literal_eval(coords_list[3])

        split_dict['annotations'].append(
          {
                "id": annotation_counter,
                "image_id": i+1,
                "category_id": 1,
                "bbox": [
                    xmin,
                    ymin,
                    xmax - xmin,
                    ymax - ymin
                ],
                "area": (xmax - xmin) * (ymax - ymin),
                "iscrowd": 0
          }
        )
        annotation_counter += 1
    else:
      continue


    split_dict['images'].append(
        {
            "id": i+1,
            "license": 1,
            "file_name": image_name,
            "height": 1024,
            "width": 1024
        }
    )

In [None]:
# Training dictionary to store data for training COCO JSON file
train_dict = {
  "info" : {
      "year": "2021",
      "version": "1",
      "description": "Training COCO JSON file for training files",

  },
  "licenses": [
      {
          "id": 1,
          "url": "https://zenodo.org/records/5092309#.Y7ksF-xBzUL",
          "name": "Zenodo"
      }
  ],
  "categories": [
      {
          "id": 1,
          "name": "wheat",
      }
  ],

  "images": [],
  "annotations": []
}

# Validation dictionary to store data for validation COCO JSON file
val_dict = {
  "info" : {
      "year": "2021",
      "version": "1",
      "description": "Validation COCO JSON file for validation files",

  },
  "licenses": [
      {
          "id": 1,
          "url": "https://zenodo.org/records/5092309#.Y7ksF-xBzUL",
          "name": "Zenodo"
      }
  ],
  "categories": [
      {
          "id": 1,
          "name": "wheat",
      }
  ],

  "images": [],
  "annotations": []
}


# Testing dictionary to store data for testing COCO JSON file
test_dict = {
  "info" : {
      "year": "2021",
      "version": "1",
      "description": "Test COCO JSON file for testing files",

  },
  "licenses": [
      {
          "id": 1,
          "url": "https://zenodo.org/records/5092309#.Y7ksF-xBzUL",
          "name": "Zenodo"
      }
  ],
  "categories": [
      {
          "id": 1,
          "name": "wheat",
      }
  ],

  "images": [],
  "annotations": []
}

# Populates training, validation, and testing dictionaries with their respective
# data in JSON format
generate_json(df_train, train_dict)
generate_json(df_val, val_dict)
generate_json(df_test, test_dict)

# Save JSON files

In [None]:
# Location for saving new dataset
new_path = '/content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets'

In [None]:
# Create new directory and subfolders
# The directory hierarchy is designed to match Roboflow's expected input format for object detection datasets

new_path = os.path.join(new_path, 'GWHD 2021 COCO Object Detection')

os.mkdir(new_path)

os.mkdir(os.path.join(new_path, 'train'))
os.mkdir(os.path.join(new_path, 'valid'))
os.mkdir(os.path.join(new_path, 'test'))

In [None]:
# Create and save JSON annotation files in the COCO format for each split
with open(os.path.join(new_path, 'train', '_annotations.coco.json'), 'w') as outfile:
  json.dump(train_dict, outfile, indent = 4)

with open(os.path.join(new_path, 'valid', '_annotations.coco.json'), 'w') as outfile:
  json.dump(val_dict, outfile, indent = 4)

with open(os.path.join(new_path, 'test', '_annotations.coco.json'), 'w') as outfile:
  json.dump(test_dict, outfile, indent = 4)

# Organize Dataset Splits for Roboflow

Putting images into their respective split folders (train, valid, and test) and ensuring each folder contains its corresponding COCO JSON annotation file. This directory structure is required for uploading the dataset to Roboflow and facilitates dataset splitting on their platform.

In [None]:
def copy_images_to_splits(df, original_path, new_location):
  """
  Copies images from the original dataset directory to the new split directory
  based on image names in the provided DataFrame.

  This function iterates through a pandas DataFrame containing image names and
  copies each corresponding image file from the original dataset directory
  to a specified new location.

  Args:
    df (pandas.DataFrame): DataFrame containing 'image_name' columns
    original_path (str): The path to the directory containing the original images.
    new_location (str): The path to the destination directory for the copied images.

  Returns:
    None.
  """
  for i in range(len(df)):
    image_name = df.loc[i, 'image_name']
    image_path = os.path.join(original_path, image_name)
    try:
      if not os.path.exists(os.path.join(new_location, image_name)):
        shutil.copy(image_path, new_location)
    except:
      print("Failed to copy image " + image_name + "\nProceeding to the next image")

In [None]:
# Copies images from the original location to each split's new location
copy_images_to_splits(df_train, os.path.join(dataset_path, 'images'), os.path.join(new_path, 'train'))
copy_images_to_splits(df_val, os.path.join(dataset_path, 'images'), os.path.join(new_path, 'valid'))
copy_images_to_splits(df_test, os.path.join(dataset_path, 'images'), os.path.join(new_path, 'test'))

In [None]:
# Check the number of images in each folder
# Expecting 3657, 1476, 1382 for training, validation, and testing splits

train_path = '/content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/train'
valid_path = '/content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/valid'
test_path = '/content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/test'

split_type = {'train': train_path, 'valid': valid_path, 'test': test_path}

for split, path in split_type.items():
  png_count = 0
  for filename in os.listdir(path):
      if filename.endswith('.png'):
          png_count += 1
      else:
        print(f"Non-png file: {filename}")

  print(f"Number of PNG files in '{split}': {png_count}")

Non-png file: _annotations.coco.json
Number of PNG files in 'train': 3655
Non-png file: _annotations.coco.json
Number of PNG files in 'valid': 1476
Non-png file: _annotations.coco.json
Number of PNG files in 'test': 1381


In [None]:
# To determine which images from the original directory did not get transferred

folder_path = '/content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/gwhd_2021/images' # Replace with the actual path to your folder

train_path = '/content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/train'
valid_path = '/content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/valid'
test_path = '/content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/test'

for filename in os.listdir(folder_path):
  if not(os.path.exists(os.path.join(train_path, filename)) or os.path.exists(os.path.join(valid_path, filename)) or os.path.exists(os.path.join(test_path, filename))):
    print(f'{filename} cannot be found')

b588e1b55fbf8c4c6af08886013c0c36b70dd617fb1a5070829295a5c3ab31a8.png cannot be found
8cc5870f73527da07937acc806002e1272e6200095c656ca326a680a90fab507.png cannot be found
094dcc9098204e6f751504515f3b0a7b5f5ad500a8bd2ec10124ed3d4fdbb6ed.png cannot be found


# Upload OD GWHD 2021 Dataset to Roboflow

In [None]:
roboflow.login(force=True)

visit https://app.roboflow.com/auth-cli to get your authentication token.
Paste the authentication token here: ··········


In [None]:
# Replace PLACEHOLDER_FOR_API_KEY with your Roboflow's dataset Private API key
# More directions can be found on the website below
# https://docs.roboflow.com/developer/authentication/find-your-roboflow-api-key
API_KEY = "PLACEHOLDER_FOR_API_KEY"
rf = Roboflow(api_key=API_KEY)

In [None]:
# Connect to gwhd-2021 workspace on Roboflow
workspace = rf.workspace("gwhd-2021")
print(rf.workspace())

loading Roboflow workspace...
loading Roboflow workspace...
{
  "name": "GWHD 2021",
  "url": "gwhd-2021",
  "projects": []
}


In [None]:
# Upload training split
workspace.upload_dataset(
    '/content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/train',
    'gwhd2021OD',
    num_workers = 10,
    project_license = "MIT",
    project_type = "object-detection",
    batch_name = "Train",
    num_retries=5
)

# Upload testing split
workspace.upload_dataset(
    '/content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/test',
    'gwhd2021OD',
    num_workers = 10,
    project_license = "MIT",
    project_type = "object-detection",
    batch_name = "Test",
    num_retries=5
)

# Upload validation split
workspace.upload_dataset(
    '/content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/valid',
    'gwhd2021OD',
    num_workers = 10,
    project_license = "MIT",
    project_type = "object-detection",
    batch_name = "Valid",
    num_retries=5
)

loading Roboflow project...
loading Roboflow project...


100%|██████████| 3655/3655 [00:00<00:00, 6868.11it/s]


Created project gwhd-2021/gwhd2021od-od3w5
[UPLOADED] /content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/train/0077c64686e9712d9b8efcd930ce5b0d68c72d8fb50ee99ed01ff2fd73e6d1d2.png (jgtS8GHohkQ1PKdKImfz) [1.8s] / annotations = OK [0.4s]
[UPLOADED] /content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/train/0099b614c6a3eaf82daaef1aaa2607dc537183b818d7f531e636cf2756fa046e.png (FsfJMENW1gZO8EtbTQ7q) [2.6s] / annotations = OK [0.3s]
[UPLOADED] /content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/train/004b381a051838dc0cc8ff293e09823faa1dd6f26e82ffa99af6bbef6fe6168c.png (ZjcVISyVSOAOJjpCQww5) [2.6s] / annotations = OK [0.3s]
[UPLOADED] /content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/train/0007634580386bd39d4d0d24df58893c3bb967e12d6fc065ce8659e9acacc928.png (kyOrxHdNGxoR8X825v72) [2.6s] / annotations = OK [0.4s]
[UPLO

100%|██████████| 1381/1381 [00:00<00:00, 3519.45it/s]


Created project gwhd-2021/gwhd2021od-youfr
[UPLOADED] /content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/test/024c2faa620413c4f44b84fa696bfd4fe3625b9b269e0523b1f5198af49f0572.png (aIn9TSreELblQIkqEMKs) [1.7s] / annotations = OK [0.4s]
[UPLOADED] /content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/test/01db0f2a94e02ddb59f03f319b0d4f639c2ec44ad87016ff214cce6023a97d35.png (eCVE8vBlOot1OoGlKoci) [1.8s] / annotations = OK [0.4s]
[UPLOADED] /content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/test/01225c4ab5e78e7f6a292e2648642d87853cb88dc639c0e16c286615f63d41bb.png (rY5TKCaRevDfOM3k4Sku) [2.0s] / annotations = OK [0.3s]
[UPLOADED] /content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/test/00890d0d95e9c6841d98c4c5846f84e09a6f87e7224f0e05872f35856c803ebf.png (5sBeILHBOfEkdbR5Kz7z) [2.1s] / annotations = OK [0.4s]
[UPLOADED

100%|██████████| 1476/1476 [00:00<00:00, 10046.43it/s]


Created project gwhd-2021/gwhd2021od-dgg1n
[UPLOADED] /content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/valid/00e6e6ba993877c21066a4050c601e66f47d9b21a8d3dce9b399e4882d6ba3f1.png (th2xwyP2oTMuwCBd58YM) [1.8s] / annotations = OK [0.4s]
[UPLOADED] /content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/valid/00319488e879a811698174d9f26ef174f2f108a13e12edee5a3c50899ed26336.png (HFVpLfd9AwoIyPE79re7) [2.2s] / annotations = OK [0.4s]
[UPLOADED] /content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/valid/004cf579e4a96bfadc9c626f4fc6f5270795d01f49b6f00628879667586219b5.png (q7VjpF5ta3aMCsQvQcZu) [2.2s] / annotations = OK [0.4s]
[UPLOADED] /content/drive/MyDrive/GWHD 2021 Segmentation Research/Datasets/GWHD 2021 COCO Object Detection/valid/00c63d3b51b886f9c29ca196e1e212a2c790408ae5428a1181ce958a83d8a6be.png (CBzujp46cKNo25SGxrRP) [2.4s] / annotations = OK [0.3s]
[UPLO