<a href="https://colab.research.google.com/github/rafal-bro/techlabs-instance-segmentation/blob/main/1_data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1. Data Preprocessing**

## Introduction

The classical first step in any data science and machine learning project is to collect the necessary data. The [Wild Intelligence Lab (WIL)](https://wildintelligencelab.com) provided us with a small amount of already labeled drone images which are of interest for this project. We added additional image data without labels from the *savmap* dataset, which is publicliy available [here](https://zenodo.org/record/1204408#.YlB2x7hCTfY).
The additional images were recorded in the same reserve, are 4000x3000 pixels in size and show small areas of steppe landscape from above. Just like the images provided from WIL.

As the goal of this work is to provide accurate detections of landscape features using deep learning techniques, we needed to extend our training dataset by adding labels. This is a manual process, where a polygon around an object is created and afterwards a class is defined. Various applications for this labeling task exist. We chose the `labelme` package which can be easily added to the working environment and envoked using the terminal. A part of the *savmap* dataset was labeled and the remaining images were reserved for inference i.e. model predictions.

![](https://drive.google.com/uc?id=1S0u8sbXGtFDBkFZkJpMgTZjOztH-mw2R)

       Fig 1:  Annotation process in labelme


In Fig. 1, the labeling process is shown. Corresponding to our detection goals, we defined six classes: tree, bush, dead tree, road, aardvark hole, and animal. In the labelme GUI, they are marked with different colours. The so-called masks resulting from the labeling process are then stored in one json-file per image. Once all the instances were labelled, we performed data cleaning and filtering (i.e. exclusion of images without labels) and created a COCO-like datatset using the package `labelme2coco`. 

**What is COCO Format ?** [COCO (Common Objects in Context)](https://cocodataset.org/#home) is a large image dataset designed for object detection, semantic segmentation, and instance segmentation. A COCO-like dataset is characterized by one file which provides a computer-readable form of the instance shapes and locations for all images in the dataset and a folder containing the image data. The dataset file stores its annotations in the JSON format describing object classes, bounding boxes, and bitmasks. The json file of COCO format has the following structure:

 ```
{
"images": [
		     {
	           "height":2400,
		      "width" : 2400
               "id": 0,
               "license": 1,
               "file_name": "<filename0>.<ext>",
		      },...
		   ],
"annotations":[{
            "id": 0,
            "image_id": 0,
            "category_id": 2,
            "bbox": [260, 177, 231, 199],
            "segmentation": [...],
            "area": 45969,
            "iscrowd": 0
        },...],
"categories": [...
        {
            "id": 2,
            "name": "tree",
            "supercategory": "tree"
        },...]
}
```

As we annotated images of size 4000x3000 pixels, the whole process of labeling was very time consuming. To ensure equal image size thoughout the whole dataset (the images provided by WIL were of size 2048x2048 pixels) and enable higher batch size during training (lower memory requirement per iteration), we decided to crop each image into smaller parts of size 1024x1024 pixels. Cropping of the previously created COCO-like dataset was done using the `sahi` python package. `sahi` is a vision library used for slicing images and COCO-like datasets into smaller parts [8]. After slicing, the dataset was filtered again and images without labels were removed. 

Overall, we collected 1680 images of 1024x1024 pixels with labels for model development. Finally, we performed a training-test split in the ratio of 90:10. The test set is excluded from model development and the training set will be used for model training in the next notebook.

Below the described steps are implemented.

## Setting up the Environment

In [None]:
!pip install labelme2coco==0.1.2
!pip install sahi
!pip install funcy
!pip install Pillow==9.0.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Import modules and mount drive.

In [None]:
import os
import sys

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


Specify paths.

In [None]:
# specify root to kuzikus_group04 folder
PROJECT_ROOT = r'/content/drive/MyDrive/techlabs_instance_segmentation'

# specify folder containing raw image data as well as images with labelme annotations - should be placed in project 'data' folder
RAW_DATA_FOLDER = 'savmap_dataset_labeled'

# specify desired folder name for results
DATA_FOLDER = 'kuzikus_coco'

Set flags.

In [None]:
# if True, images will be cropped
CROP_IMGS = True

# specify desired image size for crop 
IMG_SIZE = 1024

# if True, inference dataset will be created i.e. images without labels will be prepared for detection
CREATE_INFERENCE = True

Specify image settings.

In [None]:
# specify initial image format, make sure that all images have the same extension - resulting images will be jpg for memory reasons
IMG_FORMAT = 'png'

In [None]:
# root path of data 
DATA_ROOT = os.path.join(PROJECT_ROOT, 'data')

# path to folder containing helper functions
HELPER_DIR = os.path.join(PROJECT_ROOT, 'helper_functions')

# import helper functions
sys.path.append(HELPER_DIR)
import preprocessing as pre

In [None]:
RAW_DATA_DIR = os.path.join(DATA_ROOT, RAW_DATA_FOLDER)
DATA_DIR = os.path.join(DATA_ROOT, DATA_FOLDER)

## Create COCO-like Datasets

First of all, clean data and create base coco dataset. This might take some time.

In [None]:
pre.create_coco(RAW_DATA_DIR, IMG_FORMAT, DATA_DIR)

Crop images to desired size and save new dataset to folder. New images without any instances will be automatically excluded.

In [None]:
pre.crop_images_coco(DATA_DIR, IMG_SIZE)

indexing coco dataset annotations...


Loading coco annotations: 100%|██████████| 192/192 [00:01<00:00, 124.80it/s]
  2%|▏         | 3/192 [00:01<02:21,  1.33it/s]06/11/2022 01:09:03 - ERROR - shapely.geos -   TopologyException: Input geom 0 is invalid: Self-intersection at 1285.8375655477105 1962.0938106445351
06/11/2022 01:09:03 - INFO - shapely.geos -   Self-intersection at or near point 1285.8375655477105 1962.0938106445351
 48%|████▊     | 93/192 [01:09<01:16,  1.30it/s]06/11/2022 01:10:11 - ERROR - shapely.geos -   TopologyException: Input geom 0 is invalid: Self-intersection at 1228.1736975940537 1099.41970398383
06/11/2022 01:10:11 - INFO - shapely.geos -   Self-intersection at or near point 1228.1736975940537 1099.41970398383
 70%|███████   | 135/192 [01:38<00:56,  1.02it/s]06/11/2022 01:10:40 - ERROR - shapely.geos -   TopologyException: Input geom 0 is invalid: Self-intersection at 2331.5296523517382 351.61145194274025
06/11/2022 01:10:40 - INFO - shapely.geos -   Self-intersection at or near point 2331.529652351

Saved 1680 entries in /content/drive/MyDrive/techlabs_instance_segmentation/data/kuzikus_coco_1024


Perform training - test split of cropped dataset.

In [None]:
COCO_FILE_PATH = os.path.join(DATA_DIR+f'_{IMG_SIZE}', f'{DATA_FOLDER}_{IMG_SIZE}.json')

In [None]:
# 80% will be training and the remaining 20% test
pre.split_coco(COCO_FILE_PATH, split=0.8)

Crop inference dataset.

In [None]:
pre.crop_images_inference(DATA_DIR, RAW_DATA_DIR, IMG_SIZE, IMG_FORMAT)

7993 images saved in /content/drive/MyDrive/techlabs_instance_segmentation/data/kuzikus_coco_1024_inference
