# Prepare the ground truth object detection labeling output for training

This notebook walks you through the steps we have taken to process the object detection label output from Ground Truth to prepare it for model training in SageMaker. 

1. [Join together outputs from multiple labeling jobs](#join_output)
1. [Filter out labels that did not meet our quality bar](#filter_bad_labels)
1. [Inject class labels (if you didn't have the Ground Truth workers pick classes)](#inject_class)
1. [Split train/validation data](#split_train)
1. [data augmentation](#data_aug)

## Setup

In [None]:
BUCKET = 'robcost-potatohead'
JOB_NAME = 'demo' 

### Import dependencies and define helper functions

In [None]:
import numpy as np
import random
import os, shutil
import json
import boto3
import botocore
import sagemaker

In [None]:
sagemaker_client = boto3.client('sagemaker')

def make_tmp_folder(folder_name):
    try:
        os.makedirs(folder_name, exist_ok=False)
    except FileExistsError:
        print("{} folder already exists".format(folder_name))
        
def read_manifest_file(file_path):
    with open(file_path, 'r') as f:
        output = [json.loads(line.strip()) for line in f.readlines()]
        return output

### Specify the Ground Truth labeling job id(s) 

In [None]:
## if using your own Ground Truth labeling job, replace below with appropriate job IDs
LABEL_JOB_IDS = [
    'TestingJob1']


In [None]:
TMP_FOLDER_NAME = 'tmp'
make_tmp_folder(TMP_FOLDER_NAME)


## 1. Join outputs from multiple jobs <a id='join_output'></a>

To be able to iterate on Ground Truth jobs, we created several smaller labeling jobs for our dataset instead of a single large job containing the full dataset. 

The below code takes one or more Ground Truth job IDs, download the output (Augmented Manifest File format) and join them together into one array for manipulation 

In [None]:
joined_outputs = []

def get_output_manifest_s3_uri(label_job_id):
    # below code uses label outputs from our sample dataset
    # return f's3://greengrass-object-detection-blog/ground-truth-output/{label_job_id}.output.manifest'
    # uncomment below if you are using your own Ground Truth labeling job 
    return sagemaker_client.describe_labeling_job(LabelingJobName=label_job_id)['LabelingJobOutput']['OutputDatasetS3Uri']

for label_job_id in LABEL_JOB_IDS: 
    output_manifest_s3_uri = get_output_manifest_s3_uri(label_job_id)
    output_manifest_fname = "{}-{}".format(label_job_id, os.path.split(output_manifest_s3_uri)[1])
    !aws s3 cp $output_manifest_s3_uri $TMP_FOLDER_NAME/$output_manifest_fname
    output_manifest_local_path = os.path.join(TMP_FOLDER_NAME, output_manifest_fname)
    output_manifest_lines = read_manifest_file(output_manifest_local_path)
    print("loaded {} lines from {}".format(len(output_manifest_lines), output_manifest_local_path))
    joined_outputs += output_manifest_lines
    
print("loaded total of {} lines".format(len(joined_outputs)))

## Example labels

In [None]:
joined_outputs[15]

In [None]:
joined_outputs[-15]

## 2. Discard any bad labels from visual inspection <a id="filter_bad_labels"></a>

you may manually review the labeled bounding boxes on the Ground Truth console and mark the image IDs that didn't pass a quality bar 

In [None]:
TO_DISCARD = set([])

In [None]:
filtered_manifest = []
count_filtered = 0
for line in joined_outputs:
    filename= os.path.split(line["source-ref"])[1]
    imageid = os.path.splitext(filename)[0]
    if imageid not in TO_DISCARD:
        filtered_manifest.append(line)
    else:
        count_filtered+=1
        
print("filtered out {} labels. {} labels remains".format(count_filtered, len(filtered_manifest)))

In [None]:
## example entry
filtered_manifest[2]

## 3. Inject class labels from metadata <a id="inject_class"></a>

As you can see from the examples above, because we didn't ask the Ground Truth workers to classify the object they are labeling, all the annotations say `'class_id': 0`, regardless of what object it actually is

We can use the metadata that we injected into the manifest (`color` and `object` field) to insert the correct class ID 

In [None]:
NEW_CLASS_MAP = {"blue box": 0 , "yellow box": 1}
REVERSE_CLASS_MAP =  { '0': "blue box" , "1": "yellow box"}

In [None]:
classified_manifest = []
for line in filtered_manifest:
    if line["object"] == "box":
        transformed_line = line.copy()
        annotations = line['bb']['annotations']
        new_annotations = []
        if line["color"] == "blue":
            for annotation in annotations:
                annotation["class_id"] = NEW_CLASS_MAP["blue box"]
                new_annotations.append(annotation)
        elif line["color"] == "yellow":
            for annotation in annotations:
                annotation["class_id"] = NEW_CLASS_MAP["yellow box"]
                new_annotations.append(annotation)
        transformed_line['bb']['annotations'] = new_annotations
        transformed_line['bb-metadata']['class-map'] = REVERSE_CLASS_MAP

        classified_manifest.append(transformed_line)

In [None]:
classified_manifest[15]

In [None]:
classified_manifest[-15]

## 4. Split dataset between train and validation <a id='split_train'></a>

SageMaker requires two datasets during training: train and validation dataset. The training set consists of the images and annotations you want to actually train the model with. The validation set is not used for training but used to “validate” that each training pass is improving the accuracy of the model and compare accuracy between different training jobs during hyper-parameter tuning. 

In [None]:
def train_validation_split(labels, split_factor=0.9):
    np.random.shuffle(labels)

    dataset_size = len(labels)
    train_test_split_index = round(dataset_size*split_factor)

    train_data = labels[:train_test_split_index]
    validation_data = labels[train_test_split_index:]
    return train_data, validation_data

In [None]:
# changing to use original joined-manifest
# train_data, validation_data = train_validation_split(np.array(classified_manifest), split_factor=0.9)
train_data, validation_data = train_validation_split(np.array(joined_outputs), split_factor=0.9)

print("training data size:{}\nvalidation data size:{}".format(train_data.shape[0], validation_data.shape[0]))

In [None]:
with open(os.path.join(TMP_FOLDER_NAME, 'train.manifest'), 'w') as f:
    for line in train_data:
        f.write(json.dumps(line))
        f.write('\n')
    
with open(os.path.join(TMP_FOLDER_NAME,'validation.manifest'), 'w') as f:
    for line in validation_data:
        f.write(json.dumps(line))
        f.write('\n')

In [None]:
!wc -l $TMP_FOLDER_NAME/train.manifest
!wc -l $TMP_FOLDER_NAME/validation.manifest

In [None]:
!aws s3 cp $TMP_FOLDER_NAME/train.manifest s3://$BUCKET/training-manifest/$JOB_NAME/train.manifest
!aws s3 cp $TMP_FOLDER_NAME/validation.manifest s3://$BUCKET/training-manifest/$JOB_NAME/validation.manifest

## 5. Data augmentation (optional) <a id='data_aug'></a>

In [None]:
%%time
%run ./scripts/flip_images.py -m s3://$BUCKET/training-manifest/$JOB_NAME/train.manifest -d $TMP_FOLDER_NAME -b $BUCKET

In [None]:
%run ./scripts/flip_annotations.py -m s3://$BUCKET/training-manifest/$JOB_NAME/train.manifest -d $TMP_FOLDER_NAME -p $JOB_NAME

# Next step

Now we are ready to start training jobs! Move on to the [next notebook](./02_sagemaker_training_API.ipynb) to submit a sagemaker training job to train our custom object detection model!