# Create and run SageMaker Ground Truth Labeling job

This notebook creates a Ground Truth labeling job in SageMaker and lets you track the status of the job. Once this has completed, you can move onto the Prepare Data and Labels notebook. 

In [None]:
BUCKET = '<S3 Bucket Name>' # Valid name for S3 bucket.
IMG_FOLDER = 'images' # Any valid S3 prefix.
MANIFEST_FOLDER = 'manifest' # Any valid S3 prefix.
CLASS_NAME = '<Target object label name>' # The single label that will be annotated in the Ground Truth job.

In [None]:
# for testing...

BUCKET = 'robcost-potato' # Valid name for S3 bucket.
IMG_FOLDER = 'images' # Any valid S3 prefix.
MANIFEST_FOLDER = 'manifest' # Any valid S3 prefix.
CLASS_NAME = 'potatohead' # The single label that will be annotated in the Ground Truth job.

## Import Dependencies

In [None]:
import sagemaker
import numpy as np
import random
import os, shutil
import json
import boto3
import time

## Create asset bucket

In [None]:
role = sagemaker.get_execution_role()
region = boto3.session.Session().region_name

In [None]:
# Make sure the bucket is in the same region as this notebook.

s3 = boto3.client('s3', region_name=region)
location = {'LocationConstraint': region}
s3.create_bucket(Bucket=BUCKET,CreateBucketConfiguration=location)

In [None]:
bucket_region = s3.head_bucket(Bucket=BUCKET)['ResponseMetadata']['HTTPHeaders']['x-amz-bucket-region']
assert bucket_region == region, "Your S3 bucket {} and this notebook need to be in the same region.".format(BUCKET)

## Upload images to be annotated

<span style="color:red">**IMPORTANT - you must now upload your images to the bucket you specified in the previous cell, under a folder called /images.**</span>

In [None]:
# need to enumerate the bucket/images folder and get a list of all objects to create the manifest file from
%time
%run ./scripts/generate_gt_manifest.py -b $BUCKET -k $IMG_FOLDER
s3_client = boto3.client('s3')
with open('manifest.json') as file:
    object = file.read()
    s3_client.put_object(Body=object, Bucket=BUCKET, Key=MANIFEST_FOLDER + "/manifest.json")

## Specify the categories

To run an object detection labeling job, you must decide on a set of classes the annotators can choose from. At the moment, Ground Truth only supports annotating one object detection class at a time. To work with Ground Truth, this list needs to be converted to a .json file and uploaded to the S3 BUCKET.

In [None]:
CLASS_LIST = [CLASS_NAME]
print("Label space is {}".format(CLASS_LIST))

json_body = {
    'labels': [{'label': label} for label in CLASS_LIST]
}
with open('class_labels.json', 'w') as f:
    json.dump(json_body, f)
    
with open('class_labels.json') as file:
    object = file.read()
    s3_client.put_object(Body=object, Bucket=BUCKET, Key=MANIFEST_FOLDER + "/class_labels.json")

You should now see class_labels.json in s3://BUCKET/EXP_NAME/.

## Create the instruction template
Part or all of your images will be annotated by human annotators. It is essential to provide good instructions. Good instructions are:

 1. Concise. We recommend limiting verbal/textual instruction to two sentences and focusing on clear visuals.
 2. Visual. In the case of object detection, we recommend providing several labeled examples with different numbers of boxes.
 
When used through the AWS Console, Ground Truth helps you create the instructions using a visual wizard. When using the API, you need to create an HTML template for your instructions. Below, we prepare a very simple but effective template and upload it to your S3 bucket.

NOTE: If you use any images in your template (as we do), they need to be publicly accessible. You can enable public access to files in your S3 bucket through the S3 Console, as described in S3 Documentation.

**Testing your instructions**

**It is very easy to create broken instructions.** This might cause your labeling job to fail. However, it might also cause your job to complete with meaningless results if, for example, the annotators have no idea what to do or the instructions are misleading. At the moment the only way to test the instructions is to run your job in a private workforce. This is a way to run a mock labeling job for free.

It is helpful to show examples of correctly labeled images in the instructions. The following code block produces several such examples for our dataset and saves them in s3://BUCKET/EXP_NAME/.

In [None]:
from IPython.core.display import HTML, display

def make_template(test_template=False, save_fname='instructions.template'):
    template = r"""<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
    <crowd-form>
      <crowd-bounding-box
        name="boundingBox"
        src="{{{{ task.input.taskObject | grant_read_access }}}}"
        header="Dear Annotator, please draw a tight box around each {class_name} you see. Thank you!"
        labels="{labels_str}"
      >
        <full-instructions header="Please annotate each {class_name}.">

    <ol>
        <li><strong>Inspect</strong> the image</li>
        <li><strong>Determine</strong> if the specified label is/are visible in the picture.</li>
        <li><strong>Outline</strong> each instance of the specified label in the image using the provided “Box” tool.</li>
    </ol>
    <ul>
        <li>Boxes should fit tight around each object</li>
        <li>Do not include parts of the object are overlapping or that cannot be seen, even though you think you can interpolate the whole shape.</li>
        <li>Avoid including shadows.</li>
        <li>If the target is off screen, draw the box up to the edge of the image.</li>
    </ul>

        </full-instructions>
        <short-instructions>
            <p>Short Instructions</p>
        </short-instructions>
      </crowd-bounding-box>
    </crowd-form>
    """.format(class_name=CLASS_NAME,
               labels_str=str(CLASS_LIST) if test_template else '{{ task.input.labels | to_json | escape }}')
    with open(save_fname, 'w') as f:
        f.write(template)

        
make_template(test_template=True, save_fname='instructions.html')
make_template(test_template=False, save_fname='instructions.template')

with open('instructions.template') as file:
    object = file.read()
    s3_client.put_object(Body=object, Bucket=BUCKET, Key=MANIFEST_FOLDER + "/instructions.template")

You should now be able to find your template in s3://BUCKET/EXP_NAME/instructions.template.

## Create a Private Workforce for the labeling job

This step will create the required Amazon Cognito User Pool, SageMaker Private Team, and Workers (users), that will be assigned the task of annotating the images.

In [None]:
# create Cognito pool, team, and workers
cognito = boto3.client('cognito-idp')
myPool = cognito.create_user_pool(PoolName='sagemaker-groundtruth-user-pool')
myPoolId = myPool["UserPool"]["Id"]
print('Cognito Pool ID: ' + myPoolId)

In [None]:
myPoolIdPre = myPoolId[:-9]
myPoolIdNumber = myPoolId[len(myPoolIdPre):]
print('Cognito Pool ID Number: ' + myPoolIdNumber)

In [None]:
myDomain = cognito.create_user_pool_domain(
    Domain='sagemaker-groundtruth-workteam-' + myPoolIdNumber.lower(),
    UserPoolId=myPoolId
)

In [None]:
# create Cognito App Client
appClient = cognito.create_user_pool_client(
    UserPoolId=myPoolId,
    ClientName=CLASS_NAME,
    GenerateSecret=True,
)
appClientId = appClient["UserPoolClient"]["ClientId"]
print('Cognito Pool App Client ID: ' + appClientId)

In [None]:
group = cognito.create_group(
    GroupName='sagemaker-groundtruth-user-group',
    UserPoolId=myPoolId
)

In [None]:
# create private work team
sagemaker_client = boto3.client('sagemaker')
workteam = sagemaker_client.create_workteam(
    WorkteamName=CLASS_NAME + '-Team',
    MemberDefinitions=[
        {
            'CognitoMemberDefinition': {
                'UserPool': myPoolId,
                'UserGroup': 'sagemaker-groundtruth-user-group',
                'ClientId': appClientId
            }
        },
    ],
    Description='string'
)
private_workteam_arn = workteam["WorkteamArn"]

In [None]:
workteamDetail = sagemaker_client.describe_workteam(
    WorkteamName=CLASS_NAME + '-Team'
)
LabelPortalURL = workteamDetail["Workteam"]["SubDomain"]
print('Label Portal URL: https://' + LabelPortalURL)

In [None]:
appClientNew = cognito.describe_user_pool_client(
    UserPoolId=myPoolId,
    ClientId=appClientId
)

In [None]:
callbackURLs = appClientNew["UserPoolClient"]["CallbackURLs"]
logoutURLs = appClientNew["UserPoolClient"]["LogoutURLs"]

In [None]:
# need to enable the App Client settings, enable Cognito Pool and allowed OAuth2.0 settings.
appClient = cognito.update_user_pool_client(
    UserPoolId=myPoolId,
    ClientId=appClientId,
    AllowedOAuthFlows=['code','implicit'],
    AllowedOAuthScopes=['email','openid','profile'],
    AllowedOAuthFlowsUserPoolClient=True,
    CallbackURLs=callbackURLs,
    LogoutURLs=logoutURLs,
    SupportedIdentityProviders=['COGNITO'],
)

In [None]:
emailMessage = f"""<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    </head>
    <body>
        <div>
            <h2> <span style="font-family:\'Amazon Ember\',sans-serif;color:#333333">You\'re invited to work on a labeling
                    project. <o:p></o:p> </span></h2>
            <p
                style="font-variant-ligatures: normal;font-variant-caps: normal;orphans:2;text-align:start;widows:2;-webkit-text-stroke-width: 0px;text-decoration-style:initial;text-decoration-color:initial;word-spacing:0px;padding-bottom:30px">
                <span style="font-size:13.5pt;font-family:\'Amazon Ember\',sans-serif;color:#333333">You will need this
                    Username and temporary password to log in the first time. <o:p></o:p> </span></p>
            <p
                style="font-variant-ligatures: normal;font-variant-caps: normal;orphans:2;text-align:start;widows:2;-webkit-text-stroke-width: 0px;text-decoration-style:initial;text-decoration-color: initial;word-spacing:0px">
                <span style="font-size:13.5pt;font-family:\'Amazon Ember\',sans-serif;color:#333333">User name:
                    <b>{{username}}</b>
                    <o:p></o:p>
                </span></p>
            <p
                style="font-variant-ligatures: normal;font-variant-caps: normal;orphans:2;text-align:start;widows:2;-webkit-text-stroke-width: 0px;text-decoration-style:initial;text-decoration-color:initial;word-spacing:0px">
                <span style="font-size:13.5pt;font-family:\'Amazon Ember\',sans-serif;color:#333333">Temporary password:
                    <b>{{####}}</b>
                    <o:p></o:p>
                </span></p>
            <p
                style="font-variant-ligatures: normal;font-variant-caps: normal;orphans:2;text-align:start;widows:2;-webkit-text-stroke-width: 0px;text-decoration-style:initial;text-decoration-color:initial;word-spacing:0px">
                <span style="font-size:13.5pt;font-family:\'Amazon Ember\',sans-serif;color:#333333">Open the link below to
                    log in: <o:p></o:p> </span></p>
            <p
                style="font-variant-ligatures: normal;font-variant-caps: normal;orphans:2;text-align:start;widows:2;-webkit-text-stroke-width: 0px;text-decoration-style:initial;text-decoration-color:initial;word-spacing:0px;padding-bottom:30px">
                <span style="font-size:13.5pt;font-family:\'Amazon Ember\',sans-serif;color:#007DBC"> <a href="https://{LabelPortalURL}"
                        target="_blank">https://{LabelPortalURL}</a>
                    <o:p></o:p>
                </span></p>
            <p
                style="font-variant-ligatures: normal;font-variant-caps: normal;orphans:2;text-align:start;widows:2;-webkit-text-stroke-width: 0px;text-decoration-style:initial;text-decoration-color:initial;word-spacing:0px">
                <span style="font-size:13.5pt;font-family:\'Amazon Ember\',sans-serif;color:#333333">After you log in with
                    your temporary password, you are required to create a new one. <o:p></o:p> </span></p>
        </div>
    </body>
</html>"""

In [None]:
# now update the email template used by cognito to invite users
cognito.update_user_pool(
    UserPoolId=myPoolId,
    AdminCreateUserConfig={
        'AllowAdminCreateUserOnly': True,
        'UnusedAccountValidityDays': 7,
        'InviteMessageTemplate': {
            'EmailMessage': emailMessage,
            'EmailSubject': "You're invited to work on a labeling project."
        }
    }
)

## Submit the Ground Truth job request
The API starts a Ground Truth job by submitting a request. The request contains the 
full configuration of the annotation task, and allows you to modify the fine details of
the job that are fixed to default values when you use the AWS Console. The parameters that make up the request are described in more detail in the [SageMaker Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateLabelingJob.html).

After you submit the request, you should be able to see the job in your AWS Console, at `Amazon SageMaker > Labeling Jobs`.
You can track the progress of the job there. This job will take several hours to complete. If your job
is larger (say 100,000 images), the speed and cost benefit of auto-labeling should be larger.

In [None]:
# Specify ARNs for resources needed to run an object detection job.
ac_arn_map = {'us-west-2': '081040173940',
              'us-east-1': '432418664414',
              'us-east-2': '266458841044',
              'eu-west-1': '568282634449',
              'ap-northeast-1': '477331159723',
              'ap-southeast-2': '454466003867'}

prehuman_arn = 'arn:aws:lambda:{}:{}:function:PRE-BoundingBox'.format(region, ac_arn_map[region])
acs_arn = 'arn:aws:lambda:{}:{}:function:ACS-BoundingBox'.format(region, ac_arn_map[region]) 
labeling_algorithm_specification_arn = 'arn:aws:sagemaker:{}:027400017018:labeling-job-algorithm-specification/object-detection'.format(region)

In [None]:
%time
task_description = 'Dear Annotator, please draw a box around each {}. Thank you!'.format(CLASS_NAME)
task_keywords = ['image', 'object', 'detection']
task_title = 'Please draw a box around each {}.'.format(CLASS_NAME)
job_name = CLASS_NAME + str(int(time.time()))

human_task_config = {
      "AnnotationConsolidationConfig": {
        "AnnotationConsolidationLambdaArn": acs_arn,
      },
      "PreHumanTaskLambdaArn": prehuman_arn,
      "MaxConcurrentTaskCount": 200, # 200 images will be sent at a time to the workteam.
      "NumberOfHumanWorkersPerDataObject": 5, # We will obtain and consolidate 5 human annotations for each image.
      "TaskAvailabilityLifetimeInSeconds": 21600, # Your workteam has 6 hours to complete all pending tasks.
      "TaskDescription": task_description,
      "TaskKeywords": task_keywords,
      "TaskTimeLimitInSeconds": 300, # Each image must be labeled within 5 minutes.
      "TaskTitle": task_title,
      "UiConfig": {
        "UiTemplateS3Uri": 's3://{}/{}/instructions.template'.format(BUCKET, MANIFEST_FOLDER),
      }
    }

human_task_config["WorkteamArn"] = private_workteam_arn

ground_truth_request = {
        "InputConfig" : {
          "DataSource": {
            "S3DataSource": {
              "ManifestS3Uri": 's3://{}/{}/{}'.format(BUCKET, MANIFEST_FOLDER, 'manifest.json'),
            }
          },
          "DataAttributes": {
            "ContentClassifiers": [
              "FreeOfPersonallyIdentifiableInformation",
              "FreeOfAdultContent"
            ]
          },  
        },
        "OutputConfig" : {
          "S3OutputPath": 's3://{}/{}/output/'.format(BUCKET, IMG_FOLDER),
        },
        "HumanTaskConfig" : human_task_config,
        "LabelingJobName": job_name,
        "RoleArn": role, 
        "LabelAttributeName": "category",
        "LabelCategoryConfigS3Uri": 's3://{}/{}/class_labels.json'.format(BUCKET, MANIFEST_FOLDER),
    }

ground_truth_request[ "LabelingJobAlgorithmsConfig"] = {"LabelingJobAlgorithmSpecificationArn": labeling_algorithm_specification_arn
                                       }
label_job = sagemaker_client.create_labeling_job(**ground_truth_request)
print(label_job)

# STOP HERE!!!
## <span style="color:red">**You must now use the Labeling Portal to label your images before proceeding!!!**</span>

------------

# Prepare Labeled Data for Training

Here you will split the dataset, augment it with additional data, and create the manifest files required for training.

In [None]:
label_job_id = label_job["LabelingJobArn"].split("/")[1]

In [None]:
def train_validation_split(labels, split_factor=0.9):
    np.random.shuffle(labels)

    dataset_size = len(labels)
    train_test_split_index = round(dataset_size*split_factor)

    train_data = labels[:train_test_split_index]
    validation_data = labels[train_test_split_index:]
    return train_data, validation_data

def read_manifest_file(file_path):
    with open(file_path, 'r') as f:
        output = [json.loads(line.strip()) for line in f.readlines()]
        return output

In [None]:
output_manifest_s3_uri = sagemaker_client.describe_labeling_job(LabelingJobName=label_job_id)['LabelingJobOutput']['OutputDatasetS3Uri']
print('Job Output Manifest: ' + output_manifest_s3_uri)

In [None]:
output_manifest_fname = "{}-{}".format(label_job_id, os.path.split(output_manifest_s3_uri)[1])
!aws s3 cp $output_manifest_s3_uri $output_manifest_fname
output_manifest_local_path = output_manifest_fname
output_manifest_lines = read_manifest_file(output_manifest_local_path)

In [None]:
train_data, validation_data = train_validation_split(np.array(output_manifest_lines), split_factor=0.9)
print("training data size:{}\nvalidation data size:{}".format(train_data.shape[0], validation_data.shape[0]))

In [None]:
with open('./train.manifest', 'w') as f:
    for line in train_data:
        f.write(json.dumps(line))
        f.write('\n')
    
with open('./validation.manifest', 'w') as f:
    for line in validation_data:
        f.write(json.dumps(line))
        f.write('\n')

In [None]:
!wc -l ./train.manifest
!wc -l ./validation.manifest

In [None]:
train_manifest_location = 's3://{}/{}/train.manifest'.format(BUCKET,MANIFEST_FOLDER)
validation_manifest_location = 's3://{}/{}/validation.manifest'.format(BUCKET,MANIFEST_FOLDER)
print(validation_manifest_location)

In [None]:
s3_client = boto3.client('s3')
with open('train.manifest') as file:
    object = file.read()
    s3_client.put_object(Body=object, Bucket=BUCKET, Key=MANIFEST_FOLDER + "/train.manifest")
    
with open('validation.manifest') as file:
    object = file.read()
    s3_client.put_object(Body=object, Bucket=BUCKET, Key=MANIFEST_FOLDER + "/validation.manifest")

In [None]:
def make_tmp_folder(folder_name):
    try:
        os.makedirs(folder_name, exist_ok=False)
    except FileExistsError:
        print("{} folder already exists".format(folder_name))
        
TMP_FOLDER_NAME = 'tmp'
make_tmp_folder(TMP_FOLDER_NAME)

In [None]:
%%time
print(TMP_FOLDER_NAME)
%run ./scripts/flip_images.py -m s3://$BUCKET/$MANIFEST_FOLDER/train.manifest -d $TMP_FOLDER_NAME -b $BUCKET

In [None]:
%time
%run ./scripts/flip_annotations.py -m s3://$BUCKET/$MANIFEST_FOLDER/train.manifest -d $TMP_FOLDER_NAME -p $CLASS_NAME -c $CLASS_NAME

### The updated TRAINING manifest file is now uploaded to: *s3://$BUCKET/CLASS_NAME/all_augmented.json*

### The VALIDATION manifest will be located here: *s3://$BUCKET/$MANIFEST_FOLDER/validation.manifest*

# Next step

Now we are ready to start training jobs! Move on to the [next notebook](./02_sagemaker_training_API.ipynb) to submit a sagemaker training job to train our custom object detection model!