# Sagemaker Bear Classification

This notebook attempts to do my own job of image classification. This notebook pulls a list of bear photos (categorized as "polar", "brown" and "no") and creates a classification model that can be used to predict what kind of bear (again, "brown", "polar" or "no" bear) is in the photo.

This notebook is based off the following sources:

[1] https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/imageclassification_mscoco_multi_label/Image-classification-multilabel-lst.ipynb

[2] https://github.com/aws-samples/aws-deeplens-reinvent-2019-workshops/blob/master/AIM405-Advanced/Lab2/lab2-image-classification.ipynb

Begin with setting up some standard stuff to run Sagemaker

In [None]:
%%time
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
print(role)

sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'bear'

print('using bucket %s'%bucket)

Get the image classification training image used for training

In [None]:
training_image = sagemaker.image_uris.retrieve('image-classification', sess.boto_region_name, "1")
print (training_image)

# Data preparation

We begin by extracting the open_images_bears.zip file. This file contains a list of different kinds of bears from public image sources. 

Extract the data and save to the notebook.

In [None]:
!conda update --all -y
!conda install -c conda-forge pycocotools -y
!pip install tqdm

In [None]:
import requests
import os
import csv
import zipfile
from tqdm import tqdm
import time

ZIP_FILE = './open_images_bears.zip'
ERRORS_FILE = 'download-errors.txt'
CSV_DIR = './image_csv/'
DATA_DIR = './data/'
if not os.path.isdir(DATA_DIR):
    os.mkdir(DATA_DIR)
    
with zipfile.ZipFile(ZIP_FILE, 'r') as f:
    f.extractall(os.path.expanduser(CSV_DIR))
    
files = list(filter(lambda x: x.endswith('csv'), os.listdir(CSV_DIR)))

f = files[0]
with open(CSV_DIR + f, 'r') as f:
    reader = csv.reader(f)
    records = list(reader)
    
def download(url, path):
    r = requests.get(url, allow_redirects=True)
    if len(r.content) < 1024:
        raise Exception((path.split('/')[-1]).split('.')[0])
    else:
        open(path, 'wb').write(r.content)
        
with open(ERRORS_FILE,'w') as f:
    f.write('')
for idx,fn in enumerate(files):
    print('{}/{} {} is being processed.'.format(idx, len(files), fn))
    time.sleep(1)
    with open(CSV_DIR + fn, 'r') as f:
        reader = csv.reader(f)
        records = list(reader)[1:] # no header row
    stage = fn.split('-')[0]
    lbl = fn.split('-')[1]
    dir_path = DATA_DIR + stage
    if not os.path.isdir(dir_path):
        os.mkdir(dir_path)
    dir_path = DATA_DIR + '{}/{}'.format(stage,lbl)
    if not os.path.isdir(dir_path):
        os.mkdir(dir_path)
        
    cnt = 0 
    for row in tqdm(records):
        path = dir_path + '/{}.jpg'.format(row[0])
        try:
            # If thumnail url is empty, download original url
            if not row[13]:
                download(row[5], path)
            else:
                download(row[13], path)
        except Exception as e:
            with open(ERRORS_FILE,'a') as f:
                f.write(e.args[0]+'\n')

Now create a data file that contains the image and the category it belongs to. These files will be used as the input data to our model

In [None]:
import os
import glob
from pycocotools.coco import COCO
import random

SEARCH_CRITERION = '**/*.jpg'
train_images = glob.glob(os.path.join(DATA_DIR + 'train', SEARCH_CRITERION), recursive=True)
val_images = glob.glob(os.path.join(DATA_DIR + 'val', SEARCH_CRITERION), recursive=True)
test_images = glob.glob(os.path.join(DATA_DIR + 'test', SEARCH_CRITERION), recursive=True)

def create_data_file(image_list, image_type):
    with open('image-' + image_type + '.lst', 'w') as fp:
        for ind in enumerate(image_list):
            image_path = ind[1]
            fp.write(str(ind[0]) + '\t')
            if image_path.find('/brown/') > -1:
                fp.write('0' + '\t')
            elif image_path.find('/polar/') > -1:
                fp.write('1' + '\t')
            else:
                fp.write('2' + '\t')
            fp.write(image_path[-20:])
            fp.write('\n')
        fp.close()

random.shuffle(train_images)
random.shuffle(val_images)
random.shuffle(test_images)

create_data_file(train_images, 'train')
create_data_file(val_images, 'val')
create_data_file(test_images, 'test')

Show some sample images to make sure everything looks ok

In [None]:
import random
from IPython.display import Image

rand_image = random.randrange(1,len(train_images))
print(train_images[rand_image])
Image(train_images[rand_image])

Push files to S3 in preparation for training

In [None]:
# Four channels: train, validation, train_lst, and validation_lst
s3train = 's3://{}/{}/train/'.format(bucket, prefix)
s3validation = 's3://{}/{}/validation/'.format(bucket, prefix)
s3train_lst = 's3://{}/{}/train_lst/'.format(bucket, prefix)
s3validation_lst = 's3://{}/{}/validation_lst/'.format(bucket, prefix)

# upload the image files to train and validation channels
!aws s3 cp ./data/train/brown $s3train --recursive --quiet
!aws s3 cp ./data/train/no $s3train --recursive --quiet
!aws s3 cp ./data/train/polar $s3train --recursive --quiet
!aws s3 cp ./data/val/brown $s3validation --recursive --quiet
!aws s3 cp ./data/val/no $s3validation --recursive --quiet
!aws s3 cp ./data/val/polar $s3validation --recursive --quiet

# upload the lst files to train_lst and validation_lst channels
!aws s3 cp image-train.lst $s3train_lst --quiet
!aws s3 cp image-val.lst $s3validation_lst --quiet

# Training

Begin the training of our model

In [None]:
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)
multilabel_ic = sagemaker.estimator.Estimator(training_image,
                                         role, 
                                         instance_count=1, 
                                         instance_type='ml.p3.2xlarge',
                                         volume_size = 50,
                                         max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

In [None]:
multilabel_ic.set_hyperparameters(num_layers=50,
                             use_pretrained_model=1,
                             image_shape = "3,224,224",
                             num_classes=3,
                             mini_batch_size=64,
                             epochs=10,
                             resize=256,
                             learning_rate=0.001,
                             num_training_samples=1814,
                             use_weighted_loss=1,
                             augmentation_type = 'crop_color_transform',
                             precision_dtype='float16',
                             multi_label=0)

In [None]:
train_data = sagemaker.inputs.TrainingInput(s3train, distribution='FullyReplicated', 
                        content_type='application/x-image', s3_data_type='S3Prefix')
train_data_lst = sagemaker.inputs.TrainingInput(s3train_lst, distribution='FullyReplicated', 
                        content_type='application/x-image', s3_data_type='S3Prefix')

validation_data = sagemaker.inputs.TrainingInput(s3validation, distribution='FullyReplicated', 
                        content_type='application/x-image', s3_data_type='S3Prefix')
validation_data_lst = sagemaker.inputs.TrainingInput(s3validation_lst, distribution='FullyReplicated', 
                        content_type='application/x-image', s3_data_type='S3Prefix')

data_channels = {'train': train_data, 'validation': validation_data, 'train_lst': train_data_lst, 
                        'validation_lst': validation_data_lst}

Start the training

In [None]:
multilabel_ic.fit(inputs=data_channels, logs=True)

# Inference

Set up an endpoint where we can make inferences

In [None]:
ic_classifier = multilabel_ic.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

Get a random image from our test dataset

In [None]:
import random
from IPython.display import Image
import json

rand_image = random.randrange(1,len(test_images)-1)
print(test_images[rand_image])

with open(test_images[rand_image], 'rb') as image:
    f = image.read()
    b = bytearray(f)
results = ic_classifier.predict(b, initial_args={'ContentType': 'application/x-image'})

prob = json.loads(results)
classes = ['Brown Bear', 'Polar Bear', 'No Bear']
for idx, val in enumerate(classes):
    print('%s:%f '%(classes[idx], prob[idx]), end='')
Image(test_images[rand_image])

As you can see, the model does a pretty good job at predicting the right class. This should give you an idea of how to create an image-based model. 

Clean up the endpoint

In [None]:
ic_classifier.delete_endpoint()