# Data Preparation

In this notebook we will be preparing the data for image classification. We will be performing the following tasks:
- splitting the data in training and validation sets
- creating .lst files for image classification model.
- uploading the data into our S3 bucket.

In [7]:
import pandas as pd
import glob
import os
import shutil
import random

In [31]:
os.mkdir('./data/train')
os.mkdir('./data/validation')

## Splitting Dataset into Training and Validation Folders

In [58]:
# Splitting the data into training, test, and validation folder. 499 for hotdog and 499 for non-hotdog
train_split = 0.80

def split_data(train_split):
    training_directory = './data/train/'
    validation_directory = './data/validation/'
   
    folder = './data/'
    
    images = glob.glob('./data/*.jpg')
    hotdog_images = [file for file in images if '/hot_dog' in file]
    nothotdog_images = [file for file in images if 'not_hot_dog' in file]
    
    num_of_each = int(len(images)*train_split*0.5)
    
    for i in range(0, num_of_each + 1):
        shutil.move(hotdog_images[i], training_directory + hotdog_images[i].split('/')[-1])
        shutil.move(nothotdog_images[i], training_directory + nothotdog_images[i].split('/')[-1])

    
    rest_of_images = glob.glob('./data/*.jpg')
    for image in rest_of_images:
        file_name = image.split('/')[-1]
        shutil.move(image, validation_directory + file_name)

In [59]:
# Split the data
split_data(train_split)

Checking to see if the dataset was properly split

In [61]:
# verify split properly
training_images = glob.glob('./data/train/*.jpg')
validation_images = glob.glob('./data/validation/*.jpg')

training_hotdogs = [image for image in training_images if '/hot_dog' in image]
training_nots = [image for image in training_images if 'not_hot_dog' in image]

val_hotdogs = [image for image in validation_images if '/hot_dog' in image]
val_nots = [image for image in validation_images if 'not_hot_dog' in image]

print("Number of Training Images: {}".format(len(training_images)))
print("Number of Validation Images: {}".format(len(validation_images)))

print("Number of Hot Dogs in Training: {}".format(len(training_hotdogs)))
print("Number of Not Hot Dogs in Training: {}".format(len(training_nots)))

print("Number of Hot Dogs in Validation: {}".format(len(val_hotdogs)))
print("Number of Not Hot Dogs in Training: {}".format(len(val_nots)))

Number of Training Images: 799
Number of Validation Images: 199
Number of Hot Dogs in Training: 400
Number of Not Hot Dogs in Training: 399
Number of Hot Dogs in Validation: 99
Number of Not Hot Dogs in Training: 100


## Creating *.lst* files

Amazon's built-in image classifier can take two types of input formats: RecordIO format or lst format.

A .lst file is a tab-seperated file with three columns that contains a list of image files. The first column is the image index, second column is the class label, and the third column is the file path of the image.

We will be creating a dataframe for the training and validation images and saving that dataframe as an lst file.

In [64]:
# Creating Image Dataframe
def create_img_dataframe(directory):
    labels = []
    filenames = []
    
    folder = './data/{}/*.jpg'.format(directory)
    images = glob.glob(folder)
    
    for image in images:
        if '/hot_dog' in image:
            labels.append(1)
            filenames.append(os.path.basename(image))
        elif 'not_hot_dog' in image:
            labels.append(0)
            filenames.append(os.path.basename(image))
    
    df = pd.DataFrame(list(zip(labels, filenames)), columns = ['labels', 's3_path'])
    return df

In [66]:
train_df = create_img_dataframe('train')
train_df.head()

Unnamed: 0,labels,s3_path
0,0,not_hot_dog_274.jpg
1,0,not_hot_dog_112.jpg
2,1,hot_dog_374.jpg
3,1,hot_dog_409.jpg
4,0,not_hot_dog_94.jpg


In [68]:
val_df = create_img_dataframe('validation')
val_df.head()

Unnamed: 0,labels,s3_path
0,1,hot_dog_447.jpg
1,1,hot_dog_27.jpg
2,1,hot_dog_102.jpg
3,1,hot_dog_136.jpg
4,1,hot_dog_328.jpg


In [69]:
# save dataframe into lst file
def save_to_lst(df, prefix):
    return df[['labels', 's3_path']].to_csv(f'{prefix}.lst', sep = '\t', index = True, header = False)

save_to_lst(train_df.copy(), 'train')
save_to_lst(val_df.copy(), 'validation')

## Upload data to S3 Bucket

In [70]:
# Upload lst files into s3

import sagemaker
import boto3

sagemaker_session = sagemaker.Session()
bucket = 'not-hot-dog'
role = sagemaker.get_execution_role()
region = sagemaker_session.boto_region_name

os.environ['DEFAULT_S3_BUCKET'] = bucket

In [72]:
# Uploading local files into s3 bucket
!aws s3 sync ./data/train s3://${DEFAULT_S3_BUCKET}/train/
!aws s3 sync ./data/validation s3://${DEFAULT_S3_BUCKET}/validation/

In [73]:
# Upload lst files into s3 bucket
boto3.Session().resource('s3').Bucket(bucket).Object('train.lst').upload_file('./train.lst')
boto3.Session().resource('s3').Bucket(bucket).Object('validation.lst').upload_file('./validation.lst')