# Part II: Data Preprocessing

From our data explorataion we know we need to make the following changes and codifying them will ensure they are consistent.

1. rescale the images to be the same size
1. crop the border from around the image
1. change the image to ensure it's grayscale and not RGB
1. create pipeline tasks that can be reused for data ingestion and against batch offline data

# Setup

In [None]:
import os
import cv2
import numpy as np
import matplotlib.pyplot as plt

from IPython.display import clear_output

In [None]:
# scratch directory is apart of the .gitignore to ensure it is not committed to git
%env SCRATCH=../scratch
! [ -e "${SCRATCH}" ] || mkdir -p "${SCRATCH}"

scratch_path = os.environ.get('SCRATCH', './scratch')

# Transform the images

It is a best practice not to save over the raw data you have, so we will write these changes to a new directory:
1. resize all the images to 96, 96
1. crop the border
1. convert to grayscale
1. because the images will be smaller and we want to maintain the 96x96 input, resize them back

## Define the processing function

The writefile magic command demonstrates how you can save a python function and call it in a notebook with the %run magic command.

In [None]:
%%writefile ../src/process_images.py

import os
import cv2

CROP_TOP = 10
CROP_BOT = 96-15
CROP_L = 5
CROP_R = 96-6

def process_images_in_directory(input_directory, output_directory):
    # Create the output directory if it doesn't exist
    os.makedirs(output_directory, exist_ok=True)

    for filename in os.listdir(input_directory):
        file_path = os.path.join(input_directory, filename)

        if os.path.isdir(file_path):
            # If it's a subdirectory, recursively process its contents
            subdirectory_output = os.path.join(output_directory, filename)
            process_images_in_directory(file_path, subdirectory_output)
        elif filename.endswith(".png"):  # You can adjust the file extension as needed
            # Read the image using OpenCV
            image = cv2.imread(file_path)

            if image is not None:
                # Resize the image to 96x96
                resized_image = cv2.resize(image, (96, 96))

                # Crop the image to [5:95, 5:91]
                cropped_image = resized_image[CROP_TOP:CROP_BOT, CROP_L:CROP_R]
                
                # Convert the images to grayscale
                grayscale_image = cv2.cvtColor(cropped_image, cv2.COLOR_BGR2GRAY)
                
                # Save the cropped and resized image to the output directory
                output_path = os.path.join(output_directory, filename)
                cv2.imwrite(output_path, grayscale_image)

                print(f"Processed and saved {filename} to {output_path}")
            else:
                print(f"Skipping {filename}: Unable to read the image")

# Example usage:
# Replace "input_root_directory" and "output_root_directory" with your root input and output directory paths
# process_images_in_directory("input_root_directory", "output_root_directory")

## Load the function in the notebook

In [None]:
%run ../src/process_images.py

## Set the input and output directory to process the training data

In [None]:
input_directory = scratch_path + "/train"  
output_directory = scratch_path + "/processed/hand" 

In [None]:
process_images_in_directory(input_directory, output_directory)

clear_output()

## Set the input and output directory to process the real data

In [None]:
input_directory = scratch_path + "/real"  
output_directory = scratch_path + "/processed/real"  

In [None]:
process_images_in_directory(input_directory, output_directory)

clear_output()

## Cleanup unused data from memory

This code checks if the directories real and train exist within the scratch_path before attempting to remove them. If they exist, it performs the removal using shutil.rmtree(). This helps avoid errors that might occur if trying to remove non-existent directories.

In [None]:
import shutil

if os.path.exists(scratch_path + "/real"):
    shutil.rmtree(scratch_path + "/real")
if os.path.exists(scratch_path + "/train"):
    shutil.rmtree(scratch_path + "/train")

# Upload the processed data to S3

In [None]:
# list objects using the aws s3 cli
! aws s3 ls

## List the dynamically created bucket for the demo

In [None]:
import boto3

s3_client = boto3.client('s3')
response = s3_client.list_buckets()

filtered_buckets=[]

for bucket in response['Buckets']:
    bucket_name = bucket['Name']
    if bucket_name.startswith('sagemaker-fingerprint-'):
        filtered_buckets.append(bucket_name)
    
print(filtered_buckets)

## Upload the model to the demo bucket

If you want to interact with S3 and upload the data to the buckets, you can change the following cell from Raw to Code and execute.