# Iris Training and Prediction with Sagemaker Scikit-learn
### Modified Version of AWS Example:
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_iris/Scikit-learn%20Estimator%20Example%20With%20Batch%20Transform.ipynb  

Following modifications were made:  
1. Incorporated scripts for local mode hosting  
2. Added Train and Test Channels  
3. Visualize results (confusion matrix and reports)  
4. Added steps to deploy using model artifacts stored in S3  
  
Following Script changes were made:  
1. RandomForest Algorithm
2. Refactored script to follow the template provided in tensorflow example:
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_script_mode_training_and_serving/tensorflow_script_mode_training_and_serving.ipynb


This tutorial shows you how to use [Scikit-learn](https://scikit-learn.org/stable/) with Sagemaker by utilizing the pre-built container. Scikit-learn is a popular Python machine learning framework. It includes a number of different algorithms for classification, regression, clustering, dimensionality reduction, and data/feature pre-processing. 

The [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) module  makes it easy to take existing scikit-learn code, which we will show by training a model on the IRIS dataset and generating a set of predictions. For more information about the Scikit-learn container, see the [sagemaker-scikit-learn-containers](https://github.com/aws/sagemaker-scikit-learn-container) repository and the [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) repository.

For more on Scikit-learn, please visit the Scikit-learn website: <http://scikit-learn.org/stable/>.

### Table of contents
* [Upload the data for training](#upload_data)
* [Create a Scikit-learn script to train with](#create_sklearn_script)
* [Create the SageMaker Scikit Estimator](#create_sklearn_estimator)
* [Train the SKLearn Estimator on the Iris data](#train_sklearn)
* [Using the trained model to make inference requests](#inferece)
 * [Deploy the model](#deploy)
 * [Choose some data and use it for a prediction](#prediction_request)
 * [Endpoint cleanup](#endpoint_cleanup)
* [Batch Transform](#batch_transform)
 * [Prepare Input Data](#prepare_input_data)
 * [Run Transform Job](#run_transform_job)
 * [Check Output Data](#check_output_data)

First, lets create our Sagemaker session and role, and create a S3 prefix to use for the notebook example.

### Local Mode Execution - requires docker compose

In [None]:
# Jupyter Notebook Setup. No longer needed as Jupyter Notebook
# has docker pre-installed for local mode execution

#!/bin/bash ./setup.sh

In [1]:
# JupyterLab: Use this script to install docker components in JupyterLab
# This is needed for local mode training and deployment
# Modified to incorporate additional checks: https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/sagemaker_studio_docker_cli_install/sagemaker-ubuntu-jammy-docker-cli-install.sh
!/bin/bash "../../docker_install.sh"

Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]      
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [3413 kB]
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [2738 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates/multiverse amd64 Packages [52.2 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1513 kB]
Get:9 http://archive.ubuntu.com/ubuntu jammy-backports/universe amd64 Packages [33.8 kB]
Get:10 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [3323 kB]
Get:11 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [2454 kB]
Get:12 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1224 kB]
Get:13 http://

In [None]:
import os
import sys
import sagemaker
from sagemaker import get_execution_role
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import itertools

import sklearn
from sklearn import preprocessing
from sklearn.metrics import classification_report, confusion_matrix

# SageMaker SKLearn Estimator
from sagemaker.sklearn.estimator import SKLearn

sagemaker_session = sagemaker.Session()

role = get_execution_role()
region = sagemaker_session.boto_session.region_name

In [None]:
print('Python version', sys.version)
print ('Sklearn version', sklearn.__version__)

## Training Data

In [None]:
column_list_file = 'iris_train_column_list.txt'
train_file = 'iris_train.csv'
test_file = 'iris_validation.csv'

columns = ''
with open(column_list_file,'r') as f:
    columns = f.read().split(',')

In [None]:
# Specify your bucket name
bucket_name = 'chandra-ml-sagemaker-us-west-2'

training_folder = r'iris/train'
test_folder = r'iris/test'
model_folder = r'iris/model/'

training_data_uri = r's3://' + bucket_name + r'/' + training_folder
testing_data_uri = r's3://' + bucket_name + r'/' + test_folder
model_data_uri = r's3://' + bucket_name + r'/' + model_folder

In [None]:
training_data_uri,testing_data_uri,model_data_uri

In [None]:
sagemaker_session.upload_data(train_file,
                              bucket=bucket_name, 
                              key_prefix=training_folder)

In [None]:
sagemaker_session.upload_data(test_file, 
                              bucket=bucket_name, 
                              key_prefix=test_folder)

Once we have the data locally, we can use use the tools provided by the SageMaker Python SDK to upload the data to a default bucket. 

## Create a Scikit-learn script to train with <a class="anchor" id="create_sklearn_script"></a>
SageMaker can now run a scikit-learn script using the `SKLearn` estimator. When executed on SageMaker a number of helpful environment variables are available to access properties of the training environment, such as:

* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to. Any artifacts saved in this folder are uploaded to S3 for model hosting after the training job completes.
* `SM_OUTPUT_DIR`: A string representing the filesystem path to write output artifacts to. Output artifacts may include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed and uploaded to S3 to the same S3 prefix as the model artifacts.

Supposing two input channels, 'train' and 'test', were used in the call to the `SKLearn` estimator's `fit()` method, the following environment variables will be set, following the format `SM_CHANNEL_[channel_name]`:

* `SM_CHANNEL_TRAIN`: A string representing the path to the directory containing data in the 'train' channel
* `SM_CHANNEL_TEST`: Same as above, but for the 'test' channel.

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance. For example, the script that we will run in this notebook is the below:

In [None]:
!pygmentize 'scikit_learn_iris.py'

Because the Scikit-learn container imports your training script, you should always put your training code in a main guard `(if __name__=='__main__':)` so that the container does not inadvertently run your training code at the wrong point in execution.

For more information about training environment variables, please visit https://github.com/aws/sagemaker-containers.

## Create SageMaker Scikit Estimator <a class="anchor" id="create_sklearn_estimator"></a>

To run our Scikit-learn training script on SageMaker, we construct a `sagemaker.sklearn.estimator.sklearn` estimator, which accepts several constructor arguments:

* __entry_point__: The path to the Python script SageMaker runs for training and prediction.
* __role__: Role ARN
* __train_instance_type__ *(optional)*: The type of SageMaker instances for training. __Note__: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* __sagemaker_session__ *(optional)*: The session used to train on Sagemaker.
* __hyperparameters__ *(optional)*: A dictionary passed to the train function as hyperparameters.

To see the code for the SKLearn Estimator, see here: https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/sklearn

In [None]:
#instance_type='ml.m5.xlarge'

instance_type='local'

In [None]:
#   Reference: http://sagemaker.readthedocs.io/en/latest/estimators.html

# SDK 2.x version does not require train prefix for instance count and type
# Specify framework and python Version

# Updated framework version from 0.20.0 to 0.23-1

estimator = SKLearn(entry_point='scikit_learn_iris.py',
                    framework_version = "0.23-1",
                    py_version = 'py3',
                    instance_type= instance_type,                     
                    role=role, 
                    output_path=model_data_uri,
                    base_job_name='sklearn-iris',
                    hyperparameters={'n_estimators': 50,'max_depth':5})

## Train SKLearn Estimator on Iris data <a class="anchor" id="train_sklearn"></a>
Training is very simple, just call `fit` on the Estimator! This will start a SageMaker Training job that will download the data for us, invoke our scikit-learn code (in the provided script file), and save any model artifacts that the script creates.

In [None]:
estimator.fit({'training':training_data_uri,'testing':testing_data_uri})

In [None]:
estimator.latest_training_job.job_name

In [None]:
estimator.model_data

## Using the trained model to make inference requests <a class="anchor" id="inference"></a>

### Deploy the model <a class="anchor" id="deploy"></a>

Deploying the model to SageMaker hosting just requires a `deploy` call on the fitted model. This call takes an instance count and instance type.

In [None]:
predictor = estimator.deploy(initial_instance_count=1, 
                           instance_type=instance_type)

### Choose some data and use it for a prediction <a class="anchor" id="prediction_request"></a>

In order to do some predictions, we'll extract some of the data we used for training and do predictions against it. This is, of course, bad statistical practice, but a good way to see how the mechanism works.

In [None]:
df = pd.read_csv(test_file, names=columns)

In [None]:
from sklearn import preprocessing
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
# Encode Class Labels to integers
# Labeled Classes
labels=[0,1,2]
classes = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
le = preprocessing.LabelEncoder()
le.fit(classes)

In [None]:
df.head()

In [None]:
X_test = df.iloc[:,1:]
print(X_test[:5])

In [None]:
result = predictor.predict(X_test)

In [None]:
result

In [None]:
df['predicted_class'] = result

In [None]:
df.head()

<h2>Confusion Matrix</h2>
Confusion Matrix is a table that summarizes performance of classification model.<br><br>

In [None]:
# Reference: 
# https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    #else:
    #    print('Confusion matrix, without normalization')

    #print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

In [None]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(df['encoded_class'],
                              df['predicted_class'],labels=labels)

In [None]:
cnf_matrix

In [None]:
# Plot confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=classes,
                      title='Confusion matrix - Count')

In [None]:
# Plot confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=classes,
                      title='Confusion matrix - Count',normalize=True)

In [None]:
print(classification_report(
    df['encoded_class'],
    df['predicted_class'],
    labels=labels,
    target_names=classes))

### Endpoint cleanup <a class="anchor" id="endpoint_cleanup"></a>

When you're done with the endpoint, you'll want to clean it up.

In [None]:
# SDK 2
predictor.delete_endpoint()

## Another way to deploy endpoint
## Using trained model artifacts
https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html#scikit-learn-predictor
https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#working-with-existing-model-data-and-training-jobs

In [None]:
model_data = estimator.model_data

In [None]:
model_data

In [None]:
# Updated framework version from 0.20.0 to 0.23-1
import sagemaker.sklearn

model = sagemaker.sklearn.model.SKLearnModel(model_data=model_data,
                                             role=role, entry_point='scikit_learn_iris.py',
                                             framework_version = "0.23-1",
                                             py_version = 'py3')

In [None]:
predictor_2 = model.deploy(initial_instance_count=1, instance_type=instance_type)

In [None]:
predictor_2.predict(X_test[:20])

In [None]:
predictor_2.delete_endpoint()