# Adaptation Exercise: From PyTorch Analysis of Moon Data to Sklearn Analysis of Iris Data

References:

* MLE Nanodegree Lesson "Deploy Custom Model" (with [Moon Data Exercise](https://github.com/udacity/ML_SageMaker_Studies/blob/master/Moon_Data/Moon_Classification_Exercise.ipynb))
* [Iris Training and Prediction with Sagemaker Scikit-learn¶ notebook](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_iris/scikit_learn_estimator_example_with_batch_transform.ipynb) by AWS Sagemaker GitHub tutorials

### Get data

In [1]:
import numpy as np
import os
from sklearn import datasets

# Load Iris dataset, then join labels and features
iris = datasets.load_iris()
joined_iris = np.insert(iris.data, 0, iris.target, axis=1)

# Create directory and write csv
os.makedirs('./data', exist_ok=True)
np.savetxt('./data/iris.csv', joined_iris, delimiter=',', fmt='%1.1f, %1.3f, %1.3f, %1.3f, %1.3f')

### Sagemaker Resources

In [2]:
prefix = 'scikit_iris'

import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
sagemaker_session

<sagemaker.session.Session at 0x7f95c4cf1c50>

In [3]:
# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()

In [4]:
# default S3 bucket
bucket = sagemaker_session.default_bucket()
bucket

'sagemaker-us-east-1-906713186745'

### Upload Data to S3

In [6]:
# specify where to upload in S3
prefix = 'sagemaker/iris-data'

# upload to S3
input_data = sagemaker_session.upload_data(path='data/', bucket=bucket, key_prefix=prefix)
input_data

's3://sagemaker-us-east-1-906713186745/sagemaker/iris-data'

In [8]:
# iterate through S3 objects and print contents
import boto3
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
     print(obj.key)

sagemaker/iris-data/iris.csv


### Create Sagemaker Estimator & Train

In [9]:
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"
script_path = 'src/train.py'
hyperparameters = {'max_leaf_nodes': 30}

model = SKLearn(
    entry_point = script_path, #path to the Python script SageMaker runs for training and prediction TODO: further abstract
    framework_version = FRAMEWORK_VERSION,
    instance_type = "ml.c4.xlarge",
    role = role,
    sagemaker_session = sagemaker_session,
    hyperparameters = hyperparameters
)

In [10]:
model.fit({'train':input_data})

2021-04-22 02:22:29 Starting - Starting the training job...
2021-04-22 02:22:52 Starting - Launching requested ML instancesProfilerReport-1619058149: InProgress
......
2021-04-22 02:23:53 Starting - Preparing the instances for training......
2021-04-22 02:24:59 Downloading - Downloading input data...
2021-04-22 02:25:23 Training - Downloading the training image...
2021-04-22 02:25:57 Uploading - Uploading generated training model
2021-04-22 02:25:57 Completed - Training job completed
[34m2021-04-22 02:25:44,741 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2021-04-22 02:25:44,743 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-04-22 02:25:44,753 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2021-04-22 02:25:45,113 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-04-22 02:25:45,126 sagemaker-training-to

### Deploy trained model

In [11]:
predictor = model.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

-------------------!

### Apply deployed trained model on new data

Obtain some test data from the original Iris dataset:

In [12]:
import itertools
import pandas as pd

shape = pd.read_csv("data/iris.csv", header=None)

a = [50*i for i in range(3)]
b = [40+i for i in range(10)]
indices = [i+j for i,j in itertools.product(a,b)]

test_data = shape.iloc[indices[:-1]]
test_X = test_data.iloc[:,1:]
test_y = test_data.iloc[:,0]

Unlike PyTorch, the default `.predict(...)` is sufficient. No need for a separate `predict.py` script.

In [13]:
print(predictor.predict(test_X.values))
print(test_y.values)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2.
 2. 2. 2. 2. 2.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2.
 2. 2. 2. 2. 2.]


More generic prediction and eval

In [16]:
def evaluate(test_preds, test_labels, verbose=True):
    """
    Evaluate a model on a test set given the prediction endpoint.  
    Return binary classification metrics.
    Ref: Udacity
    """
    
    tp = np.logical_and(test_labels, test_preds).sum()
    fp = np.logical_and(1-test_labels, test_preds).sum()
    tn = np.logical_and(1-test_labels, 1-test_preds).sum()
    fn = np.logical_and(test_labels, 1-test_preds).sum()
    
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    accuracy = (tp + tn) / (tp + fp + tn + fn)
    
    if verbose:
        print(pd.crosstab(test_labels, test_preds, rownames=['actuals'], colnames=['predictions']))
        print("\n{:<11} {:.3f}".format('Recall:', recall))
        print("{:<11} {:.3f}".format('Precision:', precision))
        print("{:<11} {:.3f}".format('Accuracy:', accuracy))
        print()
        
    return {'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn, 
            'Precision': precision, 'Recall': recall, 'Accuracy': accuracy}

In [14]:
test_preds = np.squeeze(np.round(predictor.predict(test_X.values)))

In [17]:
metrics = evaluate(test_preds, test_y)

predictions  0.0  1.0  2.0
actuals                   
0.0           10    0    0
1.0            0   10    0
2.0            0    0    9

Recall:     0.679
Precision:  0.679
Accuracy:   0.679

