# Overview

Demonstrate using ROSA with AWS SageMaker SDK for machine learning development. 

In [3]:
# TODO create ODH SageMaker Notebook image
!pip install -q -r ../../requirements.txt

You should consider upgrading via the '/opt/app-root/bin/python3.8 -m pip install --upgrade pip' command.[0m


## Verify your environment variables

In [4]:
# uncomment the below line to display the values for the Environment Variables. If not set, enter in the notebook image spawner prior to starting the server.
#!env | grep 'AWS\|S3\|ARN' | sort

# Imports

Amazon SageMaker Python SDK - is an open source library for training and deploying machine-learned models on Amazon SageMaker. With the SDK, you can train and deploy models using popular deep learning frameworks, algorithms provided by Amazon, or your own algorithms. https://sagemaker.readthedocs.io/en/stable/

Boto3 - is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2. Boto3 is maintained and published by Amazon Web Services. https://github.com/boto/boto3

In [5]:
# import AWS Sagemaker SDK 
import sagemaker
from sagemaker import get_execution_role

# for the tf estimator model that is used to train with
from sagemaker.tensorflow import TensorFlow

# import the AWS SDK for python. 
import boto3

# import os for misc. operating system dependent functionality
import os

# import numpy for common machine learning libraries like scikit-learn and SciPy
import numpy as np

# import keras for an open-source software library that provides a Python interface for artificial neural networks
import keras

# this module provide a few toy datasets (already-vectorized, in Numpy format) that can be used for debugging a model or creating simple code
from keras.datasets import fashion_mnist

In [6]:
# print the imported versions
print("SageMaker " + sagemaker.__version__)
print("Boto3 " + boto3.__version__)
print("keras " + keras.__version__)

SageMaker 2.116.0
Boto3 1.26.8
keras 2.10.0


# AWS Configurations

## Set your region

In [7]:
# session manages interactions with the Amazon SageMaker APIs and any other AWS services needed.
sess = sagemaker.Session(boto3.session.Session(region_name=os.getenv('AWS_DEFAULT_REGION')))

## Set your IAM role

In [8]:
# Create a role https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html. if you need to create or find the arn, for example, you'd go to https://us-east-1.console.aws.amazon.com/iamv2
# set your IAM role arn for the Execution Role
role = os.getenv('EXECUTION_ROLE_ARN')

# Uncomment to verify role values
print(role)

arn:aws:iam::913990135345:role/AmazonSageMaker-ExecutionRole


With a known ARN for your role, you can programmatically check the role when running the notebook locally or on SageMaker. Replace RoleName with your known ARN:

## Configure the S3 storage

In [9]:
# For S3 Storage
# Edit this section using your own credentials
# enter your region name
s3_region = os.getenv('AWS_DEFAULT_REGION')

# enter your S3 endpoint URL
s3_endpoint_url = os.getenv('S3_ENDPOINT_URL')

# enter your S3 access key ID
s3_access_key_id = os.getenv('S3_ACCESS_KEY_ID')

# enter your S3 secret access key
# TODO make OCP secret
# TODO OIDC identity
s3_secret_access_key = os.getenv('S3_SECRET_ACCESS_KEY')

# enter your S3 bucket name
s3_bucket = os.getenv('S3_BUCKET')

# configure boto S3 connection
s3 = boto3.client('s3',
                  s3_region,
                  #endpoint_url = s3_endpoint_url,
                  aws_access_key_id = s3_access_key_id,
                  aws_secret_access_key = s3_secret_access_key)

## Test your S3 bucket connection

In [10]:
# uncommect to verify S3 bucket connection
s3.list_buckets()

{'ResponseMetadata': {'RequestId': 'Y0VB4H40SWMN7YVC',
  'HostId': 'rLokpljRWxUuaRoXRCYMoWaCvVKpKsO1Ud/6JRuoy2X5aqR7hE44hmYD8rMMuRJj6CSQc3dGTPw=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'rLokpljRWxUuaRoXRCYMoWaCvVKpKsO1Ud/6JRuoy2X5aqR7hE44hmYD8rMMuRJj6CSQc3dGTPw=',
   'x-amz-request-id': 'Y0VB4H40SWMN7YVC',
   'date': 'Fri, 11 Nov 2022 23:21:34 GMT',
   'content-type': 'application/xml',
   'transfer-encoding': 'chunked',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'Buckets': [{'Name': 'ocpml-gx2m4-image-registry-us-east-2-dgtlkgqesfwwtnwjjnhhwvbra',
   'CreationDate': datetime.datetime(2022, 11, 8, 22, 1, 43, tzinfo=tzlocal())},
  {'Name': 'sagemaker-ocpai',
   'CreationDate': datetime.datetime(2022, 11, 9, 0, 2, 29, tzinfo=tzlocal())},
  {'Name': 'sagemaker-us-east-2-913990135345',
   'CreationDate': datetime.datetime(2022, 11, 9, 20, 38, 1, tzinfo=tzlocal())}],
 'Owner': {'ID': '4594cadc576d5fc744e25ba027c3674a5af3f95c887743c5e8fae4be0f8c53fd'}}

# Configure your Datasets

In [11]:
# x_train is the data and y_train is the label
# x_val is the data and y_val is the label
(x_train, y_train), (x_val, y_val) = fashion_mnist.load_data()

s3 = boto3.client('s3')

# these paths assume you have a folder 'data' in your bucket
training = 'data/training.npz'
validation = 'data/validation.npz'
s3.upload_file(training, s3_bucket, training)
s3.upload_file(validation, s3_bucket, validation)

### Upload data to your S3

In [12]:
!echo $S3_BUCKET

sagemaker-ocpai


In [13]:
# set the training path for the data
training_input_path = 's3://sagemaker-ocpai/data/training.npz'

# set the validation path for the data
validation_input_path = 's3://sagemaker-ocpai/data/validation.npz'

# Set the location to store the trained model
output_path = 's3://sagemaker-ocpai/models/'

# Define the model

In [14]:
# list of parameters https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html
# list of instance types https://aws.amazon.com/sagemaker/pricing/

tf_estimator = TensorFlow(entry_point='mnist_keras_tf.py', # Path (absolute or relative) to the local Python source file which should be executed as the entry point to training
                          role=role,                       # The or the TensorFlowModel
                          instance_count=1,                # number of EC2 instances to use
                          instance_type='ml.m5.2xlarge',   # Type of EC2 instance to use, for example, ‘ml.c4.xlarge’.
                          framework_version='1.15',        # TF version to use for executing the training code
                          py_version='py3',                # Python version to use for executing the model training code
                          hyperparameters={'epochs': 1},   # hyperparameters used during training
                          model_dir=output_path            # S3 location where the checkpoint data and models can be exported during training
                          )  

                          #image_uri=123.dkr.ecr.us-west-2.amazonaws.com/my-custom-image:1.0 custom-image:latest,
                          #distribution,                   # how to run distributed training for data or model parallelism

# Train the model

In [15]:
# Train! This will pull (once) the SageMaker CPU/GPU container for TensorFlow to your local machine.
# Make sure that Docker is running and that docker-compose is installed

tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path})

2022-11-11 23:21:43 Starting - Starting the training job...
2022-11-11 23:22:06 Starting - Preparing the instances for trainingProfilerReport-1668208903: InProgress
......
[34m2022-11-11 23:23:23,260 sagemaker-training-toolkit INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2022-11-11 23:23:23,269 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-11-11 23:23:23,491 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-11-11 23:23:23,507 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-11-11 23:23:23,522 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-11-11 23:23:23,533 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/train