# AWS Machine Learning Purpose-built Accelerators Tutorial
## Learn how to use [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/) and [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) with [Amazon SageMaker](https://aws.amazon.com/sagemaker/), to optimize your ML workload
## Part 3/3 - Compiling and deploying a Bert model to AWS Inferentia1 with SageMaker + [Hugging Face Optimum Neuron](https://huggingface.co/docs/optimum-neuron/index)

**SageMaker studio Kernel: PyTorch 1.13 Python 3.9 CPU - ml.t3.medium** 

In this tutorial, you'll learn how to compile a model to AWS Inferentia and then deploy it to a SageMaker real-time endpoint powered by AWS Inferentia1. First we'll kick-off a SageMaker job to compile the model. We need to do this once. After that, we can deploy our model to a SageMaker endpoint and finally get some predictions.

In section 02, you extract some metadata from the Optimum Neuron API and render a table with the current tested/supported models (similar models not listed there can also be compatible, but you need to check by yourself). This table is important for you to understand which models can be selected for deployment. However, if you also need to fine-tune your model, check a similar table in the notebook **Part 2** to see which models can be fine-tuned with AWS Trainium using HF Optimum Neuron. That way you can plan your end2end solution and start implementing it right now.

## 1) Install some required packages

In [None]:
%pip install -U sagemaker

## 2) Supported models/tasks

Models with **[TP]** after the name support Tensor Parallelism

In [None]:
from IPython.display import Markdown, display

display(Markdown("../docs/optimum_neuron_models.md"))

## 3) Prepare the model to deploy to Inferentia 1

In [None]:
import os
import boto3
import shutil
import sagemaker

print(sagemaker.__version__)
if not sagemaker.__version__ >= "2.146.0": print("You need to upgrade or restart the kernel if you already upgraded")

training_job_name=""

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()
region = sess.boto_region_name

if os.path.isfile("training_job_name.txt"): training_job_name = open("training_job_name.txt", "r").read().strip()
if len(training_job_name)==0: raise Exception("Please run Notebook number #2 or copy the name of the training_job you ran in the previous notebook and set training_job_name")
checkpoint_s3_uri=f"s3://{bucket}/output/{training_job_name}/output/model.tar.gz"

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {bucket}")
print(f"sagemaker session region: {region}")
print(f"Training job name: {training_job_name}")
print(f"Model S3 URI: {checkpoint_s3_uri}")

In [None]:
%%writefile src/requirements.txt
--extra-index-url=https://pip.repos.neuron.amazonaws.com
neuron-cc[tensorflow]==1.22.0
optimum[neuron]==1.20.0
optimum-neuron==0.0.23

### 3.1) Model compilation file

In [None]:
%%writefile src/compile_inf1.py
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

import os
os.environ['NEURON_RT_NUM_CORES'] = '1'
import sys
import json
import torch
import shutil
import tarfile
import logging
import argparse
import subprocess
from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForSequenceClassification

## Helper functions that will be used for Inference after deploying the endpoint
## Model and tokenizer loader
def model_fn(model_dir, context=None):
    tokenizer = AutoTokenizer.from_pretrained(os.environ.get("MODEL_ID", "bert-base-uncased"))
    model = NeuronModelForSequenceClassification.from_pretrained(model_dir)
    return model,tokenizer

def input_fn(input_data, content_type, context=None):
    if content_type == 'application/json':
        req = json.loads(input_data)
        prompt = req.get('prompt')
        if prompt is None or len(prompt) < 3:
            raise("Invalid prompt. Provide an input like: {'prompt': 'text text text'}")
        return prompt
    else:
        raise Exception(f"Unsupported mime type: {content_type}. Supported: application/json")

def predict_fn(input_object, model_tokenizer, context=None):
    try:
        model,tokenizer = model_tokenizer
        inputs = tokenizer(input_object, truncation=True, return_tensors="pt")
        logits = model(**inputs).logits
        idx = logits.argmax(1, keepdim=True)
        conf = torch.gather(logits, 1, idx)
        return torch.cat([idx,conf], 1)
    except Exception as e:
        print(e)
        return None

if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.    
    parser.add_argument("--task", type=str, default="")
    parser.add_argument("--dynamic_batch_size", type=bool, default=False, action=argparse.BooleanOptionalAction)
    parser.add_argument("--batch_size", type=int, default=1)
    parser.add_argument("--sequence_length", type=int, default=1)
    parser.add_argument("--is_model_compressed", type=bool, default=False, action=argparse.BooleanOptionalAction)
    
    parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])    
    parser.add_argument("--checkpoint_dir", type=str, default=os.environ["SM_CHANNEL_CHECKPOINT"])
    
    args, _ = parser.parse_known_args()
    # Set up logging        
    logging.basicConfig(
        level=logging.getLevelName("DEBUG"),
        handlers=[logging.StreamHandler(sys.stdout)],
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    )
    logger = logging.getLogger(__name__)
    logger.info(args)
    logger.info(f"Checkpoint files: {os.listdir(args.checkpoint_dir)}")
    
    model_path = args.checkpoint_dir
    if args.is_model_compressed:
        logger.info("Decompressing model file...")
        with tarfile.open(os.path.join(args.checkpoint_dir, "model.tar.gz"), 'r:gz') as tar:
            tar.extractall(os.path.join(args.checkpoint_dir, "model"))
        model_path = os.path.join(args.checkpoint_dir, "model")
        logger.info(f"Done! Model path: {model_path}")
        logger.info(f"Model path files: {os.listdir(model_path)}")

    cmd  = "optimum-cli export neuron --disable-validation "
    cmd += f"--model {model_path} "
    cmd += f"--task {args.task} "
    cmd += f"--sequence_length {args.sequence_length} "
    cmd += f"--batch_size {args.batch_size} "
    if args.dynamic_batch_size: cmd += "--dynamic-batch-size "
    cmd += args.model_dir
    logger.info(f"Final command: {cmd}")
    subprocess.check_call(cmd.split(' '))

    code_path = os.path.join(args.model_dir, 'code')
    os.makedirs(code_path, exist_ok=True)

    shutil.copy(__file__, os.path.join(code_path, "inference.py"))
    shutil.copy('requirements.txt', os.path.join(code_path, 'requirements.txt'))

In [None]:
batch_size=1
sequence_length=512
task="text-classification"
model_id="bert-base-uncased"

### 3.2) Compile model

In [None]:
import json
import logging
from sagemaker.utils import name_from_base
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point="compile_inf1.py", # Specify your train script
    source_dir="src",
    role=role,
    name=name_from_base("inf1-compile"),
    sagemaker_session=sess,
    container_log_level=logging.DEBUG,
    instance_count=1,
    instance_type='ml.c5.2xlarge',
    output_path=f"s3://{bucket}/output",
    disable_profiler=True,
    # Inf1 models can be compiled on any CPU
    # so, let's use a regular CPU PyTorch image on a C5 instance
    image_uri=f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training:1.13.1-cpu-py39-ubuntu20.04-sagemaker",
    volume_size = 512,
    hyperparameters={     
        "task": task,
        "batch_size": batch_size,
        "sequence_length": sequence_length,
        "dynamic_batch_size": True,
        "is_model_compressed": True
    }
)
estimator.framework_version = '1.13.1' # workround when using image_uri

In [None]:
estimator.fit({"checkpoint": checkpoint_s3_uri})
model_data=estimator.model_data
print(f"Model data: {model_data}")

## 4) Deploy a SageMaker real-time endpoint

In [None]:
import logging
from sagemaker.utils import name_from_base
from sagemaker.pytorch.model import PyTorchModel

# depending on the inf1 instance you deploy the model you'll have more or less 
# accelerators. We'll ask SageMaker to launch 1 worker per core

pytorch_model = PyTorchModel(    
    image_uri=f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-inference-neuron:1.13.1-neuron-py310-sdk2.18.2-ubuntu20.04",
    model_data=model_data,
    role=role,
    name=name_from_base('bert-spam-classifier'),
    sagemaker_session=sess,
    container_log_level=logging.DEBUG,
    model_server_workers=4, # 1 worker per core
    framework_version="1.13.1",
    env = {
        'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '3600',
        'MODEL_ID': model_id
    }
    # for production it is important to define vpc_config and use a vpc_endpoint
    #vpc_config={
    #    'Subnets': ['<SUBNET1>', '<SUBNET2>'],
    #    'SecurityGroupIds': ['<SECURITYGROUP1>', '<DEFAULTSECURITYGROUP>']
    #}
)
pytorch_model._is_compiled_model = True

In [None]:
predictor = pytorch_model.deploy(
    initial_instance_count=1,
    instance_type='ml.inf1.xlarge'
)

## 5) Run a simple test

In [None]:
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
predictor.serializer = JSONSerializer()
predictor.deserializer = JSONDeserializer()

In [None]:
import time

labels={0: "not spam", 1: "spam"}
not_spam=" Deezer.com 10,406,168 Artist DB\n\nWe have scraped the Deezer Artist DB, right now there are 10,406,168 listings according to Deezer.com\n\nPlease note in going through part of the list, it is obvious there are mistakes inside their system.\n\nExamples include and Artist with &amp; in its name might also be found with "and" but the Albums for each have different totals etc. Have no clue if there are duplicate albums etc do this error in their system. Even a comma in a name could mean the Artist shows up more than once, I saw in 1 instance that 1 Artist had 6 different ArtistIDs due to spelling errors.\n\nSo what is this DB, very simple, it gives you the ArtistID and the actual name of the Artist in another column. If you want to see the artist you add the baseurl to the ArtistID\n\nAn example is ArtistID 115 is AC/DC\n\n[https://www.deezer.com/us/artist/115](https://www.deezer.com/us/artist/115)\n\nYou do not have to use [https://www.deezer.com/us/artist/](https://www.deezer.com/us/artist/) if your first language is other than English, just see if Deezer supports your language and use that baseref\n\nFrench for example is [https://www.deezer.com/fr/artist/115](https://www.deezer.com/fr/artist/115)\n\nI am providing the DB in 3 different formats:\n\n \n\nI tried posting download links here but it seems Reddit does not like that so get them here:\n\n[https://pastebin\\[DOT\\]com/V3KJbgif](https://pastebin.com/V3KJbgif)\n\n&amp;#x200B;\n\n**Special thanks go to** [**/user/KoalaBear84**](https://www.reddit.com/user/KoalaBear84) **for writing the scraper.**\n\n&amp;#x200B;\n\n**Cross Posted to related Reddit Groups**"
spam="🚨 ATTENTION ALL USERS! 🚨\n\n🆘 Are you looking for a way to GET RICH QUICK? 🆘\n\n💰 Don't waste your time with boring old jobs! 💰\n\n💸 Join our CRAZY MONEY-MAKING SYSTEM today! 💸\n\n🤑 Just sign up and start earning BIG BUCKS right away! 🤑\n\n👉 Plus, if you refer your friends, you'll get even MORE CASH! 👈\n\n🔥 This is the HOTTEST OFFER of the year! 🔥\n\n👍 Don't wait"
for i,text in enumerate([not_spam, spam]):
    t=time.time()
    pred = predictor.predict({"prompt": text})
    elapsed = (time.time()-t)*1000
    print(f"Elapsed time: {elapsed}")
    print(f"Pred: {i} - {labels[pred[0][0]]} / score: {pred[0][1]}")

Elapsed time: 105.842058181762695
Pred: 0 - not spam / score: 4.6610636711120605
Elapsed time: 110.35146522521973
Pred: 1 - spam / score: 4.273118495941162


## 6) Delete endpoint

In [None]:
predictor.delete_model()
predictor.delete_endpoint()