<h3 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049; text-align: center; border-radius: 5px 5px; padding: 5px"> Bring your BERT Model to AWS SageMaker with Script Mode </h3>

<img src = "img/BringYourOwnModel.png">

In my [previous post](https://medium.com/analytics-vidhya/aws-sagemaker-train-deploy-and-update-a-hugging-face-bert-model-eeefc8211368) I have discussed about how to train a Hugging Face BERT model using on-demand and spot instances, deploy that fine-tuned model on SageMaker real-time endpoint and update that endpoint as well. This makes our work easier because SageMaker supports managed training and inference for a variety of [ML frameworks](https://docs.aws.amazon.com/sagemaker/latest/dg/frameworks.html) such as XGBoost, TensorFlow, PyTorch and HuggingFace etc.

Now let's discuss the use cases of using SageMaker [script mode](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-script-mode/index.html):

1. When we trained a model outside of SageMaker (might have trained on local Jupyter notebook, Google colab, AWS EC2 instances and SageMaker notebook instance etc) then we can bring our fine-tuned model with custom inference script, dependent libraries (can be specified in requirements.txt file) and deploy it on SageMaker endpoints for inference.

2. If we want to train models using a custom algorithm not supported by one of the [built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html) or we want to train a model with custom training script then we can make use of script mode.

This notebook will demonstrate how to host fine-tuned BERT model with custom inference script on SageMaker real-time endpoint.

Please refer to [Medium article](https://medium.com/@vinayakshanawad/bring-your-own-model-with-amazon-sagemaker-script-mode-6cf374747f9e)

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Development Environment and Permissions </h2>

NOTE: You can run this demo in Sagemaker Studio, your local machine, or Sagemaker Notebook Instances

If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.

In [None]:
from sagemaker import get_execution_role
import boto3
import sagemaker

role = get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'huggingface-bert-model-deploy'
sm_client = boto3.client("sagemaker")


print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {bucket}")
print(f"sagemaker session region: {region}")

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Store Model Artifacts </h2>

As discussed in my previous post, I've taken dataset from [Kaggle competition](https://www.kaggle.com/c/nlp-getting-started/overview) which consists of fake and real Tweets about disasters. The task is to classify the tweets.

Trained a HuggingFace BERT model using on-demand instances with hyperparameter (epoch = 2) value and we can store our fine-tuned [BERT](https://arxiv.org/abs/1810.04805) model artifacts in `model/` directory and define `model_path`.

**Note**: I've used fine-tuned BERT model using on-demand instances to show you how we can bring bring our own model with SageMaker script mode, but in real-life scenario we can bring in our own models which are trained outside of SageMaker. 

In [3]:
import os

model_path = 'model'
os.listdir(os.path.join("/home/ec2-user/SageMaker/", model_path))

['config.json', 'model.tar.gz', '.ipynb_checkpoints', 'pytorch_model.bin']

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Write the Inference Script </h2> 

Since we are bringing our own model to SageMaker, we must create an inference script. The script will run inside our HuggingFace container. Our script should include a function for model loading, and optionally functions generating predictions, and input/output processing. The HuggingFace container provides default implementations for generating a prediction and input/output processing. By including these functions in your script you are overriding the default functions. You can find additional [details here](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#serve-a-pytorch-model).

**Note**:

To install additional libraries at container startup, we can add a requirements.txt text file that specifies the libraries to be installed using pip. Within the archive, the HuggingFace container expects all inference code and requirements.txt file to be inside the code/ directory.

In the next cell we'll see our inference script for BERT model which helps us to predict whether tweet is real disaster or not. 

You will notice that it uses the [transformers library from Hugging Face](https://huggingface.co/docs/transformers/index) and installed using pip command in inference script, likewise we need to install additional libraries if required.

In [4]:
!mkdir {model_path}/code

! cp code/inference.py {model_path}/code/inference.py
! cp code/requirements.txt {model_path}/code/requirements.txt

In [5]:
!pygmentize {model_path}/code/inference.py

[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m BertForSequenceClassification, BertTokenizer

logger = logging.getLogger([31m__name__[39;49;00m)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

MAX_LEN = [34m64[39;49;00m  [37m# this is the max length of the sentence[39;49;00m

[36mprint[39;49;00m([33m"[39;49;00m[33mLoading BERT tokenizer...[39;49;00m[33m"[39;49;00m)
tokenizer = BertTokenizer.from_pretrained([33m"[39;49;00m[33mbert-base-uncased[39;49;00m[33m"[39;49;00m, do_lower_case=[34mTrue[39;49;00m)


[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    de

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Package Model </h2> 

For hosting, SageMaker requires that the deployment package be structured in a compatible format. It expects all files to be packaged in a tar archive named "model.tar.gz" with gzip compression. Within the archive, the HuggingFace container expects all inference code file to be inside the `code/` directory. See the guide here for a thorough explanation of the required directory structure.

In [6]:
!tar -czvf {model_path}/model.tar.gz -C {model_path}/ .

./
./config.json
./model.tar.gz
./.ipynb_checkpoints/
./code/
./code/inference.py
./code/requirements.txt
./pytorch_model.bin


<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Upload HuggingFace model to S3 </h2> 

In [None]:
from sagemaker.s3 import S3Uploader

model_data = S3Uploader.upload('model/model.tar.gz', 's3://{0}/{1}/models'.format(bucket,prefix))
model_data

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Create SageMaker Real-time Endpoint </h2> 

After we upload BERT model to S3 we can deploy our endpoint. To create/deploy a real-time endpoint with boto3 you need to create a "SageMaker Model", a "SageMaker Endpoint Configuration" and a "SageMaker Endpoint". The "SageMaker Model" contains our model configuration including our S3 path where we upload/deploy huggingface model. The "SageMaker Endpoint Configuration" contains the configuration for the endpoint. The "SageMaker Endpoint" is the actual endpoint.

In [None]:
# create SageMaker Model
image_uri = "763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-cpu-py38-ubuntu20.04"
deployment_name = "huggingface-bert-model"

primary_container = {
    'Image': image_uri,
    'ModelDataUrl': model_data,
    'Environment': {
        'SAGEMAKER_PROGRAM': 'inference.py',
        'SAGEMAKER_REGION': region,
        'SAGEMAKER_SUBMIT_DIRECTORY': model_data
    }
}

create_model_response = sm_client.create_model(ModelName = f"{deployment_name}-v1",
                                              ExecutionRoleArn = get_execution_role(),
                                              PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

# create SageMaker Endpoint configuration
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = f"{deployment_name}-epc",
    ProductionVariants=[
        {
        'InstanceType':'ml.m5.4xlarge',
        'InitialInstanceCount':1,
        'ModelName': f"{deployment_name}-v1",
        'VariantName':'AllTraffic',
        'InitialVariantWeight':1
        }
    ])

print('Endpoint configuration arn:  {}'.format(endpoint_config_response['EndpointConfigArn']))

# create SageMaker Endpoint
endpoint_params = {
    'EndpointName': f"{deployment_name}-ep", 'EndpointConfigName': f"{deployment_name}-epc"}
endpoint_response = sm_client.create_endpoint(**endpoint_params)
print('EndpointArn = {}'.format(endpoint_response['EndpointArn']))

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Get Predictions </h2> 

Now that our API endpoint is deployed, we can send it text to get predictions from our BERT model. You can use the SageMaker SDK or the SageMaker Runtime API to invoke the endpoint.

In [10]:
from sagemaker.huggingface.model import HuggingFacePredictor

predictor = HuggingFacePredictor(f"{deployment_name}-ep")

predictor.serializer = sagemaker.serializers.JSONSerializer()
predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

test_sentences = ["I met my friend today by accident",
                  "Frank had a severe head injury after the car accident last month", 
                  "Just happened a terrible car crash"
                  ]

result = predictor.predict(test_sentences)
print("result:", result)

result: ['Not a disaster', 'Real disaster', 'Real disaster']


<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Update a SageMaker Real-time Endpoint </h2>

As we know Machine Learning is a highly iterative process. During the course of a single project, data scientists and ML engineers routinely train thousands of different models in search of maximum accuracy and other metrics. Indeed, the number of combinations for algorithms, data sets, and training parameters (aka hyperparameters) is infinite.

For example, I've considered a trained BERT model with updated hyperparameter (epoch = 3) value and see if model performance is improved or not.

If we find the improvement in model performance then we can update the existing SageMaker model endpoint.

In [11]:
# replace the existing model artifacts (model trained with epoch = 2) with updated artifacts (model trained with epoch = 3) and package it
!tar -czvf {model_path}/model.tar.gz -C {model_path}/ .

./
./config.json
./model.tar.gz
./.ipynb_checkpoints/
./code/
./code/inference.py
./code/requirements.txt
./pytorch_model.bin


In [None]:
# upload updated model artifacts to S3
model_data = S3Uploader.upload('model/model.tar.gz', 's3://{0}/{1}/models'.format(bucket,prefix))
model_data

In [13]:
# Create SageMaker model
deployment_name = "huggingface-bert-model-v2"

primary_container = {
    'Image': image_uri,
    'ModelDataUrl': model_data,
    'Environment': {
        'SAGEMAKER_PROGRAM': 'inference.py',
        'SAGEMAKER_REGION': region,
        'SAGEMAKER_SUBMIT_DIRECTORY': model_data
    }
}

create_model_response = sm_client.create_model(
   ModelName = deployment_name, ExecutionRoleArn = role, PrimaryContainer = primary_container
)

# update SageMaker Endpoint
predictor.update_endpoint(initial_instance_count=1, instance_type="ml.m5.4xlarge", model_name=deployment_name)

-----!

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Delete the Real-time Endpoint </h2> 

In [14]:
# delete endpoint
predictor.delete_endpoint()

<h2 style = "font-size:35px; font-family:Garamond ; font-weight : normal; background-color: #007580; color :#fed049   ; text-align: center; border-radius: 5px 5px; padding: 5px"> Conclusion </h2> 

In this post, we discussed the use cases for using script mode, and how script mode can accelerate the model deployment process. We successfully deployed our own fine-tuned Hugging Face Transformer model to Amazon SageMaker for inference using the Real-time Endpoint. Real-time endpoints are a great option for inference workloads especially when we have real-time, interactive, low latency requirements. Real-time endpoints are fully managed and support autoscaling.

Thanks for reading! If you have any questions, feel free to contact me.