# Deploy Huggingface Model to Serverless Sagemaker

- Load a model from huggingface hub and store on S3.
- Register a huggingface model as a Sagemaker model.
- Expose your HuggingFace Model to the Outside World.
- Make predictions


### Links

- https://huggingface.co/docs/sagemaker/inference#deploy-a-model-from-the-hub
- https://www.youtube.com/watch?v=l9QZuazbzWM

## PREREQUISITES

- Fork this repo, launch a Sagemaker (studio) notebook.
- Clone the repo to the notebook.
- Set the kernel to a pytorch kernel.

In [2]:
!pip install sagemaker transformers --upgrade --quiet



In [3]:
import torch
import sagemaker

from transformers import pipeline
from sagemaker.huggingface.model import HuggingFaceModel

## 1. Load a HuggingFace Model and Store on S3

In [4]:
pretrained_classifier = pipeline("sentiment-analysis")
pretrained_classifier.save_pretrained("./model/")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


We save the pretrained model, to mimic a situation where 
<br>
we finetuned on a pretrained model and saved the model to S3.

In [5]:
! cd model/ && tar zcvf model.tar.gz *

config.json
pytorch_model.bin
special_tokens_map.json
tokenizer.json
tokenizer_config.json
vocab.txt


In [6]:
sess = sagemaker.Session()
default_bucket = sess.default_bucket()
model_path = f"s3://{default_bucket}/sagemaker-studio/huggingface-on-serverless-sagemaker/distilbert-base-uncased-finetuned-sst-2-english/model.tar.gz"

In [7]:
!aws s3 cp model/model.tar.gz $model_path

upload: model/model.tar.gz to s3://sagemaker-eu-west-1-077590795309/sagemaker-studio/huggingface-on-serverless-sagemaker/distilbert-base-uncased-finetuned-sst-2-english/model.tar.gz


## 2. Register HuggingFace Model as Sagemaker Model

In [8]:
role = sagemaker.get_execution_role()

huggingface_model = HuggingFaceModel(
    model_data=model_path,                                                
    role=role,                              
    transformers_version="4.12",                            
    pytorch_version="1.9",                                  
    py_version='py38', 
)

In [9]:
# create a sagemaker model
# https://github.com/aws/sagemaker-python-sdk/blob/d635faff4ac54f80465f7bc7f3181f67336e249a/src/sagemaker/model.py#L261
# Maybe not the best way to create a sagemaker model, but i didn't found a better way.

huggingface_model._create_sagemaker_model(instance_type="ml.m5.xlarge", accelerator_type=None, tags=None)
sagemaker_model_name = huggingface_model.name
sagemaker_model_name

'huggingface-pytorch-inference-2022-04-26-11-26-46-198'

## 3. Expose your HuggingFace Model to the Outside World

If you want to expose your model to the outside world,
</br>
you need to connect it to an API Gateway. We do this via a Lambda function.
</br>
</br>
__API Gateway <-> Lambda <-> Sagemaker Endpoint__

### 3.1  Lambda to forward the request

In [10]:
%%writefile lambda_handler.py
import json
import boto3
import os

runtime_client = boto3.client("runtime.sagemaker")
sagemaker_endpoint_name = os.environ["SAGEMAKER_ENDPOINT_NAME"]

def handler(event, context):

    print(f"making a prediction on the text: {event['body']}")
    
    response = runtime_client.invoke_endpoint(
        Body=event["body"],
        EndpointName=sagemaker_endpoint_name,
        Accept="application/json",
        ContentType="application/json",
    )
    
    prediction = response["Body"].read()
    print(f"prediction: {prediction}")

    return {
        'statusCode': 200,
        'body': prediction
    }



Overwriting lambda_handler.py


### 3.2 Serverless.yml

Define a serverless.yml file that describes the stack;

- API Gateway + Lambda
- Sagemaker Endpoint

In [11]:
# function to write variables to a textfile
# https://github.com/ipython/ipython/issues/6701#issuecomment-382640776
from IPython.core.magic import register_line_cell_magic

@register_line_cell_magic
def writetemplate(line, cell):
    with open(line, 'w') as f:
        f.write(cell.format(**globals()))

In [18]:
%%writetemplate serverless.yml
service: huggingface-on-serverless-sagemaker

provider:
  name: aws
  region: eu-west-1 
  runtime: python3.8
  iam:
    role:
      managedPolicies: arn:aws:iam::aws:policy/AdministratorAccess


functions:
  huggingface:
    handler: lambda_handler.handler
    timeout: 120
    memorySize: 128 
    events:
      - http:
          path: prediction
          method: post
    environment:
      SAGEMAKER_ENDPOINT_NAME: !GetAtt SageMakerEndpoint.EndpointName

resources:
  Resources:
    SageMakerEndpointConfig:
      Type: AWS::SageMaker::EndpointConfig
      Properties:
        ProductionVariants:
          - ModelName: {sagemaker_model_name}
#             InitialInstanceCount: 1
            InitialVariantWeight: 1.0
            VariantName: SageMakerModel
            ServerlessConfig:
              MaxConcurrency: 200
              MemorySizeInMB: 1028

    SageMakerEndpoint:
      Type: AWS::SageMaker::Endpoint
      Properties:
        EndpointConfigName: !GetAtt SageMakerEndpointConfig.EndpointConfigName
        EndpointName: huggingface-serverless-sagemaker-endpoint


### 3.3 Deploy The Stack

- First push your changes with the generated serverless.yml that contains the sagemaker model name you created.

- Open an [AWS Cloudshell](https://docs.aws.amazon.com/cloudshell/latest/userguide/getting-started.html#launch-region-shell)

```
git clone my repo you forked.

cd huggingface-on-serverless-sagemaker/

npm install serverless

/home/cloudshell-user/node_modules/serverless/bin/serverless.js deploy
```

## 4. Make a Prediction

Replace the endpoint with the one in the cell below.

It should look  like

     https://8xkeg6xv2b.execute-api.eu-west-1.amazonaws.com/dev/prediction

In [13]:
%%time
!curl -d '{"inputs":"some very much wow positive text!"}' -H "Content-Type: application/json" -X POST  https://8xkeg6xv2b.execute-api.eu-west-1.amazonaws.com/dev/prediction

[{"label":"POSITIVE","score":0.9998674392700195}]CPU times: user 181 ms, sys: 44.4 ms, total: 226 ms
Wall time: 10.6 s


## 5. Burst of 10 000 (Serverless Sagemaker max concurrency = 200)

- default max concurrency of a lambda function is 1000. 
- serverless sagemaker max concurrency is 200 
- we update the serverless.yml too increase the concurrency ...

        resources:
        Resources:
            SageMakerEndpointConfig:
            Type: AWS::SageMaker::EndpointConfig
            Properties:
                ProductionVariants:
                - ModelName: {sagemaker_model}
                    InitialVariantWeight: 1.0
                    VariantName: SageMakerModel
                    ServerlessConfig:
                    MaxConcurrency: 200 # <--- set the max concurrency here
                    MemorySizeInMB: 4096

In [17]:
%%time
from joblib import Parallel, delayed
import shlex
import subprocess

def process(i):
    print(".",)
    cmd = """curl -d '{"inputs":"some very much wow positive text!"}' -H "Content-Type: application/json" -X POST  https://8xkeg6xv2b.execute-api.eu-west-1.amazonaws.com/dev/prediction"""
    subprocess.check_call(shlex.split(cmd))
    
results = Parallel(n_jobs=100)(delayed(process)(i) for i in range(10000))

KeyboardInterrupt: 

Results:

- min latency is 20 ms.
- max latency is 7100 ms.
- average latency is 6500 ms when burst starts.
- average latency is 80 ms when burst is caught up.
- 100% of the requests were successful

![assets/serverless-sagemaker-max-concurrency-1000.png](assets/serverless-sagemaker-max-concurrency-1000.png)

## 5. Remove the Stack

```
/home/cloudshell-user/node_modules/serverless/bin/serverless.js remove
```