# Deploy Small Language Models Cost-efficiently with Amazon SageMaker and AWS Graviton

As organizations look to incorporate AI capabilities into their applications, Large Language Models (LLMs) have emerged as powerful tools for natural language processing tasks. [Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html), AWS's fully managed machine learning service, provides a platform for deploying these ML models with multiple inference options, allowing organizations to optimize for cost, latency, and throughput. However, the computational requirements and costs associated with running these large, powerful LLMs can be prohibitive:

- Traditional LLMs with billions of parameters require significant computational resources, often necessitating GPU instances with substantial memory.
- This computational intensity and cost have led to growing interest in smaller, more efficient language models that can run on CPU infrastructure while still delivering good performance for specific use cases.
- [AWS Graviton processors](https://aws.amazon.com/ec2/graviton/), specifically designed for cloud workloads, offer an optimal platform for running these quantized models, providing up to 50% better price performance compared to traditional x86-based instances for ML inference workloads.

In this notebook, we'll demonstrate how to deploy a qauntized [DeepSeek R1 distilled 8B model](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF) on Amazon SageMaker AI using Graviton-based instances, highlighting the challenges of running large LLMs and the benefits of utilizing efficient language models on cost-optimized hardware.


### Architecture and Components
Our solution leverages Amazon SageMaker with AWS Graviton3 processors to run small language models cost-efficiently. The key components include:

* Amazon SageMaker AI hosted endpoints for model serving
* AWS  Graviton3-based instances (ml.c7g series) for computation
* Llama.cpp  for CPU-optimized inference
* Pre-quantized  GGUF format models

[Llama.cpp](https://github.com/ggerganov/llama.cpp) uses GGUF, a special binary format for storing the model and metadata. Existing models need to be converted to GGUF format before they can be used for the inference. 

### Deployment Process
To deploy your model on SageMaker with Graviton, you'll need to:

1. Create  a Docker container compatible with ARM64 architecture
2. Package  your model and inference code
3. Create  a SageMaker model
4. Configure  and launch an endpoint




### Preparation
Install python packages and prepare environment variables

In [None]:
!sudo apt-get install -y zip
!pip install huggingface-hub

In [None]:
import boto3
import botocore
import sagemaker
import sys
import time
import json

sagemaker_client = boto3.client("sagemaker")
role = sagemaker.get_execution_role()
print(f"Role: {role}")

boto_session = boto3.Session()
sagemaker_session = sagemaker.session.Session(boto_session) # sagemaker session for interacting with different AWS APIs
region = sagemaker_session._region_name

default_bucket = sagemaker_session.default_bucket()  # bucket to house model artifacts

prefix = sagemaker.utils.unique_name_from_base("DEMO")
print(f"prefix: {prefix}")

To run the model on Graviton processor, you need to use a docker container that supports the instance instance and has necessary packages installed. With Amazon SageMaker, you can package your own algorithms that can then be trained and deployed in the SageMaker environment. This notebook guides you through an example on how to extend one of our existing and predefined SageMaker deep learning framework containers. You can find a [list of available pre-built containers here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).

By packaging an algorithm in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies. 
1. [Extending our PyTorch graviton containers](#Extending-our-PyTorch-containers)

### Extending our PyTorch containers
In this example we show how to package a prebuilt PyTorch container that supports Graviton instances, extending the SageMaker PyTorch container, with a Python example which works with the DeepSeek distilled model.

#### How Amazon SageMaker runs your Docker container

* Typically you specify a program (e.g. script) as an `ENTRYPOINT` in the Dockerfile, that program will be run at startup and decide what to do. The original `ENTRYPOINT` specified within the SageMaker PyTorch is [here](https://github.com/aws/deep-learning-containers/blob/master/pytorch/training/docker/1.5.1/py3/Dockerfile.cpu#L142).

#### Running your container during training

Currently, our SageMaker PyTorch container utilizes [console_scripts](http://python-packaging.readthedocs.io/en/latest/command-line-scripts.html#the-console-scripts-entry-point) to make use of the `train` command issued at training time. The line that gets invoked during `train` is defined within the setup.py file inside [SageMaker Containers](https://github.com/aws/sagemaker-containers/blob/master/setup.py#L48), our common SageMaker deep learning container framework. When this command is run, it will invoke the [trainer class](https://github.com/aws/sagemaker-containers/blob/master/src/sagemaker_containers/cli/train.py) to run, which will finally invoke our [PyTorch container code](https://github.com/aws/sagemaker-pytorch-container/blob/master/src/sagemaker_pytorch_container/training.py) to run your Python file.

A number of files are laid out for your use, under the `/opt/ml` directory:

    /opt/ml
    |-- input
    |   |-- config
    |   |   |-- hyperparameters.json
    |   |   `-- resourceConfig.json
    |   `-- data
    |       `-- <channel_name>
    |           `-- <input data>
    |-- model
    |   `-- <model files>
    `-- output
        `-- failure

In this example, we will only using the inference contain as shown below.

#### Running your container during hosting

Hosting has a very different model than training because hosting is responding to inference requests that come in via HTTP. Currently, the SageMaker PyTorch containers [uses](https://github.com/aws/sagemaker-pytorch-container/blob/master/src/sagemaker_pytorch_container/serving.py#L103) our [recommended Python serving stack](https://github.com/aws/sagemaker-containers/blob/master/src/sagemaker_containers/_server.py#L44) to provide robust and scalable serving of inference requests:

![Request serving stack](./stack.png)

Amazon SageMaker uses two URLs in the container:

* `/ping` receives `GET` requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.
* `/invocations` is the endpoint that receives client inference `POST` requests. The format of the request and the response is up to the algorithm. If the client supplied `ContentType` and `Accept` headers, these are passed in as well. 

The container has the model files in the same place that they were written to during training:

    /opt/ml
    `-- model
        `-- <model files>

#### Custom files available to build the container used in this example

The `container` directory has all the components you need to extend the SageMaker PyTorch container to use as a sample algorithm:

    .
    |-- Dockerfile
    `-- code
        `-- inference.py
        `-- requirements.txt

Let's discuss each of these in turn:

* __`Dockerfile`__ describes how to build your Docker container image for *inference*. More details are provided below.
* __`build_and_push.sh`__ is a script that uses the Dockerfile to build your container images and then pushes it to ECR. We invoke the commands directly later in this notebook, but you can just copy and run the script for your own algorithms.
* __`code`__ is the directory which contains our user code to be invoked.

In this application, we install and/or update a few libraries, and copy one script in the container, which will be used as `ENTRYPOINT`. You may only need that many, but if you have many supporting routines, you may wish to install more and use more files.

The files that we put in the container are:

* __`inference.py`__ is the program that implements our training algorithm (used only for training container)
* __`requirements.txt`__ is the text file that contains additional python packages which will be installed during deployment time

#### The inference Dockerfile

The Dockerfile describes the image that we want to build. We start from the SageMaker PyTorch image as the base *inference* one. 

So the SageMaker PyTorch ECR image that supports Graviton in this case would be:
* FROM 763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-inference-arm64:2.5.1-cpu-py311-ubuntu22.04-sagemaker

Next, we install the required additional libraries and add the code that implements our specific algorithm to the container, and set up the right environment for it to run under.

Let's look at the Dockerfile for this example.

In [None]:
%%writefile Dockerfile
FROM 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-inference-arm64:2.5.1-cpu-py311-ubuntu22.04-sagemaker-v1.1

RUN apt-get update && apt-get upgrade -y && apt-get install -y --no-install-recommends \
    ninja-build \
    cmake \
    libopenblas-dev \
    build-essential \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* /tmp/*
RUN python3 -m pip install --upgrade pip
RUN pip uninstall ninja -y
RUN python3 -m pip install --upgrade huggingface-hub pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context
ENV FORCE_CMAKE=1

RUN CMAKE_ARGS="-DCMAKE_CXX_FLAGS='-mcpu=native -fopenmp' -DCMAKE_C_FLAGS='-mcpu=native -fopenmp'" python3 -m pip install llama-cpp-python==0.3.4 --verbose

In [None]:
!awk -v region="$AWS_REGION" '{gsub(/<region>/, region)}1' Dockerfile > Dockerfile.tmp && mv Dockerfile.tmp Dockerfile

### Permissions

Running this notebook requires permissions in addition to the normal `SageMakerFullAccess` permissions. This is because it will use `codecommit` to create new repositories in Amazon ECR. You can add the below inline policy to the role that you used to start your notebook instance. There's no need to restart your notebook instance when you do this, the new permissions will be available immediately.
```python
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "codebuild:BatchGetProjects",
                "iam:PassRole",
                "iam:DeleteRolePolicy",
                "iam:ListAttachedRolePolicies",
                "codebuild:ListBuilds",
                "iam:CreateRole",
                "iam:DeleteRole",
                "codebuild:StartBuild",
                "iam:PutRolePolicy",
                "iam:ListRolePolicies",
                "codebuild:CreateProject",
                "codebuild:BatchGetBuilds"
            ],
            "Resource": "*"
        }
    ]
}
```

### Building and registering the inference container

The following shell code shows how to build the container image using `codebuild` and push the container image to ECR using `docker push`. The reason we need to use `codebuild` instead of the notebook locally is because the graviton supported docker containers need to be built using a graviton instance. Therefore, the `codebuild` provides the necessary compute environment for the docker build.

This code looks for an ECR repository in the account you're using and the current default region. If the repository doesn't exist, the script will create it. In addition, since we are using the SageMaker PyTorch image as the base, we will need to retrieve ECR credentials to pull this public image.

Note that role used by `codebuild` needs to have the permission to push images to the ECR registry.

In [None]:
%%sh
#!/bin/bash

# Exit on any error
set -e

# Configuration variables
PROJECT_NAME="arm-docker-build"
AWS_REGION="${AWS_REGION:-us-east-1}"
ECR_REPO_NAME="llama-cpp-python"
IMAGE_TAG="latest"
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUILDSPEC_FILE="buildspec.yml"
S3_BUCKET="${AWS_ACCOUNT_ID}-codebuild-source"
CODEBUILD_ROLE_NAME="codebuild-${PROJECT_NAME}-service-role"

# Function to wait for CodeBuild project to be ready
wait_for_codebuild_project() {
    echo "Waiting for CodeBuild project to be ready..."
    while true; do
        if aws codebuild batch-get-projects --names "${PROJECT_NAME}" --region "${AWS_REGION}" | grep -q "\"name\": \"${PROJECT_NAME}\""; then
            echo "CodeBuild project is ready"
            break
        fi
        echo "Still waiting for CodeBuild project..."
        sleep 10
    done
}

# Function to check if project exists
check_project_exists() {
    aws codebuild batch-get-projects --names "${PROJECT_NAME}" --region "${AWS_REGION}" | grep -q "\"name\": \"${PROJECT_NAME}\"" || return 1
}

# Create CodeBuild project
create_codebuild_project() {
    echo "Creating CodeBuild project..."
    
    aws codebuild create-project \
        --name "${PROJECT_NAME}" \
        --description "Docker build project for ARM64 architecture" \
        --source "{
            \"type\": \"S3\",
            \"location\": \"${S3_BUCKET}/source.zip\"
        }" \
        --artifacts "{
            \"type\": \"NO_ARTIFACTS\"
        }" \
        --environment "{
            \"type\": \"ARM_CONTAINER\",
            \"image\": \"aws/codebuild/amazonlinux2-aarch64-standard:2.0\",
            \"computeType\": \"BUILD_GENERAL1_LARGE\",
            \"privilegedMode\": true,
            \"environmentVariables\": [
                {
                    \"name\": \"AWS_DEFAULT_REGION\",
                    \"value\": \"${AWS_REGION}\",
                    \"type\": \"PLAINTEXT\"
                },
                {
                    \"name\": \"AWS_ACCOUNT_ID\",
                    \"value\": \"${AWS_ACCOUNT_ID}\",
                    \"type\": \"PLAINTEXT\"
                },
                {
                    \"name\": \"ECR_REPO_NAME\",
                    \"value\": \"${ECR_REPO_NAME}\",
                    \"type\": \"PLAINTEXT\"
                }
            ]
        }" \
        --service-role "arn:aws:iam::${AWS_ACCOUNT_ID}:role/${CODEBUILD_ROLE_NAME}" \
        --region "${AWS_REGION}"

    # Wait for project to be ready
    wait_for_codebuild_project
}

# Create IAM role for CodeBuild
create_codebuild_role() {
    echo "Creating IAM role for CodeBuild..."
    
    # Create trust policy
    cat << EOF > trust-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "codebuild.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

    # Create policy document
    cat << EOF > policy.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": [
                "arn:aws:logs:${AWS_REGION}:${AWS_ACCOUNT_ID}:log-group:/aws/codebuild/${PROJECT_NAME}",
                "arn:aws:logs:${AWS_REGION}:${AWS_ACCOUNT_ID}:log-group:/aws/codebuild/${PROJECT_NAME}:*"
            ],
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ]
        },
        {
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::${S3_BUCKET}/*"
            ],
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:GetBucketAcl",
                "s3:GetBucketLocation"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr:BatchCheckLayerAvailability",
                "ecr:CompleteLayerUpload",
                "ecr:GetAuthorizationToken",
                "ecr:InitiateLayerUpload",
                "ecr:PutImage",
                "ecr:UploadLayerPart",
                "ecr:BatchGetImage",
                "ecr:GetDownloadUrlForLayer"
            ],
            "Resource": "*"
        }
    ]
}
EOF

    # Create IAM role
    aws iam create-role \
        --role-name ${CODEBUILD_ROLE_NAME} \
        --assume-role-policy-document file://trust-policy.json

    # Create IAM policy
    aws iam put-role-policy \
        --role-name ${CODEBUILD_ROLE_NAME} \
        --policy-name codebuild-policy \
        --policy-document file://policy.json

    # Clean up policy files
    rm trust-policy.json policy.json

    # Wait for role to be available
    echo "Waiting for IAM role to be available..."
    sleep 20
}

# Check and create S3 bucket
if ! aws s3api head-bucket --bucket "${S3_BUCKET}" 2>/dev/null; then
    echo "Creating S3 bucket: ${S3_BUCKET}"
    if [ "${AWS_REGION}" = "us-east-1" ]; then
        aws s3api create-bucket \
            --bucket "${S3_BUCKET}" \
            --region "${AWS_REGION}"
    else
        aws s3api create-bucket \
            --bucket "${S3_BUCKET}" \
            --region "${AWS_REGION}" \
            --create-bucket-configuration LocationConstraint="${AWS_REGION}"
    fi

    # Configure bucket
    aws s3api put-bucket-versioning \
        --bucket "${S3_BUCKET}" \
        --versioning-configuration Status=Enabled

    aws s3api put-bucket-encryption \
        --bucket "${S3_BUCKET}" \
        --server-side-encryption-configuration '{
            "Rules": [
                {
                    "ApplyServerSideEncryptionByDefault": {
                        "SSEAlgorithm": "AES256"
                    }
                }
            ]
        }'
fi

# Check if role exists
if aws iam get-role --role-name "${CODEBUILD_ROLE_NAME}" 2>/dev/null; then
    echo "Role ${CODEBUILD_ROLE_NAME} exists. Deleting..."
    
    # First detach all policies from the role
    for policy_arn in $(aws iam list-attached-role-policies --role-name "${CODEBUILD_ROLE_NAME}" --query 'AttachedPolicies[*].PolicyArn' --output text); do
        echo "Detaching policy: ${policy_arn}"
        aws iam detach-role-policy \
            --role-name "${CODEBUILD_ROLE_NAME}" \
            --policy-arn "${policy_arn}"
    done

    # Delete any inline policies
    for policy_name in $(aws iam list-role-policies --role-name "${CODEBUILD_ROLE_NAME}" --query 'PolicyNames[*]' --output text); do
        echo "Deleting inline policy: ${policy_name}"
        aws iam delete-role-policy \
            --role-name "${CODEBUILD_ROLE_NAME}" \
            --policy-name "${policy_name}"
    done

    # Delete the role
    aws iam delete-role --role-name "${CODEBUILD_ROLE_NAME}"
    
    # Wait a bit for deletion to propagate
    echo "Waiting for role deletion to propagate..."
    sleep 10
fi

# Create new role
echo "Creating new role: ${CODEBUILD_ROLE_NAME}"
create_codebuild_role

# Verify the role was created
if aws iam get-role --role-name "${CODEBUILD_ROLE_NAME}" &>/dev/null; then
    echo "Role ${CODEBUILD_ROLE_NAME} successfully created"
else
    echo "Failed to create role ${CODEBUILD_ROLE_NAME}"
    exit 1
fi

# Check and create CodeBuild project
if ! check_project_exists; then
    create_codebuild_project
fi

# Delete buildspec.yml if it exists and create new one
if [ -f "$BUILDSPEC_FILE" ]; then
    echo "Removing existing buildspec file..."
    rm -f "$BUILDSPEC_FILE"
fi

# Create new buildspec.yml
if [ ! -f "$BUILDSPEC_FILE" ]; then
    echo "Creating new buildspec file..."
    cat << EOF > "$BUILDSPEC_FILE"
version: 0.2

phases:
  pre_build:
    commands:
      - echo Logging in to Amazon ECR...
      - aws ecr get-login-password --region \$AWS_DEFAULT_REGION | docker login --username AWS --password-stdin \$AWS_ACCOUNT_ID.dkr.ecr.\$AWS_DEFAULT_REGION.amazonaws.com
      - aws ecr get-login-password --region \$AWS_DEFAULT_REGION | docker login --username AWS --password-stdin 763104351884.dkr.ecr.\$AWS_DEFAULT_REGION.amazonaws.com
      - REPOSITORY_URI=\$AWS_ACCOUNT_ID.dkr.ecr.\$AWS_DEFAULT_REGION.amazonaws.com/\$ECR_REPO_NAME
      - COMMIT_HASH=\$(echo \$CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
      - IMAGE_TAG=\${COMMIT_HASH:=latest}
  build:
    commands:
      - echo Build started on \`date\`
      - echo Building the Docker image...
      - docker build -t \$REPOSITORY_URI:latest .
      - docker tag \$REPOSITORY_URI:latest \$REPOSITORY_URI:\$IMAGE_TAG
  post_build:
    commands:
      - echo Build completed on \`date\`
      - echo Pushing the Docker images...
      - docker push \$REPOSITORY_URI:latest
      - docker push \$REPOSITORY_URI:\$IMAGE_TAG
      - echo Writing image definitions file...
      - printf '{"ImageURI":"%s"}' \$REPOSITORY_URI:\$IMAGE_TAG > imageDefinitions.json
artifacts:
  files:
    - imageDefinitions.json
EOF
fi

# Create ECR repository if needed
if aws ecr describe-repositories --repository-names "${REPOSITORY_NAME}" &>/dev/null; then
    echo "Repository ${REPOSITORY_NAME} already exists"
else
# if ! aws ecr describe-repositories --repository-names "${ECR_REPO_NAME}" --region "${AWS_REGION}" &> /dev/null; then
    echo "Creating ECR repository..."
    aws ecr create-repository \
        --repository-name "${ECR_REPO_NAME}" \
        --region "${AWS_REGION}"
fi

echo "Creating source package..."
zip -r source.zip . -x "*.git*" -x "*.cache*" -x "*code*" -x "*.ipynb*" -x "*.tar.gz*"

echo "Uploading source to S3..."
aws s3 cp source.zip "s3://${S3_BUCKET}/source.zip"

# Wait a moment for S3 upload to complete
sleep 5

# Start build with retry logic
echo "Starting CodeBuild job..."
MAX_RETRIES=3
RETRY_COUNT=0

while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
    if BUILD_ID=$(aws codebuild start-build \
        --project-name "${PROJECT_NAME}" \
        --region "${AWS_REGION}" \
        --query 'build.id' \
        --output text 2>/dev/null); then
        echo "Build started with ID: ${BUILD_ID}"
        break
    else
        RETRY_COUNT=$((RETRY_COUNT + 1))
        if [ $RETRY_COUNT -lt $MAX_RETRIES ]; then
            echo "Failed to start build, retrying in 10 seconds..."
            sleep 10
        else
            echo "Failed to start build after ${MAX_RETRIES} attempts"
            exit 1
        fi
    fi
done

# Monitor build progress
while true; do
    BUILD_STATUS=$(aws codebuild batch-get-builds \
        --ids "${BUILD_ID}" \
        --region "${AWS_REGION}" \
        --query 'builds[0].buildStatus' \
        --output text 2>/dev/null)
    
    echo "Build status: ${BUILD_STATUS}"
    
    if [ "${BUILD_STATUS}" = "SUCCEEDED" ]; then
        echo "Build completed successfully!"
        break
    elif [ "${BUILD_STATUS}" = "FAILED" ] || [ "${BUILD_STATUS}" = "STOPPED" ]; then
        echo "Build failed or was stopped."
        exit 1
    fi
    
    sleep 30
done

echo "Docker image built and pushed to ECR successfully!"

### Writing your own inference script (inference.py)

Given the use of a pre-packaged SageMaker PyTorch container, the only requirement to write an inference script is that it has to define the following template functions:
- `model_fn()` reading the content of an existing model weights directory saved as a `tar.gz` in s3. We will use it to load the trained Model.
- `input_fn()` used here simply to format the data receives from a request made to the endpoint.
- `predict_fn()` calling the output of `model_fn()` to run inference on the output of `input_fn()`.

Optionally a `output_fn()` can be created for inference formatting, using the output of `predict_fn()`. 

In [None]:
%%writefile code/inference.py
import json
import logging
import os
from llama_cpp import Llama
from multiprocessing import cpu_count

worker_count = os.environ.get('SAGEMAKER_MODEL_SERVER_WORKERS', cpu_count())

def input_fn(request_body, request_content_type, context):
    return json.loads(request_body)

def model_fn(model_dir):
    model=Llama(
        # model_path=f'{model_dir}/Llama-3.2-3B-Instruct-Q4_0.gguf',
        model_path=f"{model_dir}/DeepSeek-R1-Distill-Llama-8B-Q5_K_S.gguf",
        verbose=False,
        n_threads=cpu_count() // int(worker_count) # Graviton has 1 vCPU = 1 physical core
    )
    logging.info("Loaded model successfully")
    return model

def predict_fn(input_data, model, context):
    response = model.create_chat_completion(
        messages=input_data
    )
    return response

def output_fn(prediction, response_content_type, context):
    return json.dumps(prediction)

In [None]:
# from huggingface_hub import hf_hub_download
# hf_hub_download(repo_id="bartowski/Llama-3.2-3B-Instruct-GGUF", filename="Llama-3.2-3B-Instruct-Q4_0.gguf", local_dir='./code')
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF", filename="DeepSeek-R1-Distill-Llama-8B-Q5_K_S.gguf", local_dir='./code')

In [None]:
!tar -czf model.tar.gz code/

In [None]:
model_data = sagemaker_session.upload_data('./model.tar.gz', key_prefix=f'{prefix}-llama-cpp-model')
model_data

In [None]:
from sagemaker import get_execution_role
from sagemaker.pytorch.model import PyTorchModel


pytorch_model = PyTorchModel(model_data=model_data, 
                             role=role,
                             entry_point='inference.py', 
                             image_uri=f"{sagemaker_session.account_id()}.dkr.ecr.{region}.amazonaws.com/llama-cpp-python:latest",
                             model_server_workers=2
)

predictor = pytorch_model.deploy(instance_type='ml.c7g.12xlarge', initial_instance_count=1)

In [None]:
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

predictor.serializer = JSONSerializer()
predictor.deserializer = JSONDeserializer()

We can use the SageMaker python SDK to invoke the endpoint as shown below:

In [None]:
%%time
prompt = [{
            "role": "system",
            "content": "You are a helpful assistant that outputs with 500 words",
        },
        {"role": "user", "content": "Who won the world series in 2020"}]
predictor.predict(prompt)

In [None]:
endpoint_name = predictor.endpoint_name
print(endpoint_name)

You can also invoke the endpoint using the low level api which is the boto3 SageMaker client to invoke the endpoint:

In [None]:
import numpy as np
client = boto3.client('sagemaker-runtime')

response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(prompt)
)
print(response['Body'].read().decode("utf-8"))

### Inference Recommender
SageMaker Inference Recommender is the capability of SageMaker that reduces the time required to get machine learning (ML) models in production by automating load tests and optimizing model performance across instance types. You can use Inference Recommender to select a real-time inference endpoint that delivers the best performance at the lowest cost.

Get started with Inference Recommender on SageMaker in minutes while selecting an instance and get an optimized endpoint configuration in hours, eliminating weeks of manual testing and tuning time.

Inference Recommender uses metadata about your ML model to recommend the best instance types and endpoint configurations for deployment. You can provide as much or as little information as you'd like but the more information you provide, the better your recommendations will be.

ML Frameworks: `TENSORFLOW, PYTORCH, XGBOOST, SAGEMAKER-SCIKIT-LEARN`

ML Domains: `COMPUTER_VISION, NATURAL_LANGUAGE_PROCESSING, MACHINE_LEARNING`

Example ML Tasks: `CLASSIFICATION, REGRESSION, IMAGE_CLASSIFICATION, OBJECT_DETECTION, SEGMENTATION, MASK_FILL, TEXT_CLASSIFICATION, TEXT_GENERATION, OTHER`

Note: Select the task that is the closest match to your model. Chose `OTHER` if none apply.

First, we need to create an archive that contains individual files that Inference Recommender can send to your SageMaker Endpoints. Inference Recommender will randomly sample files from this archive so make sure it contains a similar distribution of payloads you'd expect in production. Note that your inference code must be able to read in the file formats from the sample payload.

In [None]:
raw = predictor.serializer.serialize([
        {
            "role": "system",
            "content": "You are a helpful assistant that outputs with 500 words",
        },
        {"role": "user", "content": "Who won the world series in 2020"},
    ])

In [None]:
import json
json_raw = json.dumps(raw)
!echo {json_raw} > samplepayload

In [None]:
!cat samplepayload

In [None]:
!tar -czf payload.tar.gz samplepayload

Next, we'll upload the packaged payload examples (payload.tar.gz) that was created above to S3. The S3 location will be used as input to our Inference Recommender job later in this notebook.

In [None]:
payload = sagemaker_session.upload_data('./payload.tar.gz', key_prefix=f'{prefix}-llama-cpp-python-payload')

#### Run an Inference Recommendations Job

The Python SDK method for Inference Recommender is `.right_size()`

In [None]:
from sagemaker.parameter import CategoricalParameter
from sagemaker.inference_recommender import Phase, ModelLatencyThreshold


pytorch_model.right_size(payload, 
                         supported_content_types=['application/json'],
                         supported_instance_types=['ml.c7g.8xlarge', 'ml.c7g.12xlarge'],
                         framework='PYTORCH',
                         job_duration_in_seconds=3600,
                         hyperparameter_ranges=[{
                             'instance_types': CategoricalParameter(['ml.c7g.8xlarge', 'ml.c7g.12xlarge']),
                             'SAGEMAKER_MODEL_SERVER_WORKERS': CategoricalParameter(["1", "2"])
                         }],
                         phases=[Phase(120, 1, 1), Phase(120, 2, 1), Phase(120, 7, 1)],
                         traffic_type='PHASES',
                         model_latency_thresholds=[ModelLatencyThreshold('P95', 50000)],
                         max_invocations=60,
                         log_level="Quiet"
                        )

Once the inference recommender job has finished, you can navigate to the SageMaker AI console to check the job results.

Each inference recommendation includes `InstanceType`, `InitialInstanceCount`, `EnvironmentParameters` which are tuned environment variable parameters for better performance. We also include performance and cost metrics such as `MaxInvocations`, `ModelLatency`, `CostPerHour` and `CostPerInference`. We believe these metrics will help you narrow down to a specific endpoint configuration that suits your use case. 

Example:   

If your motivation is overall price-performance with an emphasis on throughput, then you should focus on `CostPerInference` metrics  
If your motivation is a balance between latency and throughput, then you should focus on `ModelLatency` / `MaxInvocations` metrics

| Metric | Description |
| --- | --- |
| ModelLatency | The interval of time taken by a model to respond as viewed from SageMaker. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container. <br /> Units: Microseconds |
| MaximumInvocations | The maximum number of InvokeEndpoint requests sent to a model endpoint. <br /> Units: None |
| CostPerHour | The estimated cost per hour for your real-time endpoint. <br /> Units: US Dollars |
| CostPerInference | The estimated cost per inference for your real-time endpoint. <br /> Units: US Dollars |

### Cleanup

In [None]:
predictor.delete_endpoint()

## Reference
- [How Amazon SageMaker interacts with your Docker container for training](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)
- [How Amazon SageMaker interacts with your Docker container for inference](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html)
- [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk)
- [Dockerfile](https://docs.docker.com/engine/reference/builder/)
- [SageMaker multi-model endpoint bring your own container](https://github.com/aws/amazon-sagemaker-examples/tree/f671af53c3f7c77172e5803a4ff5a3ea8672ecb6/%20%20%20%20%20deploy_and_monitor/sm-multi_model_endpoint_bring_your_own_container)
