# Deploy Small Language Models Cost-efficiently with Amazon SageMaker and AWS Graviton

As organizations look to incorporate AI capabilities into their applications, Large Language Models (LLMs) have emerged as powerful tools for natural language processing tasks. [Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html), AWS's fully managed machine learning service, provides a platform for deploying these ML models with multiple inference options, allowing organizations to optimize for cost, latency, and throughput. However, the computational requirements and costs associated with running these large, powerful LLMs can be prohibitive:

- Traditional LLMs with billions of parameters require significant computational resources, often necessitating GPU instances with substantial memory.
- This computational intensity and cost have led to growing interest in smaller, more efficient language models that can run on CPU infrastructure while still delivering good performance for specific use cases.
- [AWS Graviton processors](https://aws.amazon.com/ec2/graviton/), specifically designed for cloud workloads, offer an optimal platform for running these quantized models, providing up to 50% better price performance compared to traditional x86-based instances for ML inference workloads.

In this notebook, we'll demonstrate how to deploy a qauntized [DeepSeek R1 distilled 8B model](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF) on Amazon SageMaker AI using Graviton-based instances, highlighting the challenges of running large LLMs and the benefits of utilizing efficient language models on cost-optimized hardware.


### Architecture and Components
Our solution leverages Amazon SageMaker with AWS Graviton3 processors to run small language models cost-efficiently. The key components include:

* Amazon SageMaker AI hosted endpoints for model serving
* AWS  Graviton3-based instances (ml.c7g series) for computation
* Llama.cpp  for CPU-optimized inference
* Pre-quantized  GGUF format models

[Llama.cpp](https://github.com/ggerganov/llama.cpp) uses GGUF, a special binary format for storing the model and metadata. Existing models need to be converted to GGUF format before they can be used for the inference. 

### Deployment Process
To deploy your model on SageMaker with Graviton, you'll need to:

1. Create  a Docker container compatible with ARM64 architecture
2. Package  your model and inference code
3. Create  a SageMaker model
4. Configure  and launch an endpoint




### Preparation
Install python packages and prepare environment variables

In [None]:
!sudo apt-get install -y zip
!pip install huggingface-hub

In [None]:
import boto3
import botocore
import sagemaker
import sys
import time
import json

sagemaker_client = boto3.client("sagemaker")
role = sagemaker.get_execution_role()
print(f"Role: {role}")

boto_session = boto3.Session()
sagemaker_session = sagemaker.session.Session(boto_session) # sagemaker session for interacting with different AWS APIs
region = sagemaker_session._region_name

default_bucket = sagemaker_session.default_bucket()  # bucket to house model artifacts

prefix = sagemaker.utils.unique_name_from_base("DEMO")
print(f"prefix: {prefix}")

To run the model on Graviton processor, you need to use a docker container that supports the instance instance and has necessary packages installed. With Amazon SageMaker, you can package your own algorithms that can then be trained and deployed in the SageMaker environment. This notebook guides you through an example on how to extend one of our existing and predefined SageMaker deep learning framework containers. You can find a [list of available pre-built containers here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).

By packaging an algorithm in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies. 
1. [Extending our PyTorch graviton containers](#Extending-our-PyTorch-containers)

### Extending our PyTorch containers
In this example we show how to package a prebuilt PyTorch container that supports Graviton instances, extending the SageMaker PyTorch container, with a Python example which works with the DeepSeek distilled model.

#### How Amazon SageMaker runs your Docker container

* Typically you specify a program (e.g. script) as an `ENTRYPOINT` in the Dockerfile, that program will be run at startup and decide what to do. The original `ENTRYPOINT` specified within the SageMaker PyTorch is [here](https://github.com/aws/deep-learning-containers/blob/1074667d84b69139eb91f1a2c5c6314269c1b792/pytorch/training/docker/2.5/py3/Dockerfile.cpu#L336).

#### Running your container during training

Currently, our SageMaker PyTorch container utilizes [console_scripts](http://python-packaging.readthedocs.io/en/latest/command-line-scripts.html#the-console-scripts-entry-point) to make use of the `train` command issued at training time. The line that gets invoked during `train` is defined within the setup.py file inside [SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit/blob/e2d79421b1454f2e9b342c0b3366078a21b6eb18/setup.py#L94), our common SageMaker deep learning container framework. When this command is run, it will invoke the [trainer class](https://github.com/aws/sagemaker-training-toolkit/blob/master/src/sagemaker_training/cli/train.py) to run, which will finally invoke our [PyTorch container code](https://github.com/aws/sagemaker-pytorch-container/blob/master/src/sagemaker_pytorch_container/training.py) to run your Python file.

A number of files are laid out for your use, under the `/opt/ml` directory:

    /opt/ml
    |-- input
    |   |-- config
    |   |   |-- hyperparameters.json
    |   |   `-- resourceConfig.json
    |   `-- data
    |       `-- <channel_name>
    |           `-- <input data>
    |-- model
    |   `-- <model files>
    `-- output
        `-- failure

In this example, we will only using the inference contain as shown below.

#### Running your container during hosting

Hosting has a very different model than training because hosting is responding to inference requests that come in via HTTP. Currently, the SageMaker PyTorch containers [uses](https://github.com/aws/deep-learning-containers/blob/9fc00f0fa5a942304ac4fdb3812034c275dcfe72/pytorch/inference/docker/2.5/py3/Dockerfile.arm64.cpu#L151-L155) our [TorchServe](https://pytorch.org/serve/) to provide robust and scalable serving of inference requests:

![Request serving stack](https://user-images.githubusercontent.com/880376/83180095-c44cc600-a0d7-11ea-97c1-23abb4cdbe4d.jpg)

Amazon SageMaker uses two URLs in the container:

* `/ping` receives `GET` requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.
* `/invocations` is the endpoint that receives client inference `POST` requests. The format of the request and the response is up to the algorithm. If the client supplied `ContentType` and `Accept` headers, these are passed in as well. 

The container has the model files in the same place that they were written to during training:

    /opt/ml
    `-- model
        `-- <model files>

#### Custom files available to build the container used in this example

The `container` directory has all the components you need to extend the SageMaker PyTorch container to use as a sample algorithm:

    .
    |-- Dockerfile
    `-- code
        `-- inference.py
        `-- requirements.txt

Let's discuss each of these in turn:

* __`Dockerfile`__ describes how to build your Docker container image for *inference*. More details are provided below.
* __`build_and_push.sh`__ is a script that uses the Dockerfile to build your container images and then pushes it to ECR. We invoke the commands directly later in this notebook, but you can just copy and run the script for your own algorithms.
* __`code`__ is the directory which contains our user code to be invoked.

In this application, we install and/or update a few libraries for running Llama.cpp in Python

The files that we put in the container are:

* __`inference.py`__ is the program that implements our inference code (used only for inference container)
* __`requirements.txt`__ is the text file that contains additional python packages which will be installed during deployment time

#### The inference Dockerfile

The Dockerfile describes the image that we want to build. We start from the SageMaker PyTorch image as the base *inference* one. 

So the SageMaker PyTorch ECR image that supports Graviton in this case would be:
* FROM 763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-inference-arm64:2.5.1-cpu-py311-ubuntu22.04-sagemaker

Note: You can retrieve Dockerfile URIs with code such as:
```
from sagemaker import image_uris

image_uris.retrieve('pytorch', 'us-east-1', '2.4', image_scope='inference_graviton')
```

Next, we install the required additional libraries and add the code that implements our specific algorithm to the container, and set up the right environment for it to run under.

Let's look at the Dockerfile for this example.

In [None]:
%%writefile Dockerfile
FROM 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-inference-arm64:2.5.1-cpu-py311-ubuntu22.04-sagemaker

RUN apt-get update && apt-get upgrade -y && apt-get install -y --no-install-recommends \
    ninja-build \
    cmake \
    libopenblas-dev \
    build-essential \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* /tmp/*
RUN python3 -m pip install --upgrade pip
RUN pip uninstall ninja -y
RUN python3 -m pip install --upgrade huggingface-hub pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings starlette-context
ENV FORCE_CMAKE=1

RUN CMAKE_ARGS="-DCMAKE_CXX_FLAGS='-mcpu=native -fopenmp' -DCMAKE_C_FLAGS='-mcpu=native -fopenmp'" python3 -m pip install llama-cpp-python --verbose

In [None]:
!awk -v region="$AWS_REGION" '{gsub(/<region>/, region)}1' Dockerfile > Dockerfile.tmp && mv Dockerfile.tmp Dockerfile

### Permissions

Running this notebook requires permissions in addition to the normal `SageMakerFullAccess` permissions. This is because it will use `codecommit` to create new repositories in Amazon ECR. You can add the below inline policy to the role that you used to start your notebook instance. There's no need to restart your notebook instance when you do this, the new permissions will be available immediately.
```python
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "codebuild:BatchGetProjects",
                "iam:PassRole",
                "iam:DeleteRolePolicy",
                "iam:ListAttachedRolePolicies",
                "codebuild:ListBuilds",
                "iam:CreateRole",
                "iam:DeleteRole",
                "codebuild:StartBuild",
                "iam:PutRolePolicy",
                "iam:ListRolePolicies",
                "codebuild:CreateProject",
                "codebuild:BatchGetBuilds"
            ],
            "Resource": "*"
        }
    ]
}
```

### Building and registering the inference container

The following shell code shows how to build the container image using `codebuild` and push the container image to ECR using `docker push`. The reason we need to use `codebuild` instead of the notebook locally is because the graviton supported docker containers need to be built using a graviton instance. Therefore, the `codebuild` provides the necessary compute environment for the docker build.

This code looks for an ECR repository in the account you're using and the current default region. If the repository doesn't exist, the script will create it. In addition, since we are using the SageMaker PyTorch image as the base, we will need to retrieve ECR credentials to pull this public image.

Note that role used by `codebuild` needs to have the permission to push images to the ECR registry.

In [None]:
!bash build_and_push.sh

### Writing your own inference script (inference.py)

Given the use of a pre-packaged SageMaker PyTorch container, the only requirement to write an inference script is that it has to define the following template functions:
- `model_fn()` reading the content of an existing model weights directory saved as a `tar.gz` in s3. We will use it to load the trained Model.
- `input_fn()` used here simply to format the data receives from a request made to the endpoint.
- `predict_fn()` calling the output of `model_fn()` to run inference on the output of `input_fn()`.

Optionally a `output_fn()` can be created for inference formatting, using the output of `predict_fn()`. 

In [None]:
%%writefile code/inference.py
import json
import logging
import os
from llama_cpp import Llama
from multiprocessing import cpu_count

worker_count = os.environ.get('SAGEMAKER_MODEL_SERVER_WORKERS', cpu_count())
model_file = os.environ.get('MODEL_FILE_GGUF', 'DeepSeek-R1-Distill-Llama-8B-Q4_0.gguf')

def input_fn(request_body, request_content_type, context):
    return json.loads(request_body)

def model_fn(model_dir):
    model=Llama(
        model_path=f"{model_dir}/{model_file}",
        verbose=False,
        n_threads=cpu_count() // int(worker_count) # Graviton has 1 vCPU = 1 physical core
    )
    logging.info("Loaded model successfully")
    return model

def predict_fn(input_data, model, context):
    response = model.create_chat_completion(
        **input_data
    )
    return response

def output_fn(prediction, response_content_type, context):
    return json.dumps(prediction)

In [None]:
# from huggingface_hub import hf_hub_download
# hf_hub_download(repo_id="bartowski/Llama-3.2-3B-Instruct-GGUF", filename="Llama-3.2-3B-Instruct-Q4_0.gguf", local_dir='./code')
from huggingface_hub import hf_hub_download
file_name="DeepSeek-R1-Distill-Llama-8B-Q4_0.gguf"

hf_hub_download(repo_id="bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF", filename=file_name, local_dir='./code')

Normally you would compress model files into a tar file however this can cause startup time to take longer due to having to download and untar large files. To improve startup times, SageMaker AI supports use of uncompressed files. This removes the need to untar large files.

We upload all our files to an S3 prefix and then pass the location into the model with `"CompressionType": "None"`

In [None]:
model_data = sagemaker_session.upload_data(f'./code/{file_name}', key_prefix=f'{prefix}-llama-cpp-model')
script = model_data = sagemaker_session.upload_data('./code/inference.py', key_prefix=f'{prefix}-llama-cpp-model')

In [None]:
model_path = f"s3://{sagemaker_session.default_bucket()}/{prefix}-llama-cpp-model/"
model_path

In [None]:
from sagemaker import get_execution_role
from sagemaker.pytorch.model import PyTorchModel


pytorch_model = PyTorchModel(model_data={
                                "S3DataSource": {
                                    "S3Uri": model_path,
                                    "S3DataType": "S3Prefix",
                                    "CompressionType": "None",
                                }
                            },
                             role=role,
                             env={
                                 'MODEL_FILE_GGUF':file_name
                             },
                             image_uri=f"{sagemaker_session.account_id()}.dkr.ecr.{region}.amazonaws.com/llama-cpp-python:latest",
                             model_server_workers=2
)

predictor = pytorch_model.deploy(instance_type='ml.c7g.12xlarge', initial_instance_count=1)

In [None]:
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

predictor.serializer = JSONSerializer()
predictor.deserializer = JSONDeserializer()

We can use the SageMaker python SDK to invoke the endpoint as shown below:

In [None]:
%%time
prompt = {
            'messages':[
                {"role": "user", "content": "Who won the world series in 2020"}
            ],
    'repeat_penalty': 1.1,
    'temperature': 0.1
}
predictor.predict(prompt)

In [None]:
endpoint_name = predictor.endpoint_name
print(endpoint_name)

You can also invoke the endpoint using the low level api which is the boto3 SageMaker client to invoke the endpoint:

In [None]:
import numpy as np
client = boto3.client('sagemaker-runtime')

response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(prompt)
)
print(response['Body'].read().decode("utf-8"))

### Inference Recommender
SageMaker Inference Recommender is the capability of SageMaker that reduces the time required to get machine learning (ML) models in production by automating load tests and optimizing model performance across instance types. You can use Inference Recommender to select a real-time inference endpoint that delivers the best performance at the lowest cost.

Get started with Inference Recommender on SageMaker in minutes while selecting an instance and get an optimized endpoint configuration in hours, eliminating weeks of manual testing and tuning time.

Inference Recommender uses metadata about your ML model to recommend the best instance types and endpoint configurations for deployment. You can provide as much or as little information as you'd like but the more information you provide, the better your recommendations will be.

ML Frameworks: `TENSORFLOW, PYTORCH, XGBOOST, SAGEMAKER-SCIKIT-LEARN`

ML Domains: `COMPUTER_VISION, NATURAL_LANGUAGE_PROCESSING, MACHINE_LEARNING`

Example ML Tasks: `CLASSIFICATION, REGRESSION, IMAGE_CLASSIFICATION, OBJECT_DETECTION, SEGMENTATION, MASK_FILL, TEXT_CLASSIFICATION, TEXT_GENERATION, OTHER`

Note: Select the task that is the closest match to your model. Chose `OTHER` if none apply.

First, we need to create an archive that contains individual files that Inference Recommender can send to your SageMaker Endpoints. Inference Recommender will randomly sample files from this archive so make sure it contains a similar distribution of payloads you'd expect in production. Note that your inference code must be able to read in the file formats from the sample payload.

In [None]:
raw = predictor.serializer.serialize({'messages':[
        {"role": "user", "content": "Who won the world series in 2020"},
    ]})

In [None]:
import json
json_raw = json.dumps(raw)
!echo {json_raw} > samplepayload

In [None]:
!cat samplepayload

In [None]:
!tar -czf payload.tar.gz samplepayload

Next, we'll upload the packaged payload examples (payload.tar.gz) that was created above to S3. The S3 location will be used as input to our Inference Recommender job later in this notebook.

In [None]:
payload = sagemaker_session.upload_data('./payload.tar.gz', key_prefix=f'{prefix}-llama-cpp-python-payload')

#### Run an Inference Recommendations Job

The Python SDK method for Inference Recommender is `.right_size()`

In [None]:
from sagemaker.parameter import CategoricalParameter
from sagemaker.inference_recommender import Phase, ModelLatencyThreshold


pytorch_model.right_size(payload, 
                         supported_content_types=['application/json'],
                         supported_instance_types=['ml.c7g.8xlarge', 'ml.c7g.12xlarge'],
                         framework='PYTORCH',
                         job_duration_in_seconds=3600,
                         hyperparameter_ranges=[{
                             'instance_types': CategoricalParameter(['ml.c7g.8xlarge', 'ml.c7g.12xlarge']),
                             'SAGEMAKER_MODEL_SERVER_WORKERS': CategoricalParameter(["1", "2", "4",])
                         }],
                         phases=[Phase(120, 1, 1), Phase(120, 2, 1), Phase(120, 7, 1)],
                         traffic_type='PHASES',
                         model_latency_thresholds=[ModelLatencyThreshold('P99', 50000)],
                         max_invocations=120,
                         log_level="Quiet"
                        )

Once the inference recommender job has finished, you can navigate to the SageMaker AI console to check the job results.

Each inference recommendation includes `InstanceType`, `InitialInstanceCount`, `EnvironmentParameters` which are tuned environment variable parameters for better performance. We also include performance and cost metrics such as `MaxInvocations`, `ModelLatency`, `CostPerHour` and `CostPerInference`. We believe these metrics will help you narrow down to a specific endpoint configuration that suits your use case. 

Example:   

If your motivation is overall price-performance with an emphasis on throughput, then you should focus on `CostPerInference` metrics  
If your motivation is a balance between latency and throughput, then you should focus on `ModelLatency` / `MaxInvocations` metrics

| Metric | Description |
| --- | --- |
| ModelLatency | The interval of time taken by a model to respond as viewed from SageMaker. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container. <br /> Units: Microseconds |
| MaximumInvocations | The maximum number of InvokeEndpoint requests sent to a model endpoint. <br /> Units: None |
| CostPerHour | The estimated cost per hour for your real-time endpoint. <br /> Units: US Dollars |
| CostPerInference | The estimated cost per inference for your real-time endpoint. <br /> Units: US Dollars |

### Cleanup

In [None]:
predictor.delete_endpoint()

## Reference
- [How Amazon SageMaker interacts with your Docker container for training](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)
- [How Amazon SageMaker interacts with your Docker container for inference](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html)
- [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk)
- [Dockerfile](https://docs.docker.com/engine/reference/builder/)
- [SageMaker multi-model endpoint bring your own container](https://github.com/aws/amazon-sagemaker-examples/tree/f671af53c3f7c77172e5803a4ff5a3ea8672ecb6/%20%20%20%20%20deploy_and_monitor/sm-multi_model_endpoint_bring_your_own_container)
