# Deploy TinyLlama Model on AWS Inferentia2 using AWS SageMaker

We will deploy TinyLlama 1.1B model using DJL Serving for model deployment using AWS LMI Container. We deploy on SageMaker with Inferentia2 instance.

### 1. Setup SageMaker

In [2]:
import sagemaker
from sagemaker import Model, image_uris, serializers

In [3]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

### 2. Prepare Model Serving Artifacts

- We use DJL Serving as the model serving framework. DJL Serving needs a configuraiton file *serving.properties* for model deployment. It includes the following configurations for model deployment.
  - option.entryPoint: model serving engine
  - option.model_id: Hugging Face model tag or s3 path that stores the model
  - option.batch_size: model inference batch size
  - option.neuron_optimization_level: Neuron compiler optimization level, e.g., 1 for fast compilation and 3 for best performance
  - option.tensor_parallel_degree: number of NeuronCores to be used
  - option.load_in_8bit: enable/distable int8 weight quantization for reducing memory footprint
  - option.n_positions: maximum sequence length
  - option.dtype: date type of weight and activation
  - option.model_loading_timeout: length of time to timeout in seconds

- In this example, the model serving framework (DJL Serving) will pull the model from Hugging Face model hub according to *model_id* in the *serving.properties* file.

In [4]:
%%writefile serving.properties
engine=Python
option.entryPoint=djl_python.transformers_neuronx
option.model_id=TinyLlama/TinyLlama-1.1B-Chat-V0.3
option.batch_size=1
option.neuron_optimize_level=1
option.tensor_parallel_degree=2
option.load_in_8bit=false
option.n_positions=256
option.dtype=fp16
option.model_loading_timeout=1500

Writing serving.properties


Note: You can also try experimenting with changing values for `option.batch_size=4` with larger batch size and `option.load_in_8bit=True` to enable `int8` weight quantization for model storage in *serving.properties*

- Package the configuration file in a tarball

In [5]:
%%sh
mkdir -p mycode
mv serving.properties mycode/
tar czvf mycode.tar.gz mycode/

mycode/
mycode/serving.properties


- Upload the tarball of the model serving artifacts to SageMaker S3 bucket

In [6]:
s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mycode.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-474844183433/large-model-lmi/code/mycode.tar.gz


### 3. Set the DJL-NeuronX container

In [None]:
instance_type = "ml.inf2.xlarge"

In [7]:
inference_image_uri = image_uris.retrieve(
    framework="djl-neuronx", region=region, version="0.25.0",instance_type=instance_type
)
inference_image_uri

### 4. Create SageMaker Endpoint

- Create a SageMaker endpoint of Inferentia2

In [8]:
endpoint_name = sagemaker.utils.name_from_base("lmi-model")

- Create a model wrapper that includes docker container and model serving artifacts
- Model deployment would take 6~7 minutes as model is compiled during the process

In [9]:
import time
t0 = time.time()
model = Model(image_uri=inference_image_uri, model_data=code_artifact, role=role)
model._is_compiled_model = True # let sagemaker know model is compiled as it is done by neuron-cc
model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             container_startup_health_check_timeout=900,
             volume_size=256,
             endpoint_name=endpoint_name)
t1 = time.time()
print(f"elapsed time: {t1-t0}")

---------!elapsed time: 302.71287631988525


- Create a predictor for submit inference requests and receive reponses
- Requests and responses are in json format

In [10]:
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer()
)

### 5. Inference test
- Submit an inference request to model server and receive inference result

In [14]:
prompts = ["tell me a story of the little red riding hood"]
results = predictor.predict(
    {"inputs": prompts, "parameters": {"max_new_tokens":256, "do_sample":"true"}}
)

In [15]:
import json
for result in json.loads(results):
    generated_text = result['generated_text']
    print(f"{generated_text}\n")

tell me a story of the little red riding hood. And this is how it happened...
The little red riding hood was walking to the forest with her grandmother. The forest was very beautiful, with tall trees and bright flowers.
Little Riding Hood heard a rustling sound and turned around. She was surprised to see a huge wolf howling at her. She thought that was only a wild animal.
"You there! You scared me just walk away," the wolf growled.
Little Riding Hood stepped towards the wolf. She couldn't help but feel a mixture of anger and excitement at her situation.
The wolf lunged forward and hit her hard across the face. It was a surprise attack! Little Riding Hood lost her words when she struck the wolf. Blood was streaming down her face.
The wolf howled as he realised that little Riding Hood wasn't a threat. He let go of her and continued his way towards home. Little Riding Hood felt a little disappointed, but also relieved.
But as she was walking back home, she heard a scream. It was her grand

### 6. Cleanup the environment

In [16]:
cleanup = False
if cleanup:
    sess.delete_endpoint(endpoint_name)
    sess.delete_endpoint_config(endpoint_name)
    model.delete_model()