# Deploy Falcon-instruct 40B model to SageMaker Endpoint using HuggingFace LLM container from SageMaker

In this notebook, we will deploy a Falcon 40B model via SageMaker provided huggingface LLM inference container.

## Ensure latest version of sagemaker and also install additional libraries needed

In [None]:
!pip uninstall -q -y sagemaker
!pip install -q sagemaker
!pip install -q pyyaml

## Required imports and variables

Import all necessary packages and get the role and sagemaker session object

In [None]:
import sagemaker
import boto3

sm_client = boto3.client('sagemaker')

role = sagemaker.get_execution_role()

sess = sagemaker.Session()

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")


## Fetch SageMaker HuggingFace Large Language Model Container image URI

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

# print ecr image uri
print(f"Falcon40B ECR image uri hosted in AWS: {llm_image}")


## Create the HuggingFace Model Object for Falcon40B from huggingface model zoo

Now, we create the HuggingFaceModel object by specifying the following:
- instance on which the model needs to be deployed
- number of GPUs to use per replica of the model, this is dependent on the number of GPUs available for the instance type chosen
- ECR image URI of the inference image for Falcon40B model
- role

Apart from this, we are also specifying the configurations for [text-generation-inference](https://github.com/huggingface/text-generation-inference) via a config dictionary. All supported configurations are listed at: [sagemaker-entrypoint.sh](https://github.com/huggingface/text-generation-inference/blob/main/sagemaker-entrypoint.sh)

In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 300

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "tiiuae/falcon-40b-instruct", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)
}


llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)


## Deploy model to an endpoint

In [None]:
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout
)


## Update the endpoint name in configuration file, to be read by streamlit app

In [None]:
import yaml

dict_file = {'endpoint_name' : llm.endpoint_name}

with open(r'../endpoint_config.yaml', 'w') as file:
    documents = yaml.dump(dict_file, file)

## delete the endpoint

In [None]:
response = sm_client.delete_endpoint(
    EndpointName=llm.endpoint_name
)