# How to deploy Llama 2 LLM to Amazon SageMaker using HuggingFace LLM DLC

> The notebook is based on [Introducing the Hugging Face LLM Inference Container for Amazon SageMaker](https://huggingface.co/blog/sagemaker-huggingface-llm) from Huggingface blog. However, given llama2 models are private ones, you will need additional steps and setting so as to gain the access and deploy them.

> Highly recommended to use SageMaker Studio (with DataScience 2.0 image; t3.medium instance type is good enough.) to run this notebook with proper permissions to use SageMaker services. If you are running on local / others, please consider setup proper AWS credential profile in your environment. 

## TL;DR;

The purpose of the notebook is to provide guidance on how to deploy Llama 2 open source model deployment using SageMaker realtime inference service. Though AWS Machine Learning blog [Llama 2 foundation models from Meta are now available in Amazon SageMaker JumpStart](https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/) provides guidance on SageMaker JumpStart deployment, currently llama 2 family modes are only available in Amazon SageMaker Studio in `us-east-1`, `us-west-2`, `eu-west-1` and `ap-southeast-1` regions. If you are planning to deploy the model(s) in other regions, this notebook is a good reference to you. 

The example covers:
1. [Apply Llama 2 models access](#1-apply-llama2-models-access)
2. [Setup development environment](#2-setup-development-environment)
3. [Retrieve the new Hugging Face LLM DLC](#3-retrieve-the-new-hugging-face-llm-dlc)
4. [Deploy llama 2 to Amazon SageMaker](#4-deploy-open-assistant-12b-to-amazon-sagemaker)
5. [Run inference and chat with our model](#5-run-inference-and-chat-with-our-model)

## What is Hugging Face LLM Inference DLC?

Hugging Face LLM DLC is a new purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference), an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. 
Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures, including:
* Tensor Parallelism and custom cuda kernels
* Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) on the most popular architectures
* Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
* [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
* Accelerated weight loading (start-up time) with [safetensors](https://github.com/huggingface/safetensors)
* Logits warpers (temperature scaling, topk, repetition penalty ...)
* Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
* Stop sequences, Log probabilities
* Token streaming using Server-Sent Events (SSE)

Officially supported model architectures are currently: 
* [Llama](https://github.com/facebookresearch/llama) (vicuna, alpaca, koala) - ***llama 2 is available as of now***
* [BLOOM](https://huggingface.co/bigscience/bloom) / [BLOOMZ](https://huggingface.co/bigscience/bloomz)
* [MT0-XXL](https://huggingface.co/bigscience/mt0-xxl)
* [Galactica](https://huggingface.co/facebook/galactica-120b)
* [SantaCoder](https://huggingface.co/bigcode/santacoder)
* [GPT-Neox 20B](https://huggingface.co/EleutherAI/gpt-neox-20b) (joi, pythia, lotus, rosey, chip, RedPajama, open assistant)
* [FLAN-T5-XXL](https://huggingface.co/google/flan-t5-xxl) (T5-11B)
* [Starcoder](https://huggingface.co/bigcode/starcoder) / [SantaCoder](https://huggingface.co/bigcode/santacoder)
* [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) / [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)

With the new Hugging Face LLM Inference DLCs on Amazon SageMaker, AWS customers can benefit from the same technologies that power highly concurrent, low latency LLM experiences like [HuggingChat](https://hf.co/chat), [OpenAssistant](https://open-assistant.io/), and Inference API for LLM models on the Hugging Face Hub. 

## 1. Apply Llama 2 models access

### Llama access application steps

> For using HuggingFace Models Hub, please ensure that you are using exactly the same email id on model access application from Meta and HuggingFace account registration.

1. To apply access on [Meta Llama Access Form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/). Once you submit the request, you may receive the email confirmation on your access.
   * Within 1~2 days, Meta may send you an email with model access instructions. There will be a download links being used with [download.sh](https://github.com/facebookresearch/llama/blob/main/download.sh). If the link expires later, you may re-submit your access again so that it will generate a new link for you.
   * Once you get the approval, you have two options for access model for your model deployment using SageMaker realtime inference.
     * [Preferred] Using SageMaker LLM DLC with HuggingFace Models Hub. (I am referring this track in this notebook)
     * Downloading the expected model artifacts, wrapping & uploading to S3 bucket and then using SageMaker realtime inference. (This option may suit for organizations practicing highly-regulated cloud security without internet access; I will follow up with another blog and talk about the security practices on SageMaker realtime inference service.)

![meta llama access application](./images/meta_llama_access_application.png)

2. To apply access on [HuggingFace Models Hub]. As Meta Llama models are private ones, so you will have to sign up a HuggingFace account and apply access on the models page. e.g. [Llama 2 7B chat-hf model](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), the application page for fine-tuned LLMs, called Llama-2-Chat (optimized for dialogue use cases). 

![huggingface model hug meta llama access application](./images/hugging_face_llama_application.png)

  * Once you get the approval from Meta, the HuggingFace one will be approved shortly. 

3. To generate a READ access token

  * Refer to [Access Tokens Setting](https://huggingface.co/settings/tokens); please generate a READ access one so that you can use it for model deployment later.
  * Please run below shell script to generate the `.env` file (this is to avoid placing your token as hardcode in the notebook, which you may end up credential leaking with sharing checkin notebook on public repositories.)
  * Then copy your token and update the `.env` file


In [1]:
!echo "HF_API_TOKEN=" > ./.env

## 2. Setup development environment

We are going to use the `sagemaker` python SDK to deploy [Llama 2 7B chat-hf model](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) to Amazon SageMaker realtime inference. We need to make sure to have an AWS account configured and install proper packages.

In [1]:
!pip install -r requirements.txt

Collecting sagemaker==2.173.0 (from -r requirements.txt (line 1))
  Downloading sagemaker-2.173.0.tar.gz (854 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m854.4/854.4 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting boto3<2.0,>=1.26.131 (from sagemaker==2.173.0->-r requirements.txt (line 1))
  Obtaining dependency information for boto3<2.0,>=1.26.131 from https://files.pythonhosted.org/packages/46/a7/487512e3328f2566d72aed3b7059fd8dff18c95d9bcbbe16c5ecc13e6fc5/boto3-1.28.17-py3-none-any.whl.metadata
  Downloading boto3-1.28.17-py3-none-any.whl.metadata (6.6 kB)
Collecting cloudpickle==2.2.1 (from sagemaker==2.173.0->-r requirements.txt (line 1))
  Downloading cloudpickle-2.2.1-py3-none-any.whl (25 kB)
Collecting google-pasta (from sagemaker==2.173.0->-r requirements.txt (line 1))
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 3. Retrieve the new Hugging Face LLM DLC

Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our `HuggingFaceModel` model class with a `image_uri` pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the `get_huggingface_llm_image_uri` method provided by the `sagemaker` SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified `backend`, `session`, `region`, and `version`. You can find the available versions [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-text-generation-inference-containers)