# Chat completion: Run Llama 2 models in SageMaker JumpStart

In [19]:
%pip install --upgrade --quiet sagemaker

Note: you may need to restart the kernel to use updated packages.


***
You can continue with the default model or choose a different model: this notebook will run with the following model IDs :
- `meta-textgeneration-llama-2-7b-f`
- `meta-textgeneration-llama-2-13b-f`
- `meta-textgeneration-llama-2-70b-f`
***

In [39]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": "what is the recipe of mayonnaise?"},
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=false")
    print_dialog(payload, response)
except Exception as e:
    print(e)

An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (424) from primary with message "{
  "code":424,
  "message":"prediction failure",
  "error":"Need to pass custom_attributes='accept_eula=true' as part of header. This means you have read and accept the end-user-license-agreement (EULA) of the model. EULA can be found in model card description or from https://ai.meta.com/resources/models-and-libraries/llama-downloads/."
}". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/jumpstart-dft-meta-textgeneration-llama-2-7b-f in account 342367142984 for more information.
CPU times: user 25.7 ms, sys: 0 ns, total: 25.7 ms
Wall time: 69.6 ms


In [2]:
(
    model_id,
    model_version,
) = (
    "meta-textgeneration-llama-2-7b-f",
    "*",
)

## Deploy model


In [3]:
#deploy the model 

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id=model_id)
predictor = model.deploy()

-------------!

In [37]:
import sagemaker

# endpoint_name = "meta-textgeneration-llama-2-7b-f-2023-07-29-13-50-12-604"  # Replace with the actual endpoint name

endpoint_name = "jumpstart-dft-meta-textgeneration-llama-2-7b-f"
# instance_type = "ml.m4.xlarge"  # Replace with the instance type used during endpoint creation

# Initialize the SageMaker predictor
sess = sagemaker.Session()
predictor = sagemaker.predictor.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=sagemaker.serializers.JSONSerializer(),  # Use JSON format for input
    deserializer=sagemaker.deserializers.JSONDeserializer(),  # Use JSON format for output
)

print(predictor)

Predictor: {'endpoint_name': 'jumpstart-dft-meta-textgeneration-llama-2-7b-f', 'sagemaker_session': <sagemaker.session.Session object at 0x7f203c1987f0>, 'serializer': <sagemaker.base_serializers.JSONSerializer object at 0x7f203c199000>, 'deserializer': <sagemaker.base_deserializers.JSONDeserializer object at 0x7f203c198eb0>}


### Changing instance type
---


Models are supported on the following instance types:

 - Llama 2 7B and 7B-F: `ml.g5.2xlarge`, `ml.g5.4xlarge`, `ml.g5.8xlarge`, `ml.g5.12xlarge`, `ml.g5.24xlarge`, `ml.g5.48xlarge`, `ml.p4d.24xlarge`
 - Llama 2 13B and 13B-F: `ml.g5.12xlarge`, `ml.g5.24xlarge`, `ml.g5.48xlarge`, `ml.p4d.24xlarge`
 - Llama 2 70B and 70B-F: `ml.g5.48xlarge`, `ml.p4d.24xlarge`

By default, the JumpStartModel class selects a default instance type available in your region. If you would like to use a different instance type, you can do so by specifying instance type in the JumpStartModel class.

`my_model = JumpStartModel(model_id=model_id, instance_type="ml.g5.12xlarge")`

---

## Invoke the endpoint

***
### Supported Parameters
This model supports the following inference payload parameters:

* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.

You may specify any subset of the parameters mentioned above while invoking an endpoint. 

***
### Notes
- This model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and alternating (u/a/u/a/u...).
- If `max_new_tokens` is not defined, the model may generate up to the maximum total tokens allowed, which is 4K for these models. This may result in endpoint query timeout errors, so it is recommended to set `max_new_tokens` when possible. For 7B, 13B, and 70B models, we recommend to set `max_new_tokens` no greater than 1500, 1000, and 500 respectively, while keeping the total number of tokens less than 4K.
- In order to support a 4k context length, this model has restricted query payloads to only utilize a batch size of 1. Payloads with larger batch sizes will receive an endpoint error prior to inference.

***

In [4]:
!pip install gradio  --upgrade

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting gradio
  Downloading gradio-3.39.0-py3-none-any.whl (19.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.9/19.9 MB[0m [31m83.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting altair<6.0,>=4.2.0 (from gradio)
  Downloading altair-5.0.1-py3-none-any.whl (471 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.5/471.5 kB[0m [31m83.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi (from gradio)
  Downloading fastapi-0.100.1-py3-none-any.whl (65 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.8/65.8 kB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.1.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting gradio-client>=0.3.0 (from gradio)
  Downloading gradio_client-0.3.0-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [52]:
pip install typing-extensions --upgrade

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Note: you may need to restart the kernel to use updated packages.


In [None]:
# hyperparameters for llm
parameters = {
    "temperature": 0.7,
    "top_p":0.9,
    "max_new_tokens": 256
  }

In [56]:
## source
import gradio as gr

def history_to_dialog_format(chat_history: list[str]):
    dialog = []
    if len(chat_history) > 0:
        for idx, message in enumerate(chat_history[0]):
            role = "user" if idx % 2 == 0 else "assistant"
            dialog.append({
                "role": role,
                "content": message,
            })
    return dialog

with gr.Blocks() as demo:
    gr.Markdown("## Llama2 assistant")
    with gr.Column():
        chatbot = gr.Chatbot().style(height=800) 
        with gr.Row():
            with gr.Column():
                message = gr.Textbox(label="Chat Message Box", placeholder="Chat Message Box", show_label=False)
            with gr.Column():
                with gr.Row():
                    submit = gr.Button("Submit")

    def respond(message, chat_history):
        dialog = history_to_dialog_format(chat_history)
        dialog.append({"role": "user", "content": message})
        prompt = message
        # send request to endpoint
        llm_response = predictor.predict({"inputs": [dialog], "parameters": parameters}, 
                                         custom_attributes="accept_eula=true")
        print(llm_response[0])
        parsed_response = llm_response[-1]['generation']['content']
        chat_history.append((message, parsed_response))
        return "", chat_history

    submit.click(respond, [message, chatbot], [message, chatbot], queue=False)
    clear.click(lambda: None, None, chatbot, queue=False)

demo.launch(share=True)

  chatbot = gr.Chatbot().style(height=800)


Running on local URL:  http://127.0.0.1:7885
Running on public URL: https://007e0575140b0fc262.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




{'generation': {'role': 'assistant', 'content': " Hello! *smiling* It's nice to meet you too! I'm here to help answer any questions you may have, while being as safe, respectful, and honest as possible. Please feel free to ask me anything, and I'll do my best to provide a helpful and socially unbiased response. If a question doesn't make sense or is not factually coherent, I'll explain why instead of answering something not correct. And if I don't know the answer to a question, I'll let you know instead of sharing false information. Is there anything you'd like to know or discuss?"}}
{'generation': {'role': 'assistant', 'content': ' China has a rich and diverse culinary culture, offering a wide variety of delicious dishes. Here are some popular and iconic Chinese foods that you might enjoy:\n\n1. Peking Duck: A classic dish from Beijing, Peking duck is roasted to perfection and served with pancakes, scallions, and hoisin sauce.\n2. Xiaolongbao (Soup Dumplings): These steamed dumplings 

In [49]:
import gradio as gr

# hyperparameters for llm
parameters = {
    "temperature": 0.7,
    "top_p":0.9,
    "max_new_tokens": 256
  }


def respond(message, chat_history):
    # convert chat history to prompt
    dialog = history_to_dialog_format(chat_history)
    dialog.append({"role": "user", "content": message})
    prompt = message
    print(dialog)
    # send request to endpoint
    llm_response = predictor.predict({"inputs": [dialog], "parameters": parameters},
                                     custom_attributes="accept_eula=true")
    print(llm_response[0])
    # remove prompt from response
    parsed_response = llm_response[-1]['generation']['content']
    # chat_history.append((message, parsed_response))
    return parsed_response
    
submit.click(respond, [message, chatbot], [message, chatbot], queue=False)
clear.click(lambda: None, None, chatbot, queue=False)
    
    
demo = gr.ChatInterface(
    respond,
    title="Llama 2 7B-chat",
    retry_btn=None,
    undo_btn=None,
    clear_btn=None,
)

demo.launch(share=True)

Running on local URL:  http://127.0.0.1:7880
Running on public URL: https://01fc22d1d05465453f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


KeyboardInterrupt: 

In [42]:
def print_dialog(payload, response):
    dialog = payload["inputs"][0]
    for msg in dialog:
        print(f"{msg['role'].capitalize()}: {msg['content']}\n")
    print(
        f"> {response[0]['generation']['role'].capitalize()}: {response[0]['generation']['content']}"
    )
    print("\n==================================\n")

### Example 1

## Clean up the endpoint

In [None]:
# Delete the SageMaker endpoint
predictor.delete_model()
predictor.delete_endpoint()