# Chat completion: Run Llama 2 models in SageMaker JumpStart

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/introduction_to_amazon_algorithms|jumpstart-foundation-models|llama-2-chat-completion.ipynb)

---

---
In this demo notebook, we demonstrate how to use the SageMaker Python SDK to deploy a JumpStart model for Text Generation using the Llama 2 fine-tuned model optimized for dialogue use cases.

To perform inference on these models, you need to pass custom_attributes='accept_eula=true' as part of header. This means you have read and accept the end-user-license-agreement (EULA) of the model. EULA can be found in model card description or from https://ai.meta.com/resources/models-and-libraries/llama-downloads/. By default, this notebook sets custom_attributes='accept_eula=false', so all inference requests will fail until you explicitly change this custom attribute.

Note: Custom_attributes used to pass EULA are key/value pairs. The key and value are separated by '=' and pairs are separated by ';'. If the user passes the same key more than once, the last value is kept and passed to the script handler (i.e., in this case, used for conditional logic). For example, if 'accept_eula=false; accept_eula=true' is passed to the server, then 'accept_eula=true' is kept and passed to the script handler.

---

## Setup

***

In [37]:
%pip install --upgrade --quiet sagemaker

[0mNote: you may need to restart the kernel to use updated packages.


***
You can continue with the default model or choose a different model: this notebook will run with the following model IDs :
- `meta-textgeneration-llama-2-7b-f`
- `meta-textgeneration-llama-2-13b-f`
- `meta-textgeneration-llama-2-70b-f`
***

In [38]:
(
    model_id,
    model_version,
) = (
    "meta-textgeneration-llama-2-7b-f",
    "*",
)

## Deploy model

***
You can now deploy the model using SageMaker JumpStart.
***

In [39]:
%%time
from sagemaker.jumpstart.model import JumpStartModel
import time
start = time.time()
model = JumpStartModel(model_id=model_id)
predictor = model.deploy()

-----------------!CPU times: user 148 ms, sys: 19 ms, total: 167 ms
Wall time: 9min 3s


### Changing instance type
---


Models are supported on the following instance types:

 - Llama 2 7B and 7B-F: `ml.g5.2xlarge`, `ml.g5.4xlarge`, `ml.g5.8xlarge`, `ml.g5.12xlarge`, `ml.g5.24xlarge`, `ml.g5.48xlarge`, `ml.p4d.24xlarge`
 - Llama 2 13B and 13B-F: `ml.g5.12xlarge`, `ml.g5.24xlarge`, `ml.g5.48xlarge`, `ml.p4d.24xlarge`
 - Llama 2 70B and 70B-F: `ml.g5.48xlarge`, `ml.p4d.24xlarge`

By default, the JumpStartModel class selects a default instance type available in your region. If you would like to use a different instance type, you can do so by specifying instance type in the JumpStartModel class.

`my_model = JumpStartModel(model_id=model_id, instance_type="ml.g5.12xlarge")`

---

## Invoke the endpoint

***
### Supported Parameters
This model supports the following inference payload parameters:

* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.

You may specify any subset of the parameters mentioned above while invoking an endpoint. 

***
### Notes
- This model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and alternating (u/a/u/a/u...).
- If `max_new_tokens` is not defined, the model may generate up to the maximum total tokens allowed, which is 4K for these models. This may result in endpoint query timeout errors, so it is recommended to set `max_new_tokens` when possible. For 7B, 13B, and 70B models, we recommend to set `max_new_tokens` no greater than 1500, 1000, and 500 respectively, while keeping the total number of tokens less than 4K.
- In order to support a 4k context length, this model has restricted query payloads to only utilize a batch size of 1. Payloads with larger batch sizes will receive an endpoint error prior to inference.

***

In [40]:
def print_dialog(payload, response):
    dialog = payload["inputs"][0]
    for msg in dialog:
        print(f"{msg['role'].capitalize()}: {msg['content']}\n")
    print(
        f"> {response[0]['generation']['role'].capitalize()}: {response[0]['generation']['content']}"
    )
    print("\n==================================\n")

***
### Example 1
***

In [41]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": "what is the recipe of mayonnaise?"},
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=false")
    print_dialog(payload, response)
except Exception as e:
    print(e)

User: what is the recipe of mayonnaise?

> Assistant:  Mayonnaise is a thick, creamy condiment made from a mixture of egg yolks, oil, vinegar or lemon juice, and seasonings. Here is a basic recipe for homemade mayonnaise:

Ingredients:

* 2 egg yolks
* 1/2 cup (120 ml) neutral-tasting oil, such as canola or grapeseed
* 1 tablespoon (15 ml) vinegar or lemon juice
* 1/2 teaspoon salt
* 1/4 teaspoon black pepper

Instructions:

1. In a small bowl, whisk together the egg yolks and salt until well combined.
2. Slowly pour the oil into the egg yolk mixture while continuously whisking. The mixture will begin to thicken and emulsify as you add the oil.
3. Once you have added about 1/4 cup (60 ml) of oil, stop whisking and let the mixture sit for 5-10 minutes to allow the oil and egg yolks to fully emulsify.
4. After the mixture has sat for a few minutes, give it a few quick whisk strokes to check that it has thickened evenly. If it's still a bit too thin, let it sit for a few more minutes and 

***
### Example 2
use of assitant role
***

In [42]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": "I am going to Paris, what should I see?"},
            {
                "role": "assistant",
                "content": """\
Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:

1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.
2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.
3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.

These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world.""",
            },
            {"role": "user", "content": "What is so great about #1?"},
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=false")
    print_dialog(payload, response)
except Exception as e:
    print(e)

User: I am going to Paris, what should I see?

Assistant: Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:

1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.
2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.
3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.

These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world.

User: What is so great about #1?

> Assistant:  The Eiffel Tower is considered one of 

***
### Example 3
Use of system role
***

In [43]:
%%time

payload = {
    "inputs": [
        [
            {"role": "system", "content": "Always answer with Haiku"},
            {"role": "user", "content": "I am going to Paris, what should I see?"},
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=false")
    print_dialog(payload, response)
except Exception as e:
    print(e)

System: Always answer with Haiku

User: I am going to Paris, what should I see?

> Assistant:  Eiffel Tower high
Love locks on bridge embrace
City of light shines


CPU times: user 2.05 ms, sys: 2.45 ms, total: 4.51 ms
Wall time: 643 ms


***
### Example 4
Use of system role
***

In [44]:
%%time

payload = {
    "inputs": [
        [
            {
                "role": "system",
                "content": "Always answer with emojis",
            },
            {"role": "user", "content": "How to go from Beijing to NY?"},
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=false")
    print_dialog(payload, response)
except Exception as e:
    print(e)

System: Always answer with emojis

User: How to go from Beijing to NY?

> Assistant:  Here's how to go from Beijing to NY 🛬🗽:

1. Fly 🛫: There are several airlines that offer direct or connecting flights from Beijing Capital International Airport (PEK) to John F. Kennedy International Airport (JFK) or LaGuardia Airport (LGA) in New York.
2. Train 🚂: You can take a train from Beijing to New York, but it's a long journey that takes several days. The train route goes through Moscow and then to New York.
3. Bus 🚌: Taking a bus from Beijing to New York is another option, but it's a long journey that takes around 24 hours.
4. Drive 🚗: If you prefer to drive, you can take a car from Beijing to New York, but it's a long journey that takes around 20 days.

Let me know if you have any other questions! 😊


CPU times: user 1.89 ms, sys: 3.54 ms, total: 5.43 ms
Wall time: 6.88 s


***
### Code generation
***

In [45]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": "Write a program to compute factorial in python:"},
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=false")
    print_dialog(payload, response)
except Exception as e:
    print(e)

User: Write a program to compute factorial in python:

> Assistant:  Sure! Here is a simple program in Python to compute the factorial of a given number:
```
def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n-1)
```
Explanation:

* The function `factorial` takes in a single argument `n`, which is the number for which we want to compute the factorial.
* We use a recursive approach to compute the factorial. If `n` is 0, the factorial is defined to be 1. Otherwise, we multiply `n` by the factorial of `n-1`.
* The function returns the computed factorial.

Here is an example of how to use the `factorial` function:
```
print(factorial(5))  # prints 120
```
I hope this helps! Let me know if you have any questions.


CPU times: user 6.79 ms, sys: 180 µs, total: 6.97 ms
Wall time: 6.1 s


***
### Content Generation
One shot prompting
***

In [46]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": "Building a website can be done in 10 simple steps:"},
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=false")
    print_dialog(payload, response)
except Exception as e:
    print(e)
    

User: Building a website can be done in 10 simple steps:

> Assistant:  Sure! Here are 10 simple steps to help you build a website:

Step 1: Define Your Website's Purpose and Goals
Before you start building your website, you need to determine its purpose and goals. What do you want your website to achieve? What kind of content will you publish on it? Who is your target audience? Answering these questions will help you create a clear vision for your website and guide your decision-making throughout the building process.

Step 2: Choose a Domain Name and Web Hosting Provider
Your domain name is the address of your website (e.g., [www.yoursite.com](http://www.yoursite.com)), while your web hosting provider is the company that will store your website's files and make them accessible to the public. You can choose from a variety of domain name registrars and web hosting providers, but make sure to do your research and choose a reputable one to ensure your website is reliable and secure.

Ste

***
### Translation
Few shot learning
***

In [47]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": """Translate English to French:

                                            sea otter => loutre de mer

                                            peppermint => menthe poivrée

                                            plush girafe => girafe peluche

                                            cheese =>"""
            },
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=false")
    print_dialog(payload, response)
except Exception as e:
    print(e)
    

User: Translate English to French:

                                            sea otter => loutre de mer

                                            peppermint => menthe poivrée

                                            plush girafe => girafe peluche

                                            cheese =>

> Assistant:  Sure! Here are the translations for the remaining words:

* sea otter: loutre de mer (as you already provided)
* peppermint: menthe poivrée (corrected spelling)
* plush giraffe: girafe peluche (corrected spelling)
* cheese: fromage (this is the standard French word for "cheese")


CPU times: user 5.04 ms, sys: 0 ns, total: 5.04 ms
Wall time: 2.64 s


***
### CoT Prompting
[Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/pdf/2201.11903.pdf)  
[Large Language Models are Zero-Shot Reasoners](https://arxiv.org/pdf/2205.11916.pdf)  

Are incorporated in the more recent models such as Llama-2, and thus these mdoels are able to ahndle these prompts out of the box
***

In [50]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": """I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?"""
            },
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=false")
    print_dialog(payload, response)
except Exception as e:
    print(e)
    

User: I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?

> Assistant:  Great, let's solve this problem step by step!

You started with 10 apples.

* You gave 2 apples to the neighbor, so you have 10 - 2 = 8 apples left.
* You gave 2 apples to the repairman, so you have 8 - 2 = 6 apples left.

Then, you went and bought 5 more apples. So, you now have 6 + 5 = 11 apples.

Finally, you ate 1 apple, so you have 11 - 1 = 10 apples left.

Therefore, you remain with 10 apples.


CPU times: user 18.3 ms, sys: 270 µs, total: 18.6 ms
Wall time: 5.2 s


***
### Sentiment-analysis
***

In [30]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": """"I hate it when my phone battery dies."
                                            Sentiment: Negative
                                            ###
                                            Tweet: "My day has been :+1:"
                                            Sentiment: Positive
                                            ###
                                            Tweet: "This is the link to the article"
                                            Sentiment: Neutral
                                            ###
                                            Tweet: "This new music video was incredibile"
                                            Sentiment:"""
            },
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=false")
    print_dialog(payload, response)
except Exception as e:
    print(e)
    

User: "I hate it when my phone battery dies."
                                            Sentiment: Negative
                                            ###
                                            Tweet: "My day has been :+1:"
                                            Sentiment: Positive
                                            ###
                                            Tweet: "This is the link to the article"
                                            Sentiment: Neutral
                                            ###
                                            Tweet: "This new music video was incredibile"
                                            Sentiment:

> Assistant:  Sure! Here are the sentiment analyses for the given tweets:

1. "I hate it when my phone battery dies." - Negative
2. "My day has been :+1:" - Positive
3. "This is the link to the article" - Neutral
4. "This new music video was incredibile" - Positive

In the first tweet, the user expresses displeas

***
### Q n A
***

In [31]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": "Could you remind me when was the C programming language invented?"
            },
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=false")
    print_dialog(payload, response)
except Exception as e:
    print(e)
    

User: Could you remind me when was the C programming language invented?

> Assistant:  The C programming language was invented in the early 1970s by Dennis Ritchie, a computer scientist at Bell Labs.

The first edition of the C language specification, known as "ANSI C," was published in 1983 by the American National Standards Institute (ANSI). This standardized the language and provided a common definition for C programming.

So, the C programming language was invented in the early 1970s, and the first standard for the language was published in 1983.


CPU times: user 0 ns, sys: 4.94 ms, total: 4.94 ms
Wall time: 3.57 s


***
### Recipe generation
***

In [32]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content":  "What is the recipe for a delicious lemon rice?"
            },
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=false")
    print_dialog(payload, response)
except Exception as e:
    print(e)
    

User: What is the recipe for a delicious lemon rice?

> Assistant:  Lemon rice is a popular side dish in many cuisines, particularly in Indian and Middle Eastern cooking. Here is a simple recipe for a delicious lemon rice dish:

Ingredients:

* 1 cup uncooked white or brown rice
* 2 cups water
* 1/2 cup freshly squeezed lemon juice (about 2-3 lemons)
* 2 tablespoons vegetable oil
* 1 small onion, finely chopped
* 1 small carrot, grated
* 1 teaspoon ground cumin
* 1 teaspoon ground coriander
* Salt and pepper, to taste
* Fresh cilantro, chopped (optional)

Instructions:

1. Rinse the rice in a fine-mesh sieve until the water runs clear. Drain and set aside.
2. Heat the oil in a large saucepan over medium heat. Add the onion and cook, stirring occasionally, until it is lightly browned and translucent, about 5 minutes.
3. Add the grated carrot and cook for another 2-3 minutes, stirring occasionally.
4. Add the rice to the saucepan and stir to coat the rice with the oil and mix with the on

  
  
***
### Summarization
***

In [33]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content":  """Starting today, the state-of-the-art Falcon 40B foundation model from Technology
    Innovation Institute (TII) is available on Amazon SageMaker JumpStart, SageMaker's machine learning (ML) hub
    that offers pre-trained models, built-in algorithms, and pre-built solution templates to help you quickly get
    started with ML. You can deploy and use this Falcon LLM with a few clicks in SageMaker Studio or
    programmatically through the SageMaker Python SDK.
    Falcon 40B is a 40-billion-parameter large language model (LLM) available under the Apache 2.0 license that
    ranked #1 in Hugging Face Open LLM leaderboard, which tracks, ranks, and evaluates LLMs across multiple
    benchmarks to identify top performing models. Since its release in May 2023, Falcon 40B has demonstrated
    exceptional performance without specialized fine-tuning. To make it easier for customers to access this
    state-of-the-art model, AWS has made Falcon 40B available to customers via Amazon SageMaker JumpStart.
    Now customers can quickly and easily deploy their own Falcon 40B model and customize it to fit their specific
    needs for applications such as translation, question answering, and summarizing information.
    Falcon 40B are generally available today through Amazon SageMaker JumpStart in US East (Ohio),
    US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Mumbai),
    Europe (London), Europe (Frankfurt), Europe (Ireland), and Canada (Central),
    with availability in additional AWS Regions coming soon. To learn how to use this new feature,
    please see SageMaker JumpStart documentation, the Introduction to SageMaker JumpStart –
    Text Generation with Falcon LLMs example notebook, and the blog Technology Innovation Institute trainsthe
    state-of-the-art Falcon LLM 40B foundation model on Amazon SageMaker. Summarize the article above:"""
            },
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=false")
    print_dialog(payload, response)
except Exception as e:
    print(e)
    

User: Starting today, the state-of-the-art Falcon 40B foundation model from Technology
    Innovation Institute (TII) is available on Amazon SageMaker JumpStart, SageMaker's machine learning (ML) hub
    that offers pre-trained models, built-in algorithms, and pre-built solution templates to help you quickly get
    started with ML. You can deploy and use this Falcon LLM with a few clicks in SageMaker Studio or
    programmatically through the SageMaker Python SDK.
    Falcon 40B is a 40-billion-parameter large language model (LLM) available under the Apache 2.0 license that
    ranked #1 in Hugging Face Open LLM leaderboard, which tracks, ranks, and evaluates LLMs across multiple
    benchmarks to identify top performing models. Since its release in May 2023, Falcon 40B has demonstrated
    exceptional performance without specialized fine-tuning. To make it easier for customers to access this
    state-of-the-art model, AWS has made Falcon 40B available to customers via Amazon SageMak

In [34]:
from datetime import timedelta
end = time.time()
print(timedelta(seconds=end-start))

0:10:40.183282


## Clean up the endpoint

In [35]:
# Delete the SageMaker endpoint
predictor.delete_model()
predictor.delete_endpoint()