# Chat completion: Run Llama 2 models in SageMaker JumpStart

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/introduction_to_amazon_algorithms|jumpstart-foundation-models|llama-2-chat-completion.ipynb)

---

---
In this demo notebook, we demonstrate how to use the SageMaker Python SDK to deploy a JumpStart model for Text Generation using the Llama 2 fine-tuned model optimized for dialogue use cases.

To perform inference on these models, you need to pass custom_attributes='accept_eula=true' as part of header. This means you have read and accept the end-user-license-agreement (EULA) of the model. EULA can be found in model card description or from https://ai.meta.com/resources/models-and-libraries/llama-downloads/. By default, this notebook sets custom_attributes='accept_eula=false', so all inference requests will fail until you explicitly change this custom attribute.

Note: Custom_attributes used to pass EULA are key/value pairs. The key and value are separated by '=' and pairs are separated by ';'. If the user passes the same key more than once, the last value is kept and passed to the script handler (i.e., in this case, used for conditional logic). For example, if 'accept_eula=false; accept_eula=true' is passed to the server, then 'accept_eula=true' is kept and passed to the script handler.

---

## Setup

***

In [2]:
%pip install --upgrade --quiet sagemaker

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
distributed 2022.7.0 requires tornado<6.2,>=6.0.3, but you have tornado 6.3.3 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


***
You can continue with the default model or choose a different model: this notebook will run with the following model IDs :
- `meta-textgeneration-llama-2-7b-f`
- `meta-textgeneration-llama-2-13b-f`
- `meta-textgeneration-llama-2-70b-f`
***

In [3]:
(
    model_id,
    model_version,
) = (
    "meta-textgeneration-llama-2-7b-f",
    "*",
)

## Deploy model

***
You can now deploy the model using SageMaker JumpStart.
***

In [11]:
%%time
from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id=model_id)
predictor = model.deploy()

-----------------!CPU times: user 111 ms, sys: 28.2 ms, total: 140 ms
Wall time: 9min 4s


### Changing instance type
---


Models are supported on the following instance types:

 - Llama 2 7B and 7B-F: `ml.g5.2xlarge`, `ml.g5.4xlarge`, `ml.g5.8xlarge`, `ml.g5.12xlarge`, `ml.g5.24xlarge`, `ml.g5.48xlarge`, `ml.p4d.24xlarge`
 - Llama 2 13B and 13B-F: `ml.g5.12xlarge`, `ml.g5.24xlarge`, `ml.g5.48xlarge`, `ml.p4d.24xlarge`
 - Llama 2 70B and 70B-F: `ml.g5.48xlarge`, `ml.p4d.24xlarge`

By default, the JumpStartModel class selects a default instance type available in your region. If you would like to use a different instance type, you can do so by specifying instance type in the JumpStartModel class.

`my_model = JumpStartModel(model_id=model_id, instance_type="ml.g5.12xlarge")`

---

## Invoke the endpoint

***
### Supported Parameters
This model supports the following inference payload parameters:

* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.

You may specify any subset of the parameters mentioned above while invoking an endpoint. 

***
### Notes
- This model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and alternating (u/a/u/a/u...).
- If `max_new_tokens` is not defined, the model may generate up to the maximum total tokens allowed, which is 4K for these models. This may result in endpoint query timeout errors, so it is recommended to set `max_new_tokens` when possible. For 7B, 13B, and 70B models, we recommend to set `max_new_tokens` no greater than 1500, 1000, and 500 respectively, while keeping the total number of tokens less than 4K.
- In order to support a 4k context length, this model has restricted query payloads to only utilize a batch size of 1. Payloads with larger batch sizes will receive an endpoint error prior to inference.

***

In [15]:
def print_dialog(payload, response):
    dialog = payload["inputs"][0]
    for msg in dialog:
        print(f"{msg['role'].capitalize()}: {msg['content']}\n")
    print(
        f"> {response[0]['generation']['role'].capitalize()}: {response[0]['generation']['content']}"
    )
    print("\n==================================\n")

***
### Example 1
***

In [16]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": "what is the recipe of mayonnaise?"},
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=true")
    print_dialog(payload, response)
except Exception as e:
    print(e)

User: what is the recipe of mayonnaise?

> Assistant:  Mayonnaise is a thick, creamy condiment made from a mixture of egg yolks, oil, and vinegar or lemon juice. Here is a basic recipe for homemade mayonnaise:

Ingredients:

* 2 egg yolks
* 1/2 cup (120 ml) neutral-tasting oil, such as canola or grapeseed
* 1 tablespoon (15 ml) vinegar or lemon juice
* Salt and pepper to taste

Instructions:

1. In a small bowl, whisk together the egg yolks and vinegar or lemon juice until the mixture is smooth and slightly thickened.
2. Slowly pour the oil into the egg yolk mixture while continuously whisking. The mixture will start to thicken and emulsify as you add the oil.
3. Continue whisking until the mixture is smooth and thick, and has a consistent texture throughout. This should take about 5-7 minutes.
4. Taste and adjust the seasoning as needed with salt and pepper.
5. Transfer the mayonnaise to a jar or airtight container and store it in the refrigerator for up to 1 week.

Note: It's importa

***
### Example 2
use of assitant role
***

In [17]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": "I am going to Paris, what should I see?"},
            {
                "role": "assistant",
                "content": """\
Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:

1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.
2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.
3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.

These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world.""",
            },
            {"role": "user", "content": "What is so great about #1?"},
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=true")
    print_dialog(payload, response)
except Exception as e:
    print(e)

User: I am going to Paris, what should I see?

Assistant: Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:

1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.
2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.
3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.

These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world.

User: What is so great about #1?

> Assistant:  The Eiffel Tower is considered one of 

***
### Example 3
Use of system role
***

In [18]:
%%time

payload = {
    "inputs": [
        [
            {"role": "system", "content": "Always answer with Haiku"},
            {"role": "user", "content": "I am going to Paris, what should I see?"},
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=true")
    print_dialog(payload, response)
except Exception as e:
    print(e)

System: Always answer with Haiku

User: I am going to Paris, what should I see?

> Assistant:  Eiffel Tower high
Love locks on bridges glisten
City of romance


CPU times: user 4.74 ms, sys: 0 ns, total: 4.74 ms
Wall time: 645 ms


***
### Example 4
Use of system role
***

In [19]:
%%time

payload = {
    "inputs": [
        [
            {
                "role": "system",
                "content": "Always answer with emojis",
            },
            {"role": "user", "content": "How to go from Beijing to NY?"},
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=false")
    print_dialog(payload, response)
except Exception as e:
    print(e)

System: Always answer with emojis

User: How to go from Beijing to NY?

> Assistant:  Here's how to go from Beijing to NY 🛬🗺️:

1. Fly 🛫: There are several airlines that offer direct flights from Beijing to New York, such as Air China, China Eastern, and Delta.
2. Train 🚂: You can take a train from Beijing to New York, but it's a long journey that takes several days. The train route goes through Russia and Eastern Europe before reaching New York.
3. Bus 🚌: Taking a bus from Beijing to New York is another option, but it's a long journey that can take around 24 hours.
4. Drive 🚗: If you have a lot of time and want to see some beautiful scenery, you can drive from Beijing to New York. The journey takes around 20 days and goes through Russia, Eastern Europe, and Canada.

Whichever option you choose, make sure to plan ahead and book your tickets in advance to avoid any last-minute hassles. 😊


CPU times: user 5.45 ms, sys: 0 ns, total: 5.45 ms
Wall time: 7.41 s


***
### Code generation
***

In [20]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": "Write a program to compute factorial in python:"},
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=true")
    print_dialog(payload, response)
except Exception as e:
    print(e)

User: Write a program to compute factorial in python:

> Assistant:  Sure! Here is a simple program in Python to compute the factorial of a given number:
```
def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n-1)
```
Here's how the program works:

1. The `factorial` function takes in a single argument `n`, which is the number for which we want to compute the factorial.
2. If `n` is equal to 0, the function returns 1, because the factorial of 0 is defined to be 1.
3. Otherwise, the function recursively calls itself with `n-1` as the argument.
4. The result of the recursive call is multiplied by `n` to give the final result of the factorial of `n`.

Here's an example of how to use the `factorial` function:
```
print(factorial(5))  # prints 120
```
I hope this helps! Let me know if you have any questions.


CPU times: user 5.16 ms, sys: 0 ns, total: 5.16 ms
Wall time: 7.04 s


***
### Content Generation
***

In [21]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": "Building a website can be done in 10 simple steps:"},
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=true")
    print_dialog(payload, response)
except Exception as e:
    print(e)
    

User: Building a website can be done in 10 simple steps:

> Assistant:  Sure! Here are 10 simple steps to build a website:

Step 1: Define Your Website's Purpose and Goals
Before you start building your website, it's important to define its purpose and goals. What is the website for? What do you want to achieve with it? Who is your target audience? Answering these questions will help you create a clear vision for your website and guide your design and content decisions.

Step 2: Choose a Domain Name and Web Hosting Provider
Your domain name is the address of your website (e.g., [www.yoursite.com](http://www.yoursite.com)), while your web hosting provider is the company that will store your website's files and make them accessible to visitors. Choose a domain name that is easy to remember, relevant to your content, and available for registration. Select a reliable web hosting provider that meets your needs in terms of storage, bandwidth, and technical support.

Step 3: Plan Your Website

***
### Translation
***

In [22]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": """Translate English to French:

                                            sea otter => loutre de mer

                                            peppermint => menthe poivrée

                                            plush girafe => girafe peluche

                                            cheese =>"""
            },
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=true")
    print_dialog(payload, response)
except Exception as e:
    print(e)
    

User: Translate English to French:

                                            sea otter => loutre de mer

                                            peppermint => menthe poivrée

                                            plush girafe => girafe peluche

                                            cheese =>

> Assistant:  Sure! Here are the translations of the English words to French:

* Sea otter => Loutre de mer (loutre means "otter" in French, and de mer means "of the sea")
* Peppermint => Menthe poivrée (menthe means "mint" in French, and poivrée means "peppery")
* Plush giraffe => Girafe peluche (girafe means "giraffe" in French, and peluche means "plush")
* Cheese => Fromage (fromage is the French word for "cheese")


CPU times: user 13.4 ms, sys: 1.17 ms, total: 14.6 ms
Wall time: 4.18 s


***
### Sentiment-analysis
***

In [23]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": """"I hate it when my phone battery dies."
                                            Sentiment: Negative
                                            ###
                                            Tweet: "My day has been :+1:"
                                            Sentiment: Positive
                                            ###
                                            Tweet: "This is the link to the article"
                                            Sentiment: Neutral
                                            ###
                                            Tweet: "This new music video was incredibile"
                                            Sentiment:"""
            },
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=true")
    print_dialog(payload, response)
except Exception as e:
    print(e)
    

User: "I hate it when my phone battery dies."
                                            Sentiment: Negative
                                            ###
                                            Tweet: "My day has been :+1:"
                                            Sentiment: Positive
                                            ###
                                            Tweet: "This is the link to the article"
                                            Sentiment: Neutral
                                            ###
                                            Tweet: "This new music video was incredibile"
                                            Sentiment:

> Assistant:  Positive


CPU times: user 5.2 ms, sys: 0 ns, total: 5.2 ms
Wall time: 122 ms


***
### Q n A
***

In [24]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": "Could you remind me when was the C programming language invented?"
            },
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=true")
    print_dialog(payload, response)
except Exception as e:
    print(e)
    

User: Could you remind me when was the C programming language invented?

> Assistant:  The C programming language was first introduced in 1972 by Dennis Ritchie at Bell Labs. It was initially called "C with Classes" and was designed as a portable, high-level language for systems programming. The language was officially named "C" in 1973, and it has since become one of the most widely used programming languages in the world.


CPU times: user 15.3 ms, sys: 349 µs, total: 15.7 ms
Wall time: 2.51 s


***
### Recipe generation
***

In [25]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content":  "What is the recipe for a delicious lemon rice?"
            },
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=true")
    print_dialog(payload, response)
except Exception as e:
    print(e)
    

User: What is the recipe for a delicious lemon rice?

> Assistant:  Lemon rice is a popular dish in many parts of the world, and it's easy to make at home. Here's a simple recipe for a delicious and flavorful lemon rice dish:

Ingredients:

* 1 cup of uncooked white or brown rice
* 2 cups of water
* 1/2 cup of freshly squeezed lemon juice (about 2-3 lemons)
* 2 tablespoons of olive oil
* 1 teaspoon of grated ginger
* 1/2 teaspoon of ground cumin
* Salt and pepper to taste
* Optional: 1/4 cup of chopped fresh herbs like parsley or cilantro

Instructions:

1. Rinse the rice in a fine mesh strainer until the water runs clear. Drain and set aside.
2. Heat the olive oil in a large saucepan over medium heat. Add the grated ginger and sauté for 1-2 minutes until fragrant.
3. Add the rice to the saucepan and stir to coat the rice with the oil and ginger. Cook for 2-3 minutes until the rice is lightly browned.
4. Add the lemon juice, water, salt, and pepper to the saucepan. Stir to combine and 

  
  
***
### Summarization
***

In [26]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content":  """Starting today, the state-of-the-art Falcon 40B foundation model from Technology
    Innovation Institute (TII) is available on Amazon SageMaker JumpStart, SageMaker's machine learning (ML) hub
    that offers pre-trained models, built-in algorithms, and pre-built solution templates to help you quickly get
    started with ML. You can deploy and use this Falcon LLM with a few clicks in SageMaker Studio or
    programmatically through the SageMaker Python SDK.
    Falcon 40B is a 40-billion-parameter large language model (LLM) available under the Apache 2.0 license that
    ranked #1 in Hugging Face Open LLM leaderboard, which tracks, ranks, and evaluates LLMs across multiple
    benchmarks to identify top performing models. Since its release in May 2023, Falcon 40B has demonstrated
    exceptional performance without specialized fine-tuning. To make it easier for customers to access this
    state-of-the-art model, AWS has made Falcon 40B available to customers via Amazon SageMaker JumpStart.
    Now customers can quickly and easily deploy their own Falcon 40B model and customize it to fit their specific
    needs for applications such as translation, question answering, and summarizing information.
    Falcon 40B are generally available today through Amazon SageMaker JumpStart in US East (Ohio),
    US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Mumbai),
    Europe (London), Europe (Frankfurt), Europe (Ireland), and Canada (Central),
    with availability in additional AWS Regions coming soon. To learn how to use this new feature,
    please see SageMaker JumpStart documentation, the Introduction to SageMaker JumpStart –
    Text Generation with Falcon LLMs example notebook, and the blog Technology Innovation Institute trainsthe
    state-of-the-art Falcon LLM 40B foundation model on Amazon SageMaker. Summarize the article above:"""
            },
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=true")
    print_dialog(payload, response)
except Exception as e:
    print(e)
    

User: Starting today, the state-of-the-art Falcon 40B foundation model from Technology
    Innovation Institute (TII) is available on Amazon SageMaker JumpStart, SageMaker's machine learning (ML) hub
    that offers pre-trained models, built-in algorithms, and pre-built solution templates to help you quickly get
    started with ML. You can deploy and use this Falcon LLM with a few clicks in SageMaker Studio or
    programmatically through the SageMaker Python SDK.
    Falcon 40B is a 40-billion-parameter large language model (LLM) available under the Apache 2.0 license that
    ranked #1 in Hugging Face Open LLM leaderboard, which tracks, ranks, and evaluates LLMs across multiple
    benchmarks to identify top performing models. Since its release in May 2023, Falcon 40B has demonstrated
    exceptional performance without specialized fine-tuning. To make it easier for customers to access this
    state-of-the-art model, AWS has made Falcon 40B available to customers via Amazon SageMak

## Clean up the endpoint

In [10]:
# Delete the SageMaker endpoint
predictor.delete_model()
predictor.delete_endpoint()