## Run inference on the Llama 2 endpoint you have created.

In [2]:
import json
import boto3

### Supported Parameters

***
This model supports many parameters while performing inference. They include:

* **max_length:** Model generates text until the output length (which includes the input context length) reaches `max_length`. If specified, it must be a positive integer.
* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches `max_new_tokens`. If specified, it must be a positive integer.
* **num_beams:** Number of beams used in the greedy search. If specified, it must be integer greater than or equal to `num_return_sequences`.
* **no_repeat_ngram_size:** Model ensures that a sequence of words of `no_repeat_ngram_size` is not repeated in the output sequence. If specified, it must be a positive integer greater than 1.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **early_stopping:** If True, text generation is finished when all beam hypotheses reach the end of sentence token. If specified, it must be boolean.
* **do_sample:** If True, sample the next word as per the likelihood. If specified, it must be boolean.
* **top_k:** In each step of text generation, sample from only the `top_k` most likely words. If specified, it must be a positive integer.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.
* **stop**: If specified, it must be a list of strings. Text generation stops if any one of the specified strings is generated.

We may specify any subset of the parameters mentioned above while invoking an endpoint. Next, we show an example of how to invoke endpoint with these arguments.

**NOTE**: If `max_new_tokens` is not defined, the model may generate up to the maximum total tokens allowed, which is 4K for these models. This may result in endpoint query timeout errors, so it is recommended to set `max_new_tokens` when possible. For 7B, 13B, and 70B models, we recommend to set `max_new_tokens` no greater than 1500, 1000, and 500 respectively, while keeping the total number of tokens less than 4K.

***

In [3]:
zero_shot_prompts = [
    "I believe the meaning of life is",
    "Simply put, the theory of relativity states that ",
    """A brief message congratulating the team on the launch:

Hi everyone,

I just """,
]
few_shot_prompts = [
    """Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush girafe => girafe peluche
cheese =>""",
]

payloads = []
for prompt in zero_shot_prompts:
    payloads.append(
        {
            "inputs": prompt, 
            "parameters": {"max_new_tokens": 64, "top_p": 0.9, "temperature": 0.6, "return_full_text": False},
        }
    )
for prompt in few_shot_prompts:
    payloads.append(
        {
            "inputs": prompt, 
            "parameters": {"max_new_tokens": 64, "top_p": 0.9, "temperature": 0.6, "return_full_text": False},
        }
    )

### Query endpoint that you have created

---

In [4]:
endpoint_name = 'jumpstart-dft-meta-textgeneration-llama-2-7b'


def query_endpoint(payload):
    client = boto3.client("sagemaker-runtime")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps(payload),
    )
    response = response["Body"].read().decode("utf8")
    response = json.loads(response)
    return response

In [5]:
for payload in payloads:
    query_response = query_endpoint(payload)
    print(payload["inputs"])
    print(f"> {query_response[0]['generated_text']}")
    print("\n======\n")

I believe the meaning of life is
>  to be happy and to live a life that you enjoy.
I believe the meaning of life is to be happy and to live a life that you enjoy. I believe the meaning of life is to be happy and to live a life that you enjoy. I believe the meaning of life is to be happy and to live


Simply put, the theory of relativity states that 
> 1) space and time are relative to the observer, and 2) the speed of light is constant in a vacuum.
The theory was developed by Albert Einstein in 1905, and was proven by experiment in 1919.
Einstein's theory of relativ


A brief message congratulating the team on the launch:

Hi everyone,

I just 
> wanted to take a moment to say thank you to everyone for the amazing work you've done in getting the new version of the site up and running.  It's been a long time coming, and we're thrilled to finally be able to show it off.  We've got a lot


Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush girafe =>

In [9]:
#import torch
from langchain import HuggingFacePipeline

ModuleNotFoundError: No module named 'langchain'