# Blog: From Sagemaker Endpoint to AWS Lambda and API Gateway

This notebook walks you through the key steps of the following:

1. Use Sagemaker to stand up an endpoint, [source](https://plainenglish.io/blog/how-to-use-llama-2-with-an-api-on-aws-to-power-your-ai-apps#step-1-go-to-aws-sagemaker)
2. Create a Lambda function (see code below)
3. Stand up an API for the Lambda

## Payload

The payload looks like the following:

```json

{
  "body": {
    "inputs": "<s>[INST] what is the recipe of mayonnaise? [/INST] ",
    "parameters": {
      "max_new_tokens": 256,
      "top_p": 0.9,
      "temperature": 0.6
    }
  }
}

```

This is the `Event JSON` input. You can directly copy/paste the above into the `Event JSON` section of your lambda function.

## Deploying a Model on AWS Sagemaker

### Prerequisites
Before you begin, ensure you have a model endpoint already deployed on AWS Sagemaker. If not, follow the instructions below to set up everything you need from scratch.

#### Step 1: Access AWS Sagemaker
1. Log in to your AWS Management Console.
2. Use the search bar at the top to find AWS Sagemaker.
3. Select AWS Sagemaker to enter its dashboard.

#### Step 2: Set up a Domain on AWS Sagemaker
1. In the AWS Sagemaker dashboard, locate and click on "Domains" in the left sidebar.
2. Choose "Create a Domain".
3. Make sure the "Quick Setup" option is selected.
4. Fill out the form:
   - Enter a domain name of your choice.
   - Configure the remaining settings as suggested or based on the provided screenshot.
   - For newcomers, select "Create a new role" in the Execution role section. Experienced users can select a previously created role.
5. Click "Submit" to create your domain.
6. Record the username displayed on the screen; this is crucial for deploying your model.

##### Troubleshooting
If you encounter issues during domain creation, such as failures related to user permissions or VPC configuration, follow the suggested troubleshooting steps or consult the AWS documentation.

#### Step 3: Start a Sagemaker Studio Session
1. After your domain has been set up, click on the "Studio" link in the left sidebar.
2. Select the domain and user profile you configured earlier.
3. Click "Open Studio" to launch the session.

#### Step 4: Select and Deploy the Llama-2-7b-chat Model
1. Within Sagemaker Studio, navigate to "Models, notebooks, and solutions" under the SageMaker Jumpstart tab.
2. Use the search bar to locate the Llama 2 model, specifically the 7b chat model.
3. Click on the model to access its detailed page.
4. Here, you can adjust deployment settings if necessary. However, for simplicity, proceed with the default Sagemaker settings.
5. Deploy the model as configured. Note: The 70B version of this model requires a robust server. If you encounter deployment issues due to server constraints, consider submitting a request to AWS service quotas.
6. Allow 5-10 minutes for the deployment process to complete. Once done, a confirmation screen will appear.
7. Document the model's Endpoint name for future API interactions.

## Lambda

The lambda function is defined in the following:

```py
import os
import io
import boto3
import json

# Grab environment variables
ENDPOINT_NAME = os.environ['ENDPOINT_NAME']  # Get the SageMaker endpoint name from environment variables
print(ENDPOINT_NAME)

runtime = boto3.client('runtime.sagemaker')  # Create a SageMaker runtime client


def lambda_handler(event: dict, context: object) -> dict:
    """
    Lambda function handler that invokes a SageMaker endpoint.

    Args:
        event (dict): The input event data
        context (object): The Lambda function context

    Returns:
        dict: A dictionary with the response status code and body
    """
    # Invoke the SageMaker endpoint
    response = runtime.invoke_endpoint(
        EndpointName=ENDPOINT_NAME,  # Use the environment variable for the endpoint name
        ContentType='application/json',  # Specify the content type as JSON
        Body=json.dumps(event['body']),  # Pass the input data from the event
        CustomAttributes="accept_eula=true",  # Accept the EULA
        InferenceComponentName="meta-textgeneration-llama-2-7b-f-20240509-223751"
    )

    # Parse the response as JSON
    result = json.loads(response['Body'].read().decode())

    # Return a response with a 200 status code and the result as the body
    return {
        "statusCode": 200,
        "body": json.dumps(result)
    }

```

## Python API Call via `requests`

In [57]:
import json
import requests
from typing import Dict

def call_llama(prompt: str, max_new_tokens: int = 50, temperature: float = 0.9) -> str:
    """
    Calls the Llama API to generate text based on a given prompt, controlling the length and randomness.

    Args:
        prompt (str): The prompt text to send to the Llama model for text generation.
        max_new_tokens (int, optional): The maximum number of tokens that the model should generate. Defaults to 50.
        temperature (float, optional): Controls the randomness of the output. Lower values make the model more deterministic.
            A higher value increases randomness. Defaults to 0.9.

    Returns:
        str: The generated text response from the Llama model.

    Raises:
        Exception: If the API call fails and returns a non-200 status code, it raises an exception with the error details.
    """
    # API endpoint for the Llama model
    api_url = "https://v6rkdcyir7.execute-api.us-east-1.amazonaws.com/beta"

    # Configuration for the request body
    json_body = {
        "body": {
            "inputs": f"<s>[INST] {prompt} [/INST]",
            "parameters": {
                "max_new_tokens": max_new_tokens,
                "top_p": 0.9,  # Fixed probability cutoff to select tokens with cumulative probability above this threshold
                "temperature": temperature
            }
        }
    }

    # Headers to indicate that the payload is JSON
    headers = {"Content-Type": "application/json"}

    # Perform the POST request to the Llama API
    response = requests.post(api_url, headers=headers, json=json_body)

    # Parse the JSON response
    response_body = response.json()['body']

    # Convert the string response to a JSON object
    body_list = json.loads(response_body)

    # Extract the 'generated_text' from the first item in the list
    generated_text = body_list[0]['generated_text']

    # Separate the answer from the instruction
    answer = generated_text.split("[/INST]")[-1].strip()

    # Check the status code of the response
    if response.status_code == 200:
        return answer  # Return the text generated by the model
    else:
        # Raise an exception if the API did not succeed
        raise Exception(f"Error calling Llama API: {response.status_code}")


In [58]:
%%time

# Example usage
prompt = "tell me a joke"
response = call_llama(prompt)
print(response)

Sure! Here's a classic one:

Why don't scientists trust atoms?

Because they make up everything!

I hope that made you smile! Do you want to hear another one?
CPU times: user 88.5 ms, sys: 5.01 ms, total: 93.5 ms
Wall time: 2.49 s
