# Task 3: Deploy a model for serverless inference

## Task 3.1: Environment setup

Install packages and dependencies.

In [None]:
#install-dependencies
import boto3
import sagemaker
import sagemaker_datawrangler
import sys
import time

role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sess = boto3.Session()
sm = sess.client('sagemaker')
prefix = 'sagemaker/mlasms'
bucket = sagemaker.Session().default_bucket()
s3_client = boto3.client("s3")

Save the model from the training and tuning lab in the default Amazon Simple Storage Service (Amazon S3) bucket. Set up a model using **create_model** and configure **ModelDataUrl** to reference the trained model.

In [None]:
#set-up-model
# Upload the model to your Amazon S3 bucket
s3_client.upload_file(
    Filename="model.tar.gz", Bucket=bucket, Key=f"{prefix}/models/model.tar.gz"
)

# Set a date to use in the model name
create_date = time.strftime("%Y-%m-%d-%H-%M-%S")
model_name = 'income-model-{}'.format(create_date)

# Retrieve the container image
container = sagemaker.image_uris.retrieve(
    region=boto3.Session().region_name, 
    framework='xgboost', 
    version='1.5-1'
)

# Set up the model
income_model = sm.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = {
        'Image': container,
        'ModelDataUrl': f's3://{bucket}/{prefix}/models/model.tar.gz',
    }
)

## Task 3.2: Create an endpoint from the provided synthesized, retrained model

Amazon SageMaker Serverless Inference is a purpose-built inference option that helps you to deploy and scale machine learning (ML) models. Serverless Inference is ideal for workloads that have idle periods between traffic spurts and that can tolerate cold starts. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic. So, you do not need to choose instance types or manage scaling policies. This takes away the undifferentiated heavy lifting of selecting and managing servers. Serverless Inference integrates with AWS Lambda to offer you high availability, built-in fault tolerance, and automatic scaling.

There are three steps to creating a serverless endpoint using the Amazon SageMaker Python SDK. These are the same steps that are used for the real-time endpoint, but the steps have different configurations:
1. Create a SageMaker model in SageMaker.
2. Create an endpoint configuration for an HTTPS endpoint.
3. Create an HTTPS endpoint.

You have already created a model. You are now ready to create an endpoint configuration and an endpoint. 

First, set up the endpoint configuration name and the memory size that you want to use. Then, call the CreateEndpointConfig API.

To create an endpoint configuration, you need to set the following options:
- **VariantName**: The name of the production variant (one or more models in production).
- **ModelName**: The name of the model that you want to host. This is the name that you specified when you created the model.
- **ServerlessConfig**: This is where the endpoint is set as serverless. Configure the values for **MemorySizeInMB** and **MaxConcurrency**.
    - **MemorySizeInMB**: The allocated memory size (1024, 2048, 3072, 4096, 5120, or 6144 MB).
    - **MaxConcurrency**: The number of concurrent invocations (1 to 200).

In [None]:
#create-endpoint-configuration 
# Create an endpoint config name. Here you create one based on the date so you can search endpoints based on creation time.
endpoint_config_name = 'income-model-serverless-endpoint-{}'.format(create_date)                              

endpoint_config_response = sm.create_endpoint_config(
   EndpointConfigName=endpoint_config_name,
   ProductionVariants=[
        {
            "ModelName": model_name,
            "VariantName": "variant1", # The name of the production variant
            "ServerlessConfig": {
                "MemorySizeInMB": 2048, # The memory size
                "MaxConcurrency": 20 # Number of concurrent invocations
            }
        } 
    ]
)

print(f"Created EndpointConfig: {endpoint_config_response['EndpointConfigArn']}")

Next, create an endpoint. When you create a serverless endpoint, SageMaker provisions and manages the compute resources for you. Then, you can make inference requests to the endpoint and receive model predictions in response. SageMaker scales the compute resources up and down as needed to handle your request traffic, and you only pay for what you use.

You can choose either a container provided by SageMaker or bring your own. A serverless endpoint has a minimum RAM size of 1024 MB and a maximum RAM size of 6144 MB. Serverless Inference auto-assigns compute resources proportional to the memory that you select.

When the endpoint is in service, the helper function prints the endpoint Amazon Resource Name (ARN). Endpoint creation can take as long as 7 minutes to run.

In [None]:
#create-endpoint
# The name of the endpoint. The name must be unique within an AWS Region in your AWS account.
endpoint_name = '{}-name'.format(endpoint_config_name)

create_endpoint_response = sm.create_endpoint(
    EndpointName=endpoint_name, 
    EndpointConfigName=endpoint_config_name
) 

def wait_for_endpoint_creation_complete(endpoint):
    """Helper function to wait for the completion of creating an endpoint"""
    response = sm.describe_endpoint(EndpointName=endpoint_name)
    status = response.get("EndpointStatus")
    while status == "Creating":
        print("Waiting for Endpoint Creation")
        time.sleep(15)
        response = sm.describe_endpoint(EndpointName=endpoint_name)
        status = response.get("EndpointStatus")

    if status != "InService":
        print(f"Failed to create endpoint, response: {response}")
        failureReason = response.get("FailureReason", "")
        raise SystemExit(
            f"Failed to create endpoint {create_endpoint_response['EndpointArn']}, status: {status}, reason: {failureReason}"
        )
    print(f"Endpoint {create_endpoint_response['EndpointArn']} successfully created.")

wait_for_endpoint_creation_complete(endpoint=create_endpoint_response)


In SageMaker Studio, you can view the endpoint details under the **Endpoints** tab.

The next step opens a new tab in SageMaker Studio. To follow these directions, use one of the following options:
- **Option 1:** View the tabs side by side. To create a split screen view from the main SageMaker Studio window, either drag the **serverless_inference.ipynb** tab to the side or choose (right-click) the **serverless_inference.ipynb** tab and choose **New View for Notebook**. You can now have the directions visible as you explore the endpoint.
- **Option 2:** Switch between the SageMaker Studio tabs to follow these instructions. When you are finished exploring the endpoint, return to the notebook by choosing the **serverless_inference.ipynb** tab.

1. Choose the **SageMaker Home** icon.
2. Choose **Deployments**.
3. Choose **Endpoints**.

SageMaker Studio displays the **Endpoints** tab.

4. Select the endpoint which has **income-model-serverless-** in the **Name** column.

If the endpoint does not appear, choose the refresh icon until the endpoint appears.

SageMaker Studio displays the **ENDPOINT DETAILS** tab.

5. Choose the **AWS settings** tab.

If you opened the endpoint before it finished creating, choose the refresh icon until the **Endpoint status** changes from *Creating* to *InService*.

The **Endpoint type** is listed as **Serverless**. The **Endpoint runtime settings** section shows the configurations that you chose earlier in the notebook.

## Task 3.3: Invoke an endpoint for a serverless inference with customer records

After you deploy your model using SageMaker hosting services, you can test your model on that endpoint by sending it test data.

If your endpoint does not receive traffic for a while and then your endpoint suddenly receives new requests, it can take some time for your endpoint to spin up the compute resources to process the requests. This is called a cold start. Because serverless endpoints provision compute resources on demand, your endpoint might experience cold starts. A cold start can also occur if your concurrent requests exceed the current concurrent request usage. The cold start time depends on your model size, how long it takes to download your model, and the start-up time of your container.

Refer to [Serverless Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html) for more information about how serverless inference and cold starts work.

You received several more customer records. Confirm that the endpoint is working by invoking it with a set of records with an income value of 1 and records with an income value of 0. A list of the prediction scores for each record is output. 

In [None]:
#invoke-endpoint-serverless-records
sagemaker_runtime = boto3.client("sagemaker-runtime", region_name=region)

response = sagemaker_runtime.invoke_endpoint(
    ContentType='text/csv',
    EndpointName=endpoint_name, 
    Body=bytes('47,0,4,9,0,3,4,0,1,0,1902,60\n' +
                '53,0,0,0,0,2,4,0,1,0,0,40\n' +
                '44,0,0,0,2,0,1,0,1,14344,0,40\n', 'utf-8')
)

print(response)

print('\nTesting with records that have an income value of 1:')
print('The returned scores are: {}'.format(response['Body'].read().decode('utf-8')))

start_time = time.time()
response = sagemaker_runtime.invoke_endpoint(
    ContentType='text/csv',
    EndpointName=endpoint_name, 
    Body=bytes('19,0,1,1,1,1,2,1,0,0,0,35\n' +
                '56,2,1,1,0,1,0,0,0,0,0,50\n' +
                '61,2,0,0,0,0,0,0,0,0,0,40\n', 'utf-8')
)

print('\nTesting with records that have an income value of 0:')
print('The returned scores are: {}'.format(response['Body'].read().decode('utf-8')))

## Task 3.4: Delete the endpoint

Cleaning up an endpoint can be accomplished in three steps. First, delete the endpoint. Then, delete the endpoint configuration. Finally, if you no longer need the model that you deployed, delete the model.

In [None]:
#delete-resources
# Delete endpoint
sm.delete_endpoint(EndpointName=endpoint_name)

# Delete endpoint configuration
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
                   
# Delete model
sm.delete_model(ModelName=model_name)

### Conclusion

Congratulations! You have used SageMaker to successfully create a serverless endpoint, using the SageMaker Python SDK, and to invoke the endpoint.

The next task of the lab focuses on deploying a model for inference using asynchronous inference.

### Cleanup

You have completed this notebook. To move to the next part of the lab, do the following:

- Close this notebook file.
- Return to the lab session and continue with **Task 4: Deploy a model for asynchronous inference**.