## Task 2: Deploy a model for real-time inference

After you build, train, and evaluate your machine learning (ML) model to ensure that it’s solving the intended business problem proposed, you want to deploy that model to facilitate decision-making in business operations. Amazon SageMaker offers a broad range of deployment options that vary from low latency and high throughput to long-running inference jobs. With SageMaker Inference, you can either set up an endpoint that returns inferences or run batch inferences from your model.

SageMaker provides multiple inference options so that you can pick the option that best suits your workload. For this lab you use real-time inference. In particular, real-time inference is a great option for hosting your models when you have low and consistent latency and throughput-sensitive workloads. Use real-time inference for a persistent and fully managed endpoint (REST API) that can handle sustained traffic, backed by the instance type of your choice. Real-time inference can support payload sizes up to 6 MB and processing times of 60 seconds.

To prepare the model for deployment, you define the endpoint configuration where you specify the name of one or more models in production (variants) and the ML compute instances that you want SageMaker to launch to host each production variant. When hosting models in production, you can configure the endpoint to elastically scale the deployed ML compute instances. For each production variant, you specify the number of ML compute instances that you want to deploy. When you specify two or more instances, SageMaker launches them in multiple Availability Zones. This ensures continuous availability. SageMaker manages deployment of the instances. When you have your model and endpoint configuration, use the [CreateEndpoint API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) to create your endpoint. Provide the endpoint configuration to SageMaker. The service launches the ML compute instances and deploys the model or models as specified in the configuration. 

In this task, you deploy a model, create an endpoint, and run real-time inference on the endpoint.

### Task 2.1: Set up the environment

Before you start tuning your model, install any necessary dependencies.

In [None]:
# Install dependencies 
import boto3
import pandas as pd
import sagemaker
import time
import numpy as np
import seaborn as sns

from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sess = boto3.Session()
sm = sess.client('sagemaker')
sm_runtime = sess.client("sagemaker-runtime")
s3_client = boto3.client("s3")

Next, import the dataset.

In [None]:
# Import the dataset 
s3 = boto3.resource('s3')
for buckets in s3.buckets.all():
    if 'labdatabucket' in buckets.name:
        bucket = buckets.name
print("Bucket: ", bucket)
prefix = 'scripts/data'

# Download the file to use for inference later in the notebook
s3_client.download_file(bucket, f'{prefix}/test/adult_data_processed_test.csv', 'adult_data_processed_test.csv')

# Open adult_data_processed_test.csv with pandas, save the first column as df_labels, remove the labels from the dataset, and save it back to adult_data_processed_test.csv
df = pd.read_csv('adult_data_processed_test.csv')
df_labels = df.iloc[:, 0]
df = df.drop(df.columns[0], axis=1)
df.to_csv('adult_data_processed_test_no_target.csv', index=False, header=False)

Then import the best model that you selected from the previous lab. The **model.tar.gz** file was uploaded during the lab environment creation. 

In [None]:
# Upload the model to your Amazon S3 bucket
s3_client.upload_file(
    Filename="model.tar.gz", Bucket=bucket, Key=f"{prefix}/models/model.tar.gz"
)

# Set a date to use in the model name
create_date = time.strftime("%Y-%m-%d-%H-%M-%S")
model_name = 'income-model-{}'.format(create_date)

# Retrieve the container image
container = sagemaker.image_uris.retrieve(
    region=region, 
    framework='xgboost', 
    version='1.5-1'
)

# Set up the model
income_model = sm.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = {
        'Image': container,
        'ModelDataUrl': f's3://{bucket}/{prefix}/models/model.tar.gz',
    }
)

## Task 2.2: Configure an endpoint

You have already created a model in the previous lab. You are now ready to create an endpoint configuration. 

Set up the endpoint configuration name and the instance type that you want to use.

To create an endpoint configuration, you need to set the following options:
- **VariantName**: The name of the production variant (one or more models in production).
- **ModelName**: The name of the model that you want to host. This is the name that you specified when you created the model.
- **InstanceType**: The compute instance type.
- **InitialInstanceCount**: The number of instances to launch initially.

To log the inputs to your endpoint and the inference outputs from SageMaker real-time endpoints to Amazon S3, you can enable a feature called Data Capture. Data Capture is commonly used to record information that can be used for training, debugging, and monitoring. When you explore your endpoint in Amazon SageMaker Studio, more details about the endpoint will be displayed when Data Capture is enabled. The configuration for Data Capture features later in this lab to show you how to enable it.

Refer to [Capture Data](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-capture.html) for more information about adding Data Capture.

In [None]:
# Create an endpoint config name. Here you create one based on the date  
# so it you can search endpoints based on creation time.
endpoint_config_name = 'income-model-real-time-endpoint-{}'.format(create_date)                              
instance_type = 'ml.m5.xlarge'   
initial_sampling_percentage = 25 # Choose a value between 0 and 100
capture_modes = [ "Input",  "Output" ] # Specify input, output, or both

endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name, # You will specify this name in a CreateEndpoint request.
    # List of ProductionVariant objects, one for each model that you want to host at this endpoint.
    ProductionVariants=[
        {
            "VariantName": "variant1", # The name of the production variant.
            "ModelName": model_name, 
            "InstanceType": instance_type, # Specify the compute instance type.
            "InitialInstanceCount": 1 # Number of instances to launch initially.
        }
    ],
    DataCaptureConfig= {
        'EnableCapture': True, # Whether data should be captured or not.
        'InitialSamplingPercentage' : initial_sampling_percentage,
        'DestinationS3Uri': f's3://{bucket}/data-capture',
        'CaptureOptions': [{"CaptureMode" : capture_mode} for capture_mode in capture_modes]
    }
)

print(f"Created EndpointConfig: {endpoint_config_response['EndpointConfigArn']}")

## Task 2.3: Create an endpoint

Next, create an endpoint. When you create a real-time endpoint, SageMaker launches the ML compute instances and deploys one or more models as specified in the configuration. In this lab, you are only deploying one model for inference. In SageMaker, you can create a multi-model endpoint. Refer to [Invoke a Multi-Model Endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/invoke-multi-model-endpoint.html) for more information about multi-model endpoints.

When the endpoint is in service, the helper function will print the endpoint Amazon Resource Name (ARN). Endpoint creation will take approximately 3–7 minutes to run.

In [None]:
# Create the endpoint
# Set the name of the endpoint. The name must be unique within an AWS Region in your AWS account.
endpoint_name = '{}-name'.format(endpoint_config_name)

create_endpoint_response = sm.create_endpoint(
    EndpointName=endpoint_name, 
    EndpointConfigName=endpoint_config_name
) 

def wait_for_endpoint_creation_complete(endpoint):
    """Helper function to wait for the completion of creating an endpoint"""
    response = sm.describe_endpoint(EndpointName=endpoint_name)
    status = response.get("EndpointStatus")
    while status == "Creating":
        print("Waiting for Endpoint Creation")
        time.sleep(15)
        response = sm.describe_endpoint(EndpointName=endpoint_name)
        status = response.get("EndpointStatus")

    if status != "InService":
        print(f"Failed to create endpoint, response: {response}")
        failureReason = response.get("FailureReason", "")
        raise SystemExit(
            f"Failed to create endpoint {create_endpoint_response['EndpointArn']}, status: {status}, reason: {failureReason}"
        )
    print(f"Endpoint {create_endpoint_response['EndpointArn']} successfully created.")

wait_for_endpoint_creation_complete(endpoint=create_endpoint_response)

In SageMaker Studio, you can review the endpoint details under the **Endpoints** tab.

The next step opens a new tab in SageMaker Studio. To follow these directions, use one of the following options:
- **Option 1:** View the tabs side by side. To create a split screen view from the main SageMaker Studio window, either drag the **real_time_inference.ipynb** tab to the side or choose the **real_time_inference.ipynb** tab, and then from the toolbar, select **File** and **New View for Notebook**. You can now have the directions visible as you explore the endpoint.
- **Option 2:** Switch between the SageMaker Studio tabs to follow these instructions. When you are finished exploring the endpoint, return to the notebook by choosing the **real_time_inference.ipynb** tab.

1. Choose the **SageMaker Home** icon.
2. Choose **Deployments**.
3. Choose **Endpoints**.

SageMaker Studio displays the **Endpoints** tab.

4. Select the endpoint that has **income-model-real-time-** in the **Name** column.

If the endpoint does not appear, choose the refresh icon until the endpoint appears in the list.

SageMaker Studio displays the **ENDPOINT DETAILS** tab.

5. Choose the **AWS settings** tab.

If you opened the endpoint before it finished creating, choose the refresh icon until the **Endpoint status** changes from *Creating* to *InService*.

The **Endpoint type** is listed as **Real-time**. The **Data capture settings** and **Endpoint configuration settings** sections show the configurations that you chose earlier in the notebook.

## Task 2.4: Test an endpoint and generate a prediction

After you deploy your model by using SageMaker hosting services, you can test your model on that endpoint by sending it test data.

You have several customer records that you know show an income less than 50,000 USD (an **income** value of **1**), and several that have an income greater than or equal to 50,000 USD (an **income** value of **0**). Invoke the endpoint with these records and view the returned scores.

To view real-time predictions from the endpoint, you read the returned body text from the response, which contains a list of the prediction scores. The score for each record ranges from **0** to **1**, with numbers closer to **1** indicating that those customers are more likely to have an income less than 50,000 USD. For example, a customer with a prediction score of **0.42** is more likely to have an income less than 50,000 USD than a customer with a prediction score of **0.14**. To calculate precision, recall, and the F1 scores, the values are rounded to 0 or 1 and compared to the target data labels that you saved earlier in the notebook.

<i class="fas fa-sticky-note" style="color:#ff6633"></i> **Note:** There are over 5,000 records in the **adult_data_processed_test_no_target.csv** dataset. It takes 2–3 minutes to complete all of the real-time inference requests.

In [None]:
#get-model-A-predictions
predictions = ""

# set the cutoff value for the binary classification problem
cutoff_value = 0.5

# determine the classification based on a probability value 
def convert_probability_to_binary(probability):
    if probability >= cutoff_value:
        return "1"
    else:
        return "0"

print(f"Sending test traffic to the endpoint {endpoint_name}. \nPlease wait...")
with open("adult_data_processed_test_no_target.csv", "r") as f:
    for row in f:
        # print(".", end="", flush=True)
        payload = row.rstrip("\n")
        response = sm_runtime.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType="text/csv",
            Body=payload
        )

        pred_probability = response["Body"].read().decode("utf-8")
        new_prediction = convert_probability_to_binary(float(pred_probability))
        predictions = ",".join([predictions, new_prediction])
f.close()

# Convert the predictions to a numpy array and round the values (values closer to 0 round to 0, while values closer to 1 round to 1)
pred_np = np.fromstring(predictions[1:], sep=",")

print("Done!")

## Task 2.5: Evaluate the prediction power with a confusion matrix

Because ML models are approximations of the real world, some of the model's predictions are likely in error. In some applications all types of prediction errors are truly equal in impact. In other applications, one kind of error can be much more costly or consequential than another, measured in absolute or relative terms, in dollars, time, or something else. In ML, the accuracy of the model is defined as the number of correct predictions divided over the total number of predictions. In this lab you are trying to predict if people make less than 50,000 USD so that you can promote government assistance services to qualified citizens.

The confusion matrix compares the predicted and actual values. It illustrates in a table the number of correct and incorrect predictions for each class by comparing an observation's predicted class and its true class. When you run the code cell, you see **true positive (TP)**, **true negative (TN)**, **false positive (FP)**, and **false negative (FN)** values.

In [None]:
#evaluate-prediction-power
# Configure the metrics for a confusion matrix with the original labels and the predictions
cf_matrix = confusion_matrix(df_labels, pred_np)
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                        cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
            zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)

# Display the confusion matrix
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')

When you evaluate inference results, there are several measures that you can use to see how effective and accurate your model is. The most common measures are **accuracy**, **precision**, **recall**, and **F1 score**. 

- **Accuracy**: A measure calculated by dividing the sum of True Positives and True Negatives by the sum of Positives and Negatives. The calculation is (TP + TN) / (P + N). 
- **Precision**: A measure calculated by dividing the True Positives by the sum of True Positives and False Positives. The calculation is TP / (TP + FP).
- **Recall**: A measure calculated by dividing the the True Positives by the sum of True Positives and False Negatives. The calculation is TP / (TP + FN).
- **F1 score**: A measure calculated by multiplying prevision and recall, and then dividing that number by the sum of precision and recall. Then, the result is multiplied by 2. The calculation is 2 * ((precision * recall)/(precision + recall)).

For this dataset, you want high *precision*. The advocacy group does not have enough funds to provide support for everyone who makes less than 50,000 USD. They want to maximize their impact by correctly identifying, with high *precision*, individuals who make less than 50,000 USD.

In [None]:
# Calculate the F1 score
print('You can view a full classification report below:\n')

# Calculate the accuracy, precision, and recall of df_labels and predictions using sklearn
print(classification_report(df_labels, pred_np))

What is the **precision** value listed next to the line starting with **1**? Is **recall** higher or lower than **precision**? 

If you want to tune your model, your goal is to continue increasing precision, even if it means that your recall results drop. The trade off between precision and recall is a common ML problem. The F1 score helps weigh the trade-off for models and use cases that are working to maximize the F1 score. For this use case, it is fine if the F1 score and the recall both drop, as long as precision is increased.


## Task 2.6: Delete the endpoint

Cleaning up an endpoint can be accomplished in three steps. First, delete the endpoint. Then delete the endpoint configuration. Finally, if you no longer need the model that you deployed, delete the model.

In [None]:
#delete-resources
# Delete endpoint
sm.delete_endpoint(EndpointName=endpoint_name)

# Delete endpoint configuration
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
                   
# Delete model
sm.delete_model(ModelName=model_name)


### Cleanup

You have completed this notebook. To move to the next part of the lab, do the following:

- Close this notebook file.
- Return to the lab session and continue with the **Conclusion**.