#  XGBoost Inference Notebook
 
## Classification on Amazon SageMaker

Perform a classification task on the given dataset.<br>
Using the features given, you will train a XGBoost decision tree model to predict a given person's salary (the `WAGP` column) - which will be categorized into multiple bins.<br>

--- 

#### Tasks: 

- Perform Exploratory Data Analysis on the given dataset
- Save preprocessed datasets to Amazon S3
- Use the Amazon Sagemaker platform to train an XGBoost model
- Evaluate the model on the test set using real-time inference
- Perform hyperparameter tuning on the XGBoost model


### 11. Get Amazon IAM execution role & instance region

 Make sure to create an S3 bucket or re-use the ones from prior exercises

In [42]:

import os, sagemaker
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from sagemaker import get_execution_role

Get and store the IAM executon role, SageMaker Session, instance region & the SageMaker client in the cell below.

#### **Expected output:** Print the instance region

In [43]:
# Define IAM role- this will be necessary when defining your model
iam_role = get_execution_role()

# Set SageMaker session handle
sess = sagemaker.Session()

# Set the region of the instance 
my_region = sess.boto_session.region_name

print(my_region)

us-west-2


### 12. Prepare test data

Drop the label (`WAGP_CAT`) column and load the dataframe values in an array as shown in the demo.

#### **Expected Output -** Shape of the dataframe values array in tuple format

In [11]:
bucket = "sb-aws-bucket-mgt"
prefix = "data"
print('Using bucket ' + bucket)

Using bucket sb-aws-bucket-mgt


In [33]:

data_fname = "s3://{}/{}/
{}".format(bucket, prefix ,"model_data/test_data.csv")
test_df = pd.read_csv(data_fname)
test_df_array = test_df.drop(['WAGP_CAT'], axis=1).values
test_df_array.shape

(131085, 11)

### 13. Show the name and status of the deployed model endpoints 

Use the SageMaker client to get the list of active endpoints (Note that this returns a list of dictionaries)


#### **Expected Output -** Name of active endpoints and their status

In [44]:

# Create a SageMaker client
sagemaker_client = boto3.client('sagemaker')

# Get the list of active endpoints
endpoints=sagemaker_client.list_endpoints()
# Print endpoint information
for endpoint in endpoints['Endpoints']:
    print("Endpoint Name:", endpoint['EndpointName'])
    # print("EndpointArn:", endpoint['EndpointArn'])
    print("Status:", endpoint['EndpointStatus'])
    print()

Endpoint Name: xgboost-2024-03-20-01-47-34-585
Status: InService

Endpoint Name: xgboost-2024-03-20-01-08-44-058
Status: InService



### 14. Real-Time Inference using deployed endpoints 

Use the `.predictor.Predictor()` method to load the model deployed at the endpoint as shown in the demo. 

Use the following values for the parameters:
* `endpoint_name = <name-of-active-endpoint>`
* `sagemaker_session = sess`
* `serializer = sagemaker.serializers.CSVSerializer()`
* `deserializer = sagemaker.deserializers.BytesDeserializer()`

Next, use the loaded model to make predictions on the test data array.

**NOTE:** Predictions are returned as byte object, so the contents need to be decoded into string and converted to number array. Refer the demo for assistance.

#### **Expected Output:** Show the value counts of the predictions array

Hint: You can use pd.Series to convert predictions array into Series to use the value counts function.



In [59]:

predictor = sagemaker.predictor.Predictor(endpoint_name='xgboost-2024-03-20-01-47-34-585',
                                          sagemaker_session=sess,
                                          serializer=sagemaker.serializers.CSVSerializer(),
                                          deserializer=sagemaker.deserializers.BytesDeserializer())
predictions = predictor.predict(data=test_df_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions, sep=',')  #and turn the prediction into an array
# Convert predictions array into a pandas Series
predictions_series = pd.Series(predictions_array)

# Show the value counts of the predictions Series
print(predictions_series.value_counts())

0.0    131085
Name: count, dtype: int64


### Classification report 

Use the `classification_report` method to see how your model performs on the test set

In [51]:
from sklearn.metrics import classification_report

#### **Expected output** - Classification report. Use `print` for formatting

**NOTE** - Ignore any warnings

In [53]:

y_true = test_df['WAGP_CAT'].values
y_pred = predictions_array.astype(int)

print(y_pred)
print(y_true)

print(classification_report(y_true,y_pred))

[0 0 0 ... 0 0 0]
[0 0 1 ... 0 0 0]
              precision    recall  f1-score   support

           0       0.64      1.00      0.78     83390
           1       0.00      0.00      0.00     36865
           2       0.00      0.00      0.00     10830

    accuracy                           0.64    131085
   macro avg       0.21      0.33      0.26    131085
weighted avg       0.40      0.64      0.49    131085



### Show the name and status of the deployed _TUNED_ model endpoint
Use the SageMaker client to get the list of active endpoints (Note that this returns a list of dictionaries)


#### **Expected Output -** Name of active endpoints after model tuning and their status

In [55]:
sagemaker_client = boto3.client('sagemaker')

# Get the list of active endpoints
endpoints=sagemaker_client.list_endpoints()
# Print endpoint information
for endpoint in endpoints['Endpoints']:
    print("Endpoint Name:", endpoint['EndpointName'])
    # print("EndpointArn:", endpoint['EndpointArn'])
    print("Status:", endpoint['EndpointStatus'])
    print()

Endpoint Name: XGBoost-Tuner-240320-0212-002-2b47cacb
Status: InService

Endpoint Name: xgboost-2024-03-20-01-47-34-585
Status: InService

Endpoint Name: xgboost-2024-03-20-01-08-44-058
Status: InService



### Real-Time Inference using deployed endpoints after model tuning 

Use the `.predictor.Predictor()` method to load the tuned model deployed at the endpoint as shown in the demo. 

Use the following values for the parameters:
* `endpoint_name = <name-of-active-endpoint>`
* `sagemaker_session = sess`
* `serializer = sagemaker.serializers.CSVSerializer()`
* `deserializer = sagemaker.deserializers.BytesDeserializer()`

Next, use the loaded model to make predictions on the test data array.

**NOTE:** Predictions are returned as byte object, so the contents need to be decoded into string and converted to number array. Refer the demo for assistance.

#### **Expected Output:** Show the value counts of the predictions array





In [60]:
predictor = sagemaker.predictor.Predictor(endpoint_name='XGBoost-Tuner-240320-0212-002-2b47cacb',
                                          sagemaker_session=sess,
                                          serializer=sagemaker.serializers.CSVSerializer(),
                                          deserializer=sagemaker.deserializers.BytesDeserializer())
predictions = predictor.predict(data=test_df_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions, sep=',')  #and turn the prediction into an array
# Convert predictions array into a pandas Series
predictions_series = pd.Series(predictions_array)

# Show the value counts of the predictions Series
print(predictions_series.value_counts())


0.0    95945
1.0    32714
2.0     2426
Name: count, dtype: int64


### 21. Classification report after model tuning

Use the `classification_report` method to see how your model performs on the test set after hyperparamter tuning.

In [None]:

y_true = test_df['WAGP_CAT'].values
y_pred = predictions_array.astype(int)

print(y_pred)
print(y_true)

print(classification_report(y_true,y_pred))
