# MGTA 466: Programming Assignment 5 - XGBoost Inference Notebook
<!-- 
## Classification on Amazon SageMaker

Perform a classification task on the given dataset.<br>
Using the features given, you will train a XGBoost decision tree model to predict a given person's salary (the `WAGP` column) - which will be categorized into multiple bins.<br>

--- 

#### Tasks: 

- Perform Exploratory Data Analysis on the given dataset
- Save preprocessed datasets to Amazon S3
- Use the Amazon Sagemaker platform to train an XGBoost model
- Evaluate the model on the test set using real-time inference
- Perform hyperparameter tuning on the XGBoost model

#### Submission on Gradescope:
You need to submit the following three files under "PA5":
- The current notebook - **PA5_Starter.ipynb**
    - **IMPORTANT** - Make sure all the cell outputs are present in the notebook
- The inference notebook - **PA5_Inference.ipynb**
- Screenshot of SageMaker dashboard showing no running jobs (nothing should be in green) - **sagemaker_ss.png**
 -->
#### IMPORTANT submission guidelines enforced by autograder. Please read carefully:
  * Make sure that all the cells in this notebook are executed before submission
  * Some cells are marked **DO NOT DELETE**. These cells cannot be deleted and the output of these cells will be used for autograding
  * You can add cells or delete(NOT recommended) other cells, but the **Expected Output** for each of the tasks MUST be the output of the cells marked as such
  * DO NOT print anything other than the *exact* expected output. Do not include any sentences describing the output. This is strictly enforced by the autograder which checks for an *exact* match of the expected output. For example, if you are expected to print the PySpark version:
      * '10.9.8' - <span style="color:#093">CORRECT</span>
      * 'The PySpark version is 10.9.8' - <span style="color:#FF0000">INCORRECT</span>
  * You can add cells for printing debugging information anywhere, but do not print anything else in **Expected Output** cells other than the expected output for the task
  
**NOTE** - In this Assignment, some of the cells may have additional logging output and that is acceptable

---

Remember: when in doubt, read the documentation first. It's always helpful to search for the class that you're trying to work with, e.g. pyspark.sql.DataFrame. 

Pandas API documentation: https://pandas.pydata.org/pandas-docs/stable/reference/index.html

Amazon Sagemaker API documentation: https://sagemaker.readthedocs.io/en/stable/

Amazon Sagemaker Tutorials: https://docs.aws.amazon.com/sagemaker/latest/dg/gs.html 

---

### Please make sure to complete steps 1 to 10.1 in the starter notebook before proceeding with step 11 below

### 11. Get Amazon IAM execution role & instance region

 Make sure to create an S3 bucket or re-use the ones from prior exercises

In [1]:
import os, sagemaker
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from sagemaker import get_execution_role



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


Get and store the IAM executon role, SageMaker Session, instance region & the SageMaker client in the cell below.

#### **Expected output:** Print the instance region

In [2]:
# Define IAM role- this will be necessary when defining your model
iam_role = get_execution_role()

# Set SageMaker session handle
sess = sagemaker.Session()

# Set the region of the instance 
my_region = sess.boto_session.region_name

# Set the sagemaker client
sagemaker_client = sagemaker.Session().boto_session.client('sagemaker')

print("Success - the SageMaker instance is in the " + my_region + " region")

Success - the SageMaker instance is in the us-west-2 region


### 12. Prepare test data - 0.5 point

Read the test data from the S3 address that you stored the data into in Step 5 of the starter notebook.
 
Drop the target (`WAGP`) column and load the dataframe values in an array as shown in the demo.

#### **Expected Output -** Print shape of the dataframe values array in tuple format

In [3]:
bucket = "wew-s3-demo"
prefix = "data"
print('Using bucket ' + bucket)

data_fname = "s3://{}/{}/{}".format(bucket, prefix, "model_data/test_data.csv")
test_df  = pd.read_csv(data_fname)

Using bucket wew-s3-demo


In [4]:
test_df_array = test_df.drop(['WAGP'], axis=1).values

In [None]:
test_df.shape

131988

### 13. Show the name and status of the deployed model endpoint - 1 point

Use the SageMaker client to print the list of active endpoints.

<b>Hint: </b> Note that this returns a list of dictionaries. Use a for loop to print out the name and status of each endpoint in the list.

Useful Function: [sagemaker_client.list_endpoints()](https://boto3.amazonaws.com/v1/documentation/api/1.26.94/reference/services/sagemaker/client/list_endpoints.html)

#### **Expected Output -** Name and status of active endpoint on separate lines

In [6]:
# List endpoints
endpoints = sagemaker_client.list_endpoints()

In [7]:
# Print endpoint information
for endpoint in endpoints['Endpoints']:
    print("Endpoint Name:", endpoint['EndpointName'])
    # print("EndpointArn:", endpoint['EndpointArn'])
    print("Status:", endpoint['EndpointStatus'])
    print()

Endpoint Name: xgboost-2025-03-17-07-02-35-198
Status: InService



### 14. Real-Time Inference using deployed endpoints - 1 point

Use the `.predictor.Predictor()` method to load the model deployed at the endpoint as shown in the demo. 

Use the following values for the parameters:
* `endpoint_name = <name-of-active-endpoint>`
* `sagemaker_session = sess`
* `serializer = sagemaker.serializers.CSVSerializer()`
* `deserializer = sagemaker.deserializers.BytesDeserializer()`

Next, use the loaded model to make predictions on the test data array.

**NOTE:** Predictions are returned as byte object, so the contents need to be decoded into string and converted to number array. Refer to the demo for assistance.

#### **Expected Output:** Show the predictions array as a Pandas Series

Hint: You can use pd.Series to convert predictions array into Series.



In [8]:
predictor = sagemaker.predictor.Predictor(endpoint_name='xgboost-2025-03-17-07-02-35-198',
                                          sagemaker_session=sess,
                                          serializer=sagemaker.serializers.CSVSerializer(),
                                          deserializer=sagemaker.deserializers.BytesDeserializer())

In [9]:
predictions = predictor.predict(data=test_df_array).decode('utf-8') # predict!
predictions_array = np.fromstring(predictions, sep=',')  #and turn the prediction into an array

In [10]:
predictions_df = pd.Series(predictions_array)
predictions_df

0         59386.562500
1         44999.617188
2         31737.544922
3         54657.617188
4         59719.851562
              ...     
131983    54472.523438
131984    66361.296875
131985    39539.285156
131986    50260.347656
131987    59719.851562
Length: 131988, dtype: float64

### 15. Calculate RMSE - 1 point

Use the `root_mean_squared_error` method to see how your model performs on the test set. 

In [11]:
from sklearn.metrics import root_mean_squared_error

#### **Expected output** - RMSE on test set between the predicted values and target values. Use `print()` for formatting

In [12]:
rmse = root_mean_squared_error(test_df['WAGP'].values, predictions_array)
print(rmse)

47402.46804613207


### Please make sure to complete steps 16 to 18.1 in the starter notebook before proceeding with step 19 below

**NOTE:** Do not forget to delete the endpoint after usage in step 10.2 in starter notebook

### 19. Show the name and status of the deployed _TUNED_ model endpoint - 0.5 points

Use the SageMaker client to get the list of active endpoints (Note that this returns a list of dictionaries)


#### **Expected Output -** Name of active endpoints after model tuning and their status

In [13]:
# List endpoints
endpoints = sagemaker_client.list_endpoints()

In [14]:
# Print endpoint information
for endpoint in endpoints['Endpoints']:
    print("Endpoint Name:", endpoint['EndpointName'])
    # print("EndpointArn:", endpoint['EndpointArn'])
    print("Status:", endpoint['EndpointStatus'])
    print()

Endpoint Name: XGBoost-Tuner-250317-0711-004-0cc24b8c
Status: InService



### 20. Real-Time Inference using deployed endpoints after model tuning - 0.5 points

Use the `.predictor.Predictor()` method to load the tuned model deployed at the endpoint as shown in the demo. 

Use the following values for the parameters:
* `endpoint_name = <name-of-active-endpoint>`
* `sagemaker_session = sess`
* `serializer = sagemaker.serializers.CSVSerializer()`
* `deserializer = sagemaker.deserializers.BytesDeserializer()`

Next, use the loaded model to make predictions on the test data array.

**NOTE:** Predictions are returned as byte object, so the contents need to be decoded into string and converted to number array. Refer to the demo for assistance.

#### **Expected Output:** Show the predictions array as a Pandas Series

Hint: You can use pd.Series to convert predictions array into Series.



In [15]:
predictor2 = sagemaker.predictor.Predictor(endpoint_name='XGBoost-Tuner-250317-0711-004-0cc24b8c',
                                          sagemaker_session=sess,
                                          serializer=sagemaker.serializers.CSVSerializer(),
                                          deserializer=sagemaker.deserializers.BytesDeserializer())

In [16]:
predictions2 = predictor2.predict(data=test_df_array).decode('utf-8') # predict!
predictions_array2 = np.fromstring(predictions2, sep=',')  #and turn the prediction into an array

In [17]:
predictions_df2 = pd.Series(predictions_array2)
predictions_df2

0         46521.738281
1         56275.281250
2         34065.164062
3         53835.527344
4         52869.089844
              ...     
131983    51478.023438
131984    60065.277344
131985    29250.210938
131986    48395.433594
131987    61578.914062
Length: 131988, dtype: float64

### 21. Calculate RMSE - 0.5 points

Use the `root_mean_squared_error` method to see how your tuned model performs on the test set. 

#### **Expected output** - RMSE on test set between the predicted values and target values. Use `print()` for formatting

In [18]:
rmse2 = root_mean_squared_error(test_df['WAGP'].values, predictions_array2)
print(rmse2)

45599.975478876884


### Go back to the starter notebook and delete any active endpoints (step 18.2) before submitting the assignment