# MGTA 466: Programming Assignment 5 - XGBoost using SageMaker

## Regression on Amazon SageMaker

Perform a regression task on the given dataset.<br>
Using the features given, you will train a XGBoost decision tree model to predict a given person's salary (the `WAGP` column).<br>

--- 

#### Tasks: 

- Perform Exploratory Data Analysis on the given dataset
- Save preprocessed datasets to Amazon S3
- Use the Amazon Sagemaker platform to train an XGBoost model
- Evaluate the model on the test set using real-time inference
- Perform hyperparameter tuning on the XGBoost model

#### Submission on Gradescope:
You need to submit the following three files under "PA5":
- The current notebook - **PA5_Starter.ipynb** & the inference notebook - **PA5_Inference.ipynb**
    - **IMPORTANT** - Make sure all the cell outputs are present in the notebook
- Screenshot of active endpoints showing that the status is `inService` - **tuned_endpoint.png**

#### IMPORTANT submission guidelines enforced by autograder. Please read carefully:
  * Make sure that all the cells in this notebook are executed before submission
  * Some cells are marked **DO NOT DELETE**. These cells cannot be deleted and the output of these cells will be used for autograding
  * You can add cells or delete(NOT recommended) other cells, but the **Expected Output** for each of the tasks MUST be the output of the cells marked as such
  * DO NOT print anything other than the *exact* expected output. Do not include any sentences describing the output. This is strictly enforced by the autograder which checks for an *exact* match of the expected output. For example, if you are expected to print the PySpark version:
      * '10.9.8' - <span style="color:#093">CORRECT</span>
      * 'The PySpark version is 10.9.8' - <span style="color:#FF0000">INCORRECT</span>
  * You can add cells for printing debugging information anywhere, but do not print anything else in **Expected Output** cells other than the expected output for the task
  
**NOTE** - In this Assignment, some of the cells may have additional logging output and that is acceptable. **Any question that asks you to print the output requires the use of the print() function.**

---

Remember: when in doubt, read the documentation first. It's always helpful to search for the class that you're trying to work with, e.g. pyspark.sql.DataFrame. 

Pandas API documentation: https://pandas.pydata.org/pandas-docs/stable/reference/index.html

Amazon Sagemaker API documentation: https://sagemaker.readthedocs.io/en/stable/

Amazon Sagemaker Tutorials: https://docs.aws.amazon.com/sagemaker/latest/dg/gs.html 

---

### 1. Get Amazon IAM execution role & instance region

 Make sure to create an S3 bucket or re-use the ones from prior exercises

 **NOTE** - You can safely ignore any warnings

In [1]:
# ALL YOUR IMPORTS HERE
import sagemaker
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sagemaker import get_execution_role



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


Get and store the IAM executon role, SageMaker Session, instance region & the SageMaker client in the cell below.

#### **Expected output:** Print the instance region

In [2]:
# Define IAM role- this will be necessary when defining your model
iam_role = get_execution_role()

# Set SageMaker session handle
sess = sagemaker.Session()

# Set the region of the instance 
my_region = sess.boto_session.region_name

print("Success - the SageMaker instance is in the " + my_region + " region")

Success - the SageMaker instance is in the us-west-2 region


### 2. Read data using pandas and select features - 1 point

#### 2.1 Read data from the given s3 bucket path into a pandas dataframe - 0.5 points

#### **Expected output**: First five rows of the dataframe

In [3]:
file_path = "s3://mgta466-w25/data/person_records_merged.csv"
train_df = pd.read_csv(file_path, storage_options={"anon": False})
train_df.head(5)

severe performance issues, see also https://github.com/dask/dask/issues/10276

To fix, you should specify a lower version bound on s3fs, or
update the current installation.



Unnamed: 0,SERIALNO,SPORDER,PUMA,ST,ADJINC,AGEP,CIT,CITWP,COW,DDRS,...,RACWHT,RC,SFN,SFR,SOCP,VPS,WAOB,FHINS3C,FHINS4C,FHINS5C
0,84,1,2600,1,1007549,19,1,,,2.0,...,1,0,,,,,1,,,
1,154,1,2500,1,1007549,55,1,,1.0,2.0,...,0,0,,,411011.0,,1,,,
2,154,2,2500,1,1007549,56,1,,6.0,2.0,...,0,0,,,493050.0,,1,,,
3,154,3,2500,1,1007549,21,1,,,2.0,...,0,0,,,,,1,,,
4,154,4,2500,1,1007549,21,1,,,1.0,...,0,0,,,,,1,,,


### Description of Columns

There are lots of columns in the original dataset. However, we'll only use the following columns whose descriptions are given below:

WAGP - Wages or salary income past 12 months

AGEP -  Age

COW - Class of worker

JWMNP - Travel time to work

JWTR - Means of transportation to work

MAR - Marital status

PERNP - Total person's earnings

NWAV - Available for work

NWLA - On layoff from work

NWLK - Looking for work

NWAB - Temporary absence from work

SCHL - Educational attainment

WKW - Weeks worked during past 12 months

#### 2.2 Feature selection - Select the features in the columns listed above and filter data - 0.5 points
Select only the columns listed above and filter the pandas dataframe to remove `WAGP` values less than or equal to 0

**Note:** Make sure `WAGP` column is the first column. XGBoost expects target variables to be in the first column

#### **Expected output**: First five rows of the dataframe after feature selection and filtering

In [4]:
# Define required columns (ensuring WAGP is first)
required_columns = [
    "WAGP",
    "AGEP", "COW", "JWMNP", "JWTR", "MAR", "PERNP",
    "NWAV", "NWLA", "NWLK", "NWAB", "SCHL", "WKW"
]

# Select only these columns from the dataframe
df = train_df[required_columns]

# Remove rows where WAGP <= 0
df = df[df["WAGP"] > 0]

df.head(5)

Unnamed: 0,WAGP,AGEP,COW,JWMNP,JWTR,MAR,PERNP,NWAV,NWLA,NWLK,NWAB,SCHL,WKW
1,52000.0,55,1.0,30.0,1.0,1,52000.0,5.0,3.0,3.0,3.0,20.0,1.0
5,39000.0,63,3.0,15.0,1.0,3,39000.0,5.0,3.0,3.0,3.0,21.0,1.0
7,1100.0,20,1.0,,,5,1100.0,1.0,2.0,1.0,2.0,16.0,6.0
11,90000.0,59,1.0,10.0,1.0,1,90000.0,1.0,2.0,2.0,2.0,16.0,1.0
12,46000.0,56,1.0,45.0,1.0,1,46000.0,5.0,3.0,3.0,3.0,18.0,1.0


### 3. Data processing - 1 point

#### 3.1 Remove highly correlated column - 0.5 points

In [5]:
df.corr()['WAGP']

WAGP     1.000000
AGEP     0.204185
COW      0.079328
JWMNP    0.108181
JWTR     0.005556
MAR     -0.241447
PERNP    0.983637
NWAV     0.109769
NWLA     0.142831
NWLK     0.149386
NWAB     0.131806
SCHL     0.296058
WKW     -0.310829
Name: WAGP, dtype: float64

As seen from the correlation values, column `PERNP` is highly correlated with the wage and must be removed

#### **Expected output** - Columns of the dataframe after removing `PERNP`

In [6]:
df_processed = df.drop('PERNP', axis=1)
df_processed.columns

Index(['WAGP', 'AGEP', 'COW', 'JWMNP', 'JWTR', 'MAR', 'NWAV', 'NWLA', 'NWLK',
       'NWAB', 'SCHL', 'WKW'],
      dtype='object')

#### 3.2. Dropping NAs - 0.5 points

Drop rows with any nulls in any of the columns

#### **Expected output** - Number of rows in the cleaned dataframe

In [7]:
df_cleaned = df_processed.dropna()
df_cleaned.shape[0]

1257026

### 4. Splitting data and converting to CSV - 1 point

Split the dataset into train, validation, and test sets using sklearn's `train_test_split`.
Look up the API definition of `train_test_split` to see what values you need to pass

First, split the dataframe into two parts - `train_data` and `val_data` with a 70:30 ratio, and then
split the `train_data` into `train_data` and `test_data` in a 85:15 ratio.

Use the following parameters for train_test_split:
* `random_state = 466`
* `shuffle = True`
* `train_size = 0.7`, `test_size = 0.3` for the first split
* `train_size = 0.85`, `test_size = 0.15` for the second split

**IMPORTANT** - Use `random_state=466` as one the parameters of the `train_test_split` function to maintain consistency across submissions

#### **Expected output** - Size of train, validation and test data in a tuple format - (length of train, length of validation, length of test)

In [8]:
train_data, val_data = train_test_split(df_cleaned, train_size=0.7, test_size=0.3, random_state=466, shuffle=True)

train_data, test_data = train_test_split(train_data, train_size=0.85, test_size=0.15, random_state=466, shuffle=True)

train_data.shape[0], val_data.shape[0], test_data.shape[0]

(747930, 377108, 131988)

### Write prepared data to files.
Write the `train_data`, `val_data`, and `test_data` to csv files using the `.to_csv()` method

Use `index = False` as the parameters as shown in the demo.

**NOTE:** Use `header = False` as another parameter for `train_data` and `val_data`, while for `test_data` use `header = True`

In [9]:
train_data.to_csv("train_data.csv", index=False, header = False)
val_data.to_csv("val_data.csv", index=False, header = False)
test_data.to_csv("test_data.csv", index=False, header=True)

### 5. Save processed data to S3 - 1 point

This step is needed for using XGBoost with Amazon Sagemaker. Send data to S3. SageMaker will read training data from S3. The example for training data is given, you need do the same for validation and test data

#### **Expected output** - Path of train, validation and test data in AWS S3 in tuple format - (train_path, val_path, test_path)

In [10]:
bucket = 'wew-s3-demo'

In [11]:
# DO NOT DELETE THIS CELL

prefix = "data"
key_prefix = prefix + "/model_data"
train_path = sess.upload_data(
    path="train_data.csv", bucket=bucket, key_prefix=key_prefix)
print('Train data uploaded to ' + train_path)

val_path = sess.upload_data(
    path="val_data.csv", bucket=bucket, key_prefix=key_prefix)
print('Validation data uploaded to ' + val_path)

test_path = sess.upload_data(
    path="test_data.csv", bucket=bucket, key_prefix=key_prefix)
print('Test data uploaded to ' + test_path)

Train data uploaded to s3://wew-s3-demo/data/model_data/train_data.csv
Validation data uploaded to s3://wew-s3-demo/data/model_data/val_data.csv
Test data uploaded to s3://wew-s3-demo/data/model_data/test_data.csv


### 6. Create channels for train and validation data to feed to model - 1 point
1. Set up data channels for the training, validation, and test data as shown in the demo
2. Set the output location for the model

You'll have to use the [`TrainingInput`](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.TrainingInput) function and pass the `s3_data` and `content_type` parameters

#### **Expected output** - [`config`](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.TrainingInput.config) of `TrainingInput` created for training data

Refer - https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.TrainingInput.config

In [12]:
# Set data channels

s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_path, content_type='csv')
s3_input_val = sagemaker.inputs.TrainingInput(s3_data=val_path, content_type='csv')
s3_input_test = sagemaker.inputs.TrainingInput(s3_data=test_path, content_type='csv')

# config
s3_input_train.config

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
   'S3Uri': 's3://wew-s3-demo/data/model_data/train_data.csv',
   'S3DataDistributionType': 'FullyReplicated'}},
 'ContentType': 'csv'}

#### **Expected output** - Model's output location in AWS S3

NOTE - Output format should be `s3://<bucket-name>/<path-to-model-folder>`

In [13]:
# Set model output location

output_location = "s3://{}/{}/model".format(bucket,prefix)
# location
output_location

's3://wew-s3-demo/data/model'

### 7. Create the XGBoost model - 2 points

In [14]:
from sagemaker.amazon.amazon_estimator import image_uris
xgb_image = image_uris.retrieve(framework="xgboost", region=my_region, version='latest')

### Create an Estimator using sagemaker.estimator.Estimator.
You'll need to pass the `xgb_image` and the `iam_role` parameters as the first two parameters to `sagemaker.estimator.Estimator`. `xgb_image` was created in the previous step, and `iam_role` in the first step

Use the following values for other parameters:
* `instance_count = 1`
* `instance_type = ml.m5.xlarge`
* `output_path = output_location` from 6.2
* `sagemaker_session = sess`

#### **Expected output** - `output_path` of the xgb Estimator. Note that this output should the same as the model output path above 

In [16]:
xgb_model = sagemaker.estimator.Estimator(xgb_image,
                                          iam_role, 
                                          instance_count=1, 
                                          instance_type='ml.m5.xlarge',
                                          output_path=output_location,
                                          sagemaker_session=sess)
xgb_model.output_path

's3://wew-s3-demo/data/model'

### 8. Set model hyperparameters - 1 point
Set the hyperparameters for the model. You'll have to use the `set_hyperparameters()` method.
Refer to the demo for how it's done.

Read the below references for more information:
* https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
* https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters

Use the following values for the parameters:
* `max_depth = 1`
* `min_child_weight = 2`
* `early_stopping_rounds=5`
* `objective='reg:linear'`
* `num_round=100`

#### **Expected output** - Hyperparameters of the xgb Estimator in Python `dict` format

In [17]:
xgb_model.set_hyperparameters(max_depth = 1,             
                              min_child_weight = 2,
                              early_stopping_rounds=5,
                              objective='reg:linear',
                              num_round=100)
xgb_model.hyperparameters()

{'max_depth': 1,
 'min_child_weight': 2,
 'early_stopping_rounds': 5,
 'objective': 'reg:linear',
 'num_round': 100}

### 9. Train model using train and validation data channels - 1 point
Use the `.fit()` method to fit the model using the training and validation data channels. 
Execute the XGBoost training job.

NOTE:  This step may take several minutes. <br>
Also, add parameter `logs = False` to the fit function to avoid printing extra info logs. These are different than the training logs that the fit function will automatically generate.

#### **Expected output** - Training log. Note that you only have to call `.fit` on the xgb Estimator, with the required parameters, and the logs will  be automatically generated

In [18]:
%%time

xgb_model.fit({'train': s3_input_train, 'validation': s3_input_val}, logs = False)


.025-03-17 07:00:19 Starting - Starting the training job
...5-03-17 07:00:33 Starting - Preparing the instances for training
....-03-17 07:00:52 Downloading - Downloading input data
.......-17 07:01:17 Downloading - Downloading the training image
...5-03-17 07:01:58 Training - Training image download completed. Training in progress.
..25-03-17 07:02:18 Uploading - Uploading generated training model
2025-03-17 07:02:31 Completed - Training job completed
CPU times: user 201 ms, sys: 9.57 ms, total: 210 ms
Wall time: 2min 17s


### 10. Real-time Inference - 1.5 point

#### 10.1 Deploy endpoint for inference - 1 point

1. Deploy the model that you fit in the previous step as an endpoint for real-time inference
2. Delete the endpoint after performing inference in the inference notebook

Use the `.deploy()` method to deploy the model and create an endpoint for real-time inference as shown in the demo.

Use the following values for the parameters:
* `initial_instance_count = 1`
* `serializer = sagemaker.serializers.CSVSerializer()`
* `instance_type = 'ml.m5.xlarge'`

**NOTE**:  This step may take several minutes

### **Go to the inference notebook to perform model inference in steps 11 - 15 after deploying the model in this step**

#### **Expected Output -** Print the name of the deployed endpoint

In [19]:
%%time

# Deploy best model from hyperparameter tuning

xgb_predictor = xgb_model.deploy(initial_instance_count=1,
                    instance_type='ml.m5.xlarge', serializer = sagemaker.serializers.CSVSerializer())

xgb_predictor.endpoint_name

--------!CPU times: user 90.2 ms, sys: 14.2 ms, total: 104 ms
Wall time: 4min 32s


'xgboost-2025-03-17-07-02-35-198'

### Please make sure to complete steps 11 to 15 in the inference notebook before proceeding with step 10.2 below

#### 10.2 Delete endpoint - 0.5 point

**NOTE**: There is a limit on the number of active endpoints


#### **Expected Output -** Delete endpoint logs. Note that these will be automatically generated once you delete the endpoints.

In [20]:
xgb_predictor.delete_endpoint()

# Delete model if no longer needed
xgb_predictor.delete_model()

### 16. Hyperparameter tuning - 2 points

Read through the following links for more information:
* https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html
* https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-automatic-model-tuning-now-supports-random-search-and-hyperparameter-scaling/

We'll perform hyperparameter tuning on two hyperparameters:

1. min_child_weight: 1 to 10
2. max_depth: 2 to 10

We'll use a `Random` search strategy. The code has been given for you, assuming the XGBoost estimator is stored in the variable `xgb`.

`max_parallel_jobs` is set to 2 so that too many instances are not created for Hyperparameter tuning

In [24]:
from sagemaker.tuner import HyperparameterTuner, IntegerParameter

Define the hyperparameters to tune in the dictionary `hyperparameter_ranges`

In [25]:
hyperparameter_ranges = {
    'min_child_weight': IntegerParameter(1, 10),
    'max_depth': IntegerParameter(2, 10)}

In [26]:
optimizer = HyperparameterTuner(
    estimator=xgb_model,
    hyperparameter_ranges=hyperparameter_ranges,
    base_tuning_job_name='XGBoost-Tuner',
    objective_type='Minimize',
    objective_metric_name='validation:rmse',
    max_jobs=4,
    max_parallel_jobs=2,
    strategy='Random',
    random_seed=123)

Now that we have created the Optimizer. We need to call `.fit()` on it to start the tuning job.

Refer to the demo and see how to call `fit()` and pass the appropriate data channels.

#### **Expected output** - Tuning log. Note that you only have to call `.fit` on the optimizer, with the required parameters, and the logs will be automatically generated

In [27]:
%%time

optimizer.fit({'train': s3_input_train, 'validation': s3_input_val}, logs = False)

.........................................!
CPU times: user 249 ms, sys: 37.3 ms, total: 287 ms
Wall time: 3min 35s


### 17. Best hyperparameters - 1 point

#### **Expected output** - Hyperparameters of the `best_estimator` of the `HyperparameterTuner` in Python's dictionary(`dict`) format

In [28]:
optimizer.best_estimator().hyperparameters()


2025-03-17 07:15:20 Starting - Found matching resource for reuse
2025-03-17 07:15:20 Downloading - Downloading the training image
2025-03-17 07:15:20 Training - Training image download completed. Training in progress.
2025-03-17 07:15:20 Uploading - Uploading generated training model
2025-03-17 07:15:20 Completed - Resource retained for reuse


{'_tuning_objective_metric': 'validation:rmse',
 'early_stopping_rounds': '5',
 'max_depth': '8',
 'min_child_weight': '2',
 'num_round': '100',
 'objective': 'reg:linear'}

### 18. Real-time inference after Hyperparameter tuning - 1 point

#### 18.1 Deploy the tuned model as an endpoint - 1 point
**NOTE**:  This step may take several minutes

1. Deploy the tuned model with the best parameters that you got in the previous steps as an endpoint for real-time inference
2. Delete the endpoint after performing inference in the inference notebook

Use the `.deploy()` method to deploy the model and create an endpoint for real-time inference as shown in the demo.

Use the following values for the parameters:
* `initial_instance_count = 1`
* `serializer = sagemaker.serializers.CSVSerializer()`
* `instance_type = 'ml.m5.xlarge'`

**NOTE**:  This step may take several minutes

#### **Expected Output -** Print the name of the deployed endpoint

In [29]:
%%time

# Deploy best model from hyperparameter tuning

tuned_model_predictor = optimizer.deploy(initial_instance_count=1,
                    instance_type='ml.m5.xlarge', serializer = sagemaker.serializers.CSVSerializer())

tuned_model_predictor.endpoint_name


2025-03-17 07:15:20 Starting - Found matching resource for reuse
2025-03-17 07:15:20 Downloading - Downloading the training image
2025-03-17 07:15:20 Training - Training image download completed. Training in progress.
2025-03-17 07:15:20 Uploading - Uploading generated training model
2025-03-17 07:15:20 Completed - Resource retained for reuse


--------!CPU times: user 121 ms, sys: 14.7 ms, total: 136 ms
Wall time: 4min 37s


'XGBoost-Tuner-250317-0711-004-0cc24b8c'

#### 18.1.5 Screenshot of deployed endpoint (tuned_endpoint.png) - 0 points
Provide a screenshot (png file) of the endpoint of the tuned model named 'tuned_endpoint.png', showing the endpoint name, creation time, last updated time, your username, and that the status is `inService`.

**NOTE:** Your submission will not be graded if no screenshot is provided.

### **Go to the inference notebook to perform inference on the tuned model in steps 19 - 21 after taking the screenshot of the tuned model endpoint in this step**

### Please make sure to complete steps 19 to 21 in the inference notebook before proceeding with step 18.2 below

#### 18.2 Delete endpoint - 0.5 points

**NOTE**: There is a limit on the number of active endpoints


#### **Expected Output -** Delete endpoint logs. Note that these will be automatically generated once you delete the endpoints.

In [30]:
# DO NOT DELETE THIS CELL

tuned_model_predictor.delete_endpoint()

# Delete model if no longer needed
xgb_predictor.delete_model()