# XGBoost using SageMaker

## Classification on Amazon SageMaker

Perform a classification task on the given dataset.<br>
Using the features given, you will train a XGBoost decision tree model to predict a given person's salary (the `WAGP` column) - which will be categorized into multiple bins.<br>

--- 

#### Tasks: 

- Perform Exploratory Data Analysis on the given dataset
- Save preprocessed datasets to Amazon S3
- Use the Amazon Sagemaker platform to train an XGBoost model
- Evaluate the model on the test set using real-time inference
- Perform hyperparameter tuning on the XGBoost model

### 1. Get Amazon IAM execution role & instance region

 Make sure to create an S3 bucket or re-use the ones from prior exercises

In [2]:
# ALL YOUR IMPORTS HERE
import os, sagemaker
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from sagemaker import get_execution_role

Get and store the IAM executon role, SageMaker Session, instance region & the SageMaker client in the cell below.

#### **Expected output:** Print the instance region

In [3]:
# DO NOT DELETE THIS CELL
# YOUR CODE HERE

# Define IAM role- this will be necessary when defining your model
iam_role = get_execution_role()

# Set SageMaker session handle
sess = sagemaker.Session()

# Set the region of the instance 
my_region = sess.boto_session.region_name

print(my_region)

us-west-2


### 2. Read data using pandas and select features

#### Read data from the given s3 bucket path into a pandas dataframe 

#### **Expected output**: First five rows of the dataframe

In [6]:

bucket = "mgta466-w24"
prefix = "data"
print('Using bucket ' + bucket)

Using bucket mgta466-w24


In [None]:
import pandas as pd
import zipfile
file_path = "s3://mgta466-w24/data/person_records_merged_discretized.csv.zip"


data_fname = "s3://{}/{}/{}".format(bucket, prefix ,"person_records_merged_discretized.csv.zip")
val_df = pd.read_csv(data_fname)
df=val_df.copy()
df.head()

### Description of Columns

There are lots of columns in the original dataset. However, we'll only use the following columns whose descriptions are given below:

WAGP_CAT - Discretized wages or salary income past 12 months

AGEP -  Age

COW - Class of worker

JWMNP - Travel time to work

JWTR - Means of transportation to work

MAR - Marital status

PERNP - Total person's earnings

NWAV - Available for work

NWLA - On layoff from work

NWLK - Looking for work

NWAB - Temporary absence from work

SCHL - Educational attainment

WKW - Weeks worked during past 12 months

#### 2.2 Feature selection - Select the features in the columns listed above
Note: Make sure `WAGP_CAT` column is the first column. XGBoost expects labels to be in the first column

#### **Expected output**: First five rows of the dataframe after feature selection

In [10]:
# DO NOT DELETE THIS CELL
# YOUR CODE HERE
# Selecting only the specified columns
selected_columns = ['WAGP_CAT', 'AGEP', 'COW', 'JWMNP', 'JWTR', 'MAR', 'PERNP', 'NWAV', 'NWLA', 'NWLK', 'NWAB', 'SCHL', 'WKW']
df= df[selected_columns]
print(df.head())


   WAGP_CAT  AGEP  COW  JWMNP  JWTR  MAR    PERNP  NWAV  NWLA  NWLK  NWAB  \
0         0    19  NaN    NaN   NaN    5      0.0   5.0   2.0   2.0   2.0   
1         1    55  1.0   30.0   1.0    1  52000.0   5.0   3.0   3.0   3.0   
2         0    56  6.0    NaN  11.0    1  99000.0   5.0   3.0   3.0   3.0   
3         0    21  NaN    NaN   NaN    5      0.0   5.0   2.0   2.0   2.0   
4         0    21  NaN    NaN   NaN    5      0.0   5.0   2.0   2.0   2.0   

   SCHL  WKW  
0  19.0  NaN  
1  20.0  1.0  
2  16.0  1.0  
3  19.0  NaN  
4  19.0  NaN  


### 3. Data processing

#### 3.1 Remove highly correlated column

In [11]:
df.corr()['WAGP_CAT']

WAGP_CAT    1.000000
AGEP       -0.023210
COW        -0.044870
JWMNP       0.121411
JWTR       -0.049439
MAR        -0.166626
PERNP       0.791990
NWAV        0.108792
NWLA        0.305329
NWLK        0.298288
NWAB        0.292421
SCHL        0.273946
WKW        -0.302723
Name: WAGP_CAT, dtype: float64

As seen from the correlation valus, column `PERNP` is highly correlated with the wage and must be removed

#### **Expected output** - Columns of the dataframe after removing `PERNP`

In [12]:
# DO NOT DELETE THIS CELL
# YOUR CODE HERE
# Remove the 'PERNP' column
selected_columns.remove('PERNP')

# Selecting the remaining columns
df = df[selected_columns]

# Printing the columns of the selected DataFrame
print(df.columns)


Index(['WAGP_CAT', 'AGEP', 'COW', 'JWMNP', 'JWTR', 'MAR', 'NWAV', 'NWLA',
       'NWLK', 'NWAB', 'SCHL', 'WKW'],
      dtype='object')


#### 3.2. Dropping NAs 

Drop rows with any nulls in any of the columns

#### **Expected output** - Number of rows in the cleaned dataframe

In [13]:
# DO NOT DELETE THIS CELL
# YOUR CODE HERE
df = df.dropna()
df.shape[0]

1248427

### 4. Splitting data and converting to CSV

Split the dataset into train, validation, and test sets using sklearn's `train_test_split`.
Look up the API definition of `train_test_split` to see what values you need to pass

First, split the dataframe into two parts - `train_data` and `val_data` with an 70:30 ratio, and then
split the `train_data` into `train_data` and `test_data` in a 85:15 ratio.

Use the following parameters for train_test_split:
* `random_state = 466`
* `shuffle = True`
* `train_size = 0.7`, `test_size = 0.3` for the first split
* `train_size = 0.85`, `test_size = 0.15` for the second split

**IMPORTANT** - Use `random_state=466` as one the parameters of the `train_test_split` function to maintain consistency across submissions

#### **Expected output** - Size of train, validation and test data in a tuple format - (length of train, length of validation, length of test)

In [15]:
# DO NOT DELETE THIS CELL
random_state = 466
# YOUR CODE HERE
from sklearn.model_selection import train_test_split

# Splitting the dataframe into train_data and val_data with a 70:30 ratio
train_data, val_data = train_test_split(df, test_size=0.3, random_state=466, shuffle=True)

# Splitting the train_data into train_data and test_data with an 85:15 ratio
train_data, test_data = train_test_split(train_data, test_size=0.15, random_state=466, shuffle=True)

# Getting the sizes of train, validation, and test data
train_size = len(train_data)
val_size = len(val_data)
test_size = len(test_data)
print((train_size, val_size, test_size))

(742813, 374529, 131085)


### Write prepared data to files.
Write the `train_data`, `val_data`, and `test_data` to csv files using the `.to_csv()` method

Use `index = False` as the parameters as shown in the demo.

**NOTE:** Use `header = False` as another parameter for `train_data` and `val_data`, while for `test_data` use `header = True`

In [35]:
# YOUR CODE HERE
# Write prepared data to files

train_data.to_csv('train_data.csv', index=False, header=False)
val_data.to_csv('val_data.csv', index=False, header=False)
test_data.to_csv('test_data.csv', index=False, header=True)

### 5. Save processed data to S3 

This step is needed for using XGBoost with Amazon Sagemaker. Send data to S3. SageMaker will read training data from S3. The example for training data is given, you need do the same for validation and test data

#### **Expected output** - Path of train, validation and test data in AWS S3 in tuple format - (train_path, val_path, test_path)

In [36]:
bucket = 'sb-aws-bucket-mgt'

In [37]:
# DO NOT DELETE THIS CELL

prefix = "data"
key_prefix = prefix + "/model_data"

train_path = sess.upload_data(
    path='train_data.csv', bucket=bucket,
    key_prefix=key_prefix)

val_path = sess.upload_data(
    path="val_data.csv", bucket=bucket, key_prefix=key_prefix)

test_path = sess.upload_data(
    path="test_data.csv", bucket=bucket, key_prefix=key_prefix)

# YOUR CODE HERE
print((train_path,val_path,test_path))


('s3://sb-aws-bucket-mgt/data/model_data/train_data.csv', 's3://sb-aws-bucket-mgt/data/model_data/val_data.csv', 's3://sb-aws-bucket-mgt/data/model_data/test_data.csv')


### 6. Create channels for train and validation data to feed to model
1. Set up data channels for the training, validation, and test data as shown in the demo
2. Set the output location for the model

You'll have to use the [`TrainingInput`](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.TrainingInput) function and pass the `s3_data` and `content_type` parameters

#### **Expected output** - [`config`](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.TrainingInput.config) of `TrainingInput` created for training data

Refer - https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.TrainingInput.config

In [38]:

s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_path, content_type='csv')
s3_input_val = sagemaker.inputs.TrainingInput(s3_data=val_path, content_type='csv')
s3_input_test = sagemaker.inputs.TrainingInput(s3_data=test_path, content_type='csv')

# Set model output location

output_location = "s3://{}/{}/model".format(bucket,prefix)
print((s3_input_train.config))

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sb-aws-bucket-mgt/data/model_data/train_data.csv', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'csv'}


#### **Expected output** - Model's output location in AWS S3

NOTE - Output format should be `s3://<bucket-name>/<path-to-model-folder>`

In [39]:
output_location

's3://sb-aws-bucket-mgt/data/model'

### 7. Create the XGBoost model 

In [23]:
# CODE FOR XGB IMAGE
from sagemaker.amazon.amazon_estimator import image_uris
xgb_image = image_uris.retrieve(framework="xgboost", region=my_region, version='latest')

### Create an Estimator using sagemaker.estimator.Estimator.


#### **Expected output** - `output_path` of the xgb Estimator. Note that this output should the same as the model output path above 

In [24]:
# DO NOT DELETE THIS CELL
# YOUR CODE HERE
xgb_model = sagemaker.estimator.Estimator(xgb_image,
                                          iam_role, 
                                          instance_count=1, 
                                          instance_type='ml.m5.xlarge',
                                          output_path=output_location,
                                          sagemaker_session=sess)

### 8. Set model hyperparameters - 
Set the hyperparameters for the model. You'll have to use the `set_hyperparameters()` method.
Refer to the demo for how it's done.
The wages have been discretized into 3 bins so the number of classes for our classification problem is 3

Read the below references for more information:
* https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
* https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters

Use the following values for the parameters:
* `num_class = 3`
* `max_depth = 1`
* `min_child_weight = 2`
* `early_stopping_rounds=5`
* `objective='multi:softmax'`
* `num_round=100`

#### **Expected output** - Hyperparameters of the xgb Estimator in Python `dict` format

In [40]:
xgb_model.set_hyperparameters(num_class = 3,
    max_depth = 1,
    min_child_weight = 2,
    early_stopping_rounds=5,
    objective='multi:softmax',
    num_round=100)

In [41]:
 #Hyperparameters of the xgb Estimator in Python dict format
xgb_model.hyperparameters()

{'num_class': 3,
 'max_depth': 1,
 'min_child_weight': 2,
 'early_stopping_rounds': 5,
 'objective': 'multi:softmax',
 'num_round': 100}

### Train model using train and validation data channels
Use the `.fit()` method to fit the model using the training and validation data channels. 
Execute the XGBoost training job.

NOTE:  This step may take several minutes. <br>
Also, add parameter `logs = False` to the fit function to avoid printing extra info logs. These are different than the training logs that the fit function will automatically generate.

#### **Expected output** - Training log. Note that you only have to call `.fit` on the xgb Estimator, with the required parameters, and the logs will  be automatically generated

In [42]:

import time

# Record the start time
start_time = time.time()

xgb_model.fit({'train': s3_input_train, 'validation': s3_input_val}, logs = False)
end_time = time.time()

INFO:sagemaker:Creating training-job with name: xgboost-2024-03-20-01-45-07-319



2024-03-20 01:45:07 Starting - Starting the training job....
2024-03-20 01:45:36 Starting - Preparing the instances for training......
2024-03-20 01:46:09 Downloading - Downloading input data....
2024-03-20 01:46:34 Downloading - Downloading the training image.....
2024-03-20 01:47:05 Training - Training image download completed. Training in progress...
2024-03-20 01:47:20 Uploading - Uploading generated training model.
2024-03-20 01:47:31 Completed - Training job completed


In [43]:
print("Wall time:",(end_time-start_time)/60)

Wall time: 2.4540852944056195


###  Real-time Inference

#### 10.1 Deploy endpoint for inference

1. Deploy the model that you fit in the previous step as an endpoint for real-time inference
2. Delete the endpoint after performing inference in the inference notebook

Use the `.deploy()` method to deploy the model and create an endpoint for real-time inference as shown in the demo.

Use the following values for the parameters:
* `initial_instance_count = 1`
* `serializer = sagemaker.serializers.CSVSerializer()`
* `instance_type = 'ml.m5.xlarge'`

**NOTE**:  This step may take several minutes

**Go to the inference notebook to perform the inference after deploying the model in this step**

#### **Expected Output -** Print the name of the deployed endpoint

In [44]:

model_predictor = xgb_model.deploy(initial_instance_count=1,
                    instance_type='ml.m5.xlarge', serializer = sagemaker.serializers.CSVSerializer())
model_predictor.endpoint_name

INFO:sagemaker:Creating model with name: xgboost-2024-03-20-01-47-34-585
INFO:sagemaker:Creating endpoint-config with name xgboost-2024-03-20-01-47-34-585
INFO:sagemaker:Creating endpoint with name xgboost-2024-03-20-01-47-34-585


-----!

'xgboost-2024-03-20-01-47-34-585'

#### 10.2 Delete endpoint 

**NOTE**: There is a limit on the number of active endpoints


#### **Expected Output -** Delete endpoint logs. Note that these will be automatically generated once you delete the endpoints.

In [57]:

model_predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: xgboost-2024-03-20-01-47-34-585
INFO:sagemaker:Deleting endpoint with name: xgboost-2024-03-20-01-47-34-585


### 16. Hyperparameter tuning 

Read through the following links for more information:
* https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html
* https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-automatic-model-tuning-now-supports-random-search-and-hyperparameter-scaling/

We'll perform hyperparameter tuning on two hyperparameters:

1. min_child_weight: 1 to 10
2. max_depth: 2 to 10

We'll use a `Random` search strategy. The code has been given for you, assuming the XGBoost estimator is stored in the variable `xgb`.

`max_parallel_jobs` is set to 2 so that too many instances are not created for Hyperparamter tuning

In [45]:
from sagemaker.tuner import HyperparameterTuner, IntegerParameter

Define the hyperparameters to tune in the dictionary `hyperparameter_ranges`

In [46]:
xgb_model.hyperparameters()

{'num_class': 3,
 'max_depth': 1,
 'min_child_weight': 2,
 'early_stopping_rounds': 5,
 'objective': 'multi:softmax',
 'num_round': 100}

In [47]:

hyperparameter_ranges = {
    'max_depth': IntegerParameter(2, 10),
    'min_child_weight':IntegerParameter(1, 10)
     }


In [49]:
optimizer = HyperparameterTuner(
    estimator=xgb_model,
    hyperparameter_ranges=hyperparameter_ranges,
    base_tuning_job_name='XGBoost-Tuner',
    objective_type='Minimize',
    objective_metric_name='validation:merror',
    max_jobs=4,
    max_parallel_jobs=2,
    strategy='Random',
    random_seed=123)

Now that we have created the Optimizer. We need to call `.fit()` on it to start the tuning job.

Refer to the demo and see how to call `fit()` and pass the appropriate data channels.

#### **Expected output** - Tuning log. Note that you only have to call `.fit` on the optimizer, with the required parameters, and the logs will be automatically generated

In [52]:
#%%time
start_time = time.time()

optimizer.fit({'train': s3_input_train, 'validation': s3_input_val}, logs = False)
end_time = time.time()

INFO:sagemaker:Creating hyperparameter tuning job with name: XGBoost-Tuner-240320-0212


.................................................!


### 17. Best hyperparameters 

#### **Expected output** - Hyperparameters of the `best_estimator` of the `HyperparameterTuner` in Python's dictionary(`dict`) format

In [54]:
optimizer.best_estimator().hyperparameters()


2024-03-20 02:15:46 Starting - Preparing the instances for training
2024-03-20 02:15:46 Downloading - Downloading the training image
2024-03-20 02:15:46 Training - Training image download completed. Training in progress.
2024-03-20 02:15:46 Uploading - Uploading generated training model
2024-03-20 02:15:46 Completed - Resource reused by training job: XGBoost-Tuner-240320-0212-004-806ca846


{'_tuning_objective_metric': 'validation:merror',
 'early_stopping_rounds': '5',
 'max_depth': '9',
 'min_child_weight': '2',
 'num_class': '3',
 'num_round': '100',
 'objective': 'multi:softmax'}

### 18. Real-time inference after Hyperparamter tuning 

#### 18.1 Deploy the tuned model as an endpoint
**NOTE**:  This step may take several minutes

1. Deploy the tuned model with the best parameters that you got in the previous steps as an endpoint for real-time inference
2. Delete the endpoint after performing inference in the inference notebook

Use the `.deploy()` method to deploy the model and create an endpoint for real-time inference as shown in the demo.

Use the following values for the parameters:
* `initial_instance_count = 1`
* `serializer = sagemaker.serializers.CSVSerializer()`
* `instance_type = 'ml.m5.xlarge'`

**NOTE**:  This step may take several minutes

#### **Expected Output -** Print the name of the deployed endpoint

In [55]:
#%%time

start_time = time.time()

tuned_model_predictor = optimizer.deploy(initial_instance_count=1,
                    instance_type='ml.m5.xlarge', serializer = sagemaker.serializers.CSVSerializer())

tuned_model_predictor.endpoint_name
end_time = time.time()


2024-03-20 02:15:46 Starting - Preparing the instances for training
2024-03-20 02:15:46 Downloading - Downloading the training image
2024-03-20 02:15:46 Training - Training image download completed. Training in progress.
2024-03-20 02:15:46 Uploading - Uploading generated training model
2024-03-20 02:15:46 Completed - Resource reused by training job: XGBoost-Tuner-240320-0212-004-806ca846

INFO:sagemaker:Creating model with name: XGBoost-Tuner-2024-03-20-02-22-23-756





INFO:sagemaker:Creating endpoint-config with name XGBoost-Tuner-240320-0212-002-2b47cacb
INFO:sagemaker:Creating endpoint with name XGBoost-Tuner-240320-0212-002-2b47cacb


----!

In [59]:
tuned_model_predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: XGBoost-Tuner-240320-0212-002-2b47cacb
INFO:sagemaker:Deleting endpoint with name: XGBoost-Tuner-240320-0212-002-2b47cacb
