### Lets download the data

Data is from [manifoldailearning](https://raw.githubusercontent.com/manifoldailearning/mlops-with-aws-datascientists/main/Section-13-Feature-Engineering/Dataset/bank-additional-full.csv)

The dataset contains marketing campain data of a  Portuguese bank to recommend on which customer segments to target for optimized campaign performance.

In [10]:
!wget https://raw.githubusercontent.com/manifoldailearning/mlops-with-aws-datascientists/main/Section-13-Feature-Engineering/Dataset/bank-additional-full.csv

--2024-04-01 22:46:03--  https://raw.githubusercontent.com/manifoldailearning/mlops-with-aws-datascientists/main/Section-13-Feature-Engineering/Dataset/bank-additional-full.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5146674 (4.9M) [text/plain]
Saving to: ‘bank-additional-full.csv.1’


2024-04-01 22:46:03 (295 MB/s) - ‘bank-additional-full.csv.1’ saved [5146674/5146674]



In [48]:
import os
import sagemaker
from sagemaker import get_execution_role, Session


role = get_execution_role()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [7]:
sess = Session()
bucket = sess.default_bucket() #get the default bucket

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


### Feature Engineering with Sagemaker Processing
Let's use sagemaker processing for preprocessing the data. SageMaker Processing facilitates the processing of terabytes of data, meeting the demands of production-level ML algorithms, within a SageMaker-managed cluster distinct from the instances which are typically used on a notebook server for experimentation.

In [9]:
#lets store the data in our default bucket
input_source = sess.upload_data('./bank-additional-full.csv',bucket = bucket)
input_source

's3://sagemaker-us-east-1-021332873423/data/bank-additional-full.csv'

In [11]:
# define paths for train ,test and validation sets
train_path = f"s3://{bucket}/train"
test_path = f"s3://{bucket}/test"
validation_path = f"s3://{bucket}/validation"

In [23]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

#Loading sklearn processing container
sklearn_processor = SKLearnProcessor(
    framework_version="0.23-1",
    role=get_execution_role(),
    instance_type="ml.t3.medium",
    instance_count=1, 
    base_job_name='sklearnprocessing'
)

# Running the container
sklearn_processor.run(
    code = 'feature-engg-script.py',
    inputs=[
        ProcessingInput(source = input_source,
                        destination="/opt/ml/processing/input",
                        s3_input_mode="File",
                        s3_data_distribution_type="ShardedByS3Key")],
    
    outputs=[ 
        ProcessingOutput(output_name="train_data",
                         source = "/opt/ml/processing/output/train",
                         destination= train_path),
        
         ProcessingOutput(output_name="validation_data",
                         source = "/opt/ml/processing/output/validation",
                         destination= validation_path),
        
         ProcessingOutput(output_name="test_data",
                         source = "/opt/ml/processing/output/test",
                         destination= test_path)] )

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


INFO:sagemaker.image_uris:Defaulting to only available Python version: py3
INFO:sagemaker:Creating processing-job with name sklearnprocessing-2024-04-02-00-04-23-615


................................................................................[34m## Processing completed. Exiting.[0m



### Training
Let's traing an xgboost model 

In [31]:
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/train'.format(bucket), content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/validation/'.format(bucket), content_type='csv')

In [32]:
container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


In [36]:
role = get_execution_role()
xgb = sagemaker.estimator.Estimator(container,
                                    role,
                                    instance_count=1,
                                    instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/output'.format(bucket),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


INFO:sagemaker:Creating training-job with name: xgboost-2024-04-02-00-36-20-185


2024-04-02 00:36:20 Starting - Starting the training job...
2024-04-02 00:36:35 Starting - Preparing the instances for training...
2024-04-02 00:37:15 Downloading - Downloading input data...
2024-04-02 00:37:45 Downloading - Downloading the training image......
2024-04-02 00:38:40 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[2024-04-02:00:38:50:INFO] Running standalone xgboost training.[0m
[34m[2024-04-02:00:38:50:INFO] File size need to be processed in the node: 4.35mb. Available memory size in the node: 8496.19mb[0m
[34m[2024-04-02:00:38:50:INFO] Determined delimiter of CSV input is ','[0m
[34m[00:38:50] S3DistributionType set as FullyReplicated[0m
[34m[00:38:50] 28831x59 matrix with 1701029 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2024-04-02:00:38:50:INFO] Determined delimiter of CSV input is ','[0m
[34m[00:38:50] S3DistributionType set as FullyReplicated[0m
[34m[00

### Deployment and Prediction on Test set

In [37]:
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: xgboost-2024-04-02-00-42-23-491
INFO:sagemaker:Creating endpoint-config with name xgboost-2024-04-02-00-42-23-491
INFO:sagemaker:Creating endpoint with name xgboost-2024-04-02-00-42-23-491


-------!

In [38]:
xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()


In [43]:
import pandas as pd
import os
import numpy as np

In [44]:
# loading the test data

test_data_x = pd.read_csv(os.path.join(test_path, 'test_script_x.csv'),header=None)
test_data_y = pd.read_csv(os.path.join(test_path, 'test_script_y.csv'),header=None)

In [45]:
# This function will take the csv test data and split it into mini batches of rows and invoking the end point with the batches 
# Finally coverting the csv output to array format
def predict(data, predictor, rows=500 ):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')
predictions = predict(test_data_x, xgb_predictor)

  return bound(*args, **kwds)


In [46]:
# Creating the confusion matrix
pd.crosstab(index=test_data_y[0], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])


predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3584,51
1,383,101


### Conclusions
From the confusion matrix, we can say the model is doing good on class 0 compared to class 1. There are significant no. of False positives and False negetives. The model can be further fine tuned to get better performance.

In [47]:
# delete the endpoints to avoid furthrr billing
xgb_predictor.delete_endpoint(delete_endpoint_config=True)

INFO:sagemaker:Deleting endpoint configuration with name: xgboost-2024-04-02-00-42-23-491
INFO:sagemaker:Deleting endpoint with name: xgboost-2024-04-02-00-42-23-491
