# Tutorial: Build, Train and Deploy a ML Model with SageMaker

In this example we'll build, train and deploy a machine learning model using the XGBoost algorithm. Amazon SageMaker is a fully managed service that manages all the infrastructure required to build and deploy machine learning (ML) models quickly and at scale.

The goal of the exercise is to predict whether a bank customer will enroll for a certificate of deposit.

## Environment and libraries

The first step is to import the required libraries and define the environment variables you need to prepare the data, and to train and deploy the ML model.

In [1]:
import boto3, re, sys, math, json, os, sagemaker, urllib.request
from sagemaker import get_execution_role
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image
from IPython.display import display
from time import gmtime, strftime
from sagemaker.predictor import csv_serializer

# Define IAM role
role = get_execution_role()
prefix = 'sagemaker/DEMO-xgboost-dm'
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'} # each region has its XGBoost container
my_region = boto3.session.Session().region_name    # Set the region of the instance
print("Success - the MySageMakerInstance is in the " + my_region + " region. You will use the " +
     containers[my_region] + " container for your SageMaker endpoint.")

Success - the MySageMakerInstance is in the us-east-1 region. You will use the 811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest container for your SageMaker endpoint.


## Data preparation

The data used here is the Bank Marketing Data Set (https://archive.ics.uci.edu/ml/datasets/bank+marketing) that contains information on customer demographics, responses to marketing events, external factors, and a column which identifies whether the customer is enrolled for a product offered by the bank.

Start by preprocessing the data and uploading it to an Amazon S3 bucket. We'll create the bucket here. The bucket name must be unique - if you don't receive a success message after running the code, change the bucket name and try again.

In [2]:
bucket_name = 'sagemaker-tutorial-bucket-1234'
s3 = boto3.resource('s3')
try:
    if my_region == 'us-east-1':
        s3.create_bucket(Bucket=bucket_name)
    else:
        s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={ 'LocationConstraint': my_region })
    print('S3 bucket created successfully')
except Exception as e:
    print('S3 error: ', e)

S3 bucket created successfully


Download the data to your SageMaker instance and load the data into a dataframe.

In [3]:
try:
    urllib.request.urlretrieve("https://d1.awsstatic.com/tmt/build-train-deploy-machine-learning-model-sagemaker/bank_clean.27f01fbbdf43271788427f3682996ae29ceca05d.csv", "bank_clean.csv")
    print('Success: downloaded bank_clean.csv.')
except Exception as e:
    print('Data load error: ', e)
    
try:
    model_data = pd.read_csv('./bank_clean.csv', index_col=0)
    print('Success: data loaded into dataframe.')
except Exception as e:
    print('Data load error: ', e)


Success: downloaded bank_clean.csv.
Success: data loaded into dataframe.


Shuffle and split the data into training data (70% of the total data available) and test data (the remaining 30%).

In [4]:
train_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7*len(model_data))])

## Train the ML model

We'll use the training dataset to train a ML model. First we have to reformat the header and first column of the training data, then load from the S3 bucket. This step is required to use the SageMaker pre-built XGBoost algorithm.

In [5]:
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

Set up the Amazon SageMaker session, create an instance of the XGBoost model (an estimator) and define the model's hyperparameters, then start the training job.

In [6]:
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(containers[my_region], role, instance_count=1, instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket_name, prefix), sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5, eta=0.2, gamma=4, min_child_weight=6, subsample=0.8, silent=0,
                        objective='binary:logistic', num_round=100)

Start training the model using gradient optimization on a `ml.m4.xlarge` instance.

In [7]:
xgb.fit({'train': s3_input_train})

2021-04-20 23:23:00 Starting - Starting the training job...
2021-04-20 23:23:25 Starting - Launching requested ML instancesProfilerReport-1618960980: InProgress
......
2021-04-20 23:24:25 Starting - Preparing the instances for training.........
2021-04-20 23:25:45 Downloading - Downloading input data...
2021-04-20 23:26:26 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[2021-04-20:23:26:27:INFO] Running standalone xgboost training.[0m
[34m[2021-04-20:23:26:28:INFO] Path /opt/ml/input/data/validation does not exist![0m
[34m[2021-04-20:23:26:28:INFO] File size need to be processed in the node: 3.38mb. Available memory size in the node: 8413.57mb[0m
[34m[2021-04-20:23:26:28:INFO] Determined delimiter of CSV input is ','[0m
[34m[23:26:27] S3DistributionType set as FullyReplicated[0m
[34m[23:26:28] 28831x59 matrix with 1701029 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[23:26:28] s


2021-04-20 23:26:46 Uploading - Uploading generated training model
2021-04-20 23:26:46 Completed - Training job completed
Training seconds: 62
Billable seconds: 62


## Deploy the trained model

Deploy the trained model on a server and create a SageMaker endpoint for access (this takes a while to run).

In [8]:
xgb_predictor = xgb.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

---------------!

Create predictions: reformat and load the test data, then run the model.

In [9]:
from sagemaker.serializers import CSVSerializer

test_data_array = test_data.drop(['y_no', 'y_yes'], axis=1).values    # Load the data into an array
xgb_predictor.serializer = CSVSerializer()                            # Set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8')  # Generate predictions
predictions_array = np.fromstring(predictions[1:], sep=',')           # Turn prediction into an array
print(predictions_array.shape)

(12357,)


## Evaluate model performance

Compare the actual vs. predicted values in a table called a **confusion matrix**.

In [10]:
cm = pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions_array), rownames=['Observed'], colnames=['Predicted'])
tn = cm.iloc[0,0]; fn = cm.iloc[1,0]; tp = cm.iloc[1,1]; fp = cm.iloc[0,1]
p = (tp+tn)/(tp+tn+fp+fn)*100
print("\n{0:<20}{1:<4.1f}%\n".format("Overall Classification Rate: ", p))
print("{0:<15}{1:<15}{2:>8}".format("Predicted", "No Purchase", "Purchase"))
print("Observed")
print("{0:<15}{1:<2.0f}% ({2:<}){3:>6.0f}% ({4:<})".format("No Purchase", tn/(tn+fn)*100,tn, fp/(tp+fp)*100, fp))
print("{0:<16}{1:<1.0f}% ({2:<}){3:>7.0f}% ({4:<}) \n".format("Purchase", fn/(tn+fn)*100,fn, tp/(tp+fp)*100, tp))


Overall Classification Rate: 89.5%

Predicted      No Purchase    Purchase
Observed
No Purchase    90% (10769)    37% (167)
Purchase        10% (1133)     63% (288) 



## Clean up

Terminate the resources used:

- Delete your endpoint
- Delete training artifacts and S3 bucket
- Delete SageMaker notebook (in the SageMaker console)

In [11]:
# Delete the endpoint
xgb_predictor.delete_endpoint()

In [12]:
# Delete the model
xgb_predictor.delete_model()

ClientError: An error occurred (ValidationException) when calling the DescribeEndpointConfig operation: Could not find endpoint configuration "arn:aws:sagemaker:us-east-1:288637950667:endpoint-config/xgboost-2021-04-20-23-27-15-560".

In [13]:
# Delete training artifacts and bucket
bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': 'EVGYM6ECXKYSBMR7',
   'HostId': 'Cij+Zh3rO+L/28Rom+AFD9VFWJ3laPCqiA7FUFsiw/zWrcFOxwPpvBQCdZ4YJGR07TvM5D9gdPU=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'Cij+Zh3rO+L/28Rom+AFD9VFWJ3laPCqiA7FUFsiw/zWrcFOxwPpvBQCdZ4YJGR07TvM5D9gdPU=',
    'x-amz-request-id': 'EVGYM6ECXKYSBMR7',
    'date': 'Wed, 21 Apr 2021 00:06:12 GMT',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3',
    'connection': 'close'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'sagemaker/DEMO-xgboost-dm/output/xgboost-2021-04-20-22-55-51-729/profiler-output/system/incremental/2021042022/1618959540.algo-1.json'},
   {'Key': 'sagemaker/DEMO-xgboost-dm/output/xgboost-2021-04-20-23-23-00-555/profiler-output/system/incremental/2021042023/1618961160.algo-1.json'},
   {'Key': 'sagemaker/DEMO-xgboost-dm/output/xgboost-2021-04-20-22-55-51-729/rule-output/ProfilerReport-1618959351/profiler-output/profiler-reports/BatchSize.jso