# Titanic challenge with SageMaker - Remote Training

This notebook in in the series of learning SageMaker with Titanic challenge. The original challenge is defined at https://www.kaggle.com/c/titanic/data. In this notebook I experiment using SageMaker features from my local instead of using a SageMaker Notebook. 

In some cases, our local machines are strong enough to handle the data and we only want to use AWS resources for training and serving models. It can save costs of using Notebook Instances. The only difference to using a SageMaker Notebook Instance is that we must prepare a developing environment by ourselves and a proper credential to make requests to AWS from our local.

Requirements:
- Local environment with Python and necessary libraries/packages.
- S3 buckets to store the data and the output
- IAM user with the permission of writing to S3 and using SageMaker service.
- An IAM role having permissions on S3 at least to be attached to training instances.

This piece of code uses SageMaker 2.18.0 and XGBoost.


## 1. Preparation

In [1]:
# import libraries
import boto3
import sagemaker

import pandas as pd
import numpy as np


In [2]:
# Define bucket name and prefix
bucket = '<bucket-name>' 
prefix = 'prefix'

# Define IAM role and sagemaker client
boto_session = boto3.Session()
session = sagemaker.Session(boto_session=boto_session)

# A role with the rights of reading and writing to S3
role = '<your:arn:role>'

In [3]:
# define the local data path
train_data_file = './data/processed/exp-raw/train.csv'
validation_data_file = './data/processed/exp-raw/validation.csv'
test_data_file = './data/processed/exp-raw/test.csv'

In [4]:
# upload local data into s3
s3_train_uri = session.upload_data(path=train_data_file, bucket=bucket, key_prefix='/'.join((prefix, 'basic/train')))
s3_validate_uri = session.upload_data(path=validation_data_file, bucket=bucket, key_prefix='/'.join((prefix, 'basic/validation')))
s3_test_uri = session.upload_data(path=test_data_file, bucket=bucket, key_prefix='/'.join((prefix, 'basic/test')))


## 2. Model training

In [5]:
# Get the XGBoost image uri corresponding to the current region
from sagemaker import image_uris
container = image_uris.retrieve('xgboost', session.boto_region_name, 'latest')

In [12]:
# Define the estimator and hyperparams
xgb = sagemaker.estimator.Estimator(container,
                                    role, # role to be attached to instance to access data
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    use_spot_instances=True,
                                    max_run=400,
                                    max_wait=600)

xgb.set_hyperparameters(eval_metric='auc',
                        objective='binary:logistic',
                        num_round = 100,
                        early_stopping_round=10)

In [13]:
# Define the input points for xgboost
s3_input_train = sagemaker.inputs.TrainingInput(s3_data=s3_train_uri, content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=s3_validate_uri, content_type='csv')

In [15]:
# Train the model
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

2020-11-23 17:09:30 Starting - Starting the training job...
2020-11-23 17:09:32 Starting - Launching requested ML instances......
2020-11-23 17:10:39 Starting - Preparing the instances for training......
2020-11-23 17:11:39 Downloading - Downloading input data...
2020-11-23 17:12:31 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[2020-11-23:17:12:32:INFO] Running standalone xgboost training.[0m
[34m[2020-11-23:17:12:32:INFO] File size need to be processed in the node: 0.03mb. Available memory size in the node: 8474.34mb[0m
[34m[2020-11-23:17:12:32:INFO] Determined delimiter of CSV input is ','[0m
[34m[17:12:32] S3DistributionType set as FullyReplicated[0m
[34m[17:12:32] 712x10 matrix with 7120 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2020-11-23:17:12:32:INFO] Determined delimiter of CSV input is ','[0m
[34m[17:12:32] S3DistributionType set as FullyReplicated[0m
[34m[17:12


2020-11-23 17:12:44 Uploading - Uploading generated training model
2020-11-23 17:12:44 Completed - Training job completed
Training seconds: 65
Billable seconds: 24
Managed Spot Training savings: 63.1%


## 3. Make predictions

In [2]:
# Deploy the estimator with Batch Transform 
xgb_transformer = xgb.transformer(instance_count=1,
                                  instance_type='ml.m4.xlarge',
                                  strategy='MultiRecord',
                                  assemble_with='Line',
                                  output_path='s3://{}/{}/prediction/'.format(bucket, prefix))

In [17]:
xgb_transformer.transform(s3_test_uri, content_type='text/csv', split_type='Line')
xgb_transformer.wait()

Wall time: 0 ns
............................[32m2020-11-23T17:20:53.026:[sagemaker logs]: MaxConcurrentTransforms=4, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34mArguments: serve[0m
[34m[2020-11-23 17:20:52 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2020-11-23 17:20:52 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2020-11-23 17:20:52 +0000] [1] [INFO] Using worker: gevent[0m
[35mArguments: serve[0m
[35m[2020-11-23 17:20:52 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[35m[2020-11-23 17:20:52 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[35m[2020-11-23 17:20:52 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2020-11-23 17:20:52 +0000] [36] [INFO] Booting worker with pid: 36[0m
[34m[2020-11-23 17:20:52 +0000] [37] [INFO] Booting worker with pid: 37[0m
[34m[2020-11-23:17:20:53:INFO] Model loaded successfully for worker : 36[0m
[34m[2020-11-23:17:20:53:INFO] Model loaded successfully for worker : 37[0m
[34m[2020-11-2

In [21]:
result_df = pd.read_csv('s3://{}/{}/prediction/test.csv.out'.format(bucket, prefix),header=None,names=['Survived'])

In [22]:
result_df.head(5)

Unnamed: 0,Survived
0,0.001603
1,0.067456
2,0.005251
3,0.010396
4,0.260264
