# Titanic challenge with SageMaker - XGBoost as a framework

This notebook in in the series of learning SageMaker with Titanic challenge. The original challenge is defined at https://www.kaggle.com/c/titanic/data. In this notebook I experiment using SageMaker XGBoost as a framework. In particular, I use a customized traning script to make it do the cross validation. The whole training process is performed at local.

Requirements:
- Docker installed at local.
- Developing environment with Python and necessary libraries/packages.
- Due to some errors when sagemaker xgboost handles csv file at local mode, data is uploaded into S3 instead. Then we need a S3 bucket and a role with the access to S3.

This piece of code uses SageMaker 2.18.0 and XGBoost.


## 1. Preparation

In [1]:
# import libraries
import boto3
import sagemaker

import pandas as pd
import numpy as np


In [2]:
# # Define bucket name and prefix
bucket = '<your-bucket-name>' 
prefix = '<your-bucket-prefix>' # for example, sagemaker/titanic

# Define IAM role and sagemaker client
boto_session = boto3.Session()
session = sagemaker.Session(boto_session=boto_session)
role = '<your-role-arn>'

In [3]:
# define the local data path
train_data_file = './data/processed/exp-raw/train.csv'
validation_data_file = './data/processed/exp-raw/validation.csv'

In [79]:
# load into one
train_df = pd.read_csv(train_data_file,header=None)
validation_df = pd.read_csv(validation_data_file,header=None)

merged_df = pd.concat([train_df,validation_df])


In [81]:
print(train_df.shape, validation_df.shape,merged_df.shape)

(712, 11) (179, 11) (891, 11)


In [82]:
merged_df.to_csv('./data/processed/train.csv', header=False, index=False)

In [4]:
s3_train_uri = session.upload_data(path='./data/processed/train.csv', bucket=bucket, key_prefix='/'.join((prefix, 'train')))
# s3_validate_uri = session.upload_data(path='data/processed/exp-raw/validation.csv', bucket=bucket, key_prefix='/'.join((prefix, 'validation')))
# s3_test_uri = session.upload_data(path='data/processed/exp-raw/test.csv', bucket=bucket, key_prefix='/'.join((prefix, 'test')))

In [5]:
# Define the input points for xgboost
input_train = sagemaker.inputs.TrainingInput(s3_data=s3_train_uri, content_type='csv')
# input_validation = sagemaker.inputs.TrainingInput(s3_data=s3_validate_uri, content_type='csv')

We need to write a training script for our XGBoost in which:
- Define parameters will be passed to XGBoost
- Get the input data and convert it to DMatrix to work with XGBoost, using help function from XGBoost SageMaker
- Call the .cv() function of XGBoost to do cross validation
- Write the result to the output directory, data in this directory will be uploaded into S3.

In [None]:
import argparse
import json
import logging
import os
import pandas as pd
import pickle as pkl

from sagemaker_containers import entry_point
from sagemaker_xgboost_container.data_utils import get_dmatrix
from sagemaker_xgboost_container import distributed

import xgboost as xgb

import boto3

def _xgb_train(
        params,
        dtrain,
        evals,
        num_boost_round,
        num_fold,
        seed,
        model_dir,
        output_data_dir,
        is_master,
        early_stopping_round):
    """Run xgb cross validation on arguments given with rabit initialized.

    This is our rabit execution function.

    :param args_dict: Argument dictionary used to run xgb.cv().
    :param is_master: True if current node is master host in distributed training, or is running single node training job. Note that rabit_run will include this argument.
    """

    cvresult = xgb.cv(
        params=params,
        dtrain=dtrain,
        num_boost_round=num_boost_round,
        nfold=num_fold,
        metrics=['auc'],
        early_stopping_rounds=early_stopping_round,
        stratified=True,
        seed=seed)

    if is_master:
        model_location = output_data_dir + '/cv-result'
        pkl.dump(cvresult, open(model_location, 'wb'))
        logging.info("Stored cv result at {}".format(model_location))



if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    # cross validation params
    parser.add_argument('--k_fold', type=int)

    # Hyperparameters are described here. In this simple example we are just
    # including one hyperparameter.
    parser.add_argument('--max_depth', type=int)
    parser.add_argument('--eta', type=float)
    parser.add_argument('--gamma', type=int)
    parser.add_argument('--min_child_weight', type=int)
    parser.add_argument('--subsample', type=float)
    parser.add_argument('--verbose', type=int)
    parser.add_argument('--objective', type=str)
    parser.add_argument('--num_round', type=int)
    parser.add_argument('--early_stopping_round', type=int)
    parser.add_argument('--eval_metric', type=str)
    parser.add_argument('--num_fold', type=int)
    parser.add_argument('--seed', type=int)

    # Sagemaker specific arguments. Defaults are set in the environment
    # variables.
    parser.add_argument('--output_data_dir', type=str,
                        default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model_dir', type=str,
                        default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str,
                        default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--sm_hosts', type=str, default=os.environ['SM_HOSTS'])
    parser.add_argument('--sm_current_host', type=str,
                        default=os.environ['SM_CURRENT_HOST'])

    args, _ = parser.parse_known_args()

    # Get SageMaker host information from runtime environment variables
    sm_hosts = json.loads(os.environ['SM_HOSTS'])
    sm_current_host = args.sm_current_host

    dtrain = get_dmatrix(args.train, 'csv')
    watchlist = [(dtrain, 'train')]

    train_hp = {
        'max_depth': args.max_depth,
        'eta': args.eta,
        'gamma': args.gamma,
        'min_child_weight': args.min_child_weight,
        'subsample': args.subsample,
        'verbose': args.verbose,
        'objective': args.objective,
        'eval_metric': args.eval_metric
    }

    xgb_train_args = dict(
        params=train_hp,
        dtrain=dtrain,
        evals=watchlist,
        num_boost_round=args.num_round,
        num_fold=args.num_fold,
        seed=args.seed,
        model_dir=args.model_dir,
        early_stopping_round=args.early_stopping_round,
        output_data_dir=args.output_data_dir)

    if len(sm_hosts) > 1:
        # Wait until all hosts are able to find each other
        entry_point._wait_hostname_resolution()

        # Execute training function after initializing rabit.
        distributed.rabit_run(
            exec_fun=_xgb_train,
            args=xgb_train_args,
            include_in_training=(dtrain is not None),
            hosts=sm_hosts,
            current_host=sm_current_host,
            update_rabit_args=True
        )
    else:
        # If single node training, call training method directly.
        if dtrain:
            xgb_train_args['is_master'] = True
            _xgb_train(**xgb_train_args)
        else:
            raise ValueError("Training channel must have data to train model.")

## 2. Model training

In [38]:
params = {
        "eval_metric": "auc",
        "early_stopping_round":"50",
        "objective":"binary:logistic",
        "num_round":"50",
        "num_fold":"5",
        "seed":"1819"}

In [39]:
# Define the estimator and hyperparams
from sagemaker.xgboost.estimator import XGBoost
xgb = XGBoost(
    entry_point='scripts/entry.py',
    framework_version='1.0-1',
    hyperparameters=params,
    role=role, 
    instance_count=1, 
    instance_type='local',
    output_path='s3://{}/{}/output'.format(bucket, prefix)
)

In [49]:
%%time
xgb.fit({'train': input_train})

It took 14.4s

## 03. Verify the result

In [41]:
# job name is created auto if we don't specify one when we call .fit()
job_name='sagemaker-xgboost-2020-12-16-09-52-02-780'

In [42]:
boto3.client('s3').download_file(bucket, '{}/output/{}/output.tar.gz'.format(prefix,job_name), 'output/cross_val_result.tar.gz')

In [43]:
!tar -xf  output/cross_val_result.tar.gz -C output

In [44]:
!ls -l output

total 8
drwxrwxrwx 1 jiji jiji    0 déc.   8 16:15 cross_val_result1
-rwxrwxrwx 1 jiji jiji  789 déc.   8 17:33 cross_val_result2.tar.gz
-rwxrwxrwx 1 jiji jiji 2449 déc.  16 10:52 cross_val_result.tar.gz
drwxrwxrwx 1 jiji jiji  152 déc.  16 10:52 data


In [45]:
import pickle as pkl 

with open('output/data/cv-result', 'rb') as pickle_file:
    cv_result = pkl.load(pickle_file)

In [46]:
cv_result

Unnamed: 0,train-auc-mean,train-auc-std,test-auc-mean,test-auc-std
0,0.891615,0.007249,0.839727,0.033336
1,0.914508,0.003851,0.846446,0.03169
2,0.92948,0.002587,0.862434,0.02866
3,0.936343,0.001959,0.86123,0.025279
4,0.944799,0.006118,0.857478,0.028622
5,0.952957,0.005311,0.8624,0.023567
6,0.958371,0.004624,0.860739,0.025974
7,0.961795,0.004114,0.86078,0.024435
8,0.965844,0.004601,0.858598,0.02532
9,0.969108,0.003785,0.859689,0.026967


The result shows the score statistic for each iteration. For example here I put the max round as 50 and fold number as 5, therefore the model is trained 5 times with difference train/validation sets. Each training uses 50 decision trees.