# Titanic challenge with XGBoost - Modeling

This notebook in in the series of learning SageMaker with Titanic challenge. The original challenge is defined at https://www.kaggle.com/c/titanic/data. In this notebook I experiment using boto3 sdk instead of SageMaker APIs.

This notebook is run at my local and using AWS resources for training/serving.

Requirements:
- Local environment with Python and necessary libraries/packages.
- S3 buckets to store the data and the output
- IAM user with the permission of writing to S3 and using SageMaker service.
- An IAM role having permissions on S3 at least to be attached to training instances.



In [1]:
# import libraries
import boto3
import pandas as pd

In [2]:
# Define bucket name and prefix
bucket = '<bucket-name>' 
prefix = 'prefix'

# Define IAM role and sagemaker client
boto_session = boto3.Session()

# A role with the rights of reading and writing to S3
role = '<role-arn>'

In [3]:
boto_session.region_name

'eu-west-1'

In [4]:
# define the local data path
train_data_file = './data/processed/exp-raw/train.csv'
validation_data_file = './data/processed/exp-raw/validation.csv'
test_data_file = './data/processed/exp-raw/test.csv'

In [5]:
s3_client = boto3.client('s3')

In [14]:
# upload local data into s3
# for file_name in (train_data_file,validation_data_file,test_data_file):
try:
    s3_client.upload_file(train_data_file, bucket, '{}/train.csv'.format(prefix))
    s3_client.upload_file(validation_data_file, bucket, '{}/validation.csv'.format(prefix))
    s3_client.upload_file(test_data_file, bucket, '{}/test.csv'.format(prefix))
except object as o:
    print("Unexpected error:",o)

## 2. Train/tuning model with boto3 

In [22]:
from sagemaker import image_uris

container = image_uris.retrieve('xgboost', boto_session.region_name, 'latest')

In [23]:
s3_input_train = 's3://{}/{}/train.csv'.format(bucket, prefix)
s3_input_validation ='s3://{}/{}/validation.csv'.format(bucket, prefix)

In [24]:
# Define the tuning configuration
tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [
        {
          "MaxValue": "1",
          "MinValue": "0",
          "Name": "eta",
        },
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "min_child_weight",
        },
        {
          "MaxValue": "2",
          "MinValue": "0",
          "Name": "alpha",            
        }
      ],
      "IntegerParameterRanges": [
        {
          "MaxValue": "5",
          "MinValue": "2",
          "Name": "max_depth",
        }
      ]
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 6,
      "MaxParallelTrainingJobs": 3
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "validation:auc",
      "Type": "Maximize"
    }
}

In [26]:
# Define the training job

training_job_definition = {
    "AlgorithmSpecification": {
      "TrainingImage": container,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
      {
        "ChannelName": "train",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_train
          }
        }
      },
      {
        "ChannelName": "validation",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_validation
          }
        }
      }
    ],
    "EnableManagedSpotTraining": True,
    "OutputDataConfig": {
      "S3OutputPath": "s3://{}/{}/xgboost/output".format(bucket,prefix)
    },
    "ResourceConfig": {
      "InstanceCount": 1,
      "InstanceType": "ml.m4.xlarge",
      "VolumeSizeInGB": 5
    },
    "RoleArn": role,
    "StaticHyperParameters": {
      "eval_metric": "auc",
      "num_round": "100",
      "objective": "binary:logistic",
      "tweedie_variance_power": "1.4"
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 2400,
        "MaxWaitTimeInSeconds": 3600
    },
    
}

In [46]:
# call from boto3
tuning_job_name = "tuning-via-boto3"
smclient = boto3.client('sagemaker')
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,
                                            HyperParameterTuningJobConfig = tuning_job_config,
                                            TrainingJobDefinition = training_job_definition)

In [32]:
smclient.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name)['HyperParameterTuningJobStatus']

'Completed'

## 3. Deploy to and endpoint using boto3

In [47]:
best_training_job = smclient.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name)['BestTrainingJob']
best_training_job

In [48]:
model_name = best_training_job['TrainingJobName'] + '-mod'

info = smclient.describe_training_job(TrainingJobName=best_training_job['TrainingJobName'])
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print(model_data)

primary_container = {
    'Image': container,
    'ModelDataUrl': model_data
}

create_model_response = smclient.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

In [49]:
endpoint_config_name = 'DEMO-XGBoostEndpointConfigBoto3'
print(endpoint_config_name)
create_endpoint_config_response = smclient.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m4.xlarge',
        'InitialVariantWeight':1,
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

In [52]:
%%time
import time

endpoint_name = 'DEMO-XGBoostEndpointConfigBoto3'
print(endpoint_name)
create_endpoint_response = smclient.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])

resp = smclient.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)

while status=='Creating':
    time.sleep(60)
    resp = smclient.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Status: " + status)

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)


To make inference through an endpoint we must in aws runtime environment.

In [25]:
endpoint_name = 'DEMO-XGBoostEndpointConfigBoto3'
runtime= boto3.client('runtime.sagemaker')


In [26]:
# try predict the first line of test data
with open(test_data_file) as f:
    response = runtime.invoke_endpoint(EndpointName=endpoint_name,
                                       ContentType='text/csv',
                                       Body=f.readline())

In [27]:
import json

result = json.loads(response['Body'].read().decode())

In [28]:
result

0.041144978255