# Train and deploy a regression model with the Amazon SageMaker XGBoost Algorithm using Script Mode
_**Distributed training for regression with Amazon SageMaker XGBoost script mode**_

---

## Runtime

This notebook takes approximately 5 minutes to run.

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
  1. [Fetch the dataset](#Fetch-the-dataset)
  1. [Ingest data](#Ingest-data)
1. [Create an XGBoost training script](#Create-an-XGBoost-training-script)
1. [Train the XGBoost model](#Train-the-XGBoost-model)
1. [Deploy the XGBoost model](#Deploy-the-XGBoost-model)
1. [Cleanup](#Cleanup)

---

# SEBTAC's ANALYSIS SUMMARY

OBJECTIVE:
- Analyze the parallelization implemented for the XGBoost Algorithm in Amazon SageMaker. Two implementations tested:
    - XGBoost as a framework
    - XGBoost as a built-in algorithm

BASED ON:
- xgboost_abalone_dist_script_mode.ipynb from amazon-sagemaker-examples
- https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html

MODIFICATIONs:
- data_split() to prodcue 4 files each with up to 10 times more data than original file and additionaly copy-duplicated to produce final 32 files for GPU testing
- adding TRAIN TIME INSIDE.PY for XGBoost as a framework to bring time.measuremnt as close to the training step as possible

METHODOLOWGY:
- confirmation of parallelization is provided by the execution time measures with expectation that the stronger node and the higher level of parallelization the shorter execution time

- the desired time measure would caputere only time spent on training and exclude any time spent on instance initial data download and any output generation and instance closing - not all measures guarantee it but not all measures are available in all scenarios.

- there are up to 3 time measures used inthe analyses
    - TRAIN TIME OUTSIDE
        - Execution time of the .fit() method - generally, it provides very little differentiation in the recorded time due to capturte of time required to start and stop the EC2 instances.
    - TRAIN TIME INSIDE.PY OR "@tracker xx.xxxxxxxxxxxxxx secs between node start and job finish"
        - those measures are intended to capture only the time behind excution of the XGBoost algorithm
        - TRAIN TIME INSIDE.PY is available for the IN FRAMEWORK tests and it does exclude any initial data preparation and closing seps from measuremnt
        - "@tracker xx.xxxxxxxxxxxxxx secs between node start and job finish" available for IN ALGORITHM tests but it seems to also cover soem data preparation and output generation steps in measurements thus those valus are not comperabel with the earlier measure
            - this measre is not avaibale for single instance tests ?!?!?!??!
    - Billable seconds: 592/4
        - this is inernal AWS measure. it seems to capture well the effects of parallelization. On the surface, its values are contrary to expectaitons (i.e. increasing parallelization usually increases its value. This is so as it measures sum of Billable seconds accross all runnig instances. Thus the final measure is the average execution time per insance: Billable Secouds / Number of Instances. On this level, increaing the insance count does lead to the shorter execution time per instance. But this measure also captures time spent on data and output preparation.
        
- It would be best to implment the equivalent of TRAIN TIME INSIDE.PY measure for the built-in algorithm but this requires creation of a custom container which is not the objective of this work

FINDINGS:
- General
    - Both XGBoost framework and built-in algorithm provide positive effects of parallelization
        - in theis analysis, it is not possibel to say which approach is more effcient due to the mix of availabel measures andinclusion in some of them of external elements to the actual model traing.
    - for clearest impact it is important to work with higher sample sizes thus increase in file sizes and their number.
    - No "Pipe" Input-Data mode testing as it seems not to work with LIBSVM format.
    - Data Sharding provides beneffits in multi instance training. the effect might be present also w???!?!?!?!?!?!?!?!*&^*&^*&%*%*&%&%%*%&&*%**%*^$*$#%#*%^#*
    
- XGBoost as a FRAMEWORK
    - both update of the instance and increase of its count speeds up training.
        - the effect is most visible between 1 and 2 insstances. it flattens when edding more instances but this is possibly due to the still low sample sizes  
    - training pn GPUs has strong possitive effect possibly even in terms of cost-efficiency
        - training on instance with 8 GPUs does not provide farhter speed-up. This can be due two factors:
            - to small data
            - framework implementation utilizes only single GPU per instance thus we can parallelize GPU execution only withmultiple instances -- requires further research
            - data and output preparation takes significnat portion of the execution time with available sample size.

- XGBoost as a built-in ALGORITH
    - Same results as for the FRAMEWORK
    - "FastFile" InputData mode - slower but possibly due to the small sample sizes
    - Multi-GPU training indicates some speed-up but it is dificult to say if this is due to the stronger GPU type or due to parallization. the week effect can be attrubted to still too-small sample size.


## Introduction

This notebook demonstrates the use of Amazon SageMaker XGBoost to train and host a regression model. [XGBoost (eXtreme Gradient Boosting)](https://xgboost.readthedocs.io) is a popular and efficient machine learning (ML) algorithm used for regression and classification tasks on tabular datasets. It implements a technique know as gradient boosting on trees, and performs remarkably well in ML competitions, and gets a lot of attention from customers. 

We use the [Abalone dataset](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html), originally from the [UCI ML repository](https://archive.ics.uci.edu/ml/datasets/abalone). In this libsvm converted version, the nominal feature (Male/Female/Infant) has been converted into a real valued feature as required by XGBoost. The age of the abalone snail is predicted from eight physical measurements.  

---
## Setup

This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance on the Python 3 (Data Science) kernel.

Let's start by specifying:
1. The S3 bucket and prefix to use for the training and model data. This should be within the same region as the Notebook Instance, training, and hosting.
1. The IAM role ARN used to grant access to your data.

In [None]:
!pip3 install -U sagemaker

In [3]:
import os
import boto3
import re
import sagemaker

role = sagemaker.get_execution_role()
region = sagemaker.Session().boto_region_name

bucket = sagemaker.Session().default_bucket()
prefix = "sagemaker/DEMO-xgboost-dist-script"

### Fetch the dataset

The following methods split the data into train/test/validation datasets and upload files to S3.

In [4]:
import io
import boto3
import random

random.seed(42)


def data_split(FILE_DATA,
               DATA_DIR,
               FILE_TRAIN_BASE,
               FILE_TRAIN_1,
               FILE_TRAIN_2,
               FILE_TRAIN_3,               
               FILE_VALIDATION,
               FILE_TEST,
               PERCENT_TRAIN_0,
               PERCENT_TRAIN_1,
               PERCENT_TRAIN_2,
               PERCENT_TRAIN_3,               
               PERCENT_VALIDATION,
               PERCENT_TEST,
               ENLARGE
              ):

    data = [l for l in open(FILE_DATA, "r")]

    print(-1, len(data))    

    for i in range(ENLARGE):
        data = data + data
        print(i, len(data))        
    
    #print(type(data), len(data), data)
    
    print(1)
    train_file_0 = open(DATA_DIR + "/" + FILE_TRAIN_0, "w")
    print(2)    
    train_file_1 = open(DATA_DIR + "/" + FILE_TRAIN_1, "w")
    print(3)
    train_file_2 = open(DATA_DIR + "/" + FILE_TRAIN_2, "w")
    print(3)
    train_file_3 = open(DATA_DIR + "/" + FILE_TRAIN_3, "w")
    print(3)    
    valid_file = open(DATA_DIR + "/" + FILE_VALIDATION, "w")
    print(4)
    tests_file = open(DATA_DIR + "/" + FILE_TEST, "w")

    num_of_data = len(data)
    num_train_0 = int((PERCENT_TRAIN_0 / 100.0) * num_of_data)
    num_train_1 = int((PERCENT_TRAIN_1 / 100.0) * num_of_data)
    num_train_2 = int((PERCENT_TRAIN_2 / 100.0) * num_of_data)
    num_train_3 = int((PERCENT_TRAIN_3 / 100.0) * num_of_data)    
    num_valid = int((PERCENT_VALIDATION / 100.0) * num_of_data)
    num_tests = int((PERCENT_TEST / 100.0) * num_of_data)

    data_fractions = [num_train_0, num_train_1, num_train_2, num_train_3, num_valid, num_tests]
    split_data = [[], [], [], [], [], []]

    rand_data_ind = 0

    for split_ind, fraction in enumerate(data_fractions):
        print(5)
        for i in range(fraction):
            #print(6)
            rand_data_ind = random.randint(0, len(data) - 1)
            split_data[split_ind].append(data[rand_data_ind])
            data.pop(rand_data_ind)

    for l in split_data[0]:
        train_file_0.write(l)

    for l in split_data[1]:
        train_file_1.write(l)

    for l in split_data[2]:
        train_file_2.write(l)
        
    for l in split_data[3]:
        train_file_3.write(l)        

    for l in split_data[4]:
        valid_file.write(l)

    for l in split_data[5]:
        tests_file.write(l)

    train_file_0.close()
    train_file_1.close()
    train_file_2.close()
    train_file_3.close()    
    valid_file.close()
    tests_file.close()

def write_to_s3(fobj, bucket, key):
    return (
        boto3.Session(region_name=region)
        .resource("s3")
        .Bucket(bucket)
        .Object(key)
        .upload_fileobj(fobj)
    )

def upload_to_s3(bucket, channel, filename):
    fobj = open(filename, "rb")
    key = prefix + "/" + channel
    url = "s3://{}/{}/{}".format(bucket, key, filename)
    print("Writing to {}".format(url))
    write_to_s3(fobj, bucket, key)

### Ingest data

Next, we read the dataset from the existing repository into memory, for preprocessing prior to training. This processing could be done *in situ* by SageMaker Processing, Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets.

In [None]:
%%time

s3 = boto3.client("s3")

# Load the dataset
FILE_DATA = "abalone"
s3.download_file(
    "sagemaker-sample-files", f"datasets/tabular/uci_abalone/abalone.libsvm", FILE_DATA
)

# Split the downloaded data into train/test/validation files
FILE_TRAIN_0 = "abalone.train_0"
FILE_TRAIN_1 = "abalone.train_1"
FILE_TRAIN_2 = "abalone.train_2"
FILE_TRAIN_3 = "abalone.train_3"
FILE_VALIDATION = "abalone.validation"
FILE_TEST = "abalone.test"
PERCENT_TRAIN_0 = 24
PERCENT_TRAIN_1 = 24
PERCENT_TRAIN_2 = 24
PERCENT_TRAIN_3 = 24
PERCENT_VALIDATION = 2
PERCENT_TEST = 2
ENLARGE = 10

DATA_DIR = "data"

if not os.path.exists(DATA_DIR):
    os.mkdir(DATA_DIR)

data_split(FILE_DATA,
           DATA_DIR,
           FILE_TRAIN_0,
           FILE_TRAIN_1,
           FILE_TRAIN_2,
           FILE_TRAIN_3,           
           FILE_VALIDATION,
           FILE_TEST,
           PERCENT_TRAIN_0,
           PERCENT_TRAIN_1,
           PERCENT_TRAIN_2,
           PERCENT_TRAIN_3,           
           PERCENT_VALIDATION,
           PERCENT_TEST,
           ENLARGE
          )

# Upload the files to the S3 bucket
upload_to_s3(bucket, "train/train_0.libsvm", DATA_DIR + "/" + FILE_TRAIN_0)
upload_to_s3(bucket, "train/train_1.libsvm", DATA_DIR + "/" + FILE_TRAIN_1)
upload_to_s3(bucket, "train/train_2.libsvm", DATA_DIR + "/" + FILE_TRAIN_2)
upload_to_s3(bucket, "train/train_3.libsvm", DATA_DIR + "/" + FILE_TRAIN_3)
upload_to_s3(bucket, "validation/validation.libsvm", DATA_DIR + "/" + FILE_VALIDATION)
upload_to_s3(bucket, "test/test.libsvm", DATA_DIR + "/" + FILE_TEST)

print("DONE!")

In [14]:
DATA_DIR = "data"

FILE_TRAIN_4 = "abalone.train_4"
FILE_TRAIN_5 = "abalone.train_5"
FILE_TRAIN_6 = "abalone.train_6"
FILE_TRAIN_7 = "abalone.train_7"
FILE_TRAIN_8 = "abalone.train_8"
FILE_TRAIN_9 = "abalone.train_9"
FILE_TRAIN_10 = "abalone.train_10"
FILE_TRAIN_11 = "abalone.train_11"
FILE_TRAIN_12 = "abalone.train_12"
FILE_TRAIN_13 = "abalone.train_13"
FILE_TRAIN_14 = "abalone.train_14"
FILE_TRAIN_15 = "abalone.train_15"

upload_to_s3(bucket, "train/train_4.libsvm", DATA_DIR + "/" + FILE_TRAIN_4)
upload_to_s3(bucket, "train/train_5.libsvm", DATA_DIR + "/" + FILE_TRAIN_5)
upload_to_s3(bucket, "train/train_6.libsvm", DATA_DIR + "/" + FILE_TRAIN_6)
upload_to_s3(bucket, "train/train_7.libsvm", DATA_DIR + "/" + FILE_TRAIN_7)
upload_to_s3(bucket, "train/train_8.libsvm", DATA_DIR + "/" + FILE_TRAIN_8)
upload_to_s3(bucket, "train/train_9.libsvm", DATA_DIR + "/" + FILE_TRAIN_9)
upload_to_s3(bucket, "train/train_10.libsvm", DATA_DIR + "/" + FILE_TRAIN_10)
upload_to_s3(bucket, "train/train_11.libsvm", DATA_DIR + "/" + FILE_TRAIN_11)
upload_to_s3(bucket, "train/train_12.libsvm", DATA_DIR + "/" + FILE_TRAIN_12)
upload_to_s3(bucket, "train/train_13.libsvm", DATA_DIR + "/" + FILE_TRAIN_13)
upload_to_s3(bucket, "train/train_14.libsvm", DATA_DIR + "/" + FILE_TRAIN_14)
upload_to_s3(bucket, "train/train_15.libsvm", DATA_DIR + "/" + FILE_TRAIN_15)

Writing to s3://sagemaker-us-east-1-578864530451/sagemaker/DEMO-xgboost-dist-script/train/train_4.libsvm/data/abalone.train_4
Writing to s3://sagemaker-us-east-1-578864530451/sagemaker/DEMO-xgboost-dist-script/train/train_5.libsvm/data/abalone.train_5
Writing to s3://sagemaker-us-east-1-578864530451/sagemaker/DEMO-xgboost-dist-script/train/train_6.libsvm/data/abalone.train_6
Writing to s3://sagemaker-us-east-1-578864530451/sagemaker/DEMO-xgboost-dist-script/train/train_7.libsvm/data/abalone.train_7
Writing to s3://sagemaker-us-east-1-578864530451/sagemaker/DEMO-xgboost-dist-script/train/train_8.libsvm/data/abalone.train_8
Writing to s3://sagemaker-us-east-1-578864530451/sagemaker/DEMO-xgboost-dist-script/train/train_9.libsvm/data/abalone.train_9
Writing to s3://sagemaker-us-east-1-578864530451/sagemaker/DEMO-xgboost-dist-script/train/train_10.libsvm/data/abalone.train_10
Writing to s3://sagemaker-us-east-1-578864530451/sagemaker/DEMO-xgboost-dist-script/train/train_11.libsvm/data/abalo


## Create an XGBoost training script

SageMaker can now run an XGBoost script using the XGBoost estimator. When run on SageMaker, a number of helpful environment variables are available to access properties of the training environment, such as:

- `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to. Any artifacts saved in this folder are uploaded to S3 for model hosting after the training job completes.
- `SM_OUTPUT_DIR`: A string representing the filesystem path to write output artifacts to. Output artifacts may include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed and uploaded to S3 to the same S3 prefix as the model artifacts.

When two input channels, `train` and `validation`, are used in the call to the XGBoost estimator's `fit()` method, the following environment variables are set, following the format `SM_CHANNEL_[channel_name]`:

- `SM_CHANNEL_TRAIN`: A string representing the path to the directory containing data in the 'train' channel.
- `SM_CHANNEL_VALIDATION`: Same as above, but for the 'validation' channel.

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to the `model_dir` so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance. For example, the script that we run in this notebook is provided as the accompanying file (`abalone.py`) and also shown below:

```python

import argparse
import json
import logging
import os
import pandas as pd
import pickle as pkl

from sagemaker_containers import entry_point
from sagemaker_xgboost_container.data_utils import get_dmatrix
from sagemaker_xgboost_container import distributed

import xgboost as xgb


def _xgb_train(params, dtrain, evals, num_boost_round, model_dir, is_master):
    """Run xgb train on arguments given with rabit initialized.

    This is our rabit execution function.

    :param args_dict: Argument dictionary used to run xgb.train().
    :param is_master: True if current node is master host in distributed training,
                        or is running single node training job.
                        Note that rabit_run includes this argument.
    """
    booster = xgb.train(params=params,
                        dtrain=dtrain,
                        evals=evals,
                        num_boost_round=num_boost_round)

    if is_master:
        model_location = model_dir + '/xgboost-model'
        pkl.dump(booster, open(model_location, 'wb'))
        logging.info("Stored trained model at {}".format(model_location))


if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    # Hyperparameters are described here.
    parser.add_argument('--max_depth', type=int,)
    parser.add_argument('--eta', type=float)
    parser.add_argument('--gamma', type=int)
    parser.add_argument('--min_child_weight', type=int)
    parser.add_argument('--subsample', type=float)
    parser.add_argument('--verbosity', type=int)
    parser.add_argument('--objective', type=str)
    parser.add_argument('--num_round', type=int)
    parser.add_argument('--tree_method', type=str, default="auto")
    parser.add_argument('--predictor', type=str, default="auto")

    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--output_data_dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR'))
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION'))
    parser.add_argument('--sm_hosts', type=str, default=os.environ.get('SM_HOSTS'))
    parser.add_argument('--sm_current_host', type=str, default=os.environ.get('SM_CURRENT_HOST'))

    args, _ = parser.parse_known_args()

    # Get SageMaker host information from runtime environment variables
    sm_hosts = json.loads(args.sm_hosts)
    sm_current_host = args.sm_current_host

    dtrain = get_dmatrix(args.train, 'libsvm')
    dval = get_dmatrix(args.validation, 'libsvm')
    watchlist = [(dtrain, 'train'), (dval, 'validation')] if dval is not None else [(dtrain, 'train')]

    train_hp = {'max_depth': args.max_depth,
                'eta': args.eta,
                'gamma': args.gamma,
                'min_child_weight': args.min_child_weight,
                'subsample': args.subsample,
                'verbosity': args.verbosity,
                'objective': args.objective,
                'tree_method': args.tree_method,
                'predictor': args.predictor,
               }

    xgb_train_args = dict(params=train_hp,
                          dtrain=dtrain,
                          evals=watchlist,
                          num_boost_round=args.num_round,
                          model_dir=args.model_dir
                         )

    if len(sm_hosts) > 1:
        # Wait until all hosts are able to find each other
        entry_point._wait_hostname_resolution()

        # Execute training function after initializing rabit.
        distributed.rabit_run(exec_fun=_xgb_train,
                              args=xgb_train_args,
                              include_in_training=(dtrain is not None),
                              hosts=sm_hosts,
                              current_host=sm_current_host,
                              update_rabit_args=True
                             )
    else:
        # If single node training, call training method directly.
        if dtrain:
            xgb_train_args['is_master'] = True
            _xgb_train(**xgb_train_args)
        else:
            raise ValueError("Training channel must have data to train model.")


def model_fn(model_dir):
    """Deserialize and return fitted model.

    Note that this should have the same name as the serialized model in the _xgb_train method
    """
    model_file = 'xgboost-model'
    booster = pkl.load(open(os.path.join(model_dir, model_file), 'rb'))
    return booster
```

Because the container imports your training script, always put your training code in a main guard `(if __name__=='__main__':)` so that the container does not inadvertently run your training code at the wrong point in execution.

For more information about training environment variables, please visit the [SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit).

## Train the XGBoost model

After setting training parameters, we kick off training, and poll for status until training is complete.

To run our training script on SageMaker, we construct a sagemaker.xgboost.estimator.XGBoost estimator, which accepts several constructor arguments:

* __entry_point__: The path to the Python script that SageMaker runs for training and prediction.
* __role__: Role ARN
* __train_instance_type__ *(optional)*: The type of SageMaker instances for training.
* __sagemaker_session__ *(optional)*: The session used to train on SageMaker.
* __hyperparameters__ *(optional)*: A dictionary passed to the train function as hyperparameters.

In [46]:
hyperparams = {#"tree_method": "gpu_hist",
    "max_depth": "5",
    "eta": "0.2",
    "gamma": "4",
    "min_child_weight": "6",
    "subsample": "0.7",
    "objective": "reg:squarederror",
    "num_round": "50",
    "verbosity": "2",
}

instance_type = "ml.m5.2xlarge"
output_path = "s3://{}/{}/{}/output".format(bucket, prefix, "abalone-dist-xgb")
content_type = "libsvm"

In [59]:
#################################################
########### Use XGBoost as a framework ##########
#################################################

### Open Source distributed script mode ###

from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
from sagemaker.xgboost.estimator import XGBoost

session = Session()
script_path = "abalone.py"

instance_type = "ml.c5.4xlarge" #"ml.m5.2xlarge", "ml.p3.2xlarge", "ml.p3.16xlarge"

"""
ml.c5.9xlarge, ml.c5.xlarge, ml.c4.xlarge, ml.c5.18xlarge, ml.m5.xlarge, ml.m4.10xlarge, ml.m5.12xlarge, ml.m4.xlarge, ml.m5.24xlarge, ml.m4.2xlarge, ml.m5.2xlarge,  
ml.m4.4xlarge
"""

xgb_script_mode_estimator = XGBoost(entry_point=script_path,
                                    framework_version="1.7-1",  # Note: framework_version is mandatory
                                    hyperparameters=hyperparams,
                                    role=role,
                                    instance_count=1,
                                    instance_type=instance_type,
                                    output_path=output_path,
                                   )

train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, "train"), 
                            content_type=content_type,
                            input_mode = "Pipe", # "File", "FastFile", "Pipe"--(Possibly only for: Parquet and Recordio-protobuf input formats)
                            distribution = "ShardedByS3Key", #"FullyReplicated", # "ShardedByS3Key",
                           )
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, "validation"), 
                                 content_type=content_type                                 
                                )

INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py3.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: ml.c5.4xlarge.


In [60]:
"s3://{}/{}/{}/".format(bucket, prefix, "train")

's3://sagemaker-us-east-1-578864530451/sagemaker/DEMO-xgboost-dist-script/train/'

### Train an XGBoost Estimator on the Abalone data 


Training is as simple as calling `fit()` on the Estimator. This starts a SageMaker training job that downloads the data, invokes the entry point code (in the provided script file), and saves any model artifacts that the script creates.

In [61]:
import time

In [62]:
start_time = time.time()
xgb_script_mode_estimator.fit({"train": train_input, "validation": validation_input})
print("TRAIN TIME OUTSIDE", time.time() - start_time)

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2023-05-11-17-42-55-612


Using provided s3_resource
2023-05-11 17:42:55 Starting - Starting the training job...
2023-05-11 17:43:13 Starting - Preparing the instances for training......
2023-05-11 17:44:12 Downloading - Downloading input data...
2023-05-11 17:44:42 Training - Training image download completed. Training in progress...[34m[2023-05-11 17:45:01.077 ip-10-2-96-228.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2023-05-11 17:45:01.137 ip-10-2-96-228.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2023-05-11:17:45:01:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2023-05-11:17:45:01:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2023-05-11:17:45:01:INFO] Invoking user training script.[0m
[34m[2023-05-11:17:45:01:INFO] Module abalone does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m[2023-05-11:17:45:01:INFO] Generating setup.cfg[0m
[34m[2023-05-11:17:45:01:INFO] Gen

UnexpectedStatusException: Error for Training job sagemaker-xgboost-2023-05-11-17-42-55-612: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/miniconda3/bin/python3 -m abalone --eta 0.2 --gamma 4 --max_depth 5 --min_child_weight 6 --num_round 50 --objective reg:squarederror --subsample 0.7 --verbosity 2", exit code: 1

In [None]:
# File_Size^1 

instance_count=2, instance_type = "ml.m5.2xlarge", "S3DistributionType":"FullyReplicated", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 195.38521814346313 TRAIN TIME INSIDE.PY 6.88726019859314

instance_count=1, instance_type = "ml.m5.2xlarge", "S3DistributionType":"FullyReplicated", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 253.56789135932922 TRAIN TIME INSIDE.PY 0.1523737907409668

instance_count=2, instance_type = "ml.m5.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 226.4941885471344 TRAIN TIME INSIDE.PY 3.7223174571990967
TRAIN TIME OUTSIDE 317.8059356212616 TRAIN TIME INSIDE.PY 3.8782737255096436

instance_count=1, instance_type = "ml.m5.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 254.0368688106537 TRAIN TIME INSIDE.PY 0.22278976440429688

# File_Size^9
instance_count=2, instance_type = "ml.m5.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 195.24812078475952 TRAIN TIME INSIDE.PY 9.903223037719727

instance_count=1, instance_type = "ml.m5.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 224.5310173034668 TRAIN TIME INSIDE.PY 49.18486475944519

instance_count=4, instance_type = "ml.m5.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 199.1698350906372 TRAIN TIME INSIDE.PY 10.55151915550232

# File_Size^10

instance_count=4, instance_type = "ml.m5.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 201.93874502182007, TRAIN TIME INSIDE.PY 14.228670835494995 algo-4

instance_count=2, instance_type = "ml.m5.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 195.4813358783722, TRAIN TIME INSIDE.PY 16.20287251472473 algo-1

instance_count=1, instance_type = "ml.m5.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 288.803368806839, TRAIN TIME INSIDE.PY 116.7438588142395 algo-1

instance_count=1, instance_type = "ml.m5.xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 318.66870307922363, TRAIN TIME INSIDE.PY 163.08214235305786 algo-1

instance_count=1, instance_type = "ml.m5.4xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 256.3569004535675, TRAIN TIME INSIDE.PY 78.06328439712524 algo-1

instance_count=1, instance_type = "ml.c5.4xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 255.56110048294067, TRAIN TIME INSIDE.PY 68.70053768157959 algo-1

instance_count=1, instance_type = "ml.c4.8xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 321.32916283607483, TRAIN TIME INSIDE.PY 113.5972547531128 algo-1

instance_count=1, instance_type = "ml.c5.9xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 259.23794627189636, TRAIN TIME INSIDE.PY 76.00906419754028 algo-1

instance_count=2, instance_type = "ml.c5.4xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files


instance_count=2, instance_type = "ml.c5.4xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 196.1646420955658, TRAIN TIME INSIDE.PY 16.606664419174194 algo-2

instance_count=4, instance_type = "ml.c5.4xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 198.0961720943451, TRAIN TIME INSIDE.PY 16.196258783340454 algo-3

instance_count=1, instance_type = "ml.p3.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 4 Files
TRAIN TIME OUTSIDE 283.93668007850647, TRAIN TIME INSIDE.PY 1.416290044784546 algo-1

instance_count=1, instance_type = "ml.p3.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 16 Files
TRAIN TIME OUTSIDE 224.08697986602783 TRAIN TIME INSIDE.PY 2.3939082622528076 algo-1

instance_count=1, instance_type = "ml.p3.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 32 Files
TRAIN TIME OUTSIDE 256.29279017448425, TRAIN TIME INSIDE.PY 4.323373556137085 algo-1

instance_count=1, instance_type = "ml.p3.16xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 32 Files
TRAIN TIME OUTSIDE 256.29279017448425, TRAIN TIME INSIDE.PY 4.410764455795288 algo-1

instance_count=1, instance_type = "ml.c5.4xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 32 Files
TRAIN TIME OUTSIDE xxx.xxxxxxxxxxxxx, TRAIN TIME INSIDE.PY 166.97041201591492 algo-1

instance_count=1, instance_type = "ml.c5.4xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"FastFile", 32 Files
TRAIN TIME OUTSIDE 385.2515449523926, TRAIN TIME INSIDE.PY 172.4998059272766 algo-1

instance_count=1, instance_type = "ml.c5.4xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"Pipe", 32 Files
--FAILED-- Possibly "PIPE" works only with Parquet and Recordio-protobuf input formats



In [9]:
#################################################
###### Use XGBoost as a built-in algorithm ######
#################################################

import sagemaker, time
import boto3
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput

# initialize hyperparameters
hyperparams = {#"tree_method": "gpu_hist",
    "max_depth": "5",
    "eta": "0.2",
    "gamma": "4",
    "min_child_weight": "6",
    "subsample": "0.7",
    "objective": "reg:squarederror",
    "num_round": "50",
    "verbosity": "2",
}



# set an output path where the trained model will be saved
bucket = sagemaker.Session().default_bucket()
prefix = "sagemaker/DEMO-xgboost-dist-script"
output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'abalone-xgb-built-in-algo')

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.7-1")

# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparams,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', #'ml.m5.2xlarge', 'ml.c5.2xlarge', ml.p3.16xlarge
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)

# define the data type and paths to the training and validation datasets
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'),
                            input_mode = "File", # "File", "FastFile", "Pipe"--(Possibly only for: Parquet and Recordio-protobuf input formats)
                            distribution = "FullyReplicated", #"FullyReplicated", # "ShardedByS3Key",
                            content_type=content_type)

validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), 
                                 content_type=content_type)

# execute the XGBoost training job

start_time = time.time()
estimator.fit({'train': train_input, 'validation': validation_input})
print("TRAIN TIME OUTSIDE", time.time() - start_time)

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2023-05-12-17-03-26-283


2023-05-12 17:03:26 Starting - Starting the training job...
2023-05-12 17:03:42 Starting - Preparing the instances for training......
2023-05-12 17:04:37 Downloading - Downloading input data...
2023-05-12 17:05:22 Training - Training image download completed. Training in progress....[34m[2023-05-12 17:05:42.305 ip-10-0-72-219.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2023-05-12 17:05:42.364 ip-10-0-72-219.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2023-05-12:17:05:42:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2023-05-12:17:05:42:INFO] Failed to parse hyperparameter objective value reg:squarederror to Json.[0m
[34mReturning the value itself[0m
[34m[2023-05-12:17:05:42:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2023-05-12:17:05:42:INFO] Running XGBoost Sagemaker in algorithm mode[0m
[34m[2023-05-12:17:05:42:INFO] Determined 0 GPU(s) available on the i

In [None]:
instance_count=1, instance_type = "ml.m5.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 32 Files
[2023-05-11:20:11:42:INFO] @tracker !?!?! NOT AVAIlABLE WITH SINGLE INSTANCE !?!?! secs between node start and job finish
Billable seconds: 316/1
TRAIN TIME OUTSIDE 415.39980149269104

instance_count=2, instance_type = "ml.m5.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 32 Files
[2023-05-11:20:11:42:INFO] @tracker 107.9842574596405 secs between node start and job finish
Billable seconds: 406/2
TRAIN TIME OUTSIDE 355.3289113044739

instance_count=1, instance_type = "ml.m5.2xlarge", "S3DistributionType":"FullyReplicated", "TrainingInputMode":"File", 32 Files
[2023-05-11:20:11:42:INFO] @tracker !?!?! NOT AVAIlABLE WITH SINGLE INSTANCE !?!?! secs between node start and job finish
Billable seconds: 308
TRAIN TIME OUTSIDE 419.74195885658264

instance_count=2, instance_type = "ml.m5.2xlarge", "S3DistributionType":"FullyReplicated", "TrainingInputMode":"File", 32 Files
[2023-05-12:16:57:47:INFO] @tracker 216.17183685302734 secs between node start and job finish
Billable seconds: 616
TRAIN TIME OUTSIDE 399.7200503349304

instance_count=2, instance_type = "ml.m5.xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 32 Files
[2023-05-11:20:27:31:INFO] @tracker 134.74564719200134 secs between node start and job finish
Billable seconds: 456/2
TRAIN TIME OUTSIDE 327.17588686943054

instance_count=2, instance_type = "ml.m5.4xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 32 Files
[2023-05-11:20:34:05:INFO] @tracker 92.06404709815979 secs between node start and job finish
Billable seconds: 356/2
TRAIN TIME OUTSIDE 292.1142199039459

instance_count=4, instance_type = "ml.m5.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 32 Files
[2023-05-11:20:40:18:INFO] @tracker 54.68413758277893 secs between node start and job finish
Billable seconds: 572/4
TRAIN TIME OUTSIDE 271.30717182159424

instance_count=4, instance_type = "ml.c5.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 32 Files
[2023-05-11:20:48:53:INFO] @tracker 49.115756034851074 secs between node start and job finish
Billable seconds: 592/4
TRAIN TIME OUTSIDE 268.46313285827637

instance_count=1, instance_type = "ml.p3.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 32 Files
[2023-05-11:20:11:42:INFO] @tracker !?!?! NOT AVAIlABLE WITH SINGLE INSTANCE !?!?! secs between node start and job finish
Billable seconds: 123/1
TRAIN TIME OUTSIDE 253.56209897994995

instance_count=1, instance_type = "ml.p3.16xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"File", 32 Files
[2023-05-11:20:11:42:INFO] @tracker !?!?! NOT AVAIlABLE WITH SINGLE INSTANCE !?!?! secs between node start and job finish
Billable seconds: 103/1
TRAIN TIME OUTSIDE 345.8657298088074

instance_count=1, instance_type = "ml.p3.2xlarge", "S3DistributionType":"ShardedByS3Key", "TrainingInputMode":"Pipe", 32 Files
--FAILED-- probabaly Pipe mode odes not work with the data format

## Deploy the XGBoost model

After training, we use the estimator to create an Amazon SageMaker endpoint – a hosted and managed prediction service to perform inference.

You can also optionally specify other functions to customize the deserialization of the input request (`input_fn()`), serialization of the predictions (`output_fn()`), and how predictions are made (`predict_fn()`). The defaults work for our current use-case so we don’t need to define them.

In [None]:
predictor = xgb_script_mode_estimator.deploy(
    initial_instance_count=1, instance_type="ml.m5.2xlarge"
)

In [None]:
test_file = DATA_DIR + "/" + FILE_TEST
with open(test_file, "r") as f:
    payload = f.read()

In [None]:
runtime_client = session.sagemaker_runtime_client
response = runtime_client.invoke_endpoint(
    EndpointName=predictor.endpoint_name, ContentType="text/libsvm", Body=payload
)
result = response["Body"].read().decode("ascii")
print("Predicted values are {}.".format(result))

## Cleanup

When you're done with this exercise, please run the cell below to delete the hosted endpoint and avoid any additional charges.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()