# How to use boto3 to train and inference with script-mode XGBoost

This notebook shows how to train an XGBoost in script mode, then perform the inference using batch transform, using boto3 instead of SageMaker SDK. A typical use-case is targeted at Lambda. A Lambda function is limited to 250 MB uncompressed, and it's not sufficient to cram SageMaker SDK with other ML libraries such as pandas + sklearn. In this case, the Lambda can use the pandas+sklearn layer (which is already almost 250 MB uncompressed), then interact with SageMaker via boto3.

The high-level steps are as follows:

- create an entrypoint script, with the same convention and requirements as sklearn or mxnet entrypoint script.
- package entrypoint script (+dependencies) to S3.
- in the create training job request, add a few **hyperparameters** to specify the entrypoint script name & package.
- in the create model request, add a few **environment variables** to specify the entrypoint script name & package. The environment variables ensure that endpoint or batch transform are able to locate & invoke the entrypoint.

# Package script to S3

To recap, with SageMaker SDK, each fit() will cause the SDK to package and upload the entrypoint script (+dependencies) from local disk to S3. With boto3, we need to package our script (+dependencies) to S3. This is ok since typically Lambda is used in a production or operational pipeline, thus there's a clear need for a stable, controlled entrypoint package deployed beforehand.

In [None]:
# Package entrypoint script (+optional dependencies)
!GZIP=--best tar -czvf sourcedir.tar.gz sm_entry_point.py

# Upload entrypoint package to S3
%env ENTRYPOINT_SRC=s3://bucket/path/to/src/
!aws s3 cp sourcedir.tar.gz $ENTRYPOINT_SRC

# Synchronize env var to Python variable for later usage
import os
entrypoint_src = os.environ['ENTRYPOINT_SRC']

# Train

Create a training request with specific hyperparameters.

In [None]:
import boto3

sess = boto3.Session()
sm_client = sess.client('sagemaker')

# Must use xgb container version 0.90
image = '783357654285.dkr.ecr.ap-southeast-2.amazonaws.com/sagemaker-xgboost:0.90-1-cpu-py3'

train_job_name = 'my_training_job_12572'
sm_role = "CHANGE_ME: SM_EXECUTION_ROLE"
train_channel = 's3://bucket/path/to/data/train/'
valid_channel = 's3://bucket/path/to/data/valid/'
train_output = 's3://bucket/path/to/sagemaker_train_output/'   # NOTE: SageMaker will output to $train_output/$sm_jobname/output/model.tar.gz.

entrypoint_name = 'sm_entry_point.py'
entrypoint_src = 's3://bucket/path/to/src/sourcedir.tar.gz...'
hyperparameters = {
    'sagemaker_program' = entrypoint_name,            # Mandatory
    'sagemaker_submit_directory' = entrypoint_src,    # Mandatory
    'sagemaker_container_log_level': '20',            # Optional
    'sagemaker_enable_cloudwatch_metrics': 'false',   # Optional

    # Additional hyperparameter supported by entrypoint
    # Additional hyperparameter supported by entrypoint
    # Additional hyperparameter supported by entrypoint
    # ...
}

train_param = {
        "TrainingJobName": train_job_name,
        "AlgorithmSpecification": {
          "TrainingImage": image,
          "TrainingInputMode": "File"
        },
        "RoleArn": sm_role,
        "StoppingCondition": {
          "MaxRuntimeInSeconds": 86400
        },
        "ResourceConfig": {
          "InstanceCount": "1",
          "InstanceType": "ml.m5.large",
          "VolumeSizeInGB": 30
        },
        "HyperParameters": hyperparameters,
        "InputDataConfig": [
          {
            "ChannelName": "train",
            "DataSource": {
              "S3DataSource": {
                "S3DataDistributionType": "FullyReplicated",
                "S3DataType": "S3Prefix",
                "S3Uri": train_channel
              }
            },
            "ContentType": "text/csv",
            "CompressionType": "None",
            "RecordWrapperType": "None",
            "InputMode": "File"
          },
          {
            "ChannelName": "validation",
            "DataSource": {
              "S3DataSource": {
                "S3DataDistributionType": "FullyReplicated",
                "S3DataType": "S3Prefix",
                "S3Uri.$": valid_channel
              }
            },
            "ContentType": "text/csv",
            "CompressionType": "None",
            "RecordWrapperType": "None",
            "InputMode": "File"
          }
        ],
        "OutputDataConfig": {
          "S3OutputPath": train_output
        }
      }

sm_client.create_training_job(train_param)

# Create Model

We specify the entrypoint settings of a model by setting a few environment variables.

The following example probes the training job to automatically deduce what the environment variables that a model must use. In addition, the example also pull a few settings (e.g., image url) from the training job -- essentially, treating the training job's metadata as the single-source, centralized configuration.

In [None]:
model_name = 'my_model'

# Attempt to delete existing model which has the same name as the request.
train_job = sm_client.describe_training_job(TrainingJobName=train_jobname)
try:
    response = sm_client.delete_model(ModelName=model_name)
except ClientError as e:
    if 'Could not find model' in e.response['Error']['Message']:
        pass        # No previous model, so we're okay to proceed further
    else:
        raise e     # Oh no, something terrible happens, so let's escalate the error.
else:
    print(f'Deleted pre-existing model {model_name}')

# The environment variables to indicate entry point.
# Will automatically synchronize from the training job.
env = {}
if 'sagemaker_program' in train_job['HyperParameters']:
    env['SAGEMAKER_PROGRAM'] = train_job['HyperParameters']['sagemaker_program']
    env['SAGEMAKER_SUBMIT_DIRECTORY'] = train_job['HyperParameters']['sagemaker_submit_directory']
    env['SAGEMAKER_CONTAINER_LOG_LEVEL'] = '20'
    env['SAGEMAKER_ENABLE_CLOUDWATCH_METRICS'] = 'False'

# Create an SM model.
# Will automatically synchronize a settings from the training job.
response = sm_client.create_model(
    ModelName=model_name,
    Containers=[
        {
            'Image': train_job['AlgorithmSpecification']['TrainingImage'],
            'ModelDataUrl': train_job['ModelArtifacts']['S3ModelArtifacts'],
            'Environment': env,
        },
    ],
    ExecutionRoleArn=sm_role,      # NOTE: can use a different role than training
    EnableNetworkIsolation=False,
)

# Inference

Nothing special on the inference side. Simply create an endpoint or a batch transform job using the model, and SageMaker will correctly use the entry point specified by the model's environment variables.

# Tips on XGBoost Hyperparameters

There're a few differences in hyperparameters depending on whether the XGBoost container runs in algorithm vs script mode.

Renamed hyperparameters:

Algorithm Mode | Script Mode
--- | ---
`num_round` | `n_estimators`
`eta` | `learning_rate`


In addition, the `objective` hyperparameter also differs in the accepted values:

- algorithm mode: see the [SageMaker documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html). Supported values include `reg:linear`, `reg:logistic`, `multi:softmax`. Note that `reg:squarederror` is not supported, and possibly a few others.
- script mode: see [supported values](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters).