# Train a CatBoost Model using Script Mode

The aim of this notebook is to demonstrate how to train and deploy a catboost model in Amazon SageMaker. The method used is called Script Mode, in which we write a script to train our model and submit it to the SageMaker Python SDK. For more information, feel free to read [Using Scikit-learn with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html).

## Runtime
This notebook takes approximately 15 minutes to run.

## Contents
1. [Download data](#Download-data)
1. [Prepare data](#Prepare-data)
1. [Train model](#Train-model)
1. [Deploy and test endpoint](#Deploy-and-test-endpoint)
1. [Cleanup](#Cleanup)

In [None]:
import boto3
import sagemaker

from sagemaker import get_execution_role

role = get_execution_role()

account_id = role.split(':')[4]
region = boto3.Session().region_name
sess = sagemaker.session.Session()
bucket = sess.default_bucket()

print('Account: {}'.format(account_id))
print('Region: {}'.format(region))
print('Role: {}'.format(role))
print('S3 Bucket: {}'.format(bucket))

### Download data
We use pandas to process a small local dataset into a training and testing piece.

We could also design code that loads all the data and runs cross-validation within the script. 

In [None]:
import os

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

In [None]:
# we use the Boston housing dataset 
data = load_boston()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.25, random_state=42)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX['target'] = y_test

In [None]:
local_train = 'train.csv'
local_test = 'test.csv'

trainX.to_csv(local_train)
testX.to_csv(local_test)

In [None]:
# send data to S3. SageMaker will take training data from S3
train_location = sess.upload_data(
    path=local_train, 
    bucket=bucket,
    key_prefix='catboost')

test_location = sess.upload_data(
    path=local_test, 
    bucket=bucket,
    key_prefix='catboost')

print(train_location, test_location)

## Train model
The model is trained using the SageMaker SDK's Estimator class. Firstly, get the execution role for training. This role allows us to access the S3 bucket in the last step, where the train and test data set is located.

In [None]:
# Use the current execution role for training. It needs access to S3
role = sagemaker.get_execution_role()
print(role)

Then, it is time to define the SageMaker SDK Estimator class. We use an Estimator class specifically desgined to train scikit-learn models called `SKLearn`. In this estimator, we define the following parameters:
1. The script that we want to use to train the model (i.e. `entry_point`). This is the heart of the Script Mode method. Additionally, set the `script_mode` parameter to `True`.
1. The role which allows us access to the S3 bucket containing the train and test data set (i.e. `role`)
1. How many instances we want to use in training (i.e. `instance_count`) and what type of instance we want to use in training (i.e. `instance_type`)
1. Which version of scikit-learn to use (i.e. `framework_version`)
1. Training hyperparameters (i.e. `hyperparameters`)

After setting these parameters, the `fit` function is invoked to train the model.

In [None]:
# Docs: https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html

from sagemaker.sklearn import SKLearn

instance_type = "ml.c5.xlarge"  # "local", "ml.c5.xlarge"
# if instance_type == "local":
#     train_location = "file:///home/ec2-user/SageMaker/catboost_sagemaker/train.csv"
#     test_location = "file:///home/ec2-user/SageMaker/catboost_sagemaker/test.csv"

sk_estimator = SKLearn(
    entry_point="train.py",
    source_dir="./",
    role=role,
    instance_count=1,
    instance_type=instance_type,
    py_version="py3",
    framework_version="0.23-1",
    script_mode=True,
    hyperparameters={'features': 'CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT',
                     'target': 'target'},
)

# Train the estimator
sk_estimator.fit({'train':train_location, 'test': test_location}, logs=True)

## Deploy and test endpoint
After training the model, it is time to deploy it as an endpoint. To do so, we invoke the `deploy` function within the scikit-learn estimator. As shown in the code below, one can define the number of instances (i.e. `initial_instance_count`) and instance type (i.e. `instance_type`) used to deploy the model.

In [None]:
import time

sk_endpoint_name = "sklearn-model" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
sk_predictor = sk_estimator.deploy(
    initial_instance_count=1, instance_type="ml.m5.large", endpoint_name=sk_endpoint_name
)

After the endpoint has been completely deployed, it can be invoked using the [SageMaker Runtime Client](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html) (which is the method used in the code cell below) or [Scikit Learn Predictor](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-predictor). If you plan to use the latter method, make sure to use a [Serializer](https://sagemaker.readthedocs.io/en/stable/api/inference/serializers.html) to serialize your data properly.

In [None]:
import json

client = sess.sagemaker_runtime_client

request_body = {"Input": [[0,0.09178,0.0,4.05,0.0,0.51,6.416,84.1,2.6463,5.0,296.0,16.6,395.5,9.04]]}
data = json.loads(json.dumps(request_body))
payload = json.dumps(data)

response = client.invoke_endpoint(
    EndpointName=sk_endpoint_name, ContentType="application/json", Body=payload
)

result = json.loads(response["Body"].read().decode())["Output"]
print("Predicted result {}".format(result))

## Cleanup
If the model and endpoint are no longer in use, they should be deleted to save costs and free up resources.

In [None]:
sk_predictor.delete_model()
sk_predictor.delete_endpoint()