# Integrating AWS Sagemaker with YData Fabric

This tutorial shows how to use Amazon SageMaker to develop, train, tune and deploy a Scikit-Learn based ML model (Random Forest) within the YData Fabric. We use the [California Housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html), present in Scikit-Learn. 

Some resources that are useful for the implementation:
* Scikit Learn Docs: https://sagemaker.readthedocs.io/en/stable/using_sklearn.html
* Sagemake Python SDK https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html
* boto3 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#client

The first step would be install the sagemaker python SDK using pip and then to import relevant libraries.

In [3]:
%%capture

pip install sagemaker

In [4]:
import datetime
import time
import tarfile
import boto3
import pandas as pd
import numpy as np
from sagemaker import get_execution_role
import sagemaker
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.sklearn.model import SKLearnModel
from sagemaker.tuner import IntegerParameter
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

The access keys are required to connect AWS to the YData Labs. You may use them as variables in the cell below, or read them from a CSV/Text file to not share the credentials in the notebook.

In [5]:
# You may read these as variables from a config file programmatically

ACCESS_KEY = "XXX"
SECRET_KEY = "XXX"
ROLE_ARN = "XXX"

In [7]:
# Create a boto3 client

sm_boto3 = boto3.client(
    "sagemaker",
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY, 
    region_name='eu-west-1'
)

In [8]:
# Create a boto3 session

boto_sess = boto3.Session(
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY,
    region_name='eu-west-1'
)

In [9]:
# Pass boto3 session to create a sagemaker session
sess = sagemaker.Session(boto_session = boto_sess)

In [10]:
region = sess.boto_session.region_name

bucket = sess.default_bucket()  # this could also be a hard-coded bucket name

print("Using bucket " + bucket)

Using bucket sagemaker-eu-west-1-788946076961


## Prepare data
We load a dataset from sklearn, split it and send it to S3.

In [11]:
# we use the California housing dataset
data = fetch_california_housing()

In [12]:
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.25, random_state=42
)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX["target"] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX["target"] = y_test

In [13]:
trainX.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,4.2143,37.0,5.288235,0.973529,860.0,2.529412,33.81,-118.12,2.285
1,5.3468,42.0,6.364322,1.08794,957.0,2.404523,37.16,-121.98,2.799
2,3.9191,36.0,6.110063,1.059748,711.0,2.235849,38.45,-122.69,1.83
3,6.3703,32.0,6.0,0.990196,1159.0,2.272549,34.16,-118.41,4.658
4,2.3684,17.0,4.795858,1.035503,706.0,2.088757,38.57,-121.33,1.5


In [14]:
trainX.to_csv("california_housing_train.csv")
testX.to_csv("california_housing_test.csv")

In [22]:
# send data to S3. SageMaker will take training data from s3
trainpath = sess.upload_data(
    path="california_housing_train.csv", bucket=bucket, key_prefix="sagemaker/sklearncontainer"
)

testpath = sess.upload_data(
    path="california_housing_test.csv", bucket=bucket, key_prefix="sagemaker/sklearncontainer"
)

After preparing the data, we need to create a script that can be used for training, as well as inference by providing parameters to it. We leverage the Script Mode for this.

## Writing a *Script Mode* script
The below script contains both training and inference functionality and can run both in SageMaker Training hardware or locally (desktop, SageMaker notebook, on prem, etc).

In [15]:
%%writefile script.py

import argparse
import joblib
import os

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor


# inference functions ---------------
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf


if __name__ == "__main__":

    print("extracting arguments")
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    # to simplify the demo we don't use all sklearn RandomForest hyperparameters
    parser.add_argument("--n-estimators", type=int, default=10)
    parser.add_argument("--min-samples-leaf", type=int, default=3)

    # Data, model, and output directories
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    parser.add_argument("--train-file", type=str, default="california_housing_train.csv")
    parser.add_argument("--test-file", type=str, default="california_housing_test.csv")
    parser.add_argument(
        "--features", type=str
    )  # in this script we ask user to explicitly name features
    parser.add_argument(
        "--target", type=str
    )  # in this script we ask user to explicitly name the target

    args, _ = parser.parse_known_args()

    print("reading data")
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))

    print("building training and testing datasets")
    X_train = train_df[args.features.split()]
    X_test = test_df[args.features.split()]
    y_train = train_df[args.target]
    y_test = test_df[args.target]

    # train
    print("training model")
    model = RandomForestRegressor(
        n_estimators=args.n_estimators, min_samples_leaf=args.min_samples_leaf, n_jobs=-1
    )

    model.fit(X_train, y_train)

    # print abs error
    print("validating model")
    abs_err = np.abs(model.predict(X_test) - y_test)

    # print couple perf metrics
    for q in [10, 50, 90]:
        print("AE-at-" + str(q) + "th-percentile: " + str(np.percentile(a=abs_err, q=q)))

    # persist model
    path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model, path)
    print("model persisted at " + path)
    print(args.min_samples_leaf)

Overwriting script.py


## Launching a training job

We can use the Scikit-learn Estimator from the SageMaker Python SDK to launch the training job directly from the Fabric platform.

In [20]:
# We use the Estimator from the SageMaker Python SDK

FRAMEWORK_VERSION = "0.23-1"

sklearn_estimator = SKLearn(
    entry_point="script.py",
    sagemaker_session = sess,
    role=ROLE_ARN,
    instance_count=1,
    instance_type="ml.c5.xlarge",
    framework_version=FRAMEWORK_VERSION,
    base_job_name="rf-scikit",
    metric_definitions=[{"Name": "median-AE", "Regex": "AE-at-50th-percentile: ([0-9.]+).*$"}],
    hyperparameters={
        "n-estimators": 100,
        "min-samples-leaf": 3,
        "features": "MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude",
        "target": "target",
    },
)

When the training job is launches, you'll be able to see verbose related to the training.

In [23]:
# launch training job, with asynchronous call
sklearn_estimator.fit({"train": trainpath, "test": testpath}, wait=True)

2022-11-02 11:25:08 Starting - Starting the training job...
2022-11-02 11:25:32 Starting - Preparing the instances for trainingProfilerReport-1667388308: InProgress
.........
2022-11-02 11:27:00 Downloading - Downloading input data...
2022-11-02 11:27:33 Training - Training image download completed. Training in progress.[34m2022-11-02 11:27:32,900 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2022-11-02 11:27:32,904 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-11-02 11:27:32,912 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2022-11-02 11:27:33,292 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-11-02 11:27:33,304 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-11-02 11:27:33,315 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus 

## Launching a tuning job

After training a model on the dataset, to optimize performance, we can tune the hyperparameters of the model by exploring values within a range.

In [24]:
# we use the Hyperparameter Tuner
# Define exploration boundaries
hyperparameter_ranges = {
    "n-estimators": IntegerParameter(20, 100),
    "min-samples-leaf": IntegerParameter(2, 6),
}

# create Optimizer
Optimizer = sagemaker.tuner.HyperparameterTuner(
    estimator=sklearn_estimator,
    hyperparameter_ranges=hyperparameter_ranges,
    base_tuning_job_name="RF-tuner",
    objective_type="Minimize",
    objective_metric_name="median-AE",
    metric_definitions=[
        {"Name": "median-AE", "Regex": "AE-at-50th-percentile: ([0-9.]+).*$"}
    ],  # extract tracked metric from logs with regexp
    max_jobs=10,
    max_parallel_jobs=2,
)

In [25]:
Optimizer.fit({"train": trainpath, "test": testpath})

......................................................................................!


In [26]:
# get tuner results in a df
results = Optimizer.analytics().dataframe()
while results.empty:
    time.sleep(1)
    results = Optimizer.analytics().dataframe()
results.head()

Unnamed: 0,min-samples-leaf,n-estimators,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
0,2.0,98.0,RF-tuner-221102-1128-010-45a8b574,Completed,0.201123,2022-11-02 11:35:06+00:00,2022-11-02 11:35:53+00:00,47.0
1,2.0,99.0,RF-tuner-221102-1128-009-93a89dc8,Completed,0.201474,2022-11-02 11:34:54+00:00,2022-11-02 11:35:41+00:00,47.0
2,2.0,100.0,RF-tuner-221102-1128-008-81f1fcb9,Completed,0.201943,2022-11-02 11:34:11+00:00,2022-11-02 11:34:53+00:00,42.0
3,2.0,74.0,RF-tuner-221102-1128-007-c854ea19,Completed,0.205801,2022-11-02 11:34:00+00:00,2022-11-02 11:34:42+00:00,42.0
4,2.0,97.0,RF-tuner-221102-1128-006-7a8199b2,Completed,0.205158,2022-11-02 11:33:19+00:00,2022-11-02 11:34:00+00:00,41.0


## Deploy to a real-time endpoint

An `Estimator` could be deployed directly after training, with an `Estimator.deploy()` but here we showcase the more extensive process of creating a model from s3 artifacts, that could be used to deploy a model that was trained in a different session or even out of SageMaker.

In [27]:
sklearn_estimator.latest_training_job.wait(logs="None")
artifact = sm_boto3.describe_training_job(
    TrainingJobName=sklearn_estimator.latest_training_job.name
)["ModelArtifacts"]["S3ModelArtifacts"]

print("Model artifact persisted at " + artifact)


2022-11-02 11:29:13 Starting - Preparing the instances for training
2022-11-02 11:29:13 Downloading - Downloading input data
2022-11-02 11:29:13 Training - Training image download completed. Training in progress.
2022-11-02 11:29:13 Uploading - Uploading generated training model
2022-11-02 11:29:13 Completed - Training job completed
Model artifact persisted at s3://sagemaker-eu-west-1-788946076961/rf-scikit-2022-11-02-11-25-08-109/output/model.tar.gz


In [32]:
model = SKLearnModel(
    model_data=artifact,
    role=ROLE_ARN,
    entry_point="script.py",
    framework_version=FRAMEWORK_VERSION,
    sagemaker_session=sess
)

In [33]:
predictor = model.deploy(instance_type="ml.c5.large", initial_instance_count=1)

-----!

# Inferencing with the Python SDK

Finally, we can inference on the model, by providing the test dataset.

In [34]:
# the SKLearnPredictor does the serialization from pandas for us
print(predictor.predict(testX[data.feature_names]))

[0.48969083 0.73482163 4.8449463  ... 1.25522349 2.98857476 4.00746016]


After all the experimentation, ensure that you've deleted the end-point.

In [35]:
sm_boto3.delete_endpoint(EndpointName=predictor.endpoint)

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


{'ResponseMetadata': {'RequestId': '40f6e5e2-9c96-4ea1-ad8f-d9e95e3c054a',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '40f6e5e2-9c96-4ea1-ad8f-d9e95e3c054a',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Wed, 02 Nov 2022 11:49:17 GMT'},
  'RetryAttempts': 0}}