## Develop, Train, Optimize and Deploy bring your own Scikit-Learn based Models on Sagemaker
* Doc https://sagemaker.readthedocs.io/en/stable/using_sklearn.html
* SDK https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html
* boto3 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#client

In this notebook we show how to use Amazon SageMaker to develop, train Scikit-Learn based ML model (Random Forest). We are also demosntrating the hosting of (bring your own) Sickit-Learn Model on Sagemaker. This model is trained on Boston house-price data

References

 * Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
 * Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
 
 
 
 
**This sample is provided for demonstration purposes, make sure to conduct appropriate testing if derivating this code for your own use-cases!**

## Compilation of Whirpool train_labor jupyter notebook Sickit-Learn librararies

In [None]:
!pip install textblob
import traceback, os, re, gzip, pickle, nltk, multiprocessing, sklearn
from nltk.stem.snowball import SnowballStemmer
from joblib import Parallel, delayed
from nltk.corpus import stopwords
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.naive_bayes import MultinomialNB
from scipy.sparse import csr_matrix
from sklearn.linear_model import Ridge,RidgeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import pandas as pd
import numpy as np
#import aunsight_connections as au_con
#from dslib.ioutils import aunsight_connector
import transformer_labor as transformer
import datetime
import json
import pandas as pd
import numpy as np
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from textblob import TextBlob
from joblib import Parallel, delayed
import re, nltk
from sklearn.base import BaseEstimator, TransformerMixin
from scipy.sparse import csr_matrix

## Compilation of Whirpool train_parts jupyter notebook Sickit-Learn librararies 

In [313]:

!pip install sklearn_hierarchical_classification
import nltk, gzip, pickle, multiprocessing, traceback, os, gzip, pickle, json, sklearn
from nltk.stem.snowball import SnowballStemmer
from textblob import TextBlob
from sklearn.base import TransformerMixin, BaseEstimator
from joblib import Parallel, delayed
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from nltk.corpus import stopwords
import transformer_parts
import pandas as pd
import numpy as np
from sklearn.pipeline import FeatureUnion
from scipy.sparse import csr_matrix
from sklearn.linear_model import Ridge,RidgeClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn_hierarchical_classification.classifier import HierarchicalClassifier
from sklearn_hierarchical_classification.constants import ROOT
from sklearn.naive_bayes import MultinomialNB, GaussianNB
import datetime

region = sess.boto_region_name
bucket = sess.default_bucket()
prefix = 'sickit_learn_demo'

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


## Compilation of Whirpool Train_parts Sickit-Learn Pipelines

In [314]:
 
pipeline = Pipeline([
        ('transformations', FeatureUnion([
            ('desc_pipe', Pipeline([ # This creates a sparse matrix with one column per word
                ('corp', transformer.desc_to_corpus('[^A-Za-z 0-9]')),
                ('vect', TfidfVectorizer(lowercase=True, stop_words = "english", min_df = 10, max_df = .8, ngram_range=(1,5))),
            ])),
            ('article_pipe', Pipeline([ # This creates a sparse matrix with one column per word
                ('corp', transformer.request_to_corpus('[^A-Za-z 0-9]')),
                ('vect', TfidfVectorizer(lowercase=True, stop_words = "english", min_df = 10, max_df = .8, ngram_range=(1,5))),
            ])),
            ('mod_pipe', Pipeline([ 
                ('prep', transformer.prep_mod()), # needed to pass into TfidfVectorizer
                ('vect', TfidfVectorizer(lowercase=True, min_df=10, max_df = .8, norm='l2', tokenizer=transformer.tokenize_mods, ngram_range=(1,5)))
            ]))
        ])),
        ('estimator', RidgeClassifier())
    ])
    
     

In [315]:
# we use the Boston housing dataset
data = load_boston()

In [316]:
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.25, random_state=42
)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX["target"] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX["target"] = y_test

In [317]:
trainX.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.09103,0.0,2.46,0.0,0.488,7.155,92.2,2.7006,3.0,193.0,17.8,394.12,4.82,37.9
1,3.53501,0.0,19.58,1.0,0.871,6.152,82.6,1.7455,5.0,403.0,14.7,88.01,15.02,15.6
2,0.03578,20.0,3.33,0.0,0.4429,7.82,64.5,4.6947,5.0,216.0,14.9,387.31,3.76,45.4
3,0.38735,0.0,25.65,0.0,0.581,5.613,95.6,1.7572,2.0,188.0,19.1,359.29,27.26,15.7
4,0.06724,0.0,3.24,0.0,0.46,6.333,17.2,5.2146,4.0,430.0,16.9,375.21,7.34,22.6


In [318]:
trainX.to_csv("boston_train.csv")
testX.to_csv("boston_test.csv")

In [319]:
# send data to S3. SageMaker will take training data from s3

trainpath = sess.upload_data(
    path="boston_train.csv", bucket=bucket, key_prefix="sickit_learn_demo/data"
)

testpath = sess.upload_data(
    path="boston_test.csv", bucket=bucket, key_prefix="sickit_learn_demo/data"
)
print (trainpath)

s3://sagemaker-us-east-2-708870595954/sickit_learn_demo/data/boston_train.csv


## Writing a *Script Mode* script
The below script contains both training and inference functionality and can run both in SageMaker Training hardware or locally (desktop, SageMaker notebook, on prem, etc). Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script

In [320]:
%%writefile script.py

import argparse
import joblib
import os

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor


# inference functions ---------------
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf


if __name__ == "__main__":

   
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    # to simplify the demo we don't use all sklearn RandomForest hyperparameters
    parser.add_argument("--n-estimators", type=int, default=10)
    parser.add_argument("--min-samples-leaf", type=int, default=3)

    # Data, model, and output directories
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    parser.add_argument("--train-file", type=str, default="boston_train.csv")
    parser.add_argument("--test-file", type=str, default="boston_test.csv")
    parser.add_argument(
        "--features", type=str
    )  # in this script we ask user to explicitly name features
    parser.add_argument(
        "--target", type=str
    )  # in this script we ask user to explicitly name the target

    args, _ = parser.parse_known_args()

    print("reading data")
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))

    print("building training and testing datasets")
    X_train = train_df[args.features.split()]
    X_test = test_df[args.features.split()]
    y_train = train_df[args.target]
    y_test = test_df[args.target]

    # train
    
    model = RandomForestRegressor(
        n_estimators=args.n_estimators, min_samples_leaf=args.min_samples_leaf, n_jobs=-1
    )

    model.fit(X_train, y_train)

    # print abs error
    print("validating model")
    abs_err = np.abs(model.predict(X_test) - y_test)

    # print couple perf metrics
    for q in [10, 50, 90]:
        print("AE-at-" + str(q) + "th-percentile: " + str(np.percentile(a=abs_err, q=q)))

    # persist model
    path = os.path.join(args.model_dir, "model.joblib")
    
    joblib.dump(model, path)

Overwriting script.py


### Launching a training job with Sickil-Learn

In [321]:
# We use the Estimator from the SageMaker Python SDK


from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"

sklearn_estimator = SKLearn(
    model_dir="s3://sagemaker-us-east-2-708870595954/sickit_learn_demo/model",
    entry_point="script.py",
    role=get_execution_role(),
    instance_count=1,
    instance_type="ml.c5.xlarge",
    framework_version=FRAMEWORK_VERSION,
    base_job_name="rf-scikit",
    metric_definitions=[{"Name": "median-AE", "Regex": "AE-at-50th-percentile: ([0-9.]+).*$"}],
    hyperparameters={
        "n-estimators": 2,
        "min-samples-leaf": 3,
        "features": "CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT",
        "target": "target",
    },
)

In [None]:
print

In [322]:
sklearn_estimator.fit({"train": trainpath, "test": testpath})

2021-09-08 18:47:48 Starting - Starting the training job...
2021-09-08 18:47:49 Starting - Launching requested ML instancesProfilerReport-1631126867: InProgress
......
2021-09-08 18:49:05 Starting - Preparing the instances for training......
2021-09-08 18:50:05 Downloading - Downloading input data...
2021-09-08 18:50:47 Training - Training image download completed. Training in progress.
2021-09-08 18:50:47 Uploading - Uploading generated training model[34m2021-09-08 18:50:37,656 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2021-09-08 18:50:37,658 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-09-08 18:50:37,667 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2021-09-08 18:50:37,955 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-09-08 18:50:40,982 sagemaker-training-toolkit INFO     No GPUs detected (

## Deploying Sickit-Learn model to a real-time model hosting endpoint

In [310]:

from sagemaker.sklearn.model import SKLearnModel
defauk_model='source/sourcedir.tar.gz'
model_data ='s3://'+ bucket+'/rf-scikit-2021-09-08-18-47-47-637/'+ default_mode
model_data= "s3://sagemaker-us-east-2-708870595954/rf-scikit-2021-09-08-18-36-15-365/source/sourcedir.tar.gz"
#model_data = f"s3://{bucket}/{prefix}/training_artifact/output"

model = SKLearnModel(
    #model_data='s3://sagemaker-us-east-2-708870595954/rf-scikit-2021-08-31-22-11-01-583/output/model.tar.gz',
    model_data=model_data,
    role=get_execution_role(),
    entry_point="script.py",
    framework_version=FRAMEWORK_VERSION,
)
print (model_data)

s3://sagemaker-us-east-2-708870595954/rf-scikit-2021-09-08-18-36-15-365/source/sourcedir.tar.gz


In [None]:
print(bucket)
path = os.path.join(args.model_dir, "model.joblib")

### Creating the Sagemaker end point and hosting Sickit-Learn model

In [None]:
from sagemaker.model_monitor import DataCaptureConfig
prefix = 'sickit_learn_demo'
s3_capture_upload_path = f"s3://{bucket}/{prefix}/output/inferencedata"
endpoint_name = 'Sickit-Learn-latest004' 
print("EndpointName={}".format(endpoint_name))

data_capture_config = DataCaptureConfig(
                        enable_capture=True,
                        sampling_percentage=100,
                        destination_s3_uri=s3_capture_upload_path)

predictor = model.deploy(initial_instance_count=1,
                instance_type='ml.m4.xlarge',
                endpoint_name=endpoint_name,
                data_capture_config=data_capture_config)

### Predicting the Results

In [None]:
print(predictor.predict(testX[data.feature_names]))