## 3.0_train_deploy_model
 
by: Tom Goral

## OUTLINE

The purpose of this note book is to Train & Test Predictive Models using data prepared by 2.0_feature_engineering.ipynb, so that the most accurate model could be identified and deployed.

1. [STEP 1: SETUP NOTEBOOK](#STEP-1:-SETUP-NOTEBOOK)
2. [STEP 2: LOAD DATA](#STEP-2:-LOAD-DATA)
3. [STEP 3: SPLIT INPUT INTO TRAIN, VALIDATE, TEST](#STEP-3:-SPLIT-INPUT-INTO-TRAIN,-VALIDATE,-TEST)
4. [STEP 4: UPLOAD DATA TO S3](#STEP-4:-UPLOAD-DATA-TO-S3)
5. [STEP 5: TRY SEVERAL PREDICTOR MODELS](#STEP-5:-TRY-SEVERAL-PREDICTOR-MODELS)
6. [STEP 6: IDENTIFY THE BEST PREDICTOR](#STEP-6:-IDENTIFY-THE-BEST-PREDICTOR)
7. [STEP 7: OPTIMIZE THE BEST PREDICTOR](#STEP-7:-OPTIMIZE-THE-BEST-PREDICTOR)
8. [STEP 8: DEPLOY THE BEST PREDICTOR](#STEP-8:-DEPLOY-THE-BEST-PREDICTOR)
9. [STEP 9 : TUNE BEST PREDICTOR](#STEP-9:-TUNE-BEST-PREDICTOR)

## STEP 1: SETUP NOTEBOOK

In [None]:
# CUSTOM LIBRARIES
from utilities.xl2df         import xl2df
from utilities.hist_plot     import hist_plot
from utilities.print_metrics import print_metrics
from utilities.df2input      import df2input


# STANDARD LIBRARIES
import os
import sys
import numpy as np
import pandas as pd
import pickle, gzip, urllib.request, json


import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib")# Suppress matplotlib user warnings
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')  # Display inline matplotlib plots with IPython


from timeit import default_timer as timer
from time import gmtime, strftime
import datetime
now = datetime.datetime.now()

In [None]:
# SAGEMAKER LIBRARIES

import sagemaker
from   sagemaker                         import get_execution_role
from   sagemaker.amazon.amazon_estimator import get_image_uri
from   sagemaker.predictor               import csv_serializer
from   sagemaker.sklearn.processing      import SKLearnProcessor
import boto3


session  = sagemaker.Session()                # Identify SageMaker Session
role     = get_execution_role()               # Identify IAM Role
bucket   = session.default_bucket()           # Identify S3 bucket
region   = boto3.Session().region_name        # Identify Region
sm_boto3 = boto3.client('sagemaker')          # Identify Client

print('session: ', session)
print('   role: ', role)
print(' bucket: ', bucket)
print(' region: ', region)
print(' client: ', sm_boto3)

## STEP 2: LOAD DATA

Load "features & responses" of the  cleaned, prepared data for training.<br>
Load the original anonymized data to compare the different models

In [None]:
features =  xl2df('data/features.xlsx','features',0)  
response =  xl2df('data/response.xlsx','response',0)
df       =  xl2df('data/anonymous.xlsx','anonymous',0)  

## STEP 3: SPLIT INPUT INTO TRAIN, VALIDATE, TEST

In [None]:
#  preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import normalize
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer


# estimators
from sklearn.neighbors           import KNeighborsRegressor
from sklearn.linear_model        import LinearRegression
from sklearn.ensemble            import RandomForestRegressor, AdaBoostRegressor
from sklearn.tree                import DecisionTreeRegressor
from sklearn                     import svm, preprocessing
from sagemaker.sklearn.estimator import SKLearn


# model accuracy
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.metrics import make_scorer, fbeta_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [None]:

random_state=42

#  2/3 training and 1/3 testing sets.
X_train, X_test, y_train, y_test = train_test_split(features, response, test_size=0.33, random_state=random_state)

# 2/3 training and 1/3 validation sets.
X_train, X_val, y_train, y_val   = train_test_split(X_train, y_train, test_size=0.33, random_state=random_state)

#std_scaler  = StandardScaler()
#X_train     = std_scaler.fit_transform(X_train)
#X_val       = std_scaler.fit_transform(X_val)
#X_test      = std_scaler.transform(X_test)


print ("Training   set has {} samples.".format(X_train.shape[0]))
print ("Testing    set has {} samples.".format(X_test.shape[0]))
print ("Validation set has {} samples.".format(X_val.shape[0]))
print ("Total data set has {} samples.".format(X_train.shape[0]+X_val.shape[0]+X_test.shape[0]))

## STEP 4: UPLOAD DATA TO S3

When a training job is constructed using SageMaker, a container is executed which performs the training operation. This container is given access to data that is stored in S3. This means that we need to upload the data we want to use for training to S3. In addition, when we perform a batch transform job, SageMaker expects the input data to be stored on S3. We can use the SageMaker API to do this and hide some of the details.


Training Data Formats
Many Amazon SageMaker algorithms support training with data in CSV format. To use data in CSV format for training, in the input data channel specification, specify text/csv as the Content Type. Amazon SageMaker requires that a CSV file doesn't have a header record and that the target variable is in the first column. To run unsupervised learning algorithms that don't have a target, specify the number of label columns in the content type. For example, in this case 'text/csv;label_size=0'.

In [None]:
# Local data directory
data_dir = 'data'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [None]:
# Scikit Learn Input 

FILE_TRAIN      = os.path.join(data_dir, 'train.csv')
FILE_VALIDATION = os.path.join(data_dir, 'validation.csv')
FILE_TEST       = os.path.join(data_dir, 'test.csv')

# save to local directory
pd.concat([y_train, X_train], axis=1).to_csv(FILE_TRAIN)
pd.concat([y_val, X_val], axis=1).to_csv(FILE_VALIDATION)
pd.concat([y_test, X_test], axis=1).to_csv(FILE_TEST)

# upload to S3 for SageMaker
prefix         = 'capstone' 
s3_train       = session.upload_data(FILE_TRAIN,       bucket=bucket, key_prefix=prefix)
s3_validation  = session.upload_data(FILE_VALIDATION,  bucket=bucket, key_prefix=prefix)
s3_test        = session.upload_data(FILE_TEST,        bucket=bucket, key_prefix=prefix)

In [None]:
FILE_TRAIN_XGB      = os.path.join(data_dir, 'train_xgb.csv')
FILE_VALIDATION_XGB = os.path.join(data_dir, 'validation_xgb.csv')
FILE_TEST_XGB       = os.path.join(data_dir, 'test_xgb.csv')

pd.concat([y_train, X_train], axis=1).to_csv(FILE_TRAIN_XGB, header=False, index=False)
pd.concat([y_val, X_val], axis=1).to_csv(FILE_VALIDATION_XGB, header=False, index=False)
X_test.to_csv(FILE_TEST_XGB, header=False, index=False)




prefix             = 'capstone' 
s3_train_xgb       = session.upload_data(FILE_TRAIN_XGB,       bucket=bucket, key_prefix=prefix)
s3_validation_xgb  = session.upload_data(FILE_VALIDATION_XGB,  bucket=bucket, key_prefix=prefix)
s3_test_xgb        = session.upload_data(FILE_TEST_XGB,        bucket=bucket, key_prefix=prefix)

In [None]:
'''def csv2sk(filename):
    df = pd.read_csv(filename)
    df.rename(columns = {'Unnamed: 0':''}, inplace = True)
    df.set_index(df.iloc[:, 0], inplace = True)
    del df['']
    features = df.iloc[0:,1:]
    response = df.iloc[0:, 0]
    #response = df.iloc[0:, :1]
   
    return features, response

train_features, train_response           = csv2sk(FILE_TRAIN_SKL)
validation_features, validation_response = csv2sk(FILE_VALIDATION_SKL)
test_features, test_response             = csv2sk(FILE_TEST_SKL)

train_response'''

## STEP 5: TRY SEVERAL  PREDICTOR MODELS

### This was by far the most difficult part of this capstone assignment !!  I absolutely struggled with application of SageMaker AWS containers , high and low level models , built-in and custom models.  Syntax and process was not straightforward to me either, nevertheless you will find an example of trying three approaches:<br>
<br>
1. SageMaker Custom Model with a script (Random Forest)<br>
2. SageMaker Built-In Model (XgBoost)<br>
3. Scikit Models run on my PC (Decision Tree, KNN, AdaBoost)<br>

### 5.1 SageMaker Custom Model with a script (Random Forest)

In [None]:

%%writefile utilities/script.py

import argparse
import os
import numpy as np
import pandas as pd
import joblib

import subprocess as sb 
import sys
mypackage = 'sagemaker'
sb.call([sys.executable, "-m", "pip", "install", mypackage]) 


# estimators
from sklearn.neighbors           import KNeighborsRegressor
from sklearn.linear_model        import LinearRegression
from sklearn.ensemble            import RandomForestRegressor, AdaBoostRegressor
from sklearn.tree                import DecisionTreeRegressor
from sklearn                     import svm, preprocessing
from sagemaker.sklearn.estimator import SKLearn



# model accuracy
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.metrics import make_scorer, fbeta_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split



def model_fn(model_dir):
    # TODO instantiate a model from its artifact stored in model_dir
    model =  joblib.load(os.path.join(model_dir, "model.joblib"))
    return model

def predict_fn(input_data, model):
    # TODO apply model to the input_data, return result of interest
    return result










if __name__ =='__main__':

    print('\n1. EXTRACT ARGUMENTS:')
    parser = argparse.ArgumentParser()
    parser.add_argument('--choice', type=str, default ='rfr')
    parser.add_argument('--random_state', type=int, default=42)
    parser.add_argument('--n-estimators', type=int, default=10)
    parser.add_argument('--min-samples-leaf', type=int, default=3)
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train-file', type=str, default='train.csv')
    parser.add_argument('--test-file', type=str, default='test.csv')
    args, _ = parser.parse_known_args()

   
    print('2.         LOAD DATA:')
    train_df = pd.read_csv(os.path.join(args.train, args.train_file),index_col =0)
    train_df = train_df.astype('float32')
    X_train = train_df[train_df.columns[1:]]
    y_train = train_df[train_df.columns[0]]    
    test_df = pd.read_csv(os.path.join(args.test, args.test_file),index_col =0)
    test_df = test_df.astype('float32')
    X_test = test_df[test_df.columns[1:]]
    y_test = test_df[test_df.columns[0]]
    

    print('3.        LOAD MODEL: ',args.choice)
    
    if args.choice == 'tree':        
        model = DecisionTreeRegressor(random_state=args.random_state)  # baseline
        
    elif args.choice == 'knn':        
        model =  KNeighborsRegressor()
       
    elif args.choice == 'rfr':        
        model = RandomForestRegressor(
                n_estimators=args.n_estimators,
                min_samples_leaf=args.min_samples_leaf,
                random_state = args.random_state,
                n_jobs=-1)
        
    elif args.choice == 'ada':        
        model =  AdaBoostRegressor(random_state=args.random_state)
    
    elif args.choice == 'linreg':
        model =  LinearRegression()  
    

    print('4.         FIT MODEL: ',args.choice)
    model.fit(X_train, y_train)
    
    
    print('5.        TEST MODEL: ',args.choice)
    preds_model     = model.predict(X_test)
    print('         mse: {:.4f}'.format(mean_squared_error(y_test,preds_model)))
    print('        rmse: {:.4f}'.format(np.sqrt(mean_squared_error(y_test,preds_model))))
    print('         mae: {:.4f}'.format(mean_absolute_error(y_test,preds_model)))
    print('          r2: {:.4f}'.format(r2_score(y_test,preds_model)))
     
        
    
    path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model, path)
    print('6.    SAVED MODEL TO: ' + path)
  

In [None]:
# Run from command line

! python utilities/script.py --choice rfr\
                             --n-estimators 100 \
                             --min-samples-leaf 2 \
                             --model-dir ./ \
                             --train ./data \
                             --test ./data \
                             --train-file train.csv\
                             --test-file test.csv\

In [None]:
# INSTANTIATE SageMaker Custom Model , SciKit Random Forest 

from sagemaker.sklearn.estimator import SKLearn

estimator = SKLearn(
    entry_point='utilities/script.py',
    role = role,
    train_instance_count=1,
    train_instance_type='ml.c5.xlarge',
    framework_version='0.23-1',
    base_job_name='tmg',
    hyperparameters = {'n-estimators': 100,
                       'min-samples-leaf': 3})


In [None]:
# TRAIN SageMaker Custom Model , SciKit Random Forest

%time
trainpath = s3_train
testpath  = s3_test
estimator.fit({'train':trainpath, 'test': testpath}, wait=True)

In [None]:
# DEPLOY SageMaker Custom Model , SciKit Random Forest

%time
predictor = estimator.deploy(instance_type='ml.m4.xlarge', initial_instance_count=1)
print(predictor.endpoint)

In [None]:
# TEST SageMaker Custom Model , SciKit Random Forest

%time
y_preds = predictor.predict(X_test)
print('mse: ',mean_squared_error(y_test,y_preds))
print('mae: ',mean_absolute_error(y_test,y_preds))
print('r2 : ',r2_score(y_test,y_preds))

In [None]:
# DELETE SageMaker Custom Model , SciKit Random Forest
predictor.delete_endpoint()

### 5.2 SageMaker Built-In Model (XgBoost)

In [None]:
# INSTANTIATE SageMaker Built-In Model , xgboost 

container = get_image_uri(region, 'xgboost')

xgb = sagemaker.estimator.Estimator(container, # The image name of the training container
                                    role,      # The IAM role to use (our current role in this case)
                                    train_instance_count=1, # The number of instances to use for training
                                    train_instance_type='ml.m4.xlarge', # The type of instance to use for training
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                                                        # Where to save the output (the model artifacts)
                                    sagemaker_session=session) # The current SageMaker session


xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        objective='reg:linear',
                        early_stopping_rounds=10,
                        num_round=200)

In [None]:
# TRAIN SageMaker Built-In Model , xgboost 

%time
s3_input_train      = sagemaker.s3_input(s3_data=s3_train, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=s3_validation, content_type='csv')

xgb.fit({'train': s3_input_train , 'validation': s3_input_validation}, wait=True)

In [None]:
# TEST SageMaker Built-In Model , xgboost

xgb_transformer = xgb.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')
xgb_transformer.transform(s3_test, content_type='text/csv', split_type='Line')
xgb_transformer.wait()

In [None]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir
y_pred = pd.read_csv(os.path.join(data_dir, 'test_xgb.csv.out'), header=None)

### 5.3 Scikit Models run on my PC (Decision Tree, KNN, AdaBoost)

In [None]:
# Instantiate models
random_state=42

tree       = DecisionTreeRegressor(random_state=random_state)   #baseline
knn        = KNeighborsRegressor()
rfr        = RandomForestRegressor(random_state=random_state)
ada        = AdaBoostRegressor(random_state=random_state)
linreg     = LinearRegression()
algorithms = {'tree': tree,'knn':knn,'rfr': rfr,'ada': ada, 'linreg':linreg}

# Fit models

tree.fit(X_train, y_train.to_numpy().ravel())
knn.fit(X_train, y_train.to_numpy().ravel())
rfr.fit(X_train, y_train.to_numpy().ravel())
ada.fit(X_train, y_train.to_numpy().ravel())
linreg.fit(X_train, y_train.to_numpy().ravel())

# Test models

preds_tree   = tree.predict(X_test)
preds_knn    = knn.predict(X_test) 
preds_rfr    = rfr.predict(X_test)
preds_ada    = ada.predict(X_test)
preds_linreg = linreg.predict(X_test)
predictions  = {'tree': preds_tree,'knn':preds_knn,'rfr': preds_rfr,'ada': preds_ada, 'linreg':preds_linreg}

## STEP 6: IDENTIFY THE BEST PREDICTOR

In [None]:
dfMetrics = pd.DataFrame(index=['mse','rmse','mae','r2'],columns=['tree','knn','rfr','ada','linreg','xgboost'])

for k,v in predictions.items():
    dfMetrics[k].loc['mse'] =round(mean_squared_error(y_test,v),4)
    dfMetrics[k].loc['rmse']=round(np.sqrt(mean_squared_error(y_test,v)),4)
    dfMetrics[k].loc['mae'] =round(mean_absolute_error(y_test,v),4)
    dfMetrics[k].loc['r2']  =round(r2_score(y_test,v),4)
    
dfMetrics

In [None]:
# Dataframe to compare the predictions

first = True

for k,v in predictions.items():
    
    if first == True:
        t = np.reshape(v, (len(y_test),1))

    else:
        v = np.reshape(v, (len(y_test),1))
        t = np.concatenate((t,v), axis=-1)
       
    
    first = False
    
dfResults = pd.DataFrame(t,columns=algorithms,index=y_test.index)
dfResults



dfResults['ACTUAL'] =""
dfResults['class']  =""
dfResults['sub']    =""
dfResults['assy']   =""
dfResults['head']   =""
dfResults['drive']  =""
dfResults['thread'] =""
dfResults['nom']    =""
dfResults['point']  =""
dfResults['heat']   =""
dfResults['lock']   =""
dfResults['plate']  =""
dfResults['qty']    =""
dfResults['mm']     =""


for each in dfResults.index:
    dfResults['ACTUAL'].loc[each] = df['cost'].loc[each]
    dfResults['class'].loc[each]  = df['class'].loc[each]
    dfResults['sub'].loc[each]    = df['sub'].loc[each]
    dfResults['assy'].loc[each]   = df['assy'].loc[each]
    dfResults['head'].loc[each]   = df['head'].loc[each]
    dfResults['drive'].loc[each]  = df['drive'].loc[each]
    dfResults['thread'].loc[each] = df['thread'].loc[each]
    dfResults['nom'].loc[each]    = df['nom'].loc[each]
    dfResults['point'].loc[each]  = df['point'].loc[each]
    dfResults['heat'].loc[each]   = df['heat'].loc[each]
    dfResults['lock'].loc[each]   = df['lock'].loc[each]
    dfResults['plate'].loc[each]  = df['plate'].loc[each]
    dfResults['qty'].loc[each]    = df['qty'].loc[each]
    dfResults['mm'].loc[each]     = df['mm'].loc[each]

    
dfResults

In [None]:
  
    
fig, axs = plt.subplots(2, 2)
axs[0, 0].scatter(y_test, preds_tree)
axs[0, 0].set_title('Axis [0, 0]')
axs[0, 1].scatter(y_test, preds_knn, 'tab:orange')
axs[0, 1].set_title('Axis [0, 1]')
axs[1, 0].scatter(y_test, preds_rfr, 'tab:green')
axs[1, 0].set_title('Axis [1, 0]')
axs[1, 1].scatter(y_test, preds_ada, 'tab:red')
axs[1, 1].set_title('Axis [1, 1]')

for ax in axs.flat:
    ax.set(xlabel='x-label', ylabel='y-label')

# Hide x labels and tick labels for top plots and y ticks for right plots.
for ax in axs.flat:
    ax.label_outer()

## STEP 7 : OPTIMIZE THE  BEST PREDICTOR

## STEP 8 : DEPLOY THE  BEST PREDICTOR