## 3.0_train_test_models
 
by: Tom Goral

## OUTLINE

The purpose of this note book is to Train, Validate & Test Predictive Models using data from 1.1_clean_prepare_data.ipynb

1. [STEP 1: SETUP NOTEBOOK](#STEP-1:-SETUP-NOTEBOOK)
2. [STEP 2: LOAD DATA](#STEP-2:-LOAD-DATA)
3. [STEP 3: SPLIT INTO TRAIN, VALIDATE, TEST](#STEP-3:-SPLIT-INTO-TRAIN,-VALIDATE,-TEST)
4. [STEP 4: UPLOAD DATA TO S3](#STEP-4:-UPLOAD-DATA-TO-S3)
5. [STEP 5: EXPLORE CANDIDATE PREDICTORS](#STEP-5:-EXPLORE-CANDIDATE-PREDICTORS)
6. [STEP 6: TRAIN , FIT PREDICTORS](#STEP-6:-TRAIN-FIT-PREDICTORS)
7. [STEP 7: TEST PREDICTORS](#STEP-7:-TEST-PREDICTORS)
8. [STEP 8: IDENTIFY BEST PREDICTOR](#STEP-8:-IDENTIFY-BEST-PREDICTOR)
9. [STEP 9 : TUNE BEST PREDICTOR](#STEP-9:-TUNE-BEST-PREDICTOR)

## STEP 1: SETUP NOTEBOOK

In [1]:
# CUSTOM LIBRARIES
from utilities.xl2df         import xl2df
from utilities.hist_plot     import hist_plot
from utilities.print_metrics import print_metrics
from utilities.df2input      import df2input


# STANDARD LIBRARIES
import os
import sys
import numpy as np
import pandas as pd
import pickle, gzip, urllib.request, json


import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib")# Suppress matplotlib user warnings
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')  # Display inline matplotlib plots with IPython


from timeit import default_timer as timer
from time import gmtime, strftime
import datetime
now = datetime.datetime.now()

In [2]:
# SAGEMAKER LIBRARIES

import sagemaker
from   sagemaker                         import get_execution_role
from   sagemaker.amazon.amazon_estimator import get_image_uri
from   sagemaker.predictor               import csv_serializer
from   sagemaker.sklearn.processing      import SKLearnProcessor
import boto3


session  = sagemaker.Session()                # Identify SageMaker Session
role     = get_execution_role()               # Identify IAM Role
bucket   = session.default_bucket()           # Identify S3 bucket
region   = boto3.Session().region_name        # Identify Region
sm_boto3 = boto3.client('sagemaker')          # Identify Client

print('session: ', session)
print('   role: ', role)
print(' bucket: ', bucket)
print(' region: ',region)

session:  <sagemaker.session.Session object at 0x7f2254063f28>
   role:  arn:aws:iam::634491126024:role/service-role/AmazonSageMaker-ExecutionRole-20200619T082443
 bucket:  sagemaker-us-east-2-634491126024
 region:  us-east-2


## STEP 2: LOAD DATA

Load the cleaned , prepared data 

In [3]:
features =  xl2df('data/features.xlsx','features',0)  
response =  xl2df('data/response.xlsx','response',0)  


reading file: data/features.xlsx , sheet: features, index_col: 0
loaded File data/features.xlsx in 31 seconds
rows: 7965, cols: 362, cells: 2883330


reading file: data/response.xlsx , sheet: response, index_col: 0
loaded File data/response.xlsx in 0 seconds
rows: 7965, cols: 1, cells: 7965



## STEP 3: SPLIT INTO TRAIN, VALIDATE, TEST

In [4]:
#  preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import normalize
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer


# estimators
from sklearn.neighbors    import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble     import RandomForestRegressor, AdaBoostRegressor
from sklearn.tree         import DecisionTreeRegressor
from sklearn              import svm, preprocessing


# model accuracy
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.metrics import make_scorer, fbeta_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


random_state=42

#  2/3 training and 1/3 testing sets.
X_train, X_test, y_train, y_test = train_test_split(features, response, test_size=0.33, random_state=random_state)

# 2/3 training and 1/3 validation sets.
X_train, X_val, y_train, y_val   = train_test_split(X_train, y_train, test_size=0.33, random_state=random_state)

#std_scaler  = StandardScaler()
#X_train     = std_scaler.fit_transform(X_train)
#X_val       = std_scaler.fit_transform(X_val)
#X_test      = std_scaler.transform(X_test)


print ("Training   set has {} samples.".format(X_train.shape[0]))
print ("Validation set has {} samples.".format(X_val.shape[0]))
print ("Testing    set has {} samples.".format(X_test.shape[0]))
print ("Total data set has {} samples.".format(X_train.shape[0]+X_val.shape[0]+X_test.shape[0]))

Training   set has 3575 samples.
Validation set has 1761 samples.
Testing    set has 2629 samples.
Total data set has 7965 samples.


## STEP 4: UPLOAD DATA TO S3

When a training job is constructed using SageMaker, a container is executed which performs the training operation. This container is given access to data that is stored in S3. This means that we need to upload the data we want to use for training to S3. In addition, when we perform a batch transform job, SageMaker expects the input data to be stored on S3. We can use the SageMaker API to do this and hide some of the details.


Training Data Formats
Many Amazon SageMaker algorithms support training with data in CSV format. To use data in CSV format for training, in the input data channel specification, specify text/csv as the Content Type. Amazon SageMaker requires that a CSV file doesn't have a header record and that the target variable is in the first column. To run unsupervised learning algorithms that don't have a target, specify the number of label columns in the content type. For example, in this case 'text/csv;label_size=0'.

In [5]:
# Local data directory
data_dir = 'data'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [6]:
# Scikit Learn Input 

FILE_TRAIN_SKL      = os.path.join(data_dir, 'train_skl.csv')
FILE_VALIDATION_SKL = os.path.join(data_dir, 'validation_skl.csv')
FILE_TEST_SKL       = os.path.join(data_dir, 'test_skl.csv')

# save to local directory
pd.concat([y_train, X_train], axis=1).to_csv(FILE_TRAIN_SKL)
pd.concat([y_val, X_val], axis=1).to_csv(FILE_VALIDATION_SKL)
pd.concat([y_test, X_test], axis=1).to_csv(FILE_TEST_SKL)

# upload to S3 for SageMaker
prefix         = 'capstone' 
s3_train_skl       = session.upload_data(FILE_TRAIN_SKL,       bucket=bucket, key_prefix=prefix)
s3_validation_skl  = session.upload_data(FILE_VALIDATION_SKL,  bucket=bucket, key_prefix=prefix)
s3_test_skl        = session.upload_data(FILE_TEST_SKL,        bucket=bucket, key_prefix=prefix)

In [1]:
os.environ.get('SM_CHANNEL_TRAIN')

NameError: name 'os' is not defined

In [None]:
FILE_TRAIN_XGB      = os.path.join(data_dir, 'train_xgb.csv')
FILE_VALIDATION_XGB = os.path.join(data_dir, 'validation_xgb.csv')
FILE_TEST_XGB       = os.path.join(data_dir, 'test_xgb.csv')

pd.concat([y_train, X_train], axis=1).to_csv(FILE_TRAIN_XGB, header=False, index=False)
pd.concat([y_val, X_val], axis=1).to_csv(FILE_VALIDATION_XGB, header=False, index=False)
X_test.to_csv(FILE_TEST_XGB, header=False, index=False)

# upload to S3
prefix         = 'capstone' 
s3_train       = session.upload_data(FILE_TRAIN_XGB,       bucket=bucket, key_prefix=prefix)
s3_validation  = session.upload_data(FILE_VALIDATION_XGB,  bucket=bucket, key_prefix=prefix)
s3_test        = session.upload_data(FILE_TEST_XGB,        bucket=bucket, key_prefix=prefix)

In [None]:
def csv2sk(filename):
    df = pd.read_csv(filename)
    df.rename(columns = {'Unnamed: 0':''}, inplace = True)
    df.set_index(df.iloc[:, 0], inplace = True)
    del df['']
    features = df.iloc[0:,1:]
    response = df.iloc[0:, 0]
    #response = df.iloc[0:, :1]
   
    return features, response

train_features, train_response           = csv2sk(FILE_TRAIN_SKL)
validation_features, validation_response = csv2sk(FILE_VALIDATION_SKL)
test_features, test_response             = csv2sk(FILE_TEST_SKL)

train_response

In [None]:
! python script.py

## STEP 5: EXPLORE CANDIDATE PREDICTORS

In [None]:
! python script.py --n-estimators 20 \
                   --min-samples-leaf 2 \
                   --model-dir  $bucket \
                   --train-file $s3_train\
                   --test-file  $s3_test 

In [None]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn( entry_point='script.py', role = role,
                            train_instance_count=1, train_instance_type='ml.c5.xlarge',
                             framework_version='0.23-1', base_job_name='xx')

In [None]:
print(sklearn_estimator)

In [None]:
# launch training job, with asynchronous call

s3_input_train      = sagemaker.s3_input(s3_data=s3_train, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=s3_validation, content_type='csv')
sklearn_estimator.fit({'train':s3_train, 'val': s3_validation}, wait=True)

In [None]:
from sagemaker.sklearn.estimator import SKLearn

In [None]:
rf = RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                           max_depth=None, max_features='auto', max_leaf_nodes=None,
                           max_samples=None, min_impurity_decrease=0.0,
                           min_impurity_split=None, min_samples_leaf=1,
                           min_samples_split=2, min_weight_fraction_leaf=0.0,
                           n_estimators=100, n_jobs=None, oob_score=False,
                           random_state=42, verbose=0, warm_start=False)


random_grid = {'bootstrap': [True, False], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
              'max_features': ['auto', 'sqrt'],'min_samples_leaf': [1, 2, 4],'min_samples_split': [2, 5, 10],
              'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3,
                               verbose=2, random_state=random_state)# Fit the random search model


rf_random.fit(X_train, y_train)

In [None]:
s3_input_train      = sagemaker.s3_input(s3_data=s3_train, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=s3_validation, content_type='csv')
xgb.fit{'train':s3_input_train, 'validaton': s3_input_validation}

### Run directly from Notebook instead of S3

In [None]:
tree           = DecisionTreeRegressor(random_state=random_state)  # benchmark
tree.fit(X_train, y_train)
print(tree)
print()
preds_tree     = tree.predict(X_test)
print('mse: ',mean_squared_error(y_test,preds_tree))
print('mae: ',mean_absolute_error(y_test,preds_tree))
print('r2 : ',r2_score(y_test,preds_tree))

In [None]:
tree           = DecisionTreeRegressor(random_state=random_state)  # benchmark
tree.fit(train_features, train_response)
print(tree)
print()
preds_tree     = tree.predict(test_features)
print('mse: ',mean_squared_error(test_response,preds_tree))
print('mae: ',mean_absolute_error(test_response,preds_tree))
print('r2 : ',r2_score(test_response,preds_tree))

In [None]:
knn            = KNeighborsRegressor()
knn.fit(X_train, y_train)
print(knn)
print()
preds_knn     = knn.predict(X_test)
print('mse: ',mean_squared_error(y_test,preds_knn))
print('mae: ',mean_absolute_error(y_test,preds_knn))
print('r2 : ',r2_score(y_test,preds_knn))

In [None]:
knn            = KNeighborsRegressor()
knn.fit(train_features, train_response)
print(knn)
print()
preds_knn     = knn.predict(test_features)
print('mse: ',mean_squared_error(test_response,preds_knn))
print('mae: ',mean_absolute_error(test_response,preds_knn))
print('r2 : ',r2_score(test_response,preds_knn))

In [None]:
rfr            = RandomForestRegressor(random_state=random_state)
rfr.fit(X_train, y_train)
print(rfr)
print()
preds_rfr     = rfr.predict(X_test)
print('mse: ',mean_squared_error(y_test,preds_rfr))
print('mae: ',mean_absolute_error(y_test,preds_rfr))
print('r2 : ',r2_score(y_test,preds_rfr))

In [None]:
rfr            = RandomForestRegressor(random_state=random_state)
rfr.fit(train_features, train_response)
print(rfr)
print()
preds_rfr     = rfr.predict(test_features)
print('mse: ',mean_squared_error(test_response,preds_rfr))
print('mae: ',mean_absolute_error(test_response,preds_rfr))
print('r2 : ',r2_score(test_response,preds_rfr))

In [None]:
ada            = AdaBoostRegressor(random_state=random_state)
ada.fit(X_train, y_train)
print(ada)
print()
preds_ada     = ada.predict(X_test)
print('mse: ',mean_squared_error(y_test,preds_ada))
print('mae: ',mean_absolute_error(y_test,preds_ada))
print('r2 : ',r2_score(y_test,preds_ada))

In [None]:
linreg         = LinearRegression()
linreg.fit(X_train, y_train)
print(linreg)
print()
preds_linreg     = linreg.predict(X_test)
print('mse: ',mean_squared_error(y_test,preds_linreg))
print('mae: ',mean_absolute_error(y_test,preds_linreg))
print('r2 : ',r2_score(y_test,preds_linreg))

In [None]:
# your import and estimator code, here
from sagemaker.sklearn.estimator import SKLearn

# instantiate a sklearn estimator
estimator = SKLearn(entry_point='train.py',
                    source_dir='utilities',
                    role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c4.xlarge',
                    sagemaker_session=session)

In [None]:

tree           = DecisionTreeRegressor(random_state=random_state)  # benchmark
knn            = KNeighborsRegressor()
rfr            = RandomForestRegressor(random_state=random_state)
ada            = AdaBoostRegressor(random_state=random_state)
linreg         = LinearRegression()




In [None]:

container = get_image_uri(region, 'xgboost')                   # training container
xgb = sagemaker.estimator.Estimator(container,                 # estimator object from container
                                    role,     
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge', 
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=session)

xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        objective='reg:linear',
                        early_stopping_rounds=10,
                        num_round=200)
                                                                   
s3_input_train      = sagemaker.s3_input(s3_data=s3_train, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=s3_validation, content_type='csv')



algorithms          = {'tree': tree,'knn':knn,'rfr': rfr,'ada': ada, 'linreg':linreg, 'xgb':xgb}
algorithm_list      = []

for k,v in algorithms.items():
    algorithm_list.append(k)

## STEP 6: TRAIN FIT PREDICTORS

In [None]:
### TRAIN REGRESSION MODELS

tree.fit(X_train, y_train)
knn.fit(X_train, y_train)
rfr.fit(X_train, y_train)
ada.fit(X_train, y_train)
linreg.fit(X_train, y_train)

In [None]:
### TRAIN REGRESSION MODELS

tree.fit(X_train, y_train)
knn.fit(X_train, y_train)
rfr.fit(X_train, y_train)
ada.fit(X_train, y_train)
linreg.fit(X_train, y_train)



xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})
xgb_transformer = xgb.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')
xgb_transformer.transform(s3_test, content_type='text/csv', split_type='Line')
xgb_transformer.wait()

## STEP 7: TEST PREDICTORS

In [None]:
preds_tree     = tree.predict(X_test)
preds_knn      = knn.predict(X_test) 
preds_rfr      = rfr.predict(X_test)
preds_ada      = ada.predict(X_test)
preds_linreg   = linreg.predict(X_test)
predictions    = {'tree': preds_tree,'knn':preds_knn,'rfr': preds_rfr,
                  'ada': preds_ada, 'linreg':preds_linreg}

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

# We need to tell the endpoint what format the data we are sending is in
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer

Y_pred = xgb_predictor.predict(X_test.values).decode('utf-8')
# predictions is currently a comma delimited string and so we would like to break it up
# as a numpy array.
Y_pred = np.fromstring(Y_pred, sep=',')

In [None]:
Y_pred

In [None]:
xgb_predictor.delete_endpoint()

## STEP 8: IDENTIFY BEST PREDICTOR

In [None]:
#  GOOD MODEL R2 ~ 1,   BAD MODEL R2 ~ 0
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

print('Decision Tree')
print(mean_squared_error(y_test,preds_tree))
print(mean_absolute_error(y_test,preds_tree))
print('\nKNN')
print(mean_squared_error(y_test,preds_knn))
print(mean_absolute_error(y_test,preds_knn))
print('\nRandom Forest Regressor')
print(mean_squared_error(y_test,preds_rfr))
print(mean_absolute_error(y_test,preds_rfr))
print('\nAdaBoost')
print(mean_squared_error(y_test,preds_ada))
print(mean_absolute_error(y_test,preds_ada))
print('\nLinear Regression')
print(mean_squared_error(y_test,preds_linreg))
print(mean_absolute_error(y_test,preds_linreg))

In [None]:
dfResults=y_test

first = True
for k,v in predictions.items():
    
    if first == True:
        t = np.reshape(v, (2629,1))
    else:
        v = np.reshape(v, (2629,1))
        t = np.concatenate((t,v), axis=-1)
    
    first = False
    
dfPredictions = pd.DataFrame(t,columns=algorithm_list,index=y_test.index)

for each in dfPredictions.columns:
    dfResults[each] = dfPredictions[each]

    
dfResults.rename(columns={"cost": "ACTUAL"})

In [None]:
output_path
's3://{}/{}/output'.format(bucket, prefix)

## STEP 9 : TUNE BEST PREDICTOR

##  GO TO:  3.0_deploy_monitor.ipynb