# Introduction

In this workshop, we will go through the steps of training and deploying a model and then training and testing a possible replacement model using the SageMaker Shadow Test feature. We'll do this entirely in code, making use of the <a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_inference_experiment.html">SageMaker API</a>. The models will analyze and classify network traffic.  

## Contents

1) [Setup](#setup)
2) [Basic Training](#basic_training)
3) [Register the Models](#register)
4) [Create Endpoint Config](#create_endpoint)
5) [Deploy and Predict](#deploy)
6) [Create a Shadow Test](#shadow)
7) [Evaluate](#eval)

For training our model we will be using datasets <a href="https://registry.opendata.aws/cse-cic-ids2018/">CSE-CIC-IDS2018</a> by CIC and ISCX which are used for security testing and malware prevention.
These datasets include a huge amount of raw network traffic logs, plus pre-processed data where network connections have been reconstructed and  relevant features have been extracted using CICFlowMeter, a tool that outputs network connection features as CSV files. Each record is classified as benign traffic, or it can be malicious traffic, with a total number of 15 classes.

Starting from this featurized dataset, we have executed additional pre-processing for the purpose of this lab:
<ul>
    <li>Encoded class labels</li>
    <li>Replaced invalid string attribute values generated by CICFlowMeter (e.g. inf and Infinity)</li>
    <li>Executed one hot encoding of discrete attributes</li>
    <li>Remove invalid headers logged multiple times in the same CSV file</li>
    <li>Reduced the size of the featurized dataset to ~1.3GB (from ~6.3GB) to speed-up training, while making sure that all classes are well represented</li>
    <li>Executed stratified random split of the dataset into training (80%) and validation (20%) sets</li>
</ul>

Class are represented and have been encoded as follows (train + validation):


| Label                    | Encoded | N. records |
|:-------------------------|:-------:|-----------:|
| Benign                   |    0    |    1000000 |
| Bot                      |    1    |     200000 |
| DoS attacks-GoldenEye    |    2    |      40000 |
| DoS attacks-Slowloris    |    3    |      10000 |
| DDoS attacks-LOIC-HTTP   |    4    |     300000 |
| Infilteration            |    5    |     150000 |
| DDOS attack-LOIC-UDP     |    6    |       1730 |
| DDOS attack-HOIC         |    7    |     300000 |
| Brute Force -Web         |    8    |        611 |
| Brute Force -XSS         |    9    |        230 |
| SQL Injection            |   10    |         87 |
| DoS attacks-SlowHTTPTest |   11    |     100000 |
| DoS attacks-Hulk         |   12    |     250000 |
| FTP-BruteForce           |   13    |     150000 |
| SSH-Bruteforce           |   14    |     150000 |       

The final pre-processed dataset has been saved to a public Amazon S3 bucket for your convenience, and will represent the inputs to the training processes.
<a id='setup'></a>
### Let's get started!

First, we set some variables, including the AWS region we are working in, the IAM (Identity and Access Management) execution role of the notebook instance and the Amazon S3 bucket where we will store data, models, outputs, etc. We will use the Amazon SageMaker default bucket for the selected AWS region, and then define a key prefix to make sure all objects have share the same prefix for easier discoverability.

In [1]:
%pip install jsonlines --quiet
%pip install sagemaker --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import time
import glob
import json
import jsonlines
import base64
import io

import boto3
import sagemaker
from sagemaker.model_monitor import DataCaptureConfig
from sagemaker.sklearn.estimator import SKLearn

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from IPython.display import display, clear_output

pd.options.display.max_columns = 100

region = boto3.Session().region_name
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()
bucket_name = sagemaker.Session().default_bucket()
prefix = 'xgboost-webtraffic'
os.environ["AWS_REGION"] = region

print(f'REGION:  {region}')
print(f'ROLE:    {role}')
print(f'BUCKET:  {bucket_name}')

# These are the clasifications that have been encoded as ints, we'll use these for analysis
class_list = ['Benign','Bot','DoS attacks-GoldenEye','DoS attacks-Slowloris','DDoS attacks-LOIC-HTTP','Infilteration','DDOS attack-LOIC-UDP','DDOS attack-HOIC','Brute Force-Web','Brute Force-XSS','SQL Injection','DoS attacks-SlowHTTPTest','DoS attacks-Hulk','FTP-BruteForce','SSH-Bruteforce']

REGION:  us-east-1
ROLE:    arn:aws:iam::278578987671:role/SageMaker-IoTRole
BUCKET:  sagemaker-us-east-1-278578987671


#### Now we can copy the dataset from the public Amazon S3 bucket to the Amazon SageMaker default bucket used in this workshop. To do this, we will leverage on the AWS Python SDK (boto3) as follows:

In [3]:
s3 = boto3.resource('s3')

source_bucket_name = "endtoendmlapp"
source_bucket_prefix = "shadowmodel/data/"
source_bucket = s3.Bucket(source_bucket_name)

In [4]:


for s3_object in source_bucket.objects.filter(Prefix=source_bucket_prefix):
    copy_source = {
        'Bucket': source_bucket_name,
        'Key': s3_object.key
    }
    print('Copying {0} ...'.format(s3_object.key))
    s3.Bucket(bucket_name).copy(copy_source, prefix+'/data/'+s3_object.key.split('/')[-2]+'/'+s3_object.key.split('/')[-1].replace('.part','.csv'))
    
print(f'Data copy from source bucket, {source_bucket_name}/{source_bucket_prefix}, to destination bucket {bucket_name}/{prefix}/data/, complete!')

Data copy from source bucket, endtoendmlapp/shadowmodel/data/, to destination bucket sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/, complete!


# Data

Let's download some of the data to the notebook to quickly explore the dataset structure:

In [5]:
train_file_path = 's3://' + bucket_name + '/' + prefix + '/data/train/0.csv'
val_file_path = 's3://' + bucket_name + '/' + prefix + '/data/val/'

print(train_file_path)
print(val_file_path)

s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/train/0.csv
s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/


In [6]:
!mkdir -p data/train/ data/val/
!aws s3 cp {train_file_path} data/train/ 
!aws s3 cp {val_file_path} data/val/ --recursive

download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/train/0.csv to data/train/0.csv
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/4.csv to data/val/4.csv
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/2.csv to data/val/2.csv
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/1.csv to data/val/1.csv
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/3.csv to data/val/3.csv
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/0.csv to data/val/0.csv
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/8.csv to data/val/8.csv
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/5.csv to data/val/5.csv
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/6.csv to data/val/6.csv
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/9.csv to data/val/9.

In [7]:
df = pd.read_csv('data/train/0.csv')
df

Unnamed: 0,Target,Dst Port,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,Fwd Pkt Len Mean,Fwd Pkt Len Std,Bwd Pkt Len Max,Bwd Pkt Len Min,Bwd Pkt Len Mean,Bwd Pkt Len Std,Flow Byts/s,Flow Pkts/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Tot,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Tot,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Len,Bwd Header Len,Fwd Pkts/s,Bwd Pkts/s,Pkt Len Min,Pkt Len Max,Pkt Len Mean,Pkt Len Std,Pkt Len Var,FIN Flag Cnt,SYN Flag Cnt,RST Flag Cnt,PSH Flag Cnt,ACK Flag Cnt,URG Flag Cnt,CWE Flag Count,ECE Flag Cnt,Down/Up Ratio,Pkt Size Avg,Fwd Seg Size Avg,Bwd Seg Size Avg,Fwd Byts/b Avg,Fwd Pkts/b Avg,Fwd Blk Rate Avg,Bwd Byts/b Avg,Bwd Pkts/b Avg,Bwd Blk Rate Avg,Subflow Fwd Pkts,Subflow Fwd Byts,Subflow Bwd Pkts,Subflow Bwd Byts,Init Fwd Win Byts,Init Bwd Win Byts,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,day,month,year,dayofweek,prot_0,prot_6,prot_17
0,0,445,64443,5,4,373,172,140,0,74.600000,70.283711,133,0,43.000000,62.753486,8457.086107,139.658303,8.055375e+03,1.105582e+04,21474,3,64403,1.610075e+04,1.073215e+04,21537,3,64398,2.146600e+04,129.201393,21547,21317,0,0,0,0,112,92,77.587946,62.070357,0,140,54.500000,64.198044,4121.388889,0,0,0,1,0,0,0,0,0.0,60.555556,74.600000,43.000000,0.0,0.0,0.0,0.0,0.0,0.0,5,373,4,172,8192,0,3,20,0.0,0.0,0,0,0.0,0.0,0,0,20,2,2018,1,0,1,0
1,12,80,1527,2,0,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,1309.757695,1.527000e+03,0.000000e+00,1527,1527,1527,1.527000e+03,0.000000e+00,1527,1527,0,0.000000e+00,0.000000,0,0,0,0,0,0,64,0,1309.757695,0.000000,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,225,-1,0,32,0.0,0.0,0,0,0.0,0.0,0,0,16,2,2018,4,0,1,0
2,7,80,5573,2,0,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,358.873138,5.573000e+03,0.000000e+00,5573,5573,5573,5.573000e+03,0.000000e+00,5573,5573,0,0.000000e+00,0.000000,0,0,0,0,0,0,40,0,358.873138,0.000000,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,32738,-1,0,20,0.0,0.0,0,0,0.0,0.0,0,0,21,2,2018,2,0,1,0
3,12,80,44934,2,0,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,44.509725,4.493400e+04,0.000000e+00,44934,44934,44934,4.493400e+04,0.000000e+00,44934,44934,0,0.000000e+00,0.000000,0,0,0,0,0,0,64,0,44.509725,0.000000,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,225,-1,0,32,0.0,0.0,0,0,0.0,0.0,0,0,16,2,2018,4,0,1,0
4,0,443,60108569,4,2,148,252,74,0,37.000000,42.723920,126,126,126.000000,0.000000,6.654625,0.099819,1.202171e+07,2.677679e+07,59921494,44882,60108569,2.003619e+07,3.454169e+07,59921494,93516,60013670,6.001367e+07,0.000000,60013670,60013670,1,0,0,0,80,40,0.066546,0.033273,0,126,67.714286,51.774235,2680.571429,0,1,0,0,1,0,0,0,0.0,79.000000,37.000000,126.000000,0.0,0.0,0.0,0.0,0.0,0.0,4,148,2,252,257,7010,1,20,93559.0,0.0,93559,93559,59921494.0,0.0,59921494,59921494,20,2,2018,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
212208,14,22,298760,21,21,1912,2665,640,0,91.047619,139.679088,976,0,126.904762,263.574639,15319.989289,140.581068,7.286829e+03,2.230386e+04,122248,2,298722,1.493610e+04,3.049364e+04,122248,320,298752,1.493760e+04,33997.036726,126346,7,0,0,0,0,680,680,70.290534,70.290534,0,976,106.441860,207.291869,42969.919158,0,0,0,1,0,0,0,0,1.0,108.976190,91.047619,126.904762,0.0,0.0,0.0,0.0,0.0,0.0,21,1912,21,2665,26883,230,16,32,0.0,0.0,0,0,0.0,0.0,0,0,14,2,2018,2,0,1,0
212209,0,50684,29,1,1,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,68965.517241,2.900000e+01,0.000000e+00,29,29,0,0.000000e+00,0.000000e+00,0,0,0,0.000000e+00,0.000000,0,0,0,0,0,0,20,20,34482.758621,34482.758621,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,1,0,0,1.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,172,255,0,20,0.0,0.0,0,0,0.0,0.0,0,0,20,2,2018,1,0,1,0
212210,4,80,1639274,3,4,20,964,20,0,6.666667,11.547005,964,0,241.000000,482.000000,600.265727,4.270183,2.732123e+05,6.690496e+05,1638904,5,339,1.695000e+02,2.326381e+02,334,5,1639268,5.464227e+05,946116.600700,1638904,26,0,0,0,0,72,92,1.830078,2.440105,0,964,123.000000,339.887376,115523.428600,0,0,1,1,0,0,0,1,1.0,140.571429,6.666667,241.000000,0.0,0.0,0.0,0.0,0.0,0.0,3,20,4,964,8192,211,1,20,0.0,0.0,0,0,0.0,0.0,0,0,20,2,2018,1,0,1,0
212211,5,52848,309,1,1,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,6472.491909,3.090000e+02,0.000000e+00,309,309,0,0.000000e+00,0.000000e+00,0,0,0,0.000000e+00,0.000000,0,0,0,0,0,0,24,20,3236.245955,3236.245955,0,0,0.000000,0.000000,0.000000,0,0,0,1,0,0,0,0,1.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,1024,0,0,24,0.0,0.0,0,0,0.0,0.0,0,0,3,1,2018,2,0,1,0


In [None]:
df.isnull().sum()

In [None]:
df.info()

In [8]:
%%time
val_csv_files = glob.glob("./data/val/*.csv")
df_list = (pd.read_csv(file) for file in val_csv_files)
val_df= pd.concat(df_list, ignore_index=True)
val_df

CPU times: user 3.4 s, sys: 213 ms, total: 3.61 s
Wall time: 3.75 s


Unnamed: 0,Target,Dst Port,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,Fwd Pkt Len Mean,Fwd Pkt Len Std,Bwd Pkt Len Max,Bwd Pkt Len Min,Bwd Pkt Len Mean,Bwd Pkt Len Std,Flow Byts/s,Flow Pkts/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Tot,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Tot,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Len,Bwd Header Len,Fwd Pkts/s,Bwd Pkts/s,Pkt Len Min,Pkt Len Max,Pkt Len Mean,Pkt Len Std,Pkt Len Var,FIN Flag Cnt,SYN Flag Cnt,RST Flag Cnt,PSH Flag Cnt,ACK Flag Cnt,URG Flag Cnt,CWE Flag Count,ECE Flag Cnt,Down/Up Ratio,Pkt Size Avg,Fwd Seg Size Avg,Bwd Seg Size Avg,Fwd Byts/b Avg,Fwd Pkts/b Avg,Fwd Blk Rate Avg,Bwd Byts/b Avg,Bwd Pkts/b Avg,Bwd Blk Rate Avg,Subflow Fwd Pkts,Subflow Fwd Byts,Subflow Bwd Pkts,Subflow Bwd Byts,Init Fwd Win Byts,Init Bwd Win Byts,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,day,month,year,dayofweek,prot_0,prot_6,prot_17
0,0,443,116890705,11,9,875,355,517,0,79.545455,154.043087,156,0,39.444444,50.953683,10.522650,0.171100,6.152142e+06,1.830035e+07,58205691,12,116890705,11689070.5,2.448741e+07,58340043,67,116743185,1.459290e+07,2.692019e+07,58300088,52,0,0,0,0,232,192,0.094105,0.076995,0,517,58.571429,116.108816,13481.257143,0,0,0,1,0,0,0,0,0.0,61.500000,79.545455,39.444444,0.0,0.0,0.0,0.0,0.0,0.0,11,875,9,355,8192,176,6,20,288290.0,187099.040088,420589,155991,58083049.5,173441.27261,58205691,57960408,23,2,2018,4,0,1,0
1,5,54653,26,2,0,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,76923.076920,2.600000e+01,0.000000e+00,26,26,26,26.0,0.000000e+00,26,26,0,0.000000e+00,0.000000e+00,0,0,0,0,0,0,40,0,76923.076920,0.000000,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,260,-1,0,20,0.0,0.000000,0,0,0.0,0.00000,0,0,3,1,2018,2,0,1,0
2,1,8080,9422,3,4,326,129,326,0,108.666667,188.216188,112,0,32.250000,53.767245,48291.233280,742.942050,1.570333e+03,3.370254e+03,8438,1,656,328.0,3.196123e+02,554,102,8902,2.967333e+03,4.743364e+03,8438,1,0,0,0,0,72,92,318.403736,424.538315,0,326,56.875000,115.406657,13318.696430,0,0,1,1,0,0,0,1,1.0,65.000000,108.666667,32.250000,0.0,0.0,0.0,0.0,0.0,0.0,3,326,4,129,8192,219,1,20,0.0,0.000000,0,0,0.0,0.00000,0,0,3,2,2018,5,0,1,0
3,4,80,1094264,3,4,20,964,20,0,6.666667,11.547005,964,0,241.000000,482.000000,899.234554,6.396994,1.823773e+05,4.465232e+05,1093839,4,389,194.5,9.687363e+01,263,126,1094258,3.647527e+05,6.314073e+05,1093839,32,0,0,0,0,72,92,2.741569,3.655425,0,964,123.000000,339.887376,115523.428600,0,0,1,1,0,0,0,1,1.0,140.571429,6.666667,241.000000,0.0,0.0,0.0,0.0,0.0,0.0,3,20,4,964,8192,211,1,20,0.0,0.000000,0,0,0.0,0.00000,0,0,20,2,2018,1,0,1,0
4,0,51497,40,1,1,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,50000.000000,4.000000e+01,0.000000e+00,40,40,0,0.0,0.000000e+00,0,0,0,0.000000e+00,0.000000e+00,0,0,0,0,0,0,20,20,25000.000000,25000.000000,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,1,0,0,1.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,509,254,0,20,0.0,0.000000,0,0,0.0,0.00000,0,0,22,2,2018,3,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
530527,5,53,1531,1,1,45,86,45,45,45.000000,0.000000,86,86,86.000000,0.000000,85564.990200,1306.335728,1.531000e+03,0.000000e+00,1531,1531,0,0.0,0.000000e+00,0,0,0,0.000000e+00,0.000000e+00,0,0,0,0,0,0,8,8,653.167864,653.167864,45,86,58.666667,23.671361,560.333333,0,0,0,0,0,0,0,0,1.0,88.000000,45.000000,86.000000,0.0,0.0,0.0,0.0,0.0,0.0,1,45,1,86,-1,-1,0,8,0.0,0.000000,0,0,0.0,0.00000,0,0,3,1,2018,2,0,0,1
530528,0,51374,2968,5,2,935,266,935,0,187.000000,418.144712,266,0,133.000000,188.090404,404649.595700,2358.490566,4.946667e+02,7.760888e+02,2008,5,2968,742.0,1.121770e+03,2380,9,2008,2.008000e+03,0.000000e+00,2008,2008,0,0,0,0,124,40,1684.636119,673.854447,0,935,150.125000,330.514939,109240.125000,0,0,1,1,0,0,0,1,0.0,171.571429,187.000000,133.000000,0.0,0.0,0.0,0.0,0.0,0.0,5,935,2,266,65535,32768,1,20,0.0,0.000000,0,0,0.0,0.00000,0,0,21,2,2018,2,0,1,0
530529,7,80,16022,2,0,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,124.828361,1.602200e+04,0.000000e+00,16022,16022,16022,16022.0,0.000000e+00,16022,16022,0,0.000000e+00,0.000000e+00,0,0,0,0,0,0,40,0,124.828361,0.000000,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,32738,-1,0,20,0.0,0.000000,0,0,0.0,0.00000,0,0,21,2,2018,2,0,1,0
530530,12,80,2726,2,0,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,733.675715,2.726000e+03,0.000000e+00,2726,2726,2726,2726.0,0.000000e+00,2726,2726,0,0.000000e+00,0.000000e+00,0,0,0,0,0,0,64,0,733.675715,0.000000,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,225,-1,0,32,0.0,0.000000,0,0,0.0,0.00000,0,0,16,2,2018,4,0,1,0


In [None]:
val_df

#### Here we set aside some data to evaluate our models once they're deployed

In [9]:
newval_df, holdout = train_test_split(val_df, test_size=.02, random_state=42, stratify=val_df['Target'])
holdout = holdout.dropna()
print(holdout.shape)
print(newval_df.shape)

(10576, 85)
(519921, 85)


In [10]:
holdout.to_csv('./data/holdout.csv',index=False)
newval_df.to_csv('./data/newval.csv',index=False)
del val_df, newval_df

Here we upload our validation data for the model training.

In [11]:
val_data_path = f"s3://{bucket_name}/{prefix}/data/newval/newval.csv"
holdout_data_path = f"s3://{bucket_name}/{prefix}/data/newval/holdout.csv"
!aws s3 cp ./data/newval.csv {val_data_path}
!aws s3 cp ./data/holdout.csv {holdout_data_path}

upload: data/newval.csv to s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/newval/newval.csv
upload: data/holdout.csv to s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/newval/holdout.csv


<a id='basic_training'></a>
# Training

We will execute the training using the built in XGBoost algorithm.  Note that you can also use script mode if you need to have greater customization of the training process.  


In [12]:
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/data/train'.format(bucket_name, prefix), content_type='text/csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/data/newval/newval.csv'.format(bucket_name, prefix), content_type='text/csv')

## Model 1 - Hist Gradient Boosting

In [13]:
output_path = f's3://{bucket_name}/{prefix}/output/'

FRAMEWORK_VERSION = "1.0-1"

estimator1 = SKLearn(
    entry_point="histgradientboost.py",
    source_dir='./code/',
    role=role,
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    framework_version=FRAMEWORK_VERSION,
    base_job_name="hgbc-scikit",
    metric_definitions=[
        {"Name": "Accuracy", "Regex": "Accuracy is: ([0-9.]+).*$"},
        {"Name": "WeightedF1", "Regex": "Weighted F1 Score is: ([0-9.]+).*$"}
    ],
    output_path=output_path,
)

In [14]:
estimator1.fit({'train': s3_input_train, 'validation': s3_input_validation})

INFO:sagemaker:Creating training-job with name: hgbc-scikit-2023-03-31-15-00-34-254


2023-03-31 15:00:34 Starting - Starting the training job...
2023-03-31 15:01:08 Starting - Preparing the instances for training......
2023-03-31 15:01:57 Downloading - Downloading input data...
2023-03-31 15:02:37 Training - Training image download completed. Training in progress...[34m2023-03-31 15:02:51,627 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2023-03-31 15:02:51,629 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-03-31 15:02:51,638 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2023-03-31 15:02:51,821 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-03-31 15:02:51,832 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-03-31 15:02:51,844 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-03-31 15:02:51

In [15]:
estimator1.model_data

's3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/output/hgbc-scikit-2023-03-31-15-00-34-254/output/model.tar.gz'

## Model 2 - Sklearn Random Forest


In [16]:
output_path = f's3://{bucket_name}/{prefix}/output/'
output_path

's3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/output/'

In [17]:
output_path = f's3://{bucket_name}/{prefix}/output/'

FRAMEWORK_VERSION = "1.0-1"

estimator2 = SKLearn(
    entry_point="randomforest.py",
    source_dir='./code/',
    role=role,
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    framework_version=FRAMEWORK_VERSION,
    base_job_name="rf-scikit",
    metric_definitions=[
        {"Name": "Accuracy", "Regex": "Accuracy is: ([0-9.]+).*$"},
        {"Name": "WeightedF1", "Regex": "Weighted F1 Score is: ([0-9.]+).*$"}
    ],
    output_path=output_path,
    hyperparameters={
        "n-estimators": 100,
        "min-samples-leaf": 5
    }
)

In [18]:
estimator2.fit({'train': s3_input_train, 'validation': s3_input_validation})

INFO:sagemaker:Creating training-job with name: rf-scikit-2023-03-31-16-09-48-329


2023-03-31 16:09:49 Starting - Starting the training job...
2023-03-31 16:10:05 Starting - Preparing the instances for training...
2023-03-31 16:10:52 Downloading - Downloading input data...
2023-03-31 16:11:27 Training - Downloading the training image...
2023-03-31 16:11:33 Training - Training image download completed. Training in progress.[34m2023-03-31 16:11:46,878 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2023-03-31 16:11:46,881 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-03-31 16:11:46,889 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2023-03-31 16:11:47,081 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-03-31 16:11:47,092 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-03-31 16:11:47,105 sagemaker-training-toolkit INFO     No GPUs detecte

In [19]:
estimator2.model_data

's3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/output/rf-scikit-2023-03-31-16-09-48-329/output/model.tar.gz'

#### Note the accuracy, F1 score, and classification report above.

<a id='register'></a>
## Register our models

In [None]:
sm_client = boto3.Session().client('sagemaker')

In [None]:
model_name1 = "PROD-HGB-Webtraffic"
model_name2 = "SHADOW-RF-Webtraffic"

print(f"Prod model name: {model_name1}")
print(f"Shadow model name: {model_name2}")

resp = sm_client.create_model(
    ModelName=model_name1,
    ExecutionRoleArn=role,
    PrimaryContainer={
                      "Image": estimator1.training_image_uri(),
                      "Mode": "SingleModel",
                      "ModelDataUrl": estimator1.model_data,
                      "Environment": {
                          "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
                          "SAGEMAKER_SUBMIT_DIRECTORY":json.loads(estimator1.hyperparameters()['sagemaker_submit_directory']),
                          "SAGEMAKER_PROGRAM":json.loads(estimator1.hyperparameters()['sagemaker_program']),
                      },
                     }
)

resp = sm_client.create_model(
    ModelName=model_name2,
    ExecutionRoleArn=role,
    PrimaryContainer={
                      "Image": estimator2.training_image_uri(),
                      "Mode": "SingleModel",
                      "ModelDataUrl": estimator2.model_data,
                      "Environment": {
                          "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
                          "SAGEMAKER_SUBMIT_DIRECTORY":json.loads(estimator2.hyperparameters()['sagemaker_submit_directory']),
                          "SAGEMAKER_PROGRAM":json.loads(estimator2.hyperparameters()['sagemaker_program']),
                      },
                     }
)

In [22]:
print(estimator2.training_image_uri())
print(estimator2.hyperparameters()['sagemaker_submit_directory'])
print(estimator2.hyperparameters()['sagemaker_program'])

683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:1.0-1-cpu-py3
"s3://sagemaker-us-east-1-278578987671/rf-scikit-2023-03-31-16-09-48-329/source/sourcedir.tar.gz"
"randomforest.py"


<a id='deploy'></a>
## Deploy!

Let's deploy the first model to a production real time SageMaker endpoint.  This is an HTTPS endpoint that is active 24 hours per day, 7 days per week.  It will stay active until we delete it.  Here we add a serializer to convert the incoming inference request to CSV.  We use a CSV serializer since the XGBoost Algorithm used in model 1 can accept data in 'text/libsvm' or 'text/csv' formats.  You can find additional details on the input/output interface in the [XGBoost Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html)

In [None]:
from sagemaker.model_monitor import DataCaptureConfig

data_capture_s3 = f's3://{bucket_name}/{prefix}/datacapture_test/'

data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=100,
    destination_s3_uri=data_capture_s3
)

In [None]:
predictor = estimator1.deploy(initial_instance_count=3,instance_type='ml.m5.2xlarge', data_capture_config=data_capture_config)

In [None]:
predictor.endpoint_name

## Predict

Here we use the sagemaker predictor object to call our deployed instance. We send a set of hardcoded values that should result in an inference of 4, DDoS attacks-LOIC-HTTP, and that is the predicted value.

In [None]:
# single prediction
# We expect 4 - DDoS attacks-LOIC-HTTP as the predicted class.
test_values = [80,1056736,3,4,20,964,20,0,6.666666667,11.54700538,964,0,241.0,482.0,931.1691850999999,6.6241710320000005,176122.6667,431204.4454,1056315,2,394,197.0,275.77164469999997,392,2,1056733,352244.3333,609743.1115,1056315,24,0,0,0,0,72,92,2.8389304419999997,3.78524059,0,964,123.0,339.8873763,115523.4286,0,0,1,1,0,0,0,1,1.0,140.5714286,6.666666667,241.0,0.0,0.0,0.0,0.0,0.0,0.0,3,20,4,964,8192,211,1,20,0.0,0.0,0,0,0.0,0.0,0,0,20,2,2018,1,0,1,0]
result = predictor.predict(np.array(test_values).reshape(1, -1))
print(result)

<a id='shadow'></a>
# Create a Shadow Test 

## Create a Shadow Test using an Existing Endpoint

Now we will create a shadow test using the existing production endpoint.  We will pass the validation data we set aside earlier to the endpoint during this test and stop this test using the API later in the notebook.  Note that we could also specify the test start and stop time when we create the inference experiements.  If we don't provide the start and end times, then the experiment starts immediately and concludes after 7 days.  We are using an existing production endpoint for this test.  SageMaker will update that endpoint with the new model variants.  The production endpoint will also update the inference compute instance type for the production variant if needed. 


In [None]:
shadowtestname = 'ShadowInferenceTestExistingEP'
infexperimentarn = sm_client.create_inference_experiment(
    Name=shadowtestname,
    Type='ShadowMode',
    Description='Shadow inference test created via boto3 python API using an existing EP',
    RoleArn=role,
    EndpointName=predictor.endpoint_name,
    ModelVariants=[
        {
            'ModelName': model_name1,
            'VariantName': 'AllTraffic',
            'InfrastructureConfig': {
                'InfrastructureType':'RealTimeInference',
                'RealTimeInferenceConfig': {
                    'InstanceType': 'ml.m5.2xlarge',
                    'InstanceCount': 3 
                }
            }
        },
        
        {
            'ModelName': model_name2,
            'VariantName': 'Shadow-01',
            'InfrastructureConfig': {
                'InfrastructureType':'RealTimeInference',
                'RealTimeInferenceConfig': {
                    'InstanceType': 'ml.m5.2xlarge',
                    'InstanceCount': 3 
                }
            }
        },
    ],
    DataStorageConfig={
        'Destination':data_capture_s3,
    },
    ShadowModeConfig={
        'SourceModelVariantName': 'AllTraffic',
        'ShadowModelVariants': [
            {
                'ShadowModelVariantName': 'Shadow-01',
                'SamplingPercentage': 100
            },
        ]
    },
)   


In [None]:
shadowtestdescribe = sm_client.describe_inference_experiment(Name=shadowtestname)
shadowtestdescribe

##### We need to wait for the test to be active before we send data

In [None]:
from time import sleep
def wait_until_test_complete(test_name):
    print(f'Waiting on shadow test: {test_name}')
    done = False
    while not done:
        shadowtestdescribe = sm_client.describe_inference_experiment(Name=shadowtestname)
        status = shadowtestdescribe["Status"].lower()
        print(f'Status: {status}')
        if status == 'failed' or status == 'cancelled':
            print("Failure detected. Exiting Loop.")
            print(shadowtestdescribe)
            return
        elif shadowtestdescribe["Status"].lower() == 'running':
            print("Shadow test is running! Exiting Loop.")
            return
        sleep(60)

In [None]:
wait_until_test_complete(shadowtestname)

In [None]:
sm_client.describe_inference_experiment(Name=shadowtestname)

## Simulate Production Traffic

We will now simulate the production traffic.  We will loop over the production data.  In a real production use case you won't need to do this since actual production data will be flowing to the production endpoint.  Since our shadow test is now active the production variant and the shadow variant will recieve the inference input.  Only the production output will be supplied in the response, however, since we have configured the test to capture data we will record both the production and shadow variant responses in s3.  


In [None]:
%%time  
# this should take ~ 2 minutes to complete
indexes = []
actuals = []
i = 0
for index, row in holdout.iterrows():
    vals = row.to_numpy()
    prediction = predictor.predict(vals[1::].reshape(1, -1),inference_id=f'shadow test, index {index}')
    actuals.append(vals[0])
    indexes.append(index)
    
    i+=1
    if i%1000 == 0:
        print(i)

Since our Shadow test was running when we sent the data to our endpoint, we can get the test model's predictions from S3

In [None]:
storage = shadowtestdescribe['DataStorageConfig']['Destination']+predictor.endpoint_name +'/'
storage

In [None]:
!aws s3 ls {storage}

#### Now let's copy the captured data from s3 to the local EFS connected to SageMaker Studio

In [None]:
!mkdir ./data/datacapture/

In [None]:
!aws s3 cp {storage} ./data/datacapture/  --recursive

##### The shadow and the production endpoints captured data during the test.  The data is saved in json lines format.  We also included an inference_id during the inference request.  We will use this information to match the inference data capture to the hold out dataset.  Our goal is to evaluate the performance of the production and shadow models we deployed to the endpoint.

In [None]:
shadowfiles = glob.glob('./data/datacapture/Shadow-01/**/*.jsonl',recursive=True)
prodfiles = glob.glob('./data/datacapture/AllTraffic/**/*.jsonl',recursive=True)

In [None]:
print(len(shadowfiles),len(prodfiles))

In [None]:
shadowin = []
shadowout = []
shadowid = []

for f in shadowfiles:
    print(f)
    with jsonlines.open(f) as reader:
        for obj in reader:
            
            try:
                infid = obj['eventMetadata']['inferenceId'].split(' ')
                shadowid.append(int(infid[-1]))

                # input to model
                model_input = base64.b64decode(obj['captureData']['endpointInput']['data'])
                shadowin.append(np.load(io.BytesIO(model_input))[0].tolist())

                # output from model
                model_output = base64.b64decode(obj['captureData']['endpointOutput']['data'])
                shadowout.append(np.load(io.BytesIO(model_output))[0])
            except:
                pass    
            
            
            

In [None]:
shadowdf = pd.DataFrame(data=shadowout,index=shadowid,columns=['Shadow'])

In [None]:
shadowdf

In [None]:
shadowdf['Shadow'] = pd.to_numeric(shadowdf['Shadow'])
shadowdf['Shadow'] = shadowdf['Shadow'].astype(int)

In [None]:
shadowdf = pd.merge(shadowdf,holdout['Target'],left_index=True,right_index=True)

In [None]:
acc = accuracy_score(shadowdf['Target'],shadowdf['Shadow'])
wf1 = f1_score(shadowdf['Target'],shadowdf['Shadow'],average='weighted')
print(acc, wf1)

In [None]:
print(classification_report(shadowdf['Target'],shadowdf['Shadow']))

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
cm = confusion_matrix(shadowdf['Target'],shadowdf['Shadow'])
normalized_cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
clist = [class_list[i] for i in np.sort(shadowdf['Target'].unique())]
sns.heatmap(normalized_cm, ax=ax, annot=cm, fmt='',xticklabels=clist,yticklabels=clist)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Shadow Endpoint Confustion Matrix')
plt.show()

In [None]:
%%time

prodin = []
prodout = []
prodid = []

for f in prodfiles:
    print(f)
    with jsonlines.open(f) as reader:
        for obj in reader:
            try:               
                infid = obj['eventMetadata']['inferenceId'].split(' ')
                prodid.append(int(infid[-1]))

                # input to model
                model_input = base64.b64decode(obj['captureData']['endpointInput']['data'])
                prodin.append(np.load(io.BytesIO(model_input))[0].tolist())

                # output from model
                model_output = base64.b64decode(obj['captureData']['endpointOutput']['data'])
                prodout.append(np.load(io.BytesIO(model_output))[0])
                
            except:
                pass

In [None]:
proddf = pd.DataFrame(data=prodout,index=prodid,columns=['Prod'])

In [None]:
proddf

In [None]:
proddf['Prod'] = pd.to_numeric(proddf['Prod'])
proddf['Prod'] = proddf['Prod'].astype(int)

In [None]:
# Line up our production model predictions with the true value based on the index
proddf = pd.merge(proddf,holdout['Target'],left_index=True,right_index=True)

In [None]:
acc = accuracy_score(proddf['Target'],proddf['Prod'])
wf1 = f1_score(proddf['Target'],proddf['Prod'],average='weighted')
print(acc, wf1)

In [None]:
print(classification_report(proddf['Target'],proddf['Prod']))

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
cm = confusion_matrix(proddf['Target'],proddf['Prod'])
normalized_cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(normalized_cm, ax=ax, annot=cm, fmt='',xticklabels=class_list,yticklabels=class_list)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Shadow Endpoint Confustion Matrix')
plt.show()

##### We can see that our shadow model performs slightly better on our production data than the production model, so let's promote the shadow to production.  Note that it will take a few minutes to promote the shadow to production.  During this time the production model stays active.

In [None]:
sm_client.stop_inference_experiment(
    Name=shadowtestname,
    ModelVariantActions={
        'Shadow-01': 'Promote',
        'AllTraffic': 'Remove'
    },
    DesiredState='Completed',
    Reason='Shadow variant performed better in validation'
)

In [None]:
# Here we show that the shadow model is now deployed to production
sm_client.describe_endpoint(EndpointName = predictor.endpoint_name)

##### Finally, let's gracefully stop the deployed endpoint.

## Clean Up

In [None]:
def wait_until_complete(test_name):
    print(f'Waiting on shadow test: {test_name}')
    done = False
    while not done:
        shadowtestdescribe = sm_client.describe_inference_experiment(Name=shadowtestname)
        status = shadowtestdescribe["Status"].lower()
        print(f'Status: {status}')
        if status == "completed":
            print("Shadow test is stopped, ok to delete. Exiting Loop.")
            return
        sleep(60)

In [None]:
wait_until_complete(shadowtestname)

In [None]:
#predictor.delete_endpoint()
sm_client.delete_inference_experiment(
    Name=shadowtestname
)
sm_client.delete_endpoint(EndpointName=predictor.endpoint_name)

# References

* A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018) - https://registry.opendata.aws/cse-cic-ids2018/
* AIM362 - Re:Invent 2019 SageMaker Debugger and Model Monitor - https://github.com/aws-samples/reinvent2019-aim362-sagemaker-debugger-model-monitor