# Introduction

In this workshop, we will go through the steps of training and deploying a **network traffic classification model**.  We will show how to train two version of models where we will deploy them to a production and shadow variant.  We will use SageMaker Shadow Tests to help manage the test between the production and shadow variants.  

## Contents

1) [Setup](#setup)
2) [Basic Training](#basic_training)
3) [Register the Models](#register)
4) [Create Endpoint Config](#create_endpoint)
5) [Deploy and Predict](#deploy)
6) [Create a Shadow Test](#shadow)
7) [Evaluate](#eval)

For training our model we will be using datasets <a href="https://registry.opendata.aws/cse-cic-ids2018/">CSE-CIC-IDS2018</a> by CIC and ISCX which are used for security testing and malware prevention.
These datasets include a huge amount of raw network traffic logs, plus pre-processed data where network connections have been reconstructed and  relevant features have been extracted using CICFlowMeter, a tool that outputs network connection features as CSV files. Each record is classified as benign traffic, or it can be malicious traffic, with a total number of 15 classes.

Starting from this featurized dataset, we have executed additional pre-processing for the purpose of this lab:
<ul>
    <li>Encoded class labels</li>
    <li>Replaced invalid string attribute values generated by CICFlowMeter (e.g. inf and Infinity)</li>
    <li>Executed one hot encoding of discrete attributes</li>
    <li>Remove invalid headers logged multiple times in the same CSV file</li>
    <li>Reduced the size of the featurized dataset to ~1.3GB (from ~6.3GB) to speed-up training, while making sure that all classes are well represented</li>
    <li>Executed stratified random split of the dataset into training (80%) and validation (20%) sets</li>
</ul>

Class are represented and have been encoded as follows (train + validation):


| Label                    | Encoded | N. records |
|:-------------------------|:-------:|-----------:|
| Benign                   |    0    |    1000000 |
| Bot                      |    1    |     200000 |
| DoS attacks-GoldenEye    |    2    |      40000 |
| DoS attacks-Slowloris    |    3    |      10000 |
| DDoS attacks-LOIC-HTTP   |    4    |     300000 |
| Infilteration            |    5    |     150000 |
| DDOS attack-LOIC-UDP     |    6    |       1730 |
| DDOS attack-HOIC         |    7    |     300000 |
| Brute Force -Web         |    8    |        611 |
| Brute Force -XSS         |    9    |        230 |
| SQL Injection            |   10    |         87 |
| DoS attacks-SlowHTTPTest |   11    |     100000 |
| DoS attacks-Hulk         |   12    |     250000 |
| FTP-BruteForce           |   13    |     150000 |
| SSH-Bruteforce           |   14    |     150000 |       

The final pre-processed dataset has been saved to a public Amazon S3 bucket for your convenience, and will represent the inputs to the training processes.
<a id='setup'></a>
### Let's get started!

First, we set some variables, including the AWS region we are working in, the IAM (Identity and Access Management) execution role of the notebook instance and the Amazon S3 bucket where we will store data, models, outputs, etc. We will use the Amazon SageMaker default bucket for the selected AWS region, and then define a key prefix to make sure all objects have share the same prefix for easier discoverability.

In [22]:
import os
import boto3
import sagemaker
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from IPython.display import display, clear_output
from sagemaker.sklearn.estimator import SKLearn
import pandas as pd
import numpy as np
import time

pd.options.display.max_columns = 100

region = boto3.Session().region_name
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()
bucket_name = sagemaker.Session().default_bucket()
prefix = 'xgboost-webtraffic'
os.environ["AWS_REGION"] = region

print(f'REGION:  {region}')
print(f'ROLE:    {role}')
print(f'BUCKET:  {bucket_name}')

INFO:matplotlib.font_manager:generated new fontManager


REGION:  us-east-1
ROLE:    arn:aws:iam::278578987671:role/SageMaker-IoTRole
BUCKET:  sagemaker-us-east-1-278578987671


Now we can copy the dataset from the public Amazon S3 bucket to the Amazon SageMaker default bucket used in this workshop. To do this, we will leverage on the AWS Python SDK (boto3) as follows:

In [23]:
s3 = boto3.resource('s3')

source_bucket_name = "endtoendmlapp"
source_bucket_prefix = "aim362/data/"
source_bucket = s3.Bucket(source_bucket_name)

for s3_object in source_bucket.objects.filter(Prefix=source_bucket_prefix):
    copy_source = {
        'Bucket': source_bucket_name,
        'Key': s3_object.key
    }
    print('Copying {0} ...'.format(s3_object.key))
    s3.Bucket(bucket_name).copy(copy_source, prefix+'/data/'+s3_object.key.split('/')[-2]+'/'+s3_object.key.split('/')[-1])
    
print(f'Data copy from source bucket, {source_bucket_name}/{source_bucket_prefix}, to destination bucket {bucket_name}/{prefix}/data/, complete!')

Copying aim362/data/train/0.part ...
Copying aim362/data/train/1.part ...
Copying aim362/data/train/2.part ...
Copying aim362/data/train/3.part ...
Copying aim362/data/train/4.part ...
Copying aim362/data/train/5.part ...
Copying aim362/data/train/6.part ...
Copying aim362/data/train/7.part ...
Copying aim362/data/train/8.part ...
Copying aim362/data/train/9.part ...
Copying aim362/data/val/0.part ...
Copying aim362/data/val/1.part ...
Copying aim362/data/val/2.part ...
Copying aim362/data/val/3.part ...
Copying aim362/data/val/4.part ...
Copying aim362/data/val/5.part ...
Copying aim362/data/val/6.part ...
Copying aim362/data/val/7.part ...
Copying aim362/data/val/8.part ...
Copying aim362/data/val/9.part ...
Data copy from source bucket, endtoendmlapp/aim362/data/, to destination bucket sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/, complete!


Let's download some of the data to the notebook to quickly explore the dataset structure:

# Data

TODO - USE SOME OF THE VALIDATION DATA AS PRODUCTION (HOLD OUT DATA)

In [24]:
train_file_path = 's3://' + bucket_name + '/' + prefix + '/data/train/0.part'
val_file_path = 's3://' + bucket_name + '/' + prefix + '/data/val/0.part'

print(train_file_path)
print(val_file_path)

s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/train/0.part
s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/0.part


In [25]:
!mkdir -p data/train/ data/val/
!aws s3 cp {train_file_path} data/train/ 
!aws s3 cp {val_file_path} data/val/ 

download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/train/0.part to data/train/0.part
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/0.part to data/val/0.part


In [26]:
df = pd.read_csv('data/train/0.part')
df

Unnamed: 0,Target,Dst Port,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,Fwd Pkt Len Mean,Fwd Pkt Len Std,Bwd Pkt Len Max,Bwd Pkt Len Min,Bwd Pkt Len Mean,Bwd Pkt Len Std,Flow Byts/s,Flow Pkts/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Tot,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Tot,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Len,Bwd Header Len,Fwd Pkts/s,Bwd Pkts/s,Pkt Len Min,Pkt Len Max,Pkt Len Mean,Pkt Len Std,Pkt Len Var,FIN Flag Cnt,SYN Flag Cnt,RST Flag Cnt,PSH Flag Cnt,ACK Flag Cnt,URG Flag Cnt,CWE Flag Count,ECE Flag Cnt,Down/Up Ratio,Pkt Size Avg,Fwd Seg Size Avg,Bwd Seg Size Avg,Fwd Byts/b Avg,Fwd Pkts/b Avg,Fwd Blk Rate Avg,Bwd Byts/b Avg,Bwd Pkts/b Avg,Bwd Blk Rate Avg,Subflow Fwd Pkts,Subflow Fwd Byts,Subflow Bwd Pkts,Subflow Bwd Byts,Init Fwd Win Byts,Init Bwd Win Byts,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,day,month,year,dayofweek,prot_0,prot_6,prot_17
0,0,445,64443,5,4,373,172,140,0,74.600000,70.283711,133,0,43.000000,62.753486,8457.086107,139.658303,8.055375e+03,1.105582e+04,21474,3,64403,1.610075e+04,1.073215e+04,21537,3,64398,2.146600e+04,129.201393,21547,21317,0,0,0,0,112,92,77.587946,62.070357,0,140,54.500000,64.198044,4121.388889,0,0,0,1,0,0,0,0,0.0,60.555556,74.600000,43.000000,0.0,0.0,0.0,0.0,0.0,0.0,5,373,4,172,8192,0,3,20,0.0,0.0,0,0,0.0,0.0,0,0,20,2,2018,1,0,1,0
1,12,80,1527,2,0,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,1309.757695,1.527000e+03,0.000000e+00,1527,1527,1527,1.527000e+03,0.000000e+00,1527,1527,0,0.000000e+00,0.000000,0,0,0,0,0,0,64,0,1309.757695,0.000000,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,225,-1,0,32,0.0,0.0,0,0,0.0,0.0,0,0,16,2,2018,4,0,1,0
2,7,80,5573,2,0,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,358.873138,5.573000e+03,0.000000e+00,5573,5573,5573,5.573000e+03,0.000000e+00,5573,5573,0,0.000000e+00,0.000000,0,0,0,0,0,0,40,0,358.873138,0.000000,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,32738,-1,0,20,0.0,0.0,0,0,0.0,0.0,0,0,21,2,2018,2,0,1,0
3,12,80,44934,2,0,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,44.509725,4.493400e+04,0.000000e+00,44934,44934,44934,4.493400e+04,0.000000e+00,44934,44934,0,0.000000e+00,0.000000,0,0,0,0,0,0,64,0,44.509725,0.000000,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,225,-1,0,32,0.0,0.0,0,0,0.0,0.0,0,0,16,2,2018,4,0,1,0
4,0,443,60108569,4,2,148,252,74,0,37.000000,42.723920,126,126,126.000000,0.000000,6.654625,0.099819,1.202171e+07,2.677679e+07,59921494,44882,60108569,2.003619e+07,3.454169e+07,59921494,93516,60013670,6.001367e+07,0.000000,60013670,60013670,1,0,0,0,80,40,0.066546,0.033273,0,126,67.714286,51.774235,2680.571429,0,1,0,0,1,0,0,0,0.0,79.000000,37.000000,126.000000,0.0,0.0,0.0,0.0,0.0,0.0,4,148,2,252,257,7010,1,20,93559.0,0.0,93559,93559,59921494.0,0.0,59921494,59921494,20,2,2018,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
212208,14,22,298760,21,21,1912,2665,640,0,91.047619,139.679088,976,0,126.904762,263.574639,15319.989289,140.581068,7.286829e+03,2.230386e+04,122248,2,298722,1.493610e+04,3.049364e+04,122248,320,298752,1.493760e+04,33997.036726,126346,7,0,0,0,0,680,680,70.290534,70.290534,0,976,106.441860,207.291869,42969.919158,0,0,0,1,0,0,0,0,1.0,108.976190,91.047619,126.904762,0.0,0.0,0.0,0.0,0.0,0.0,21,1912,21,2665,26883,230,16,32,0.0,0.0,0,0,0.0,0.0,0,0,14,2,2018,2,0,1,0
212209,0,50684,29,1,1,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,68965.517241,2.900000e+01,0.000000e+00,29,29,0,0.000000e+00,0.000000e+00,0,0,0,0.000000e+00,0.000000,0,0,0,0,0,0,20,20,34482.758621,34482.758621,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,1,0,0,1.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,172,255,0,20,0.0,0.0,0,0,0.0,0.0,0,0,20,2,2018,1,0,1,0
212210,4,80,1639274,3,4,20,964,20,0,6.666667,11.547005,964,0,241.000000,482.000000,600.265727,4.270183,2.732123e+05,6.690496e+05,1638904,5,339,1.695000e+02,2.326381e+02,334,5,1639268,5.464227e+05,946116.600700,1638904,26,0,0,0,0,72,92,1.830078,2.440105,0,964,123.000000,339.887376,115523.428600,0,0,1,1,0,0,0,1,1.0,140.571429,6.666667,241.000000,0.0,0.0,0.0,0.0,0.0,0.0,3,20,4,964,8192,211,1,20,0.0,0.0,0,0,0.0,0.0,0,0,20,2,2018,1,0,1,0
212211,5,52848,309,1,1,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,6472.491909,3.090000e+02,0.000000e+00,309,309,0,0.000000e+00,0.000000e+00,0,0,0,0.000000e+00,0.000000,0,0,0,0,0,0,24,20,3236.245955,3236.245955,0,0,0.000000,0.000000,0.000000,0,0,0,1,0,0,0,0,1.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,1024,0,0,24,0.0,0.0,0,0,0.0,0.0,0,0,3,1,2018,2,0,1,0


In [27]:
df.shape

(212213, 85)

In [41]:
val_df = pd.read_csv('data/val/0.part')
val_df

Unnamed: 0,Target,Dst Port,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,Fwd Pkt Len Mean,Fwd Pkt Len Std,Bwd Pkt Len Max,Bwd Pkt Len Min,Bwd Pkt Len Mean,Bwd Pkt Len Std,Flow Byts/s,Flow Pkts/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Tot,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Tot,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Len,Bwd Header Len,Fwd Pkts/s,Bwd Pkts/s,Pkt Len Min,Pkt Len Max,Pkt Len Mean,Pkt Len Std,Pkt Len Var,FIN Flag Cnt,SYN Flag Cnt,RST Flag Cnt,PSH Flag Cnt,ACK Flag Cnt,URG Flag Cnt,CWE Flag Count,ECE Flag Cnt,Down/Up Ratio,Pkt Size Avg,Fwd Seg Size Avg,Bwd Seg Size Avg,Fwd Byts/b Avg,Fwd Pkts/b Avg,Fwd Blk Rate Avg,Bwd Byts/b Avg,Bwd Pkts/b Avg,Bwd Blk Rate Avg,Subflow Fwd Pkts,Subflow Fwd Byts,Subflow Bwd Pkts,Subflow Bwd Byts,Init Fwd Win Byts,Init Bwd Win Byts,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,day,month,year,dayofweek,prot_0,prot_6,prot_17
0,1,8080,11083,3,4,326,129,326,0,108.666667,188.216188,112,0,32.25,53.767245,41053.866280,631.597943,1.847167e+03,4098.995165,10207,32,525,262.5,293.449314,470,55,10645,3548.333333,5769.573410,10207,33,0,0,0,0,72,92,270.684833,360.913110,0,326,56.875,115.406657,13318.69643,0,0,1,1,0,0,0,1,1.0,65.000000,108.666667,32.25,0.0,0.0,0.0,0.0,0.0,0.0,3,326,4,129,8192,219,1,20,0.0,0.0,0,0,0.0,0.000000,0,0,3,2,2018,5,0,1,0
1,7,80,15521,2,0,0,0,0,0,0.000000,0.000000,0,0,0.00,0.000000,0.000000,128.857677,1.552100e+04,0.000000,15521,15521,15521,15521.0,0.000000,15521,15521,0,0.000000,0.000000,0,0,0,0,0,0,40,0,128.857677,0.000000,0,0,0.000,0.000000,0.00000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.00,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,32738,-1,0,20,0.0,0.0,0,0,0.0,0.000000,0,0,21,2,2018,2,0,1,0
2,0,0,112639437,3,0,0,0,0,0,0.000000,0.000000,0,0,0.00,0.000000,0.000000,0.026634,5.631972e+07,16.263456,56319730,56319707,112639437,56319718.5,16.263456,56319730,56319707,0,0.000000,0.000000,0,0,0,0,0,0,0,0,0.026634,0.000000,0,0,0.000,0.000000,0.00000,0,0,0,0,0,0,0,0,0.0,0.000000,0.000000,0.00,0.0,0.0,0.0,0.0,0.0,0.0,3,0,0,0,-1,-1,0,0,0.0,0.0,0,0,56319718.5,16.263456,56319730,56319707,20,2,2018,1,1,0,0
3,4,80,1314236,3,4,20,964,20,0,6.666667,11.547005,964,0,241.00,482.000000,748.723974,5.326288,2.190393e+05,536380.293600,1313921,1,285,142.5,200.111219,284,1,1314233,438077.666700,758502.587400,1313921,27,0,0,0,0,72,92,2.282695,3.043593,0,964,123.000,339.887376,115523.42860,0,0,1,1,0,0,0,1,1.0,140.571429,6.666667,241.00,0.0,0.0,0.0,0.0,0.0,0.0,3,20,4,964,8192,211,1,20,0.0,0.0,0,0,0.0,0.000000,0,0,20,2,2018,1,0,1,0
4,0,55882,85914380,2,0,0,0,0,0,0.000000,0.000000,0,0,0.00,0.000000,0.000000,0.023279,8.591438e+07,0.000000,85914380,85914380,85914380,85914380.0,0.000000,85914380,85914380,0,0.000000,0.000000,0,0,0,0,0,0,40,0,0.023279,0.000000,0,0,0.000,0.000000,0.00000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.00,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,62561,-1,0,20,0.0,0.0,0,0,85914380.0,0.000000,85914380,85914380,23,2,2018,4,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53049,7,80,1309,3,4,313,935,313,0,104.333333,180.710634,935,0,233.75,467.500000,953399.541600,5347.593583,2.181667e+02,285.431895,731,6,1080,540.0,270.114790,731,349,1303,434.333333,567.215421,1081,21,0,0,0,0,72,92,2291.825821,3055.767762,0,935,156.000,333.275090,111072.28570,0,0,1,1,0,0,0,1,1.0,178.285714,104.333333,233.75,0.0,0.0,0.0,0.0,0.0,0.0,3,313,4,935,65535,219,1,20,0.0,0.0,0,0,0.0,0.000000,0,0,21,2,2018,2,0,1,0
53050,0,53,69062,2,2,76,418,38,38,38.000000,0.000000,209,209,209.00,0.000000,7152.992963,57.918971,2.302067e+04,24356.658891,48524,1,20537,20537.0,0.000000,20537,20537,1,1.000000,0.000000,1,1,0,0,0,0,16,16,28.959486,28.959486,38,209,106.400,93.660557,8772.30000,0,0,0,0,0,0,0,0,1.0,133.000000,38.000000,209.00,0.0,0.0,0.0,0.0,0.0,0.0,2,76,2,418,-1,-1,1,8,0.0,0.0,0,0,0.0,0.000000,0,0,22,2,2018,3,0,0,1
53051,7,80,1253,3,4,249,935,249,0,83.000000,143.760217,935,0,233.75,467.500000,944932.162800,5586.592179,2.088333e+02,246.961063,612,6,984,492.0,169.705627,612,372,1247,415.666667,506.869148,987,20,0,0,0,0,72,92,2394.253791,3192.338388,0,935,148.000,329.717195,108713.42860,0,0,1,1,0,0,0,1,1.0,169.142857,83.000000,233.75,0.0,0.0,0.0,0.0,0.0,0.0,3,249,4,935,65535,219,1,20,0.0,0.0,0,0,0.0,0.000000,0,0,21,2,2018,2,0,1,0
53052,0,53,1237,1,1,32,128,32,32,32.000000,0.000000,128,128,128.00,0.000000,129345.189976,1616.814875,1.237000e+03,0.000000,1237,1237,0,0.0,0.000000,0,0,0,0.000000,0.000000,0,0,0,0,0,0,8,8,808.407437,808.407437,32,128,64.000,55.425626,3072.00000,0,0,0,0,0,0,0,0,1.0,96.000000,32.000000,128.00,0.0,0.0,0.0,0.0,0.0,0.0,1,32,1,128,-1,-1,0,8,0.0,0.0,0,0,0.0,0.000000,0,0,20,2,2018,1,0,0,1


In [43]:
from sklearn.model_selection import train_test_split
holdout, val_df = train_test_split(val_df, test_size=.2, random_state=42)
print(holdout.shape)
print(val_df.shape())

TypeError: 'tuple' object is not callable

<a id='basic_training'></a>
# Training

We will execute the training using the built in XGBoost algorithm.  Not that you can also use script mode if you need to have greater customization of the training process.  


In [120]:
container = sagemaker.image_uris.retrieve('xgboost',region,version='1.0-1')

print(container)

683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3


In [121]:
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/data/train'.format(bucket_name, prefix), content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/data/val'.format(bucket_name, prefix), content_type='csv')

## Model 1 - XGBoost

In [122]:
hyperparameters = {
    "max_depth": "3",
    "eta": "0.1",
    "gamma": "6",
    "min_child_weight": "6",
    "objective": "multi:softmax",
    "num_class": "15",
    "num_round": "10"
}

output_path = f's3://{bucket_name}/{prefix}/output/'

# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=container, 
                                          hyperparameters=hyperparameters,
                                          role=role,
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)



In [123]:
estimator.fit({'train': s3_input_train, 'validation': s3_input_validation})

2023-01-23 17:55:26 Starting - Starting the training job...
2023-01-23 17:55:50 Starting - Preparing the instances for trainingProfilerReport-1674496525: InProgress
......
2023-01-23 17:56:51 Downloading - Downloading input data......
2023-01-23 17:57:51 Training - Training image download completed. Training in progress.[34m[2023-01-23 17:57:41.240 ip-10-0-158-250.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value multi:softmax to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined deli

## Model 2 - Sklearn Random Forest

In [135]:
import glob

In [144]:
# read all data from training folder
train_csv_files = glob.glob("./data/val/*.part")
print(train_csv_files)
df_list = (pd.read_csv(file) for file in train_csv_files)
train_df   = pd.concat(df_list,axis=0,ignore_index=True)

['./data/val/0-Copy1.part', './data/val/0.part']


In [158]:
train_df.columns

Index(['Target', 'Dst Port', 'Flow Duration', 'Tot Fwd Pkts', 'Tot Bwd Pkts',
       'TotLen Fwd Pkts', 'TotLen Bwd Pkts', 'Fwd Pkt Len Max',
       'Fwd Pkt Len Min', 'Fwd Pkt Len Mean', 'Fwd Pkt Len Std',
       'Bwd Pkt Len Max', 'Bwd Pkt Len Min', 'Bwd Pkt Len Mean',
       'Bwd Pkt Len Std', 'Flow Byts/s', 'Flow Pkts/s', 'Flow IAT Mean',
       'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min', 'Fwd IAT Tot',
       'Fwd IAT Mean', 'Fwd IAT Std', 'Fwd IAT Max', 'Fwd IAT Min',
       'Bwd IAT Tot', 'Bwd IAT Mean', 'Bwd IAT Std', 'Bwd IAT Max',
       'Bwd IAT Min', 'Fwd PSH Flags', 'Bwd PSH Flags', 'Fwd URG Flags',
       'Bwd URG Flags', 'Fwd Header Len', 'Bwd Header Len', 'Fwd Pkts/s',
       'Bwd Pkts/s', 'Pkt Len Min', 'Pkt Len Max', 'Pkt Len Mean',
       'Pkt Len Std', 'Pkt Len Var', 'FIN Flag Cnt', 'SYN Flag Cnt',
       'RST Flag Cnt', 'PSH Flag Cnt', 'ACK Flag Cnt', 'URG Flag Cnt',
       'CWE Flag Count', 'ECE Flag Cnt', 'Down/Up Ratio', 'Pkt Size Avg',
       'Fwd Seg Size 

In [159]:
train_df.dropna(inplace=True)

In [160]:
train_df.isna().sum().values

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [165]:
output_path = f's3://{bucket_name}/{prefix}/output/'

FRAMEWORK_VERSION = "0.23-1"

estimator2 = SKLearn(
    entry_point="randomforest.py",
    source_dir='./code/',
    role=role,
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    framework_version=FRAMEWORK_VERSION,
    base_job_name="rf-scikit",
    metric_definitions=[
        {"Name": "Accuracy", "Regex": "Accuracy is: ([0-9.]+).*$"},
        {"Name": "WeightedF1", "Regex": "Weighted F1 Score is: ([0-9.]+).*$"}
    ],
    output_path=output_path,
    hyperparameters={
        "n-estimators": 50,
        "min-samples-leaf": 2
    }
)



In [166]:
estimator2.fit({'train': s3_input_train, 'validation': s3_input_validation})

2023-01-23 19:50:59 Starting - Starting the training job...
2023-01-23 19:51:24 Starting - Preparing the instances for trainingProfilerReport-1674503458: InProgress
......
2023-01-23 19:52:24 Downloading - Downloading input data.....[34m2023-01-23 19:53:06,336 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2023-01-23 19:53:06,340 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-01-23 19:53:06,383 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2023-01-23 19:53:06,546 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-01-23 19:53:06,558 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-01-23 19:53:06,570 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-01-23 19:53:06,580 sagemaker-training-toolkit INFO     Invoking 

In order to make sure that our code works for inference, we can deploy the trained model and execute some inferences.

<a id='register'></a>
## Register our models

In [33]:
sm_client = boto3.Session().client('sagemaker')

In [35]:
model_name1 = "PROD-XGBoost-Webtraffic"
model_name2 = "SHADOW-XGBoost-Webtraffic"

print(f"Prod model name: {model_name1}")
print(f"Shadow model name: {model_name2}")

resp = sm_client.create_model(
    ModelName=model_name1,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": container, 
                      "ModelDataUrl": estimator.model_data
                     }
)

resp = sm_client.create_model(
    ModelName=model_name2,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": container, 
                      "ModelDataUrl": estimator2.model_data
                     }
)

Prod model name: PROD-XGBoost-Webtraffic
Shadow model name: SHADOW-XGBoost-Webtraffic


<a id='create_endpoint'></a>
## Create Endpoint Config

Here we will create the endpoint configuration for the production endpoint.  We also include data capture of 100% of the input and output traffic to the production endpoint.  Note that we could also use the highlevel sagemaker SDK (estimator.deploy())

In [46]:
ep_config_name = "shadow-xgboost-epconfig"

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=ep_config_name,
    ProductionVariants=[
        {
            "VariantName": model_name1,
            "ModelName": model_name1,
            "InstanceType": "ml.m4.xlarge", 
            "InitialInstanceCount": 1,
            "InitialVariantWeight": 1,
        }
    ],
    DataCaptureConfig=
    {
            "EnableCapture": True,
            "InitialSamplingPercentage":100,
            "DestinationS3Uri":f"s3://{bucket_name}/{prefix}/datacapture/",
            "CaptureOptions":[{'CaptureMode': 'Input'}, {'CaptureMode': 'Output'}]
    },        
)

<a id='deploy'></a>
## Deploy!

In [94]:
predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')

-------!

In [95]:
predictor.endpoint_name

'sagemaker-xgboost-2023-01-09-22-54-56-577'

In [48]:
endpoint_name = "xgboost-webtraffic"
create_endpoint_api_response = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=ep_config_name,
)

## Predict

Now when we send a prediction to the deployed endpoint, we will recieve a response from the production variant.  The shadow variant will also get the input payload.  

In [49]:
sm_runtime = boto3.Session().client("sagemaker-runtime")

In [74]:
# We expect 4 - DDoS attacks-LOIC-HTTP as the predicted class for this instance.
test_values = "80,1056736,3,4,20,964,20,0,6.666666667,11.54700538,964,0,241.0,482.0,931.1691850999999,6.6241710320000005,176122.6667,431204.4454,1056315,2,394,197.0,275.77164469999997,392,2,1056733,352244.3333,609743.1115,1056315,24,0,0,0,0,72,92,2.8389304419999997,3.78524059,0,964,123.0,339.8873763,115523.4286,0,0,1,1,0,0,0,1,1.0,140.5714286,6.666666667,241.0,0.0,0.0,0.0,0.0,0.0,0.0,3,20,4,964,8192,211,1,20,0.0,0.0,0,0,0.0,0.0,0,0,20,2,2018,1,0,1,0"

In [75]:
response = sm_runtime.invoke_endpoint(EndpointName=endpoint_name, 
                                      ContentType='text/csv',
                                      Body=test_values)

In [76]:
response['Body'].read()

b'4.0'

<a id='shadow'></a>
# Create a Shadow Test 

## Create a Shadow Test using an Existing Endpoint

In [105]:
infexperimentarn = sm_client.create_inference_experiment(
    Name='ShadowInferenceTestExistingEP',
    Type='ShadowMode',
    Schedule={
        'StartTime': datetime(2023, 1, 10, 20, 50),
        'EndTime': datetime(2023, 1, 16, 22, 46, 10)
    },
    Description='Shadow inference test created via boto3 python API using an existing EP',
    RoleArn=role,
    EndpointName='sagemaker-xgboost-2023-01-09-22-54-56-577',
    ModelVariants=[
        {
            'ModelName': 'sagemaker-xgboost-2023-01-09-22-54-56-577',
            'VariantName': 'AllTraffic',
            'InfrastructureConfig': {
                'InfrastructureType':'RealTimeInference',
                'RealTimeInferenceConfig': {
                    'InstanceType': 'ml.m4.xlarge',
                    'InstanceCount': 1 
                }
            }
        },
        
        {
            'ModelName': 'SHADOW-XGBoost-Webtraffic',
            'VariantName': 'Shadow-01',
            'InfrastructureConfig': {
                'InfrastructureType':'RealTimeInference',
                'RealTimeInferenceConfig': {
                    'InstanceType': 'ml.m5.xlarge',
                    'InstanceCount': 1 
                }
            }
        },
    ],
    DataStorageConfig={
        'Destination':f's3://{bucket_name}/{prefix}/datacapture_test/',
    },
    ShadowModeConfig={
        'SourceModelVariantName': 'AllTraffic',
        'ShadowModelVariants': [
            {
                'ShadowModelVariantName': 'Shadow-01',
                'SamplingPercentage': 100
            },
        ]
    },
)   


## New Shadow Experiment with a New Endpoint

In [103]:
datetime.now()

datetime.datetime(2023, 1, 10, 20, 48, 22, 532584)

In [101]:
infexperimentarn = sm_client.create_inference_experiment(
    Name='ShadowInferenceTestNEWEP',
    Type='ShadowMode',
    Schedule={
        'StartTime': datetime(2023, 1, 10, 20, 27, 0),
        'EndTime': datetime(2023, 1, 16, 22, 46, 10)
    },
    Description='Shadow interence test created via boto3 python API',
    RoleArn=role,
    EndpointName='shadowTestEPCodeNewEP',
    ModelVariants=[
        {
            'ModelName': 'PROD-XGBoost-Webtraffic',
            'VariantName': 'Production-01',
            'InfrastructureConfig': {
                'InfrastructureType': 'RealTimeInference',
                'RealTimeInferenceConfig': {
                    'InstanceType': 'ml.m5.xlarge',
                    'InstanceCount': 1
                }
            },
        },
        {
            'ModelName': 'SHADOW-XGBoost-Webtraffic',
            'VariantName': 'Shadow-01',
            'InfrastructureConfig': {
                'InfrastructureType': 'RealTimeInference',
                'RealTimeInferenceConfig': {
                    'InstanceType': 'ml.m4.xlarge',
                    'InstanceCount': 1
                }
            },
        }
    ],
    DataStorageConfig={
        'Destination':f's3://{bucket_name}/{prefix}/datacapture_test/',
    },
    ShadowModeConfig={
        'SourceModelVariantName': 'Production-01',
        'ShadowModelVariants': [
            {
                'ShadowModelVariantName': 'Shadow-01',
                'SamplingPercentage': 80
            }
        ]
    },
)   

In [93]:
sm_client.describe_inference_experiment(Name='ShadowTestConsoleNewEP')

{'Arn': 'arn:aws:sagemaker:us-east-1:431615879134:inference-experiment/shadowtestconsolenewep',
 'Name': 'ShadowTestConsoleNewEP',
 'Type': 'ShadowMode',
 'Schedule': {'StartTime': datetime.datetime(2023, 1, 9, 22, 46, 10, 675000, tzinfo=tzlocal()),
  'EndTime': datetime.datetime(2023, 1, 16, 22, 46, 10, 675000, tzinfo=tzlocal())},
 'Status': 'Creating',
 'CreationTime': datetime.datetime(2023, 1, 9, 22, 48, 47, 280000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2023, 1, 9, 22, 48, 50, 552000, tzinfo=tzlocal()),
 'RoleArn': 'arn:aws:iam::431615879134:role/SageMakerGeospatial',
 'EndpointMetadata': {'EndpointName': 'shadowTestEPConsoleNew',
  'EndpointConfigName': 'ShadowTestConsol-EpConfig-bcplOiiNchDzIxrN',
  'EndpointStatus': 'Creating'},
 'ModelVariants': [{'ModelName': 'PROD-XGBoost-Webtraffic',
   'VariantName': 'Production-01',
   'InfrastructureConfig': {'InfrastructureType': 'RealTimeInference',
    'RealTimeInferenceConfig': {'InstanceType': 'ml.m5.xlarge',
    

In [96]:
sm_client.describe_inference_experiment(Name='ShadowTestExistingConsole')

{'Arn': 'arn:aws:sagemaker:us-east-1:431615879134:inference-experiment/shadowtestexistingconsole',
 'Name': 'ShadowTestExistingConsole',
 'Type': 'ShadowMode',
 'Schedule': {'StartTime': datetime.datetime(2023, 1, 9, 22, 53, 21, 645000, tzinfo=tzlocal()),
  'EndTime': datetime.datetime(2023, 1, 16, 22, 53, 21, 645000, tzinfo=tzlocal())},
 'Status': 'Creating',
 'CreationTime': datetime.datetime(2023, 1, 9, 23, 2, 8, 474000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2023, 1, 9, 23, 2, 11, 725000, tzinfo=tzlocal()),
 'RoleArn': 'arn:aws:iam::431615879134:role/sagemaker-test-role',
 'EndpointMetadata': {'EndpointName': 'sagemaker-xgboost-2023-01-09-22-54-56-577',
  'EndpointConfigName': 'sagemaker-xgboost-2023-01-09-22-54-56-577',
  'EndpointStatus': 'Updating'},
 'ModelVariants': [{'ModelName': 'sagemaker-xgboost-2023-01-09-22-54-56-577',
   'VariantName': 'AllTraffic',
   'InfrastructureConfig': {'InfrastructureType': 'RealTimeInference',
    'RealTimeInferenceConfig': {

<a id='eval'></a>
# Evaluate

Now that we have a hosted endpoint running, we can make real-time predictions from our model very easily, simply by making an http POST request. But first, we'll need to setup serializers and deserializers for passing our test_data NumPy arrays to the model behind the endpoint.


TODO - EVALUATE BOTH MODELS USING WEIGHTED F1 (OR SIMILAR METRIC).  USE THE GROUND TRUTH TO ASSESS MODEL QUALITY

Train Model 1

## Simulate Production Traffic

We will now simulate the production traffic.  We will loop over the production data.  In a real production use case you won't need to do this since actual production data will be flowing to the production endpoint.  

Note that we are not capturing the inference request or output.  We could do this, however, we have configured data capture so instead we let SageMaker handle the data capture to s3.

In [None]:
for p in prod:
    predictor.predict(p).decode('utf-8')



Now, we'll use a simple function to:

1. Loop over our test dataset
2. Split it into mini-batches of rows
3. Convert those mini-batchs to CSV string payloads
4. Retrieve mini-batch predictions by invoking the XGBoost endpoint
5. Collect predictions and convert from the CSV output our model provides into a NumPy array



In [None]:
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

In [None]:
predictions = predict(val_df.to_numpy()[:,1:])

In [None]:
predictions.shape

In [None]:
actual = val_df.to_numpy()[:,0]

In [None]:
actual.shape

In [None]:
class_list = ['Benign','Bot','DoS attacks-GoldenEye','DoS attacks-Slowloris','DDoS attacks-LOIC-HTTP','Infilteration','DDOS attack-LOIC-UDP','DDOS attack-HOIC','Brute Force-Web','Brute Force-XSS','SQL Injection','DoS attacks-SlowHTTPTest','DoS attacks-Hulk','FTP-BruteForce','SSH-Bruteforce']
fig, ax = plt.subplots(figsize=(15,10))
cm = confusion_matrix(actual,predictions)
normalized_cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(normalized_cm, ax=ax, annot=cm, fmt='',xticklabels=class_list,yticklabels=class_list)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confustion Matrix')
plt.show()

Finally, let's gracefully stop the deployed endpoint.

TODO - PROMOTE THE SHADOW VARIANT TO PROD VIA CODE, INCLUDE CODE ON HOW TO ROLL BACK TO N-1 VERSION OF THE MODEL

In [None]:
predictor.delete_endpoint()

# References

* A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018) - https://registry.opendata.aws/cse-cic-ids2018/
* AIM362 - Re:Invent 2019 SageMaker Debugger and Model Monitor - https://github.com/aws-samples/reinvent2019-aim362-sagemaker-debugger-model-monitor