# Introduction

In this workshop, we will go through the steps of training and deploying a **network traffic classification model**.  We will show how to train two version of models where we will deploy them to a production and shadow variant.  We will use SageMaker Shadow Tests to help manage the test between the production and shadow variants.  

## Contents

1) [Setup](#setup)
2) [Basic Training](#basic_training)
3) [Register the Models](#register)
4) [Create Endpoint Config](#create_endpoint)
5) [Deploy and Predict](#deploy)
6) [Create a Shadow Test](#shadow)
7) [Evaluate](#eval)

For training our model we will be using datasets <a href="https://registry.opendata.aws/cse-cic-ids2018/">CSE-CIC-IDS2018</a> by CIC and ISCX which are used for security testing and malware prevention.
These datasets include a huge amount of raw network traffic logs, plus pre-processed data where network connections have been reconstructed and  relevant features have been extracted using CICFlowMeter, a tool that outputs network connection features as CSV files. Each record is classified as benign traffic, or it can be malicious traffic, with a total number of 15 classes.

Starting from this featurized dataset, we have executed additional pre-processing for the purpose of this lab:
<ul>
    <li>Encoded class labels</li>
    <li>Replaced invalid string attribute values generated by CICFlowMeter (e.g. inf and Infinity)</li>
    <li>Executed one hot encoding of discrete attributes</li>
    <li>Remove invalid headers logged multiple times in the same CSV file</li>
    <li>Reduced the size of the featurized dataset to ~1.3GB (from ~6.3GB) to speed-up training, while making sure that all classes are well represented</li>
    <li>Executed stratified random split of the dataset into training (80%) and validation (20%) sets</li>
</ul>

Class are represented and have been encoded as follows (train + validation):


| Label                    | Encoded | N. records |
|:-------------------------|:-------:|-----------:|
| Benign                   |    0    |    1000000 |
| Bot                      |    1    |     200000 |
| DoS attacks-GoldenEye    |    2    |      40000 |
| DoS attacks-Slowloris    |    3    |      10000 |
| DDoS attacks-LOIC-HTTP   |    4    |     300000 |
| Infilteration            |    5    |     150000 |
| DDOS attack-LOIC-UDP     |    6    |       1730 |
| DDOS attack-HOIC         |    7    |     300000 |
| Brute Force -Web         |    8    |        611 |
| Brute Force -XSS         |    9    |        230 |
| SQL Injection            |   10    |         87 |
| DoS attacks-SlowHTTPTest |   11    |     100000 |
| DoS attacks-Hulk         |   12    |     250000 |
| FTP-BruteForce           |   13    |     150000 |
| SSH-Bruteforce           |   14    |     150000 |       

The final pre-processed dataset has been saved to a public Amazon S3 bucket for your convenience, and will represent the inputs to the training processes.
<a id='setup'></a>
### Let's get started!

First, we set some variables, including the AWS region we are working in, the IAM (Identity and Access Management) execution role of the notebook instance and the Amazon S3 bucket where we will store data, models, outputs, etc. We will use the Amazon SageMaker default bucket for the selected AWS region, and then define a key prefix to make sure all objects have share the same prefix for easier discoverability.

In [1]:
import os
import boto3
import sagemaker
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from IPython.display import display, clear_output
from sagemaker.sklearn.estimator import SKLearn
from sklearn.model_selection import train_test_split
from sagemaker.model_monitor import DataCaptureConfig
import pandas as pd
import numpy as np
import time
import glob
import json

pd.options.display.max_columns = 100

region = boto3.Session().region_name
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()
bucket_name = sagemaker.Session().default_bucket()
prefix = 'xgboost-webtraffic'
os.environ["AWS_REGION"] = region

print(f'REGION:  {region}')
print(f'ROLE:    {role}')
print(f'BUCKET:  {bucket_name}')

REGION:  us-east-1
ROLE:    arn:aws:iam::278578987671:role/SageMaker-IoTRole
BUCKET:  sagemaker-us-east-1-278578987671


Now we can copy the dataset from the public Amazon S3 bucket to the Amazon SageMaker default bucket used in this workshop. To do this, we will leverage on the AWS Python SDK (boto3) as follows:

In [2]:
s3 = boto3.resource('s3')

source_bucket_name = "endtoendmlapp"
source_bucket_prefix = "aim362/data/"
source_bucket = s3.Bucket(source_bucket_name)

for s3_object in source_bucket.objects.filter(Prefix=source_bucket_prefix):
    copy_source = {
        'Bucket': source_bucket_name,
        'Key': s3_object.key
    }
    print('Copying {0} ...'.format(s3_object.key))
    s3.Bucket(bucket_name).copy(copy_source, prefix+'/data/'+s3_object.key.split('/')[-2]+'/'+s3_object.key.split('/')[-1])
    
print(f'Data copy from source bucket, {source_bucket_name}/{source_bucket_prefix}, to destination bucket {bucket_name}/{prefix}/data/, complete!')

Copying aim362/data/train/0.part ...
Copying aim362/data/train/1.part ...
Copying aim362/data/train/2.part ...
Copying aim362/data/train/3.part ...
Copying aim362/data/train/4.part ...
Copying aim362/data/train/5.part ...
Copying aim362/data/train/6.part ...
Copying aim362/data/train/7.part ...
Copying aim362/data/train/8.part ...
Copying aim362/data/train/9.part ...
Copying aim362/data/val/0.part ...
Copying aim362/data/val/1.part ...
Copying aim362/data/val/2.part ...
Copying aim362/data/val/3.part ...
Copying aim362/data/val/4.part ...
Copying aim362/data/val/5.part ...
Copying aim362/data/val/6.part ...
Copying aim362/data/val/7.part ...
Copying aim362/data/val/8.part ...
Copying aim362/data/val/9.part ...
Data copy from source bucket, endtoendmlapp/aim362/data/, to destination bucket sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/, complete!


Let's download some of the data to the notebook to quickly explore the dataset structure:

# Data

In [3]:
train_file_path = 's3://' + bucket_name + '/' + prefix + '/data/train/0.part'
val_file_path = 's3://' + bucket_name + '/' + prefix + '/data/val/'

print(train_file_path)
print(val_file_path)

s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/train/0.part
s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/


In [4]:
!mkdir -p data/train/ data/val/
!aws s3 cp {train_file_path} data/train/ 
!aws s3 cp {val_file_path} data/val/ --recursive

download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/train/0.part to data/train/0.part
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/3.part to data/val/3.part
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/2.part to data/val/2.part
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/4.part to data/val/4.part
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/7.part to data/val/7.part
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/8.part to data/val/8.part
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/9.part to data/val/9.part
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/1.part to data/val/1.part
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/newval.csv to data/val/newval.csv
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/dat

In [5]:
df = pd.read_csv('data/train/0.part')
df

Unnamed: 0,Target,Dst Port,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,Fwd Pkt Len Mean,Fwd Pkt Len Std,Bwd Pkt Len Max,Bwd Pkt Len Min,Bwd Pkt Len Mean,Bwd Pkt Len Std,Flow Byts/s,Flow Pkts/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Tot,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Tot,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Len,Bwd Header Len,Fwd Pkts/s,Bwd Pkts/s,Pkt Len Min,Pkt Len Max,Pkt Len Mean,Pkt Len Std,Pkt Len Var,FIN Flag Cnt,SYN Flag Cnt,RST Flag Cnt,PSH Flag Cnt,ACK Flag Cnt,URG Flag Cnt,CWE Flag Count,ECE Flag Cnt,Down/Up Ratio,Pkt Size Avg,Fwd Seg Size Avg,Bwd Seg Size Avg,Fwd Byts/b Avg,Fwd Pkts/b Avg,Fwd Blk Rate Avg,Bwd Byts/b Avg,Bwd Pkts/b Avg,Bwd Blk Rate Avg,Subflow Fwd Pkts,Subflow Fwd Byts,Subflow Bwd Pkts,Subflow Bwd Byts,Init Fwd Win Byts,Init Bwd Win Byts,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,day,month,year,dayofweek,prot_0,prot_6,prot_17
0,0,445,64443,5,4,373,172,140,0,74.600000,70.283711,133,0,43.000000,62.753486,8457.086107,139.658303,8.055375e+03,1.105582e+04,21474,3,64403,1.610075e+04,1.073215e+04,21537,3,64398,2.146600e+04,129.201393,21547,21317,0,0,0,0,112,92,77.587946,62.070357,0,140,54.500000,64.198044,4121.388889,0,0,0,1,0,0,0,0,0.0,60.555556,74.600000,43.000000,0.0,0.0,0.0,0.0,0.0,0.0,5,373,4,172,8192,0,3,20,0.0,0.0,0,0,0.0,0.0,0,0,20,2,2018,1,0,1,0
1,12,80,1527,2,0,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,1309.757695,1.527000e+03,0.000000e+00,1527,1527,1527,1.527000e+03,0.000000e+00,1527,1527,0,0.000000e+00,0.000000,0,0,0,0,0,0,64,0,1309.757695,0.000000,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,225,-1,0,32,0.0,0.0,0,0,0.0,0.0,0,0,16,2,2018,4,0,1,0
2,7,80,5573,2,0,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,358.873138,5.573000e+03,0.000000e+00,5573,5573,5573,5.573000e+03,0.000000e+00,5573,5573,0,0.000000e+00,0.000000,0,0,0,0,0,0,40,0,358.873138,0.000000,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,32738,-1,0,20,0.0,0.0,0,0,0.0,0.0,0,0,21,2,2018,2,0,1,0
3,12,80,44934,2,0,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,44.509725,4.493400e+04,0.000000e+00,44934,44934,44934,4.493400e+04,0.000000e+00,44934,44934,0,0.000000e+00,0.000000,0,0,0,0,0,0,64,0,44.509725,0.000000,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,225,-1,0,32,0.0,0.0,0,0,0.0,0.0,0,0,16,2,2018,4,0,1,0
4,0,443,60108569,4,2,148,252,74,0,37.000000,42.723920,126,126,126.000000,0.000000,6.654625,0.099819,1.202171e+07,2.677679e+07,59921494,44882,60108569,2.003619e+07,3.454169e+07,59921494,93516,60013670,6.001367e+07,0.000000,60013670,60013670,1,0,0,0,80,40,0.066546,0.033273,0,126,67.714286,51.774235,2680.571429,0,1,0,0,1,0,0,0,0.0,79.000000,37.000000,126.000000,0.0,0.0,0.0,0.0,0.0,0.0,4,148,2,252,257,7010,1,20,93559.0,0.0,93559,93559,59921494.0,0.0,59921494,59921494,20,2,2018,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
212208,14,22,298760,21,21,1912,2665,640,0,91.047619,139.679088,976,0,126.904762,263.574639,15319.989289,140.581068,7.286829e+03,2.230386e+04,122248,2,298722,1.493610e+04,3.049364e+04,122248,320,298752,1.493760e+04,33997.036726,126346,7,0,0,0,0,680,680,70.290534,70.290534,0,976,106.441860,207.291869,42969.919158,0,0,0,1,0,0,0,0,1.0,108.976190,91.047619,126.904762,0.0,0.0,0.0,0.0,0.0,0.0,21,1912,21,2665,26883,230,16,32,0.0,0.0,0,0,0.0,0.0,0,0,14,2,2018,2,0,1,0
212209,0,50684,29,1,1,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,68965.517241,2.900000e+01,0.000000e+00,29,29,0,0.000000e+00,0.000000e+00,0,0,0,0.000000e+00,0.000000,0,0,0,0,0,0,20,20,34482.758621,34482.758621,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,1,0,0,1.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,172,255,0,20,0.0,0.0,0,0,0.0,0.0,0,0,20,2,2018,1,0,1,0
212210,4,80,1639274,3,4,20,964,20,0,6.666667,11.547005,964,0,241.000000,482.000000,600.265727,4.270183,2.732123e+05,6.690496e+05,1638904,5,339,1.695000e+02,2.326381e+02,334,5,1639268,5.464227e+05,946116.600700,1638904,26,0,0,0,0,72,92,1.830078,2.440105,0,964,123.000000,339.887376,115523.428600,0,0,1,1,0,0,0,1,1.0,140.571429,6.666667,241.000000,0.0,0.0,0.0,0.0,0.0,0.0,3,20,4,964,8192,211,1,20,0.0,0.0,0,0,0.0,0.0,0,0,20,2,2018,1,0,1,0
212211,5,52848,309,1,1,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,6472.491909,3.090000e+02,0.000000e+00,309,309,0,0.000000e+00,0.000000e+00,0,0,0,0.000000e+00,0.000000,0,0,0,0,0,0,24,20,3236.245955,3236.245955,0,0,0.000000,0.000000,0.000000,0,0,0,1,0,0,0,0,1.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,1024,0,0,24,0.0,0.0,0,0,0.0,0.0,0,0,3,1,2018,2,0,1,0


In [6]:
%%time
val_csv_files = glob.glob("./data/val/*.part")
df_list = (pd.read_csv(file) for file in val_csv_files)
val_df= pd.concat(df_list, ignore_index=True)
val_df

CPU times: user 3.37 s, sys: 408 ms, total: 3.78 s
Wall time: 4.19 s


Unnamed: 0,Target,Dst Port,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,Fwd Pkt Len Mean,Fwd Pkt Len Std,Bwd Pkt Len Max,Bwd Pkt Len Min,Bwd Pkt Len Mean,Bwd Pkt Len Std,Flow Byts/s,Flow Pkts/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Tot,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Tot,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Len,Bwd Header Len,Fwd Pkts/s,Bwd Pkts/s,Pkt Len Min,Pkt Len Max,Pkt Len Mean,Pkt Len Std,Pkt Len Var,FIN Flag Cnt,SYN Flag Cnt,RST Flag Cnt,PSH Flag Cnt,ACK Flag Cnt,URG Flag Cnt,CWE Flag Count,ECE Flag Cnt,Down/Up Ratio,Pkt Size Avg,Fwd Seg Size Avg,Bwd Seg Size Avg,Fwd Byts/b Avg,Fwd Pkts/b Avg,Fwd Blk Rate Avg,Bwd Byts/b Avg,Bwd Pkts/b Avg,Bwd Blk Rate Avg,Subflow Fwd Pkts,Subflow Fwd Byts,Subflow Bwd Pkts,Subflow Bwd Byts,Init Fwd Win Byts,Init Bwd Win Byts,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,day,month,year,dayofweek,prot_0,prot_6,prot_17
0,0,53,5416,1,1,42,58,42,42,42.000000,0.000000,58,58,58.0,0.000000,1.846381e+04,369.276219,5416.000000,0.000000,5416,5416,0,0.00,0.000000,0,0,0,0.000000,0.000000,0,0,0,0,0,0,8,8,184.638109,184.638109,42,58,47.333333,9.237604,85.333333,0,0,0,0,0,0,0,0,1.0,71.000000,42.000000,58.0,0.0,0.0,0.0,0.0,0.0,0.0,1,42,1,58,-1,-1,0,8,0.0,0.0,0,0,0.0,0.0,0,0,22,2,2018,3,0,0,1
1,5,53,1166,1,1,40,139,40,40,40.000000,0.000000,139,139,139.0,0.000000,1.535163e+05,1715.265866,1166.000000,0.000000,1166,1166,0,0.00,0.000000,0,0,0,0.000000,0.000000,0,0,0,0,0,0,8,8,857.632933,857.632933,40,139,73.000000,57.157677,3267.000000,0,0,0,0,0,0,0,0,1.0,109.500000,40.000000,139.0,0.0,0.0,0.0,0.0,0.0,0.0,1,40,1,139,-1,-1,0,8,0.0,0.0,0,0,0.0,0.0,0,0,28,2,2018,2,0,0,1
2,11,21,2,1,1,0,0,0,0,0.000000,0.000000,0,0,0.0,0.000000,0.000000e+00,1000000.000000,2.000000,0.000000,2,2,0,0.00,0.000000,0,0,0,0.000000,0.000000,0,0,0,0,0,0,40,20,500000.000000,500000.000000,0,0,0.000000,0.000000,0.000000,0,0,0,1,0,0,0,0,1.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,26883,0,0,40,0.0,0.0,0,0,0.0,0.0,0,0,16,2,2018,4,0,1,0
3,0,80,47674,3,4,437,860,437,0,145.666667,252.302068,860,0,215.0,430.000000,2.720560e+04,146.830558,7945.666667,12109.572340,23631,16,23857,11928.50,16599.331688,23666,191,24043,8014.333333,13629.834347,23752,16,0,0,0,0,72,92,62.927382,83.903176,0,860,162.125000,320.778712,102898.982143,0,0,0,1,0,0,0,0,1.0,185.285714,145.666667,215.0,0.0,0.0,0.0,0.0,0.0,0.0,3,437,4,860,8192,31,1,20,0.0,0.0,0,0,0.0,0.0,0,0,20,2,2018,1,0,1,0
4,14,22,6,1,1,0,0,0,0,0.000000,0.000000,0,0,0.0,0.000000,0.000000e+00,333333.333333,6.000000,0.000000,6,6,0,0.00,0.000000,0,0,0,0.000000,0.000000,0,0,0,0,0,0,32,32,166666.666667,166666.666667,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,1,0,0,1.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,241,230,0,32,0.0,0.0,0,0,0.0,0.0,0,0,14,2,2018,2,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
530527,0,53,371,1,1,32,48,32,32,32.000000,0.000000,48,48,48.0,0.000000,2.156334e+05,5390.835580,371.000000,0.000000,371,371,0,0.00,0.000000,0,0,0,0.000000,0.000000,0,0,0,0,0,0,8,8,2695.417790,2695.417790,32,48,37.333333,9.237604,85.333333,0,0,0,0,0,0,0,0,1.0,56.000000,32.000000,48.0,0.0,0.0,0.0,0.0,0.0,0.0,1,32,1,48,-1,-1,0,8,0.0,0.0,0,0,0.0,0.0,0,0,20,2,2018,1,0,0,1
530528,0,65151,1219,5,2,935,316,935,0,187.000000,418.144712,316,0,158.0,223.445743,1.026251e+06,5742.411813,203.166667,265.106331,683,6,1219,304.75,468.834992,996,6,683,683.000000,0.000000,683,683,0,0,0,0,124,40,4101.722724,1640.689089,0,935,156.375000,333.478608,111207.982100,0,0,1,1,0,0,0,1,0.0,178.714286,187.000000,158.0,0.0,0.0,0.0,0.0,0.0,0.0,5,935,2,316,65535,32768,1,20,0.0,0.0,0,0,0.0,0.0,0,0,21,2,2018,2,0,1,0
530529,14,22,6,1,1,0,0,0,0,0.000000,0.000000,0,0,0.0,0.000000,0.000000e+00,333333.333333,6.000000,0.000000,6,6,0,0.00,0.000000,0,0,0,0.000000,0.000000,0,0,0,0,0,0,32,32,166666.666667,166666.666667,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,1,0,0,1.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,241,230,0,32,0.0,0.0,0,0,0.0,0.0,0,0,14,2,2018,2,0,1,0
530530,13,21,3,1,1,0,0,0,0,0.000000,0.000000,0,0,0.0,0.000000,0.000000e+00,666666.666667,3.000000,0.000000,3,3,0,0.00,0.000000,0,0,0,0.000000,0.000000,0,0,0,0,0,0,40,20,333333.333333,333333.333333,0,0,0.000000,0.000000,0.000000,0,0,0,1,0,0,0,0,1.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,26883,0,0,40,0.0,0.0,0,0,0.0,0.0,0,0,14,2,2018,2,0,1,0


In [7]:
newval_df, holdout = train_test_split(val_df, test_size=.2, random_state=42, stratify=val_df['Target'])
print(holdout.shape)
print(newval_df.shape)

(106107, 85)
(424425, 85)


In [8]:
holdout.to_csv('./data/holdout.csv',index=False)
newval_df.to_csv('./data/val/newval.csv',index=False)
del val_df, newval_df

In [9]:
val_data_path = f"s3://{bucket_name}/{prefix}/data/val/newval.csv"
!aws s3 cp ./data/val/newval.csv {val_data_path}

upload: data/val/newval.csv to s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/data/val/newval.csv


<a id='basic_training'></a>
# Training

We will execute the training using the built in XGBoost algorithm.  Not that you can also use script mode if you need to have greater customization of the training process.  


In [10]:
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/data/train'.format(bucket_name, prefix), content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/data/val/newval.csv'.format(bucket_name, prefix), content_type='csv')

## Model 1 - XGBoost

In [11]:
container = sagemaker.image_uris.retrieve('xgboost',region,version='1.0-1')
print(container)

683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3


In [12]:
hyperparameters = {
    "max_depth": "3",
    "eta": "0.1",
    "gamma": "6",
    "min_child_weight": "6",
    "objective": "multi:softmax",
    "num_class": "15",
    "num_round": "10"
}

output_path = f's3://{bucket_name}/{prefix}/output/'

# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=container, 
                                          hyperparameters=hyperparameters,
                                          role=role,
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)



In [13]:
estimator.fit({'train': s3_input_train, 'validation': s3_input_validation})

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2023-01-31-20-14-45-931


2023-01-31 20:14:46 Starting - Starting the training job.........
2023-01-31 20:15:47 Starting - Preparing the instances for training...
2023-01-31 20:16:42 Downloading - Downloading input data.......[34m[2023-01-31 20:17:50.091 ip-10-0-131-137.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value multi:softmax to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimi

## Model 2 - Sklearn Random Forest

In [14]:
output_path = f's3://{bucket_name}/{prefix}/output/'

FRAMEWORK_VERSION = "0.23-1"

estimator2 = SKLearn(
    entry_point="randomforest.py",
    source_dir='./code/',
    role=role,
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    framework_version=FRAMEWORK_VERSION,
    base_job_name="rf-scikit",
    metric_definitions=[
        {"Name": "Accuracy", "Regex": "Accuracy is: ([0-9.]+).*$"},
        {"Name": "WeightedF1", "Regex": "Weighted F1 Score is: ([0-9.]+).*$"}
    ],
    output_path=output_path,
    hyperparameters={
        "n-estimators": 50,
        "min-samples-leaf": 2
    }
)

In [15]:
estimator2.fit({'train': s3_input_train, 'validation': s3_input_validation})

INFO:sagemaker:Creating training-job with name: rf-scikit-2023-01-31-20-51-48-116


2023-01-31 20:51:48 Starting - Starting the training job...
2023-01-31 20:52:15 Starting - Preparing the instances for training......
2023-01-31 20:53:07 Downloading - Downloading input data......
2023-01-31 20:54:03 Training - Training image download completed. Training in progress.[34m2023-01-31 20:54:04,127 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2023-01-31 20:54:04,131 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-01-31 20:54:04,177 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2023-01-31 20:54:04,359 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-01-31 20:54:04,371 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-01-31 20:54:04,384 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2023-01-31 20:54:0

In order to make sure that our code works for inference, we can deploy the trained model and execute some inferences.

<a id='register'></a>
## Register our models

In [16]:
sm_client = boto3.Session().client('sagemaker')

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


In [45]:
model_name1 = "PROD-XGBoost-Webtraffic"
model_name2 = "SHADOW-RandomForest-Webtraffic"

print(f"Prod model name: {model_name1}")
print(f"Shadow model name: {model_name2}")

resp = sm_client.create_model(
    ModelName=model_name1,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": container, 
                      "ModelDataUrl": estimator.model_data
                     }
)

resp = sm_client.create_model(
    ModelName=model_name2,
    ExecutionRoleArn=role,
    PrimaryContainer={
                      "Image": estimator2.training_image_uri(),
                      "Mode": "SingleModel",
                      "ModelDataUrl": estimator2.model_data,
                      "Environment": {
                          "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
                          "SAGEMAKER_SUBMIT_DIRECTORY":json.loads(estimator2.hyperparameters()['sagemaker_submit_directory']),
                          "SAGEMAKER_PROGRAM":json.loads(estimator2.hyperparameters()['sagemaker_program']),
                      },
                     }
)

Prod model name: PROD-XGBoost-Webtraffic
Shadow model name: SHADOW-RandomForest-Webtraffic


<a id='deploy'></a>
## Deploy!

TODO - Explain serialization and why we need to include it here, plus why we need to impliment the input_fn in the RF code"

In [46]:
predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge', serializer=sagemaker.serializers.CSVSerializer())

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2023-01-31-22-52-36-533
INFO:sagemaker:Creating endpoint-config with name sagemaker-xgboost-2023-01-31-22-52-36-533
INFO:sagemaker:Creating endpoint with name sagemaker-xgboost-2023-01-31-22-52-36-533


--------!

In [47]:
predictor.endpoint_name

'sagemaker-xgboost-2023-01-31-22-52-36-533'

## Predict

Now when we send a prediction to the deployed endpoint, we will recieve a response from the production variant.  The shadow variant will also get the input payload.  

In [87]:
# single prediction
# We expect 4 - DDoS attacks-LOIC-HTTP as the predicted class.
test_values = [80,1056736,3,4,20,964,20,0,6.666666667,11.54700538,964,0,241.0,482.0,931.1691850999999,6.6241710320000005,176122.6667,431204.4454,1056315,2,394,197.0,275.77164469999997,392,2,1056733,352244.3333,609743.1115,1056315,24,0,0,0,0,72,92,2.8389304419999997,3.78524059,0,964,123.0,339.8873763,115523.4286,0,0,1,1,0,0,0,1,1.0,140.5714286,6.666666667,241.0,0.0,0.0,0.0,0.0,0.0,0.0,3,20,4,964,8192,211,1,20,0.0,0.0,0,0,0.0,0.0,0,0,20,2,2018,1,0,1,0]
result = predictor.predict(test_values)
print(result)

b'4.0'


<a id='shadow'></a>
# Create a Shadow Test 

## Create a Shadow Test using an Existing Endpoint

Now we will create a shadow test using the existing endpoint.  We will stop this test using the API.  Note that we could also specify the test start and stop time when we create the inference experiements.  If we don't provide the start and end times then the experiment starts immediately and concludes after 7 days.  We are using an existing endpoint for this test.  SageMaker will update that endpoint with the new variants.  The production endpoint will also update the inference compute for the production variant


In [50]:
shadowtestname = 'ShadowInferenceTestExistingEP'
infexperimentarn = sm_client.create_inference_experiment(
    Name=shadowtestname,
    Type='ShadowMode',
    Description='Shadow inference test created via boto3 python API using an existing EP',
    RoleArn=role,
    EndpointName=predictor.endpoint_name,
    ModelVariants=[
        {
            'ModelName': model_name1,
            'VariantName': 'AllTraffic',
            'InfrastructureConfig': {
                'InfrastructureType':'RealTimeInference',
                'RealTimeInferenceConfig': {
                    'InstanceType': 'ml.m4.xlarge',
                    'InstanceCount': 1 
                }
            }
        },
        
        {
            'ModelName': model_name2,
            'VariantName': 'Shadow-01',
            'InfrastructureConfig': {
                'InfrastructureType':'RealTimeInference',
                'RealTimeInferenceConfig': {
                    'InstanceType': 'ml.m4.xlarge',
                    'InstanceCount': 1 
                }
            }
        },
    ],
    DataStorageConfig={
        'Destination':f's3://{bucket_name}/{prefix}/datacapture_test/',
    },
    ShadowModeConfig={
        'SourceModelVariantName': 'AllTraffic',
        'ShadowModelVariants': [
            {
                'ShadowModelVariantName': 'Shadow-01',
                'SamplingPercentage': 100
            },
        ]
    },
)   


In [51]:
shadowtestdescribe = sm_client.describe_inference_experiment(Name=shadowtestname)
shadowtestdescribe

{'Arn': 'arn:aws:sagemaker:us-east-1:278578987671:inference-experiment/shadowinferencetestexistingep',
 'Name': 'ShadowInferenceTestExistingEP',
 'Type': 'ShadowMode',
 'Schedule': {'StartTime': datetime.datetime(2023, 1, 31, 22, 57, 7, 245000, tzinfo=tzlocal()),
  'EndTime': datetime.datetime(2023, 2, 7, 22, 57, 7, 245000, tzinfo=tzlocal())},
 'Status': 'Creating',
 'Description': 'Shadow inference test created via boto3 python API using an existing EP',
 'CreationTime': datetime.datetime(2023, 1, 31, 22, 57, 6, 828000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2023, 1, 31, 22, 57, 7, 226000, tzinfo=tzlocal()),
 'RoleArn': 'arn:aws:iam::278578987671:role/SageMaker-IoTRole',
 'EndpointMetadata': {'EndpointName': 'sagemaker-xgboost-2023-01-31-22-52-36-533'},
 'ModelVariants': [{'ModelName': 'PROD-XGBoost-Webtraffic',
   'VariantName': 'AllTraffic',
   'InfrastructureConfig': {'InfrastructureType': 'RealTimeInference',
    'RealTimeInferenceConfig': {'InstanceType': 'ml.m

TODO - Add comment about waiting for the experiement to be active.  Add code for a "waiter"

<a id='eval'></a>
# Evaluate

Now that we have a hosted endpoint running, we can make real-time predictions from our model very easily, simply by making an http POST request. 

In [26]:
# single prediction
# We expect 4 - DDoS attacks-LOIC-HTTP as the predicted class.
test_values = [80,1056736,3,4,20,964,20,0,6.666666667,11.54700538,964,0,241.0,482.0,931.1691850999999,6.6241710320000005,176122.6667,431204.4454,1056315,2,394,197.0,275.77164469999997,392,2,1056733,352244.3333,609743.1115,1056315,24,0,0,0,0,72,92,2.8389304419999997,3.78524059,0,964,123.0,339.8873763,115523.4286,0,0,1,1,0,0,0,1,1.0,140.5714286,6.666666667,241.0,0.0,0.0,0.0,0.0,0.0,0.0,3,20,4,964,8192,211,1,20,0.0,0.0,0,0,0.0,0.0,0,0,20,2,2018,1,0,1,0]
result = predictor.predict(test_values)
print(result)

b'4.0'


In [114]:
storage = shadowtestdescribe['DataStorageConfig']['Destination']+predictor.endpoint_name +'/'
storage

's3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/datacapture_test/sagemaker-xgboost-2023-01-31-22-52-36-533/'

In [53]:
!aws s3 ls {storage}

In [88]:
from math import floor

In [174]:
holdout

Unnamed: 0,Target,Dst Port,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,Fwd Pkt Len Mean,Fwd Pkt Len Std,Bwd Pkt Len Max,Bwd Pkt Len Min,Bwd Pkt Len Mean,Bwd Pkt Len Std,Flow Byts/s,Flow Pkts/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Tot,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Tot,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Len,Bwd Header Len,Fwd Pkts/s,Bwd Pkts/s,Pkt Len Min,Pkt Len Max,Pkt Len Mean,Pkt Len Std,Pkt Len Var,FIN Flag Cnt,SYN Flag Cnt,RST Flag Cnt,PSH Flag Cnt,ACK Flag Cnt,URG Flag Cnt,CWE Flag Count,ECE Flag Cnt,Down/Up Ratio,Pkt Size Avg,Fwd Seg Size Avg,Bwd Seg Size Avg,Fwd Byts/b Avg,Fwd Pkts/b Avg,Fwd Blk Rate Avg,Bwd Byts/b Avg,Bwd Pkts/b Avg,Bwd Blk Rate Avg,Subflow Fwd Pkts,Subflow Fwd Byts,Subflow Bwd Pkts,Subflow Bwd Byts,Init Fwd Win Byts,Init Bwd Win Byts,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,day,month,year,dayofweek,prot_0,prot_6,prot_17
518708,13,21,2,1,1,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,1.000000e+06,2.000000e+00,0.000000e+00,2,2,0,0.000000e+00,0.000000e+00,0,0,0,0.000000e+00,0.000000e+00,0,0,0,0,0,0,40,20,500000.000000,500000.000000,0,0,0.000000,0.000000,0.000000,0,0,0,1,0,0,0,0,1.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,26883,0,0,40,0.000000,0.000000,0,0,0.0,0.000000,0,0,14,2,2018,2,0,1,0
108975,1,8080,565,2,0,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,3.539823e+03,5.650000e+02,0.000000e+00,565,565,565,5.650000e+02,0.000000e+00,565,565,0,0.000000e+00,0.000000e+00,0,0,0,0,0,0,40,0,3539.823009,0.000000,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,2052,-1,0,20,0.000000,0.000000,0,0,0.0,0.000000,0,0,3,2,2018,5,0,1,0
53378,0,443,61115303,13,16,327,6219,196,0,25.153846,61.962147,1460,0,388.687500,611.821512,107.109017,4.745129e-01,2.182689e+06,4.236597e+06,10191195,1,61115303,5.092942e+06,5.291137e+06,10202876,354,61103447,4.073563e+06,5.142929e+06,10202998,1,0,0,0,0,272,392,0.212713,0.261800,0,1460,218.200000,479.154816,229589.337931,0,0,1,1,0,0,0,1,1.0,225.724138,25.153846,388.687500,0.0,0.0,0.0,0.0,0.0,0.0,13,327,16,6219,8192,131,7,20,37097.833333,62086.358548,163831,11681,10148762.5,52866.122543,10191195,10067738,20,2,2018,1,0,1,0
363848,11,21,28,1,1,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,7.142857e+04,2.800000e+01,0.000000e+00,28,28,0,0.000000e+00,0.000000e+00,0,0,0,0.000000e+00,0.000000e+00,0,0,0,0,0,0,40,20,35714.285710,35714.285710,0,0,0.000000,0.000000,0.000000,0,0,0,1,0,0,0,0,1.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,26883,0,0,40,0.000000,0.000000,0,0,0.0,0.000000,0,0,16,2,2018,4,0,1,0
379387,14,22,6,1,1,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,3.333333e+05,6.000000e+00,0.000000e+00,6,6,0,0.000000e+00,0.000000e+00,0,0,0,0.000000e+00,0.000000e+00,0,0,0,0,0,0,32,32,166666.666667,166666.666667,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,1,0,0,1.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,241,230,0,32,0.000000,0.000000,0,0,0.0,0.000000,0,0,14,2,2018,2,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
314180,13,21,1,1,1,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,2.000000e+06,1.000000e+00,0.000000e+00,1,1,0,0.000000e+00,0.000000e+00,0,0,0,0.000000e+00,0.000000e+00,0,0,0,0,0,0,40,20,1000000.000000,1000000.000000,0,0,0.000000,0.000000,0.000000,0,0,0,1,0,0,0,0,1.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,26883,0,0,40,0.000000,0.000000,0,0,0.0,0.000000,0,0,14,2,2018,2,0,1,0
465531,0,3389,4164902,10,7,1144,1581,677,0,114.400000,209.896906,1173,0,225.857143,430.098604,654.277099,4.081729e+00,2.603064e+05,4.626654e+05,1748809,3,4164902,4.627669e+05,5.863784e+05,1748809,3,4035005,6.725008e+05,7.157786e+05,1969136,132264,0,0,0,0,212,152,2.401017,1.680712,0,1173,151.388889,305.039851,93049.310458,0,0,1,1,0,0,0,1,0.0,160.294118,114.400000,225.857143,0.0,0.0,0.0,0.0,0.0,0.0,10,1144,7,1581,8192,62856,5,20,0.000000,0.000000,0,0,0.0,0.000000,0,0,20,2,2018,1,0,1,0
189547,4,80,41171942,2,0,0,0,0,0,0.000000,0.000000,0,0,0.000000,0.000000,0.000000,4.857677e-02,4.120000e+07,0.000000e+00,41200000,41200000,41200000,4.120000e+07,0.000000e+00,41200000,41200000,0,0.000000e+00,0.000000e+00,0,0,0,0,0,0,40,0,0.048577,0.000000,0,0,0.000000,0.000000,0.000000,0,0,0,0,1,0,0,0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0,0,2049,-1,0,20,0.000000,0.000000,0,0,41200000.0,0.000000,41200000,41200000,20,2,2018,1,0,1,0
519158,1,8080,11443,3,4,326,129,326,0,108.666667,188.216188,112,0,32.250000,53.767245,39762.300100,6.117277e+02,1.907167e+03,4.248820e+03,10572,21,546,2.730000e+02,3.139554e+02,495,51,10969,3.656333e+03,5.991685e+03,10572,24,0,0,0,0,72,92,262.169012,349.558682,0,326,56.875000,115.406657,13318.696430,0,0,1,1,0,0,0,1,1.0,65.000000,108.666667,32.250000,0.0,0.0,0.0,0.0,0.0,0.0,3,326,4,129,8192,219,1,20,0.000000,0.000000,0,0,0.0,0.000000,0,0,3,2,2018,5,0,1,0


### Here we use the sagemaker API to call the endpoint on the validation data we held back from training

In [176]:
%%time

actuals = []
predictions = []
i = 0
for index, row in holdout.iterrows():
    vals = row.values
    prediction = predictor.predict(vals[1::], inference_id=str(index))
    actuals.append(vals[0])
    predictions.append(floor(float(prediction.decode())))
    i+=1
    if i%10000 == 0:
        print(i)

10000
20000
30000


KeyboardInterrupt: 

In [95]:
from sklearn import metrics as m
print(f'accuracy: {m.accuracy_score(actuals, predictions)}')
print(f'F1: {m.f1_score(actuals, predictions, average = "macro")}')
print(m.confusion_matrix(actuals, predictions))

accuracy: 0.9666939975684922
F1: 0.8497288303519585
[[38448    19     5     0   289  1226     0    12     0     0     0     0
      1     0     0]
 [    7  7993     0     0     0     0     0     0     0     0     0     0
      0     0     0]
 [    0     0  1562     0     0     0     0     0     0     0     0     0
     38     0     0]
 [    8     0   108   277     0     0     0     0     0     0     0     0
      7     0     0]
 [   51     0     0     0 11949     0     0     0     0     0     0     0
      0     0     0]
 [ 1718     0     1     0     0  4281     0     0     0     0     0     0
      0     0     0]
 [    0     0     0     0     0     0    69     0     0     0     0     0
      0     0     0]
 [    0     0     0     0     0     0     0 12000     0     0     0     0
      0     0     0]
 [   21     0     0     0     0     0     0     0     4     0     0     0
      0     0     0]
 [    5     0     0     0     0     0     0     0     0     4     0     0
      0     0     0

Since our Shadow test was running when we sent the data to our endpoint, we can get the test model's predictions from S3

In [31]:
!mkdir ./data/datacapture/

In [115]:
!aws s3 cp {storage} ./data/datacapture/  --recursive

download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/datacapture_test/sagemaker-xgboost-2023-01-31-22-52-36-533/AllTraffic/2023/02/01/00/32-25-209-914b7940-65cf-4727-917c-3ba5e40dc7e1.jsonl to data/datacapture/AllTraffic/2023/02/01/00/32-25-209-914b7940-65cf-4727-917c-3ba5e40dc7e1.jsonl
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/datacapture_test/sagemaker-xgboost-2023-01-31-22-52-36-533/AllTraffic/2023/01/31/23/05-36-040-2ce4adf3-f0c9-46d6-958d-0fbd780077a0.jsonl to data/datacapture/AllTraffic/2023/01/31/23/05-36-040-2ce4adf3-f0c9-46d6-958d-0fbd780077a0.jsonl
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/datacapture_test/sagemaker-xgboost-2023-01-31-22-52-36-533/AllTraffic/2023/01/31/23/01-36-017-49d9233d-e080-4b62-ae10-e7b7f12ce790.jsonl to data/datacapture/AllTraffic/2023/01/31/23/01-36-017-49d9233d-e080-4b62-ae10-e7b7f12ce790.jsonl
download: s3://sagemaker-us-east-1-278578987671/xgboost-webtraffic/datacapture_test/sagemaker-

In [116]:
shadowfiles = glob.glob('./data/datacapture/Shadow-01/**/*.jsonl',recursive=True)
shadowfiles

['./data/datacapture/Shadow-01/2023/02/01/17/57-14-770-777f973d-2587-4c56-aced-b1a85a4abc7b.jsonl',
 './data/datacapture/Shadow-01/2023/02/01/17/58-14-815-f12755b0-05ce-4add-be46-0430f3ff1160.jsonl',
 './data/datacapture/Shadow-01/2023/02/01/17/55-10-378-0e1479e1-2d21-4cdb-95d4-a237e5a319f7.jsonl',
 './data/datacapture/Shadow-01/2023/02/01/17/59-14-834-135b0c43-b556-4e98-84be-38e30c681706.jsonl',
 './data/datacapture/Shadow-01/2023/02/01/18/05-15-069-1ad7ceb9-d7eb-4896-b591-4d1e79b04c70.jsonl',
 './data/datacapture/Shadow-01/2023/02/01/18/03-15-007-900a9096-c286-46f3-a0e4-ce2dd2b6ea81.jsonl',
 './data/datacapture/Shadow-01/2023/02/01/18/00-14-861-698566e3-37fb-492c-bf0a-0fb8bac5c112.jsonl',
 './data/datacapture/Shadow-01/2023/02/01/18/08-15-199-6770001e-b9f2-4562-a4d9-6a1f1c66f594.jsonl',
 './data/datacapture/Shadow-01/2023/02/01/18/04-15-042-87875fd1-9096-4b0d-8f21-20555f628da0.jsonl',
 './data/datacapture/Shadow-01/2023/02/01/18/11-15-223-a0670323-9d37-44f8-b083-055a42763867.jsonl',


In [34]:
prodfiles = glob.glob('./data/datacapture/AllTraffic/**/*.jsonl',recursive=True)
prodfiles

['./data/datacapture/AllTraffic/2023/01/31/21/40-31-621-e43c82d1-6583-49b7-acd7-3c1024cee6a3.jsonl',
 './data/datacapture/AllTraffic/2023/01/31/21/34-31-601-38c89294-0471-4a18-8650-9604f50297f4.jsonl',
 './data/datacapture/AllTraffic/2023/01/31/21/35-31-605-04ca50a2-bd91-4ad3-886f-a173329a2720.jsonl',
 './data/datacapture/AllTraffic/2023/01/31/21/36-31-606-d687368c-a1b2-4587-abb1-7e7816ad826a.jsonl',
 './data/datacapture/AllTraffic/2023/01/31/21/31-31-586-5621fe48-89e2-4a40-90e1-e46d59d8d9be.jsonl',
 './data/datacapture/AllTraffic/2023/01/31/21/38-31-615-1f448050-d4c7-416a-b128-82aa760d238e.jsonl',
 './data/datacapture/AllTraffic/2023/01/31/21/37-31-609-13b772f9-ddd1-4334-98c3-91da7c6e3120.jsonl',
 './data/datacapture/AllTraffic/2023/01/31/21/33-31-595-789f4c38-9453-4f42-bf49-174ff616dfb0.jsonl',
 './data/datacapture/AllTraffic/2023/01/31/21/39-31-619-edf34f2c-d663-4d23-94bc-51c688b8b3f6.jsonl',
 './data/datacapture/AllTraffic/2023/01/31/21/32-31-593-228825e7-abd7-4b73-b713-7d00006fefc

In [35]:
%pip install jsonlines

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting jsonlines
  Downloading jsonlines-3.1.0-py3-none-any.whl (8.6 kB)
Installing collected packages: jsonlines
Successfully installed jsonlines-3.1.0
Note: you may need to restart the kernel to use updated packages.


In [36]:
import jsonlines

In [37]:
import base64

In [161]:
shadowin = []
shadowout = []

for f in shadowfiles:
    print(f)
    with jsonlines.open(f) as reader:
        for obj in reader:

            # input to model
            model_input = base64.b64decode(obj['captureData']['endpointInput']['data']).decode('UTF-8').split(',')
            shadowin.append(model_input)
            
            # utput from model
            model_output = base64.b64decode(obj['captureData']['endpointOutput']['data']).decode('UTF-8').strip('[').strip(']')
            metadata = obj['eventMetadata']
            if 'inferenceId' in metadata:
                shadowout.append([model_output, metadata['inferenceId']])

./data/datacapture/Shadow-01/2023/02/01/17/57-14-770-777f973d-2587-4c56-aced-b1a85a4abc7b.jsonl
./data/datacapture/Shadow-01/2023/02/01/17/58-14-815-f12755b0-05ce-4add-be46-0430f3ff1160.jsonl
./data/datacapture/Shadow-01/2023/02/01/17/55-10-378-0e1479e1-2d21-4cdb-95d4-a237e5a319f7.jsonl
./data/datacapture/Shadow-01/2023/02/01/17/59-14-834-135b0c43-b556-4e98-84be-38e30c681706.jsonl
./data/datacapture/Shadow-01/2023/02/01/18/05-15-069-1ad7ceb9-d7eb-4896-b591-4d1e79b04c70.jsonl
./data/datacapture/Shadow-01/2023/02/01/18/03-15-007-900a9096-c286-46f3-a0e4-ce2dd2b6ea81.jsonl
./data/datacapture/Shadow-01/2023/02/01/18/00-14-861-698566e3-37fb-492c-bf0a-0fb8bac5c112.jsonl
./data/datacapture/Shadow-01/2023/02/01/18/08-15-199-6770001e-b9f2-4562-a4d9-6a1f1c66f594.jsonl
./data/datacapture/Shadow-01/2023/02/01/18/04-15-042-87875fd1-9096-4b0d-8f21-20555f628da0.jsonl
./data/datacapture/Shadow-01/2023/02/01/18/11-15-223-a0670323-9d37-44f8-b083-055a42763867.jsonl
./data/datacapture/Shadow-01/2023/02/01/

In [178]:
true_df = pd.DataFrame(holdout.Target.values)
true_df.set_index(holdout.index, inplace=True)
true_df

Unnamed: 0,0
518708,13
108975,1
53378,0
363848,11
379387,14
...,...
314180,13
465531,0
189547,4
519158,1


In [219]:
shadow_df = pd.DataFrame(shadowout)
shadow_df['int_index'] = shadow_df[1].map(int)
shadow_df.set_index('int_index', inplace=True)
shadow_df.rename(columns={0:'prediction'}, inplace=True)
true_df.rename(columns={0:'actual'}, inplace=True)
shadow_df.drop(columns=[1], axis=1, inplace=True)
print(shadow_df.head())
print(true_df.head())
all_df = shadow_df.merge(true_df, left_index=True, right_index=True)
all_df

          prediction
int_index           
518708            13
518708            13
108975             1
101496            11
396506             2
        actual
518708      13
108975       1
53378        0
363848      11
379387      14


Unnamed: 0,prediction,actual
41,0,0
91,0,0
106,0,0
117,4,4
126,0,0
...,...,...
530275,4,4
530284,14,14
530288,0,0
530301,5,5


In [220]:
all_df

Unnamed: 0,prediction,actual
41,0,0
91,0,0
106,0,0
117,4,4
126,0,0
...,...,...
530275,4,4
530284,14,14
530288,0,0
530301,5,5


In [99]:
model_output[:10]

'0'

In [194]:
shadow_df

Unnamed: 0_level_0,prediction,actual
1,Unnamed: 1_level_1,Unnamed: 2_level_1
518708,13,
518708,13,
108975,1,
101496,11,
396506,2,
...,...,...
416288,14,
399604,0,
469836,4,
381563,5,


In [40]:
%%time

prodin = []
prodout = []

for f in prodfiles:
    print(f)
    with jsonlines.open(f) as reader:
        for obj in reader:
            # input to model
            model_input = base64.b64decode(obj['captureData']['endpointInput']['data']).decode('UTF-8').split(',')
            prodin.append(model_input)
            
            # utput from model
            model_output = base64.b64decode(obj['captureData']['endpointOutput']['data']).decode('UTF-8').strip('[').strip(']')
            prodout.append(model_output)

./data/datacapture/AllTraffic/2023/01/31/21/40-31-621-e43c82d1-6583-49b7-acd7-3c1024cee6a3.jsonl
./data/datacapture/AllTraffic/2023/01/31/21/34-31-601-38c89294-0471-4a18-8650-9604f50297f4.jsonl
./data/datacapture/AllTraffic/2023/01/31/21/35-31-605-04ca50a2-bd91-4ad3-886f-a173329a2720.jsonl
./data/datacapture/AllTraffic/2023/01/31/21/36-31-606-d687368c-a1b2-4587-abb1-7e7816ad826a.jsonl
./data/datacapture/AllTraffic/2023/01/31/21/31-31-586-5621fe48-89e2-4a40-90e1-e46d59d8d9be.jsonl
./data/datacapture/AllTraffic/2023/01/31/21/38-31-615-1f448050-d4c7-416a-b128-82aa760d238e.jsonl
./data/datacapture/AllTraffic/2023/01/31/21/37-31-609-13b772f9-ddd1-4334-98c3-91da7c6e3120.jsonl
./data/datacapture/AllTraffic/2023/01/31/21/33-31-595-789f4c38-9453-4f42-bf49-174ff616dfb0.jsonl
./data/datacapture/AllTraffic/2023/01/31/21/39-31-619-edf34f2c-d663-4d23-94bc-51c688b8b3f6.jsonl
./data/datacapture/AllTraffic/2023/01/31/21/32-31-593-228825e7-abd7-4b73-b713-7d00006fefc9.jsonl
./data/datacapture/AllTraffic/

In [41]:
len(prodout)

96093

Our Shadow metrics look pretty good, let's promote that model to production:

In [None]:
# Promote 

If we discover problems after promoting our test model, we can easily roll back:

In [None]:
# roll back to n-1

## Simulate Production Traffic

We will now simulate the production traffic.  We will loop over the production data.  In a real production use case you won't need to do this since actual production data will be flowing to the production endpoint.  

Note that we are not capturing the inference request or output.  We could do this, however, we have configured data capture so instead we let SageMaker handle the data capture to s3.

In [None]:
for p in prod:
    predictor.predict(p).decode('utf-8')



Now, we'll use a simple function to:

1. Loop over our test dataset
2. Split it into mini-batches of rows
3. Convert those mini-batchs to CSV string payloads
4. Retrieve mini-batch predictions by invoking the XGBoost endpoint
5. Collect predictions and convert from the CSV output our model provides into a NumPy array



In [None]:
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

In [None]:
predictions = predict(val_df.to_numpy()[:,1:])

In [None]:
predictions.shape

In [None]:
actual = val_df.to_numpy()[:,0]

In [None]:
actual.shape

In [None]:
class_list = ['Benign','Bot','DoS attacks-GoldenEye','DoS attacks-Slowloris','DDoS attacks-LOIC-HTTP','Infilteration','DDOS attack-LOIC-UDP','DDOS attack-HOIC','Brute Force-Web','Brute Force-XSS','SQL Injection','DoS attacks-SlowHTTPTest','DoS attacks-Hulk','FTP-BruteForce','SSH-Bruteforce']
fig, ax = plt.subplots(figsize=(15,10))
cm = confusion_matrix(actual,predictions)
normalized_cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(normalized_cm, ax=ax, annot=cm, fmt='',xticklabels=class_list,yticklabels=class_list)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confustion Matrix')
plt.show()

Finally, let's gracefully stop the deployed endpoint.

TODO - PROMOTE THE SHADOW VARIANT TO PROD VIA CODE, INCLUDE CODE ON HOW TO ROLL BACK TO N-1 VERSION OF THE MODEL

In [None]:
sm_client.stop_inference_experiment(
    Name=shadowtestname,
    ModelVariantActions={
        'string': 'Remove'
    },
    DesiredState='Cancelled',
    Reason='string'
)

In [None]:
#predictor.delete_endpoint()
sm_client.delete_inference_experiment(
    Name=shadowtestname
)

# References

* A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018) - https://registry.opendata.aws/cse-cic-ids2018/
* AIM362 - Re:Invent 2019 SageMaker Debugger and Model Monitor - https://github.com/aws-samples/reinvent2019-aim362-sagemaker-debugger-model-monitor