## Udacity ML Engineering Nanodegree - Capstone Project

# Identifying attacker application sessions through supervised learning over known attacker sessions
## Project Domain
This project bears on the domain of security. I work for a company that publishes expensive content on the web, making it available to subscribers only. We frequently discover cases in which attackers have stolen credentials from our customers and use them to perform unauthorized content downloads. Detecting such activity quickly, without human analysis, has always been difficult given the large amounts of data involved.

## Problem Statement

Given a selection of log data containing records that describe different usage activities, determine whether a given session is likely to describe unauthorized content access. This project will use known attacker data as labeled examples supporting a supervised learning approach. I will use a supervised learning approach on labeled input data that includes both attacker sessions and “innocent” sessions, hoping to produce a model that can predict a high percentage of attack sessions with very low false positives. A successful proof of concept would identify 70% of attacker sessions with no more than 5% false positives.

## Datasets and Inputs

- known-attacker.txt–afilecontaininglogentriescollectedfromsessions manually identified as belonging to known attackers. Covering a span of around two years, this file contains ~60K attacker events.

- mixed1-3.txt – files containing an unfiltered selection of time intervals known to contain attacker activities. These files contain a total of ~200K events, of which less than 0.5% are attacker events.

The events in each file are recorded as tab-separated rows in the following columns:

SessionNo LogTime CustID GroupID ProfID Act BadActor

The ‘SessionNo’ column groups activities into sets of consecutive user actions, each of which is identified with an ‘Act’ column that identified what the user did, e.g., logged in, performed a search, downloaded content. The CustID, GroupID, and ProfID columns identify unique customer organizations, and are not expected to be useful in the learning exercise per se.

Each file has been labeled manually with a notation whether each event belongs to an attacker session. The ‘BadActor’ column indicates whether the given event belongs to an attacker session.

Given the low ratio of attacker to innocent events in the mixed files, I expect it will be necessary to augment the data to improve training results. [2016, Buczakak and Guven] suggests dropping negative rows or duplicating positive rows; I plan to augment training data using the long history of attacker-only rows in known-attacker.txt.

## Solution Statement - further experiments

After completing the sesssion-times experiment, I wanted to evaluate some approaches for measuring performance of the model against partial sessions. Partial-session processing is useful for live incident response; I want to be able to identify an attack in progress and block it, so I can't wait for a session to complete. In this notebook I explore several kinds of partial sessions - first N transactions, last N transactions, and first N% transactions. 

In addition, I wanted to improve the software I'm using to run these experiments. I wanted to be able to run many training jobs at the same time and start many endpoints consecutively; in addition, I wanted to get the Sagemaker boilerplate code out of the notebook so it's easy to see what's going on in a given experiment.

## Prior research
As far as I can see, the problem of identifying attackers from logs of activities at the business- use-case level has not been thoroughly researched. I did find numerous papers on analysis of net flows and HTTP logs that can be read for useful parallel techniques.

Pietraszek, Tadeusz and Axel Tanner. “Data mining and machine learning - Towards reducing false positives in intrusion detection.” Inf. Sec. Techn. Report 10 (2005): 169-183. Uses machine learning to identify candidate alerts from an IDS for human labeling. Labels are used in supervised learning to refine selection of alerts.

Buczak, Anna L. and Erhan Guven. “A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection.” IEEE Communications Surveys & Tutorials 18 (2016): 1153-1176. In the domain of IP net flow analysis, establishes terminology and reviews a broad selection of techniques for modeling attacks based on collections of internet packets.

Sperotto, Anna et al. “An Overview of IP Flow-Based Intrusion Detection.” IEEE Communications Surveys & Tutorials 12 (2010): 343-356. In-depth discussion of net flow analysis, distinguishing parts of the net-flow modeling problem rather than how to analyze collections of successfully-captured packets.

Moh, Melody et al. “Detecting Web Attacks Using Multi-stage Log Analysis.” 2016 IEEE 6th International Conference on Advanced Computing (IACC) (2016): 733-738. Overview of an approach for managing high volumes of HTTP logs and analyzing for presence of SQL injection attackes. Uses Bayes net classification in WEKA, produces an enriched analyst workstation environment in Kibana.

## Benchmark Model & Evaluation Metrics
The total volume of known attacker traffic is extremely low, less than .05% for a given victim. My approach to computing a baseline is to assume that *P(attack)* for a given row is 0. I will compare the confusion matrix for this baseline to the one for predictions from my ML model.

## Project Design
I will execute the following plan:

1. Import the data into Jupyter & SageMaker in order to study it in place 2. Choose an approach for feature engineering:

    a. Can I engineer a session row that contains enough information to produce a useful result in one of the algorithms I’ve already used, like XGBoost?

    b. Do I need to use LSTM or convolution or some other learning algorithm with a memory that can learn sequences?

3. Produce a repeatable process that can convert our raw log file into a dataset that my chosen algorithm can process

4. Train a model, and execute the model against labelled data in batch transform mode 5. Use the result to calculate true and false positives and negatives. Given the small population of positive results – attacker session are a small fraction of total traffic – precision and recall are the most important metrics.

    a. High precision means that my false positive rate is low. It’s critical not to misidentify innocent traffic as malicious.

    b. High recall is the next most important metric. I need to identify as great a fraction of actual attack traffic as possible.

The training set includes the following files:

|File       |Rows                             |Contents|
------------|-------------------------------------|----|
|mixed-1.txt|119474|raw transactions|
|mixed-2.txt|43608|raw transactions|
|mixed-3.txt|30844|raw transactions|
|known-attacker.txt|61917|raw transactions for a known attacker|

All files include a 'BadActor' column that labels a transaction as belonging to a known attacker or not. The 'mixed-' files consist of whole hours of activity in which there is an attack, while known-attacker.txt contains all attacks for the known attacker over the last two years. In all, we have 193926 transactions from "innocent" sessions and 61917 LT transactions. Approximately 32% of transactions are labeled BadActor = 1, giving us a reasonable proportion in both classes.

# Sequence testing phase 1. 

## Warm up
Import standard libraries and prepare the environment.

In [1]:
import io
import os
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 

import boto3
import sagemaker
from sagemaker import get_execution_role


In [2]:
# S3 bucket name
bucket = 'sk-mlai-harvesting'

# common column names
bad_col='BadActor'
sess_col='SessionNo'
txn_col='Act'
logtime_col = 'LogTime'

# paths
csv_path = "out"
!mkdir out

mkdir: cannot create directory ‘out’: File exists


In [13]:
mixed = []
for i in range(3):
    name = 'data/mixed-{}.txt'.format(i+1)
    m = pd.read_csv(name,sep='\t')
    print( "File: {} Rows: {}".format( name, len(m)))
    mixed.append(m)

name = 'data/known-attacker.txt'
known = pd.read_csv(name ,sep='\t')
print( "File: {} Rows: {}".format( name, len(known)))

# Load data for extra day of test data - need to have full set of transaction types
name = 'data/single.txt'
single = pd.read_csv(name ,sep='\t')
single[txn_col]= single[txn_col].astype(str)

mixed.append(known)

txn = pd.concat(mixed)
print( "Total mixed transaction rows: {}".format(len(txn)))

txn[logtime_col] = pd.to_datetime(txn[logtime_col])
txn[txn_col]= txn[txn_col].astype(str)
# txn[txn[bad_col]==1]

File: data/mixed-1.txt Rows: 119474
File: data/mixed-2.txt Rows: 43608
File: data/mixed-3.txt Rows: 30844
File: data/known-attacker.txt Rows: 61917
Total mixed transaction rows: 255843


Need to compute the full list of transaction types appearing in all of our data in order to make sure that our columns line up later.

## Data conversion and feature engineering
In real life, a session consists of a series of rows of transactions of different types, and each transaction type records a variable number of additional metadata attributes describing a logged event, for a total of over 30 columns of extracted data. In addition, our tagging process has given each row a BadActor label.

|sessionno|txn id|BadActor|parm1|parm2|...|
|---------|------|--------|-----|-----|---|
|1240|111|0|query string|...|...|
|1240|112|0|meta|...|...|
|2993|301|1|meta|...|...|



We drop most of this information, including the temporal sequence of the log entries, and convert each session into a single row of data. Almost all of the columns go away, replaced by counts of transaction types in the session.

|sessionno|BadActor|111|112|113|...|301|302|...|
|---------|--------|---|---|---|---|---|---|---|
|1240|0|1|1|0|...|0|0|...|
|2993|1|0|0|0|...|1|0|...|

In [14]:
# 'Innocent' log entries
txns = pd.DataFrame(np.sort(txn['Act'].unique()))

# Harvesting log entries
known_txns = pd.DataFrame(np.sort(known['Act'].unique()))

all_txns=pd.concat([txn,single])

all_txn_types = np.sort(all_txns[txn_col].unique())
all_txn_types


array(['111', '112', '114', '115', '116', '117', '118', '119', '121',
       '123', '124', '125', '126', '127', '135', '201', '215', '216',
       '217', '219', '311', '312', '315', '316', '317', '401', '402',
       '403', '404', '406', '407', '410', '411', '511', '513', '601',
       '607'], dtype=object)

# Truncating sessions
Before we flatten the sessions, we're going to truncate them. This may be a better match for the real world, in which at best we will be able to scan sliding windows of transactions with a scaling resumption that we may not scan every event.

We'll try several approaches at once:
- dropping out D% of transactions from every session
- taking only the first N sessions from every session
- taking on the last N transactions from every session
- taking N consecutive transactions from the middle of every session. 
- choosing N transactions as above, but dropping every session without at least N transactions.


In [15]:
def drop_column_groups( txn_c ):
    txn_c.columns = txn_c.columns.droplevel(0)  
    txn_c.rename_axis(None, axis=1).reset_index()
    return txn_c

def get_session_groups( txn ):
    txn_g = txn.groupby(sess_col)
    return txn_g

txng = get_session_groups(txn)
txng

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f0cbabba630>

In [16]:

def drop_pct( df, n = .1 ):
    return df.sample(frac= 1-n)

def first_n( df, n = 5):
    return df.head( n )
    
def last_n( df, n = 5):
    return df.tail( n )


In [17]:
def grappl( fun, parm ):
    return lambda df: df.apply(fun, parm).drop(sess_col, axis=1)


In [18]:
def add_missing_columns( df_target, source_cols):
    '''
    If there are columns in the list 'source' that are not in the DataFrame 'target',
    add new columns to 'target' that are populated with 0.0.
    '''
    df = df_target
    target_cols = df_target.columns
    missing_cols = set(source_cols) - set(target_cols)
    
    print( "Missing columns: {}".format(missing_cols)) 
    if 0 < len(missing_cols):
        new_cols = dict([(col,0.0) for col in missing_cols])
        df = df_target.assign(**new_cols)
    
    return df

In [19]:
def flatten_txns( txn_log ):
    '''
    On a flat list of sessions, run a pivot table on transaction type counts by session, and eliminate extraneous columns.
    Flatten the pivot table and simplify the index.
    '''
    txn_narrow = txn_log[[sess_col, txn_col,bad_col]]
    txn_pivot = pd.pivot_table(txn_narrow, index=[sess_col,bad_col], columns = [txn_col],aggfunc=[len]).fillna(0)
    txn_pivot.columns = txn_pivot.columns.droplevel(0)           # the pivot table has a two-level index
    txn_flat = txn_pivot.rename_axis(None, axis=1).reset_index() # these two lines get rid of it so we have a simple table

    txn_full = add_missing_columns(txn_flat, all_txn_types)
    txn_xs = txn_full.drop(columns=[sess_col,bad_col])
    txn_xs = txn_xs.reindex(sorted(txn_xs.columns), axis=1)

    txn_sort = pd.concat([txn_full[[sess_col,bad_col]], txn_xs],sort=True,axis=1)
    
    return txn_sort

In [20]:
def flatten_groups( txn_log ):
    '''
    On a set of session groups, run a pivot table on transaction type counts by session, and eliminate extraneous columns.
    Flatten all groups into one table and simplify the index.
    '''
    txn_narrow = txn_log[[txn_col,bad_col]] # for groups, don't need to drop the session column because it's already an index column.
    txn_pivot = pd.pivot_table(txn_narrow, index=[sess_col,bad_col], columns = [txn_col],aggfunc=[len]).fillna(0)
    txn_pivot.columns = txn_pivot.columns.droplevel(0)           # the pivot table has a two-level index
    txn_flat = txn_pivot.rename_axis(None, axis=1).reset_index() # these two lines get rid of it so we have a simple table
    txn_full = add_missing_columns(txn_flat, all_txn_types)
    txn_xs = txn_full.drop(columns=[sess_col,bad_col])

    txn_xs = txn_xs.reindex(sorted(txn_xs.columns), axis=1)
#     txn_xs = txn_xs.drop('index',axis=1)

    txn_sort = pd.concat([txn_full[[sess_col,bad_col]], txn_xs],sort=True,axis=1)

    return txn_sort

In [21]:
jobs = [['drop0', [drop_pct, 0.0]], # control - keep the whole session, just to make sure that everything is working
    ['drop10', [drop_pct, .2]], 
    ['first5', [first_n, 5]],
    ['first10', [first_n, 10]],
    ['last5', [last_n, 5]],
    ['last10', [last_n, 10]]
       ]
job_names = [job[0] for job in jobs]
[(name, func, parm) for [name, [func, parm]] in jobs]

[('drop0', <function __main__.drop_pct(df, n=0.1)>, 0.0),
 ('drop10', <function __main__.drop_pct(df, n=0.1)>, 0.2),
 ('first5', <function __main__.first_n(df, n=5)>, 5),
 ('first10', <function __main__.first_n(df, n=5)>, 10),
 ('last5', <function __main__.last_n(df, n=5)>, 5),
 ('last10', <function __main__.last_n(df, n=5)>, 10)]

In [22]:
def prep_jobs(df, jobs):
    groups = [[name, grappl(fun, parm)(df)] for [name, [fun, parm]] in jobs]

    flats = [[name,flatten_groups( txn )] for [name,txn] in groups] #.reset_index()
    return flats

In [23]:
%%time

gs = prep_jobs(txng, jobs )


Missing columns: {'312'}
Missing columns: {'312'}
Missing columns: {'607', '312', '601', '402'}
Missing columns: {'607', '312', '601', '402'}
Missing columns: {'312', '402'}
Missing columns: {'312', '402'}
CPU times: user 3min 31s, sys: 835 ms, total: 3min 32s
Wall time: 3min 31s


In [24]:
pd.DataFrame([(g[1].columns) for g in gs]).transpose()

Unnamed: 0,0,1,2,3,4,5
0,SessionNo,SessionNo,SessionNo,SessionNo,SessionNo,SessionNo
1,BadActor,BadActor,BadActor,BadActor,BadActor,BadActor
2,111,111,111,111,111,111
3,112,112,112,112,112,112
4,114,114,114,114,114,114
5,115,115,115,115,115,115
6,116,116,116,116,116,116
7,117,117,117,117,117,117
8,118,118,118,118,118,118
9,119,119,119,119,119,119


In [25]:
txngs= gs
[(name,len(df), len(df.columns)) for [name,df] in txngs]
# gs[0]

[('drop0', 24114, 39),
 ('drop10', 24114, 39),
 ('first5', 24112, 39),
 ('first10', 24112, 39),
 ('last5', 24112, 39),
 ('last10', 24113, 39)]

## Producing pools of training and testing data

In order to support simultaneous execution of multiple jobs, this notebook introduces a new scheme for piping data through to models.

The normal flow runs as follows:

input data(s3) -> df's on notebook instance -> train.csv, test.csv, validate.csv on notebook instance -> s3/out/ -> Sagemaker instances

Hardcoding these filenames is fine for playing around in a notebook, but it limits us to one job at a time.

In this approach, every job has a base Name. This name will carry through from S3 into the Sagemaker instances.
Training data files will reside in `<s3bucket>/<key>/out/Name`.

As before, in each `Name` subfolder, we will divide the combined good and bad data pools as follows:
- a training set that the model iterates over during the learning process
- a test set that is used to evaluate the model during training
- a validation set that is kept separate to test the model after training is complete. We need separate test and validate pools in order to make sure that we're not overfitting the model to a single set of test data.

All of these functions are in the FrameSplitter class. 


In [26]:
from importlib import reload
import lib.JupHelper.JupHelper as jh
reload(jh)

<module 'lib.JupHelper.JupHelper' from '/home/ec2-user/SageMaker/ml-udacity/lib/JupHelper/JupHelper.py'>

In [27]:
import lib.JupHelper.JupHelper as jh

csv = jh.FrameSplitter( bad_col, [sess_col]) # FrameSplitter only holds onto the definition of the y_col and the x_cols - everything else is passed in

In [28]:
job_csvs = csv.make_all_csvs( gs )           # Drives the whole conversion process - look in the class for other helper methods

In [29]:
csvs = csv.get_all_csv_names(gs)

# Upload to S3 and move into SageMaker

Move all of the current csv's up into S3 for SageMaker, then start configuring jobs.

In [30]:
from importlib import reload
import lib.JupHelper.JupHelper as jh
reload(jh)

<module 'lib.JupHelper.JupHelper' from '/home/ec2-user/SageMaker/ml-udacity/lib/JupHelper/JupHelper.py'>

In [31]:
from sagemaker.amazon.amazon_estimator import get_image_uri

import boto3
from time import gmtime, strftime
import time

class XgbHyper:
    def __init__(self):
        self.params = {
            "max_depth":"5",
            "eta":"0.2",
            "gamma":"4",
            "min_child_weight":"6",
            "subsample":"0.7",
            "silent":"0",
            "objective":"binary:logistic",
            "num_round":"50"
        }

class TrainRunner:
    def __init__( self, prefix, region='us-east-1' ):
        '''
        Wrapper object to run multiple 
        '''
        self.prefix = prefix
        self.region = region

        self.running_jobs = []
        self.models = []
        self.sg_client = boto3.client('sagemaker', region_name=region )
        # sagemaker session, role
        self.sagemaker_session = sagemaker.Session()
        self.role = sagemaker.get_execution_role()
    
    def start_jobs(self, jobs, bucket ):
        for (name, csvs) in jobs:
            job_name = self.make_job_name( name )
            print( job_name )
            self.start_training_job( bucket, job_name, csvs )
            self.running_jobs.append([name,job_name])
        return self.running_jobs
    
    def clear_jobs(self):
        self.running_jobs = []
        self.models = []
        
        #delete endpoints
        #delete configs
    
    def start_training_job(self, bucket, job_name, s3_inputs, image='xgboost', instance = [1, "ml.m4.4xlarge", 5 ], hyper=XgbHyper(), verbose=True):
        self.image = image
        self.s3_input = s3_inputs
        self.instance = instance
        self.hyper = hyper
        self.bucket = bucket

        self.container = get_image_uri( self.region, 'xgboost' )
        job_config = self.make_job_config( job_name, s3_inputs[0], s3_inputs[1], s3_inputs[2])
        res = self.launch_training_job( job_config )
        if verbose:
            print( "Started {}. Response: {}".format( job_name, res))
            
        return res
    
    def make_s3_url(self, file):
        return "s3://{}/{}".format(self.bucket, file)
    
    def get_job_status(self, job):
        [name, job_name] = job
        return self.sg_client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
    
    def check_jobs_still_running( self ):
        still_running = False
        for job in self.running_jobs:
            status = self.get_job_status( job )
            if status !='Completed' and status !='Failed':
                still_running = True
                break
        
        return still_running
    
    def map_jobs( self, func ):
        for job in self.running_jobs:
            func( job )
    
    def print_job_status( self, job ):
        [name, job_name] = job
        
        print("{}: {}".format(job, self.get_job_status(job) )  )
        
    def trace_jobs( self ):
        self.map_jobs( self.print_job_status )
                    
    def make_job_name( self, job_name ):
        name = "{}-{}-{}".format( self.prefix, job_name, strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
        return name
            
    def launch_training_job( self, job_config ):
        self.sg_client.create_training_job( **job_config )
    
    def wait_for_jobs( self ):
        while True:
            print("Checking job statuses:")
            self.trace_jobs()
            if not self.check_jobs_still_running():
                break
            time.sleep(15)

    def get_model_name(self, name):
        return name + "-model"
    
    def get_endpoint_config_name(self, name):
        return name + "-config"
     
    def get_endpoint_name(self, name):
        return name + "-endpoint"

    def create_models( self ):
        for [name, job_name] in self.running_jobs:
            info = self.sg_client.describe_training_job(TrainingJobName=job_name)
            model_data = info['ModelArtifacts']['S3ModelArtifacts']
            primary_container = {
                'Image': self.container,
                'ModelDataUrl': model_data
            }
            res = self.sg_client.create_model(
                ModelName = self. get_model_name(job_name),
                ExecutionRoleArn = self.role,
                PrimaryContainer = primary_container)
            print(res['ModelArn'])
            self.models.append( res )
            
    def create_endpoint_configs(self, job_name ):
        endpoint_config_name = self.get_endpoint_config_name(job_name)
        model_name = self.get_model_name( job_name )
        print(endpoint_config_name)
        res = self.sg_client.create_endpoint_config(
            EndpointConfigName = endpoint_config_name,
            ProductionVariants=[{
                'InstanceType':'ml.m4.xlarge',
                'InitialVariantWeight':1,
                'InitialInstanceCount':1,
                'ModelName':model_name,
                'VariantName':'AllTraffic'}])
        print("Endpoint Config Arn: " + res['EndpointConfigArn'])
        
    def create_endpoints(self):
        for [name, job_name] in self.running_jobs:
            self.create_endpoint_configs(job_name)
            endpoint_name = self.get_endpoint_name(job_name)
            print(endpoint_name)
            res = self.sg_client.create_endpoint(
                EndpointName=endpoint_name,
                EndpointConfigName=self.get_endpoint_config_name( job_name ))
            print(res['EndpointArn'])

    def wait_for_endpoints(self):
        still_creating = True
            
        while still_creating:
            still_creating = True

            print("Checking endpoint statuses:")

            for [name, job_name] in self.running_jobs:
                resp = self.sg_client.describe_endpoint(EndpointName=self.get_endpoint_name( job_name ))
                status = resp['EndpointStatus']
                print( "Endpoint {}: {}".format(self.get_endpoint_name( job_name ), status ))
                if status == 'InService':
                    still_creating = False
                    
            time.sleep(30)
        
        print( "All endpoints created.")
        
    def test_model(self, job_name, csv):
        '''
        job_name - name of a job that has run all the way through to an endpoint
        csv - a csv with the same y_col and x_col structure as the training data
        returns - a single-column dataframe with the predictions from the model.
        '''
        endpoint_name = self.get_endpoint_name( job_name )
        with open(csv, 'r') as f:
            payload = f.read().strip()
        response = self.sg_client.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='text/csv', 
                                   Body=payload)
        result = response['Body'].read()
        result = result.decode("utf-8")
        result = result.split(',')
        result = [round(float(i)) for i in result]
        return pd.DataFrame( result )
            
    def make_job_config(self, job_name, train, test, val ):
        return {
        "AlgorithmSpecification": {
            "TrainingImage": self.container,
            "TrainingInputMode": "File"
        },
        "RoleArn": self.role,
        "OutputDataConfig": {
            "S3OutputPath": os.path.join("s3://", self.bucket, "out", "xgb-class") 
        },
        "ResourceConfig": {
            "InstanceCount": self.instance[0],
            "InstanceType": self.instance[1],
            "VolumeSizeInGB": self.instance[2]
        },
        "TrainingJobName": job_name,
        "HyperParameters": self.hyper.params,        
        "StoppingCondition": {
            "MaxRuntimeInSeconds": 3600
        },
        "InputDataConfig": [
            {
                "ChannelName": "train",
                "DataSource": {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": self.make_s3_url( train ), # "s3://sagemaker-mlai-harvesting/out/train.csv" , 
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },
                "ContentType": "text/csv",
                "CompressionType": "None"
            },
            {
                "ChannelName": "test",
                "DataSource": {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": self.make_s3_url( test ), # "s3://sagemaker-mlai-harvesting/out/test.csv" , 
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },
                "ContentType": "text/csv",
                "CompressionType": "None"
            },
            {
                "ChannelName": "validation",
                "DataSource": {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": self.make_s3_url( val ), # "s3://sagemaker-mlai-harvesting/out/validate.csv" ,
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },
                "ContentType": "text/csv",
                "CompressionType": "None"
            }
        ]
    }


class SageHelper:
    def __init__(self, bucket, image, region='us-east-1'):
        self.bucket = bucket
        self.image = image
        self.container = get_image_uri(region, self.image)
        self.s3_client = boto3.client('s3')
    
    def s3_upload(self, file, s3_path="", verbose = True):
        target = os.path.join(s3_path, file) 
        if verbose:
            print( "Uploading {} to s3://{}/{}".format(file, bucket, target))
        response = self.s3_client.upload_file( file, bucket, target )
        if verbose:
            print( response )
        return response
    
        
   

In [32]:
sh = SageHelper( bucket, 'xgboost' )

In [33]:
for (job, csvs ) in job_csvs:
    print( "Uploading files for {}".format(job))
    for file in csvs:
        sh.s3_upload( file )

Uploading files for drop0
Uploading out/drop0_train.csv to s3://sk-mlai-harvesting/out/drop0_train.csv
None
Uploading out/drop0_test.csv to s3://sk-mlai-harvesting/out/drop0_test.csv
None
Uploading out/drop0_validate.csv to s3://sk-mlai-harvesting/out/drop0_validate.csv
None
Uploading files for drop10
Uploading out/drop10_train.csv to s3://sk-mlai-harvesting/out/drop10_train.csv
None
Uploading out/drop10_test.csv to s3://sk-mlai-harvesting/out/drop10_test.csv
None
Uploading out/drop10_validate.csv to s3://sk-mlai-harvesting/out/drop10_validate.csv
None
Uploading files for first5
Uploading out/first5_train.csv to s3://sk-mlai-harvesting/out/first5_train.csv
None
Uploading out/first5_test.csv to s3://sk-mlai-harvesting/out/first5_test.csv
None
Uploading out/first5_validate.csv to s3://sk-mlai-harvesting/out/first5_validate.csv
None
Uploading files for first10
Uploading out/first10_train.csv to s3://sk-mlai-harvesting/out/first10_train.csv
None
Uploading out/first10_test.csv to s3://sk-ml

# Prepare and train a model
Boilerplate code mostly copied from Amazon sample code at https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.ipynb, with ample room for improvement.

In [34]:
tr= TrainRunner("harvest-xgb-bc")

In [35]:
%%time

running_jobs = tr.start_jobs(job_csvs, bucket)

tr.wait_for_jobs()

harvest-xgb-bc-drop0-2019-07-17-02-54-00
Started harvest-xgb-bc-drop0-2019-07-17-02-54-00. Response: None
harvest-xgb-bc-drop10-2019-07-17-02-54-00
Started harvest-xgb-bc-drop10-2019-07-17-02-54-00. Response: None
harvest-xgb-bc-first5-2019-07-17-02-54-01
Started harvest-xgb-bc-first5-2019-07-17-02-54-01. Response: None
harvest-xgb-bc-first10-2019-07-17-02-54-06
Started harvest-xgb-bc-first10-2019-07-17-02-54-06. Response: None
harvest-xgb-bc-last5-2019-07-17-02-54-08
Started harvest-xgb-bc-last5-2019-07-17-02-54-08. Response: None
harvest-xgb-bc-last10-2019-07-17-02-54-11
Started harvest-xgb-bc-last10-2019-07-17-02-54-11. Response: None
Checking job statuses:
['drop0', 'harvest-xgb-bc-drop0-2019-07-17-02-54-00']: InProgress
['drop10', 'harvest-xgb-bc-drop10-2019-07-17-02-54-00']: InProgress
['first5', 'harvest-xgb-bc-first5-2019-07-17-02-54-01']: InProgress
['first10', 'harvest-xgb-bc-first10-2019-07-17-02-54-06']: InProgress
['last5', 'harvest-xgb-bc-last5-2019-07-17-02-54-08']: InPr

# Launching endpoints for trained models.

In a straight-through pipeline, launch endpoints, run tests, collect output, and shut endpoints down.

In [37]:
tr.create_models()

arn:aws:sagemaker:us-east-1:617644144259:model/harvest-xgb-bc-drop0-2019-07-17-02-54-00-model
arn:aws:sagemaker:us-east-1:617644144259:model/harvest-xgb-bc-drop10-2019-07-17-02-54-00-model
arn:aws:sagemaker:us-east-1:617644144259:model/harvest-xgb-bc-first5-2019-07-17-02-54-01-model
arn:aws:sagemaker:us-east-1:617644144259:model/harvest-xgb-bc-first10-2019-07-17-02-54-06-model
arn:aws:sagemaker:us-east-1:617644144259:model/harvest-xgb-bc-last5-2019-07-17-02-54-08-model
arn:aws:sagemaker:us-east-1:617644144259:model/harvest-xgb-bc-last10-2019-07-17-02-54-11-model


In [38]:
tr.create_endpoints()

harvest-xgb-bc-drop0-2019-07-17-02-54-00-config
Endpoint Config Arn: arn:aws:sagemaker:us-east-1:617644144259:endpoint-config/harvest-xgb-bc-drop0-2019-07-17-02-54-00-config
harvest-xgb-bc-drop0-2019-07-17-02-54-00-endpoint
arn:aws:sagemaker:us-east-1:617644144259:endpoint/harvest-xgb-bc-drop0-2019-07-17-02-54-00-endpoint
harvest-xgb-bc-drop10-2019-07-17-02-54-00-config
Endpoint Config Arn: arn:aws:sagemaker:us-east-1:617644144259:endpoint-config/harvest-xgb-bc-drop10-2019-07-17-02-54-00-config
harvest-xgb-bc-drop10-2019-07-17-02-54-00-endpoint
arn:aws:sagemaker:us-east-1:617644144259:endpoint/harvest-xgb-bc-drop10-2019-07-17-02-54-00-endpoint
harvest-xgb-bc-first5-2019-07-17-02-54-01-config
Endpoint Config Arn: arn:aws:sagemaker:us-east-1:617644144259:endpoint-config/harvest-xgb-bc-first5-2019-07-17-02-54-01-config
harvest-xgb-bc-first5-2019-07-17-02-54-01-endpoint
arn:aws:sagemaker:us-east-1:617644144259:endpoint/harvest-xgb-bc-first5-2019-07-17-02-54-01-endpoint
harvest-xgb-bc-first

In [39]:
tr.wait_for_endpoints()

Checking endpoint statuses:
Endpoint harvest-xgb-bc-drop0-2019-07-17-02-54-00-endpoint: Creating
Endpoint harvest-xgb-bc-drop10-2019-07-17-02-54-00-endpoint: Creating
Endpoint harvest-xgb-bc-first5-2019-07-17-02-54-01-endpoint: Creating
Endpoint harvest-xgb-bc-first10-2019-07-17-02-54-06-endpoint: Creating
Endpoint harvest-xgb-bc-last5-2019-07-17-02-54-08-endpoint: Creating
Endpoint harvest-xgb-bc-last10-2019-07-17-02-54-11-endpoint: Creating
Checking endpoint statuses:
Endpoint harvest-xgb-bc-drop0-2019-07-17-02-54-00-endpoint: Creating
Endpoint harvest-xgb-bc-drop10-2019-07-17-02-54-00-endpoint: Creating
Endpoint harvest-xgb-bc-first5-2019-07-17-02-54-01-endpoint: Creating
Endpoint harvest-xgb-bc-first10-2019-07-17-02-54-06-endpoint: Creating
Endpoint harvest-xgb-bc-last5-2019-07-17-02-54-08-endpoint: Creating
Endpoint harvest-xgb-bc-last10-2019-07-17-02-54-11-endpoint: Creating
Checking endpoint statuses:
Endpoint harvest-xgb-bc-drop0-2019-07-17-02-54-00-endpoint: Creating
Endpoint 

# Test the model
Currently, we launch an endpoint to test the model. This endpoint includes a simple web service that takes POST request with rows of or model's X values - columns other than BadActor - and returns a corresponding list of Y values - BadActor predictions.

The endpoint approach is most suitable to interactive use, such as possibly using the model to blacklist a harvesting session as soon as it is identified. For offline analysis, this should be reconfigured to run batch transform jobs instead, which are cheaper to run and more streamlined to invoke.

# Compute the confusion metrics

Test all of the experiments against a new dataset and compute the confusion matrix values. A confusion matrix describes the proportions of true and false positives and negatives, together with some derived metrics.

In [40]:
region = 'us-east-1'
sg_client = boto3.client('runtime.sagemaker', region_name=region)

import json
from itertools import islice
import math
import struct

payload_csv = "out/single.csv"

# Use our remaining data to test our model
# test_mixed = []
# for i in range(3):
#     name = 'data/mixed-{}.txt'.format(i+1)
#     m = pd.read_csv(name,sep='\t')
#     print( "File: {} Rows: {}".format( name, len(m)))
#     test_mixed.append(m)

# df = pd.concat(test_mixed)

df = pd.read_csv( "data/single.txt", sep="\t" )
df[txn_col]=df[txn_col].astype(str)
flat = flatten_txns( df ).drop( [sess_col], axis=1)
label = pd.DataFrame(flat[bad_col])
label.columns = [bad_col]
flat = flat.drop(bad_col,axis=1)
    
flat.to_csv( path_or_buf=payload_csv,  header = None, index = None)


Missing columns: {'404', '311', '601'}


In [41]:
label[label[bad_col]==1]

Unnamed: 0,BadActor
303,1
2671,1
4358,1
4559,1
5422,1
5742,1
5941,1
6222,1
6633,1
7271,1


In [42]:
def test_model(endpoint_name, csv):
    '''
    job_name - name of a job that has run all the way through to an endpoint
    csv - a csv with the same y_col and x_col structure as the training data
    returns - a single-column dataframe with the predictions from the model.
    '''
#     endpoint_name = tr.get_endpoint_name( job_name )
    with open(csv, 'r') as f:
        payload = f.read().strip()
    response = sg_client.invoke_endpoint(EndpointName=endpoint_name, 
                               ContentType='text/csv', 
                               Body=payload)
    result = response['Body'].read()
    result = result.decode("utf-8")
    result = result.split(',')
    result = [round(float(i)) for i in result]
    return pd.DataFrame( result )

def compute_confusion(reference, test):
    '''
    reference - single-column dataframe of expected y-values - labels
    test - single-column dataframe of computed y-values for comparison
    '''
    comp = pd.concat( [reference, test], axis = 1)
    comp.columns =["label",'prediction']
    label_positive = comp['label'] == 1
    predict_positive = comp['prediction'] == 1
    tp = len( comp[label_positive & predict_positive])
    fp = len( comp[~label_positive & predict_positive])
    tn = len( comp[~label_positive & ~predict_positive])
    fn = len( comp[label_positive & ~predict_positive])
    m = len(comp)

    accuracy = (tp+tn)/m
    precision = tp/(tp+fp+0.000000001) # avoid division by 0
    recall = tp/(tp+fn + 0.000000001)

    print("accuracy: {} precision: {} recall {}".format(accuracy, precision,recall))
    return (accuracy, precision, recall, tp,fp,tn,fn)


In [44]:
results = []
endpoints = [tr.get_endpoint_name( training_job) for (name, training_job) in tr.running_jobs]
for endpoint in endpoints:
    result = test_model( endpoint, payload_csv)
    results.append(result)
    print(compute_confusion(label, result) )
    
    
    

accuracy: 0.9989184512221502 precision: 0.0 recall 0.0
(0.9989184512221502, 0.0, 0.0, 0, 0, 9236, 10)
accuracy: 0.9974042829331603 precision: 0.06249999999609375 recall 0.09999999999
(0.9974042829331603, 0.06249999999609375, 0.09999999999, 1, 15, 9221, 9)
accuracy: 0.9988102963443651 precision: 0.0 recall 0.0
(0.9988102963443651, 0.0, 0.0, 0, 1, 9235, 10)
accuracy: 0.9988102963443651 precision: 0.3333333332222222 recall 0.09999999999
(0.9988102963443651, 0.3333333332222222, 0.09999999999, 1, 2, 9234, 9)
accuracy: 0.9539260220635951 precision: 0.002392344497601932 recall 0.09999999999
(0.9539260220635951, 0.002392344497601932, 0.09999999999, 1, 417, 8819, 9)
accuracy: 0.9964308890330954 precision: 0.0399999999984 recall 0.09999999999
(0.9964308890330954, 0.0399999999984, 0.09999999999, 1, 24, 9212, 9)


In [47]:
# results[0].columns = [bad_col]
# results[0]==label
# results[0][results[0][bad_col]==1]

Unnamed: 0,BadActor
337,1
495,1
2306,1
3426,1
3909,1
4560,1
6398,1
6633,1
7379,1
8695,1


In normal traffic, almost no sessions are malicious. Compute a baseline on the assumption there are no Bad Actor rows.

In [48]:
baseline = label.copy()
baseline.values[:] = 0
compute_confusion(label, baseline)

accuracy: 0.9989184512221502 precision: 0.0 recall 0.0


(0.9989184512221502, 0.0, 0.0, 0, 0, 9236, 10)

# Conclusion
In a series of five related experiments, I explored an approach for distinguishing malicious sessions from innocent sessions and demonstrated its effectiveness.

The initial bag-of-transactions approach was extremely successful, with high accuracy and recall of almost 85%.
Adding session time features slightly improved recall at a tiny cost in precision.

Finally, using partial sessions did not greatly impair the performance of the algorithm, showing that the algorithm may be resilient enough for use in live traffic.

All experiments performed much better than the baseline assumption that there is no attack traffic.

# Remaining work

1. Fix wait_for_endpoints.
1. Move TrainingHelper and the other helpers into the JupyterHelper package
1. Final cleanup so that there's a single panel that runs everything start to finish.
    - consider creating a DataHelper class to move some of that stuff out
1. Finish writing the batch test driver so that it prints a comparison of the results of the experiments
1. Go back and add back the session-timing results that we seem to have dropped out of this architecture
    - this might mean having to have a separate notebook
    - consider combining the flatten method with the grouping approach, or putting an 'if' so that flatten only drops the sess_col when it's there.
1. Review the rubric, correct, and submit.