# Harvesting @ MLAI Training - First Round Overview

This experiment trains an XGB model to distinguish harvesting sessions from ordinary sessions. In this model, we use a simplified representation of user sessions based on the common bag-of-words representation. This representation discards sequences and timing, reducing a user session to a row of counts for each transaction type.

While this was the first attempt to train a model to recognize the activity of a known harvester, the techniques proved quite successful. The approach successfully distinguished 85% of the activity of the harvester with 0 false positives.

The rest of this document describes and implements the experiment, closing with some suggested next steps.

## Training Data
The training dataset consists of several hours of raw transaction logs containing activity from all users, with the full collection of harvesting activity from the LiquidTension harvester across two years. All LiquidTension(LT) activity is labeled as 'BadActor' = 1, while all other traffic is assumed to be innocent and labeled as 'BadActor' = 0. Since LiquidTension is currently our only easily-identified single harvester, we need his full range of activity to have a BadActor sessions in proportion to innocent sessions for training to work properly.

The training set includes the following files:

|File       |Contents                             |Rows|
------------|-------------------------------------|----|
|may1.tsv|raw transactions|119474|
|may2.tsv|raw transactions|43608|
|may3.tsv|raw transactions|30844|
|lt-only.tsv|raw transactions for a known attacker|61917|

In all, we have 193926 transactions from "innocent" sessions and 61917 LT sessions. Approximately 32% of transactions are labeled BadActor = 1, giving us a reasonable proportion in both classes.


## Warm up
Import standard libraries and prepare the environment.

In [1]:
import io
import os
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 

import boto3
import sagemaker
from sagemaker import get_execution_role

%matplotlib inline

!mkdir data

mkdir: cannot create directory ‘data’: File exists


In [41]:
# sagemaker session, role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# S3 bucket name
bucket = sagemaker_session.default_bucket()

# common column names
bad_col='BadActor'
sess_col='SessionNo'
txn_col='Act'
logtime_col = 'LogTime'



## Download
Retrieve the datafiles from the project's designated S3 bucket.

In [83]:
s3 = boto3.resource('s3')
b = s3.Bucket('sagemaker-mlai-harvesting')

# b.download_file( 'data/MLAI_ParsedDataSet.tsv', 'data/data.tsv')
b.download_file( "data/MinimalLogs/Minimal_May01.rpt", 'data/may1.tsv')
b.download_file( "data/MinimalLogs/Minimal_May02.rpt", 'data/may2.tsv')
b.download_file( "data/MinimalLogs/Minimal_May03.rpt", 'data/may3.tsv')
b.download_file( "data/MinimalLogs/Minimal_OnlyLT.rpt", 'data/lt-only.tsv')


may1 = pd.read_csv('data/may1.tsv',sep='\t')
may2 = pd.read_csv('data/may2.tsv',sep='\t')
may3 = pd.read_csv('data/may3.tsv',sep='\t')
lt = pd.read_csv('data/lt-only.tsv',sep='\t')



txn = may1.append([may2, may3, lt])
txn[logtime_col] = pd.to_datetime(txn[logtime_col])
txn[txn[bad_col]==1]

Unnamed: 0,SessionNo,LogTime,CustID,GroupID,ProfID,Act,BadActor
7704,-40132942,2019-05-01 19:18:52,s8873650,main,ehost,111,1
7707,-1,2019-05-01 19:18:53,s8873650,main,ehost,201,1
7863,-40132942,2019-05-01 19:19:22,s8873650,main,ehost,121,1
8391,-1,2019-05-01 19:20:29,s8875270,main,ehost,201,1
8396,1731108217,2019-05-01 19:20:29,s8875270,main,ehost,111,1
8722,1731108217,2019-05-01 19:21:07,s8875270,main,ehost,121,1
8963,401087102,2019-05-01 19:21:51,s8875834,main,ehost,111,1
8968,-1,2019-05-01 19:21:52,s8875834,main,ehost,201,1
9356,401087102,2019-05-01 19:22:47,s8875834,main,ehost,121,1
9690,401087102,2019-05-01 19:23:39,s8875834,main,ehost,124,1


## Data conversion and feature engineering
In real life, a session consists of a series of rows of transactions of different types, and each transaction type records a variable number of additional metadata attributes describing a logged event, for a total of over 30 columns of extracted data. In addition, our tagging process has given each row a BadActor label.

|sessionno|txn id|BadActor|parm1|parm2|...|
|---------|------|--------|-----|-----|---|
|1240|111|0|query string|...|...|
|1240|112|0|meta|...|...|
|2993|301|1|meta|...|...|


In [4]:
# 'Innocent' log entries
txns = pd.DataFrame(np.sort(txn['Act'].unique()))

# Harvesting log entries
lt_txns = pd.DataFrame(np.sort(lt['Act'].unique()))


In [218]:
def add_wait_times(txns):
    # sort by session and logtime
    txng = txns.set_index(['SessionNo','LogTime']).sort_index() 

    # remove the index so we can compute on log time values
    txng.reset_index(inplace=True)
    
    # subtract the previous row's logtime from this rows logtime
    txng['Wait'] = pd.to_timedelta(txng[logtime_col].diff() ).astype('timedelta64[s]')
    
    # add the session back to the index so that we can flag session starts
    txng = txng.set_index(['SessionNo']).sort_index() 
    
    # if the current session value is different, set the wait time to 0
    s = pd.Series(txng.index)
    session_starts = (s != s.shift()).values
    txng.loc[session_starts,'Wait'] = 0
    
    return txng


In [219]:
txnw  = add_wait_times(txn)

In [220]:
# pd.to_timedelta(txnw['Wait']).astype('timedelta64[s]')
txnw

Unnamed: 0_level_0,LogTime,CustID,GroupID,ProfID,Act,BadActor,Wait
SessionNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
-2147481927,2019-05-01 22:09:42,s9003448,main,eds,111,0,0.0
-2147360137,2019-04-23 00:41:39,s8986718,main,iconnweb,111,1,0.0
-2147360137,2019-04-23 00:41:52,s8986718,main,iconnweb,112,1,13.0
-2147360137,2019-04-23 00:41:52,s8986718,main,iconnweb,311,1,0.0
-2147317281,2019-05-01 23:54:56,s8989685,main,edsapi,111,0,0.0
-2147317281,2019-05-01 23:54:59,s8989685,main,edsapi,121,0,3.0
-2147317281,2019-05-01 23:55:06,s8989685,main,edsapi,121,0,7.0
-2147317281,2019-05-01 23:56:29,s8989685,main,edsapi,121,0,83.0
-2147002735,2019-05-04 19:52:00,s8989984,main,ehost,111,0,0.0
-2147002735,2019-05-04 19:52:22,s8989984,main,ehost,121,0,22.0


In [242]:
def compute_wait_stats(txnw):
    txnw_g = txnw.reset_index()[[sess_col,'Wait']]
    txnw_g = txnw_g.groupby(sess_col)
    txnw_stat = txnw_g.agg([np.mean,np.std,np.sum,np.min,np.max,len]).fillna(0)
    txnw_stat.columns = txnw_stat.columns.droplevel(0)  
    txnw_stat.rename_axis(None, axis=1).reset_index()
    return txnw_stat

compute_wait_stats(txnw)

Unnamed: 0_level_0,mean,std,sum,amin,amax,len
SessionNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
-2147481927,0.000000,0.000000,0.0,0.0,0.0,1.0
-2147360137,4.333333,7.505553,13.0,0.0,13.0,3.0
-2147317281,23.250000,39.936408,93.0,0.0,83.0,4.0
-2147002735,30.833333,33.017672,185.0,0.0,86.0,6.0
-2146953899,150.891892,360.721098,5583.0,0.0,2121.0,37.0
-2146926264,0.333333,0.577350,1.0,0.0,1.0,3.0
-2146915841,3.000000,4.242641,6.0,0.0,6.0,2.0
-2146723372,0.333333,0.577350,1.0,0.0,1.0,3.0
-2146089473,1.250000,1.892969,5.0,0.0,4.0,4.0
-2145757832,0.000000,0.000000,0.0,0.0,0.0,1.0


In [240]:
def compute_session_stats( txn ):
    txng = txn[[sess_col,'LogTime']].groupby(sess_col)
    txn_sess = txng.agg([np.min,np.max])
    txn_sess.columns = txn_sess.columns.droplevel(0)  
    txn_sess.rename_axis(None, axis=1).reset_index()
    txn_sess['length'] = txn_sess['amax'] - txn_sess['amin']
    return txn_sess

compute_session_stats( txn )


We drop most of this information, including the temporal sequence of the log entries, and convert each session into a single row of data. Almost all of the columns go away, replaced by counts of transaction types in the session.

|sessionno|BadActor|111|112|113|...|301|302|...|
|---------|--------|---|---|---|---|---|---|---|
|1240|0|1|1|0|...|0|0|...|
|2993|1|0|0|0|...|1|0|...|

In [5]:
def flatten_txns( txn_log ):
    txn_narrow = txn_log[[sess_col, txn_col,bad_col]]
    txn_pivot = pd.pivot_table(txn_narrow, index=[sess_col,bad_col], columns = [txn_col],aggfunc=[len]).fillna(0)
    txn_pivot.columns = txn_pivot.columns.droplevel(0)           # the pivot table has a two-level index
    txn_flat = txn_pivot.rename_axis(None, axis=1).reset_index() # these two lines get rid of it so we have a simple table
    return txn_flat

In [6]:
flat = flatten_txns( txn )

In [7]:
flat

Unnamed: 0,SessionNo,BadActor,111,112,114,115,116,117,118,119,...,403,404,406,407,410,411,511,513,601,607
0,-2147481927,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-2147360137,1,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-2147317281,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-2147002735,0,1.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-2146953899,0,0.0,1.0,0.0,20.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,-2146926264,0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,-2146915841,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,-2146723372,0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,-2146089473,0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,-2145757832,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [53]:
b.download_file( "data/MinimalLogs/Minimal_May08.rpt", 'data/may8.tsv')
may8 = pd.read_csv('data/may8.tsv',sep='\t')
flat_may8 = flatten_txns( txn )
file = "out/may8.csv"

flat_may8.to_csv(file)
s3_client.upload_file(file, bucket, file)

## Producing pools of training and testing data

We will divide the combined good and bad data pools as follows:
- a training set that the model iterates over during the learning process
- a test set that is used to evaluate the model during training
- a validation set that is kept separate to test the model after training is complete. We need separate test and validate pools in order to make sure that we're overfitting the model to a single set of test data.

In [19]:
def split_frame( df, train_frac ):
    l = len(df)
    test_frac = (1-train_frac)/2
    tr = int(train_frac * l)
    te = int(tr + test_frac * l)
    
    train = df[:tr]
    test = df[tr:te]
    val = df[te:]
    return [train, test, val]

In [20]:
def train_split( flat, bad_split=.8 ):
    bad = flat[flat[bad_col]==1]
    good = flat[flat[bad_col]==0]
    
    bads = split_frame(bad, bad_split)
    goods = split_frame(good, bad_split)
    
    dfs = []
    for i in range(3):
        # Dropping the session # because we don't want to train on it.
        # Also leaves our label - BadActor - in the 0 column, as XGBoost requires for CSV
        df = bads[i].append(goods[i]).drop(sess_col,axis=1).sample(frac=1)
        dfs.append( df )
    
    return dfs
    


# Split the data and upload to S3
Break the set into train, test, and validation collections and output CSV's.
As Sagemaker requires, leave out row indices and column headers.

In [56]:
dfs = train_split(flat_may8, .2)

!mkdir out

s3_client = boto3.client('s3')
bucket = "sagemaker-mlai-harvesting"

for i, df in enumerate(dfs):
    files = ["train","test","validate"]
    file = "out/{}.csv".format(files[i])
    df.to_csv(path_or_buf= file, header=False, index=False  )

    print("Uploading {} to {}".format(file, bucket))

    response = s3_client.upload_file(file, bucket, file)
    print(response)
    
    
    


mkdir: cannot create directory ‘out’: File exists
Uploading out/train.csv to sagemaker-mlai-harvesting
None
Uploading out/test.csv to sagemaker-mlai-harvesting
None
Uploading out/validate.csv to sagemaker-mlai-harvesting
None


# Prepare and train a model
Boilerplate code mostly copied from Amazon sample code at https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.ipynb, with ample room for improvement.

In [33]:
%%time
region = 'us-east-1'
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(region, 'xgboost')


from time import gmtime, strftime

job_name = 'harvesting-xgboost-binary-classification' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Training job", job_name)

#Ensure that the training and validation data folders generated above are reflected in the "InputDataConfig" parameter below.

create_training_params = \
{
    "AlgorithmSpecification": {
        "TrainingImage": container,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": os.path.join("s3://",bucket, "out", "xgb-class") 
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.m4.4xlarge",
        "VolumeSizeInGB": 5
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "silent":"0",
        "objective":"binary:logistic",
        "num_round":"50"
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 3600
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://sagemaker-mlai-harvesting/out/train.csv" , 
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "text/csv",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://sagemaker-mlai-harvesting/out/validate.csv" ,
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "text/csv",
            "CompressionType": "None"
        }
    ]
}


client = boto3.client('sagemaker', region_name=region)
client.create_training_job(**create_training_params)

import time

status = client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print(status)
while status !='Completed' and status!='Failed':
    time.sleep(60)
    status = client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
    print(status)

Training job harvesting-xgboost-binary-classification2019-06-13-16-48-30
InProgress
InProgress
InProgress
Completed
CPU times: user 67.9 ms, sys: 3.89 ms, total: 71.8 ms
Wall time: 3min


In [35]:
%%time
import boto3
from time import gmtime, strftime

model_name="harvesting-xgboost-binary-cl2019-06-13-16-48-30"+ '-model'
print(model_name)

info = client.describe_training_job(TrainingJobName=job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print(model_data)

primary_container = {
    'Image': container,
    'ModelDataUrl': model_data
}

create_model_response = client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

harvesting-xgboost-binary-cl2019-06-13-16-48-30-model
s3://sagemaker-mlai-harvesting/out/xgb-class/harvesting-xgboost-binary-classification2019-06-13-16-48-30/output/model.tar.gz
arn:aws:sagemaker:us-east-1:872344130825:model/harvesting-xgboost-binary-cl2019-06-13-16-48-30-model
CPU times: user 17.5 ms, sys: 0 ns, total: 17.5 ms
Wall time: 270 ms


In [36]:
from time import gmtime, strftime

endpoint_config_name = 'Harvest-XGBoostEndpointConfig-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_config_name)
create_endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m4.xlarge',
        'InitialVariantWeight':1,
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])


Harvest-XGBoostEndpointConfig-2019-06-13-16-53-57
Endpoint Config Arn: arn:aws:sagemaker:us-east-1:872344130825:endpoint-config/harvest-xgboostendpointconfig-2019-06-13-16-53-57


# Launch an endpoint

In [37]:
%%time
import time

endpoint_name = 'Harvest-XGBoostEndpoint-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])

resp = client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)

while status=='Creating':
    time.sleep(60)
    resp = client.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Status: " + status)

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)



Harvest-XGBoostEndpoint-2019-06-13-16-54-00
arn:aws:sagemaker:us-east-1:872344130825:endpoint/harvest-xgboostendpoint-2019-06-13-16-54-00
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-east-1:872344130825:endpoint/harvest-xgboostendpoint-2019-06-13-16-54-00
Status: InService
CPU times: user 134 ms, sys: 16.6 ms, total: 151 ms
Wall time: 9min 1s


# Test the model
Currently, we launch an endpoint to test the model. This endpoint includes a simple web service that takes POST request with rows of or model's X values - columns other than BadActor - and returns a corresponding list of Y values - BadActor predictions.

The endpoint approach is most suitable to interactive use, such as possibly using the model to blacklist a harvesting session as soon as it is identified. For offline analysis, this should be reconfigured to run batch transform jobs instead, which are cheaper to run and more streamlined to invoke.

In [57]:
runtime_client = boto3.client('runtime.sagemaker', region_name=region)

import json
from itertools import islice
import math
import struct

!head -10000 out/test.csv > out/single-test.csv

file_name = 'out/single-test.csv' 

# file_name = "out/may8.csv"


csv = pd.read_csv(file_name, header=None)
csv.columns
label = csv[0]
csv = csv.drop(0,axis=1)

single = "out/single.csv"

csv.to_csv(path_or_buf=single, header=False, index=False)

with open(single, 'r') as f:
    payload = f.read().strip()

In [58]:
response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='text/csv', 
                                   Body=payload)
result = response['Body'].read()
result = result.decode("utf-8")
result = result.split(',')
result = [round(float(i)) for i in result]


# Compute the confusion metrics

A confusion matrix describes the proportions of true and false positives and negatives, together with some derived metrics.

In [59]:
comp = pd.concat( [label, pd.DataFrame(result)], axis = 1)
comp.columns =["label",'prediction']

label_positive = comp['label'] == 1
predict_positive = comp['prediction'] == 1

tp = len( comp[label_positive & predict_positive])
fp = len( comp[~label_positive & predict_positive])
tn = len( comp[~label_positive & ~predict_positive])
fn = len( comp[label_positive & ~predict_positive])
m = len(comp)

accuracy = (tp+tn)/m
precision = tp/(tp+fp)
recall = tp/(tp+fn)

print("accuracy: {} precision: {} recall {}".format(accuracy, precision,recall))

accuracy: 0.9657853810264385 precision: 1.0 recall 0.8360655737704918


In [60]:
tp,fp,tn,fn, len(comp)

(1683, 0, 7632, 330, 9645)

The very first time we ran the model, we achieved strikingly successful rates of harvesting identification.
The most significant number here is the recall of 84%, meaning that we successfully identified 84% of all harvesting sessions by looking only at counts of transaction types.

This approach appears promising!

# Next steps

## Further investigating the data 

We had additional ideas for modeling the data while staying in this bag-of-transaction technique.
1. Try some hyperparameter tuning to seem if the success rates can be trivially improved.
1. Enrich the training data set in various ways - add colums to summarize total session time, average time/request, and so on.
1. Perform some clustering analysis to try to identify common patterns of behavior other than LT. This may reveal the presence of other kinds of harvesting.

## Qualifying the approach
Can we use this approach to identify and blacklist harvesting sessions as they occur? Some notes:
1. The approach must be resilient to easy efforts to evade. Does the accuracy of the identification drop if the attacker makes minor changes to his workflow?
1. How long does it take to identify an attacker in real time? 
    1. Do we gain certainty soon enough to stop an attacker before he's done what he came to do?
    2. Can we tag sessions accurately after the first N log entries, for instance?
    
## Designing an implemetation
Design an architecture for identifying and intercepting harvesting activity in real time. Confirm data sources, manage impact to usage latency, model costs and ROI.

In today's world, it would be less effective to perform real-time analysis on AWS, since all of our current content usage is on-prem. The algorithm used here, XGBoost, is performant on commodity hardware, so we may be able to run on standard VMs.

In real-time analysis, we will face a stream of events from interleaved sessions. We will have to demultiplex these into individual event streams both for training and for prediction, implying some kind of windowing to capture and send sets of log entries as partial sessions. It's not clear how big the impact of this windowing will be on the accuracy of the models.

# Other analytical techniques
While this algorithm seems promising, we're throwing away a huge amount of intelligence before we start training, in the name of simplicity. We can evaluate what kind of gains we could achieve through more advanced techniques:
- Stateful models like LSTM or CNN
- more 
