# Harvesting @ MLAI Training - Third Round Overview

At this point in the series of experiments, we're going to try some divergent techniques. I'd like to try some tactics that will improve the fit of the model to continuous use in the live application flow. Right now, the model trains on retrospective data sets and more or less expects to see full sessions. In this experiment, I'm going to try several approaches.

## Roadmap
### Feature engineering and ML algorithms

1. Instead of counting full sessions in the training datasets, we will only take sequences of a maximum N, to see if there's a useful threshold where we could evaluate sliding windows of session history. I'll compare recall at 5, 10, 15, 20 element session subsequences, offset randomly from session start. 
    - Where do the metrics drop off? 
    - Does the approach essentially require full sessions?
    - Can we query partial sessions against a model trained on full sessions?
    - Can we query partial sessions against a model trained on partial sessions?
    - Randomly drop out D% of transactions from every session.
1. Instead of using bag-of-txns, I will try a vector that contains the last N sessions by number, indexed from a transaction dictionary. This will force sequence into the model.
    - Training on the first N transactions:
        - Count N transactions from the beginning of a session.
    - Train on N-gram tiles
    - Train on N-gram shingles
1. Instead of training an XGBoost model, train neural network models that are designed to remember sequences
    - LSTM
    - Convolutional (?)
    
### Software engineering
To this point, I've mostly worked out of a some-what scattered Jupyter notebook pile of global variables, tuning up a few functions here and there to keep scopes clean. In particular, the Sagemaker code is terrible cut-and paste. As I work through the feature engineering & algorithmic roadmap, I also want to drive toward the following engineering goals:

- Establish a consistent data manipulation pipeline that is easily customizable and reentrant, with no global variables
- a set of wrapper functions for the Sagemaker API, so that I can have a simple pipeline where training jobs -> models -> batch transformations or endpoints, without giant string constant parameters that are impossible to edit or sight-check
- Automatically engage a hyperparameter tuning job as desired without changing how the model is implemented or trained.
- Easily training many models at once
- Easily launch many batch transform jobs/endpoints at once.
- Easily display the results of the many jobs at once.
- Particular feature engineering processes glide smoothly from experimental tinkering to code that can be put into production, without throwing everything away. Some of this just falls out of doing everything else better.

# Training Data
The training dataset consists of several hours of raw transaction logs containing activity from all users, with the full collection of harvesting activity from the LiquidTension harvester across two years. All LiquidTension(LT) activity is labeled as 'BadActor' = 1, while all other traffic is assumed to be innocent and labeled as 'BadActor' = 0. Since LiquidTension is currently our only easily-identified single harvester, we need his full range of activity to have a BadActor sessions in proportion to innocent sessions for training to work properly.

The training set includes the following files:

|File       |Contents                             |Rows|
------------|-------------------------------------|----|
|may1.tsv|raw transactions|119474|
|may2.tsv|raw transactions|43608|
|may3.tsv|raw transactions|30844|
|lt-only.tsv|raw transactions for a known attacker|61917|

In all, we have 193926 transactions from "innocent" sessions and 61917 LT sessions. Approximately 32% of transactions are labeled BadActor = 1, giving us a reasonable proportion in both classes.

# Sequence testing phase 1. 

## Warm up
Import standard libraries and prepare the environment.

In [1]:
import io
import os
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 

import boto3
import sagemaker
from sagemaker import get_execution_role

import toolz

%matplotlib inline

!mkdir data

mkdir: cannot create directory ‘data’: File exists


In [20]:
# S3 bucket name
bucket = 'sagemaker-mlai-harvesting'

# common column names
bad_col='BadActor'
sess_col='SessionNo'
txn_col='Act'
logtime_col = 'LogTime'

# paths
csv_path = "out"

## Download
Retrieve the datafiles from the project's designated S3 bucket.

In [3]:
s3 = boto3.resource('s3')
b = s3.Bucket('sagemaker-mlai-harvesting')

# b.download_file( 'data/MLAI_ParsedDataSet.tsv', 'data/data.tsv')
b.download_file( "data/MinimalLogs/Minimal_May01.rpt", 'data/may1.tsv')
b.download_file( "data/MinimalLogs/Minimal_May02.rpt", 'data/may2.tsv')
b.download_file( "data/MinimalLogs/Minimal_May03.rpt", 'data/may3.tsv')
b.download_file( "data/MinimalLogs/Minimal_OnlyLT.rpt", 'data/lt-only.tsv')


may1 = pd.read_csv('data/may1.tsv',sep='\t')
may2 = pd.read_csv('data/may2.tsv',sep='\t')
may3 = pd.read_csv('data/may3.tsv',sep='\t')
lt = pd.read_csv('data/lt-only.tsv',sep='\t')



txn = may1.append([may2, may3, lt])
txn[logtime_col] = pd.to_datetime(txn[logtime_col])
# txn[txn[bad_col]==1]

## Data conversion and feature engineering
In real life, a session consists of a series of rows of transactions of different types, and each transaction type records a variable number of additional metadata attributes describing a logged event, for a total of over 30 columns of extracted data. In addition, our tagging process has given each row a BadActor label.

|sessionno|txn id|BadActor|parm1|parm2|...|
|---------|------|--------|-----|-----|---|
|1240|111|0|query string|...|...|
|1240|112|0|meta|...|...|
|2993|301|1|meta|...|...|


In [4]:
# 'Innocent' log entries
txns = pd.DataFrame(np.sort(txn['Act'].unique()))

# Harvesting log entries
lt_txns = pd.DataFrame(np.sort(lt['Act'].unique()))



We drop most of this information, including the temporal sequence of the log entries, and convert each session into a single row of data. Almost all of the columns go away, replaced by counts of transaction types in the session.

|sessionno|BadActor|111|112|113|...|301|302|...|
|---------|--------|---|---|---|---|---|---|---|
|1240|0|1|1|0|...|0|0|...|
|2993|1|0|0|0|...|1|0|...|

# Truncating sessions
Before we flatten the sessions, we're going to truncate them. This may be a better match for the real world, in which at best we will be able to scan sliding windows of transactions with a scaling resumption that we may not scan every event.

We'll try several approaches at once:
- dropping out D% of transactions from every session
- taking only the first N sessions from every session
- taking on the last N transactions from every session
- taking N consecutive transactions from the middle of every session. 
- choosing N transactions as above, but dropping every session without at least N transactions.

# DOOF!

## **Truncating the sessions reduces the total population of txn types, which reduces the number of columns in the output datasets. Need to make sure we force the same columns across all experiments, or we can't test when we get to the end.**

## It also ought to be a lot easier to run the whole thing from beginning to end.

In [5]:
def drop_column_groups( txn_c ):
    txn_c.columns = txn_c.columns.droplevel(0)  
    txn_c.rename_axis(None, axis=1).reset_index()
    return txn_c

def get_session_groups( txn ):
    txn_g = txn.groupby(sess_col)
    return txn_g

txng = get_session_groups(txn)
txng

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd6ba1db5c0>

In [6]:
%%time
def drop_pct( df, n = .1 ):
    return df.sample(frac= 1-n)

def first_n( df, n = 5):
    return df.head( n )
    
def last_n( df, n = 5):
    return df.tail( n )


CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.63 µs


In [7]:
def grappl( fun, parm ):
    return lambda df: df.apply(fun, parm).drop(sess_col, axis=1)


In [8]:
def flatten_txns( txn_log ):
    '''
    On a flat list of sessions, run a pivot table on transaction type counts by session, and eliminate extraneous columns.
    Flatten the pivot table and simplify the index.
    '''
    txn_narrow = txn_log[[sess_col, txn_col,bad_col]]
    txn_pivot = pd.pivot_table(txn_narrow, index=[sess_col,bad_col], columns = [txn_col],aggfunc=[len]).fillna(0)
    txn_pivot.columns = txn_pivot.columns.droplevel(0)           # the pivot table has a two-level index
    txn_flat = txn_pivot.rename_axis(None, axis=1).reset_index() # these two lines get rid of it so we have a simple table
    return txn_flat

In [9]:
def flatten_groups( txn_log ):
    '''
    On a set of session groups, run a pivot table on transaction type counts by session, and eliminate extraneous columns.
    Flatten all groups into one table and simplify the index.
    '''
    txn_narrow = txn_log[[txn_col,bad_col]] # for groups, don't need to drop the session column because it's already an index column.
    txn_pivot = pd.pivot_table(txn_narrow, index=[sess_col,bad_col], columns = [txn_col],aggfunc=[len]).fillna(0)
    txn_pivot.columns = txn_pivot.columns.droplevel(0)           # the pivot table has a two-level index
    txn_flat = txn_pivot.rename_axis(None, axis=1).reset_index() # these two lines get rid of it so we have a simple table
    return txn_flat

In [29]:
jobs = [['drop10', [drop_pct, .1]], 
        ['first5', [first_n, 5]],
        ['last5', [last_n, 5]]
       ]
job_names = [job[0] for job in jobs]
[(name, func, parm) for [name, [func, parm]] in jobs]

[('drop10', <function __main__.drop_pct(df, n=0.1)>, 0.1),
 ('first5', <function __main__.first_n(df, n=5)>, 5),
 ('last5', <function __main__.last_n(df, n=5)>, 5)]

In [11]:
def prep_jobs(df, jobs):
    groups = [[name, grappl(fun, parm)(df)] for [name, [fun, parm]] in jobs]

    flats = [[name,flatten_groups( txn )] for [name,txn] in groups] #.reset_index()
    return flats

In [12]:
%%time

gs = prep_jobs(txng, jobs )


CPU times: user 1min 40s, sys: 535 ms, total: 1min 41s
Wall time: 1min 40s


In [257]:
txngs= gs
[(name,len(df), len(df.columns)) for [name,df] in txngs]
# gs[0]

[('drop10', 24114, 38), ('first5', 24112, 35), ('last5', 24112, 37)]

## Producing pools of training and testing data

In order to support simultaneous execution of multiple jobs, this notebook introduces a new scheme for piping data through to models.

The normal flow runs as follows:

input data(s3) -> df's on notebook instance -> train.csv, test.csv, validate.csv on notebook instance -> s3/out/ -> Sagemaker instances

Hardcoding these filenames is fine for playing around in a notebook, but it limits us to one job at a time.

In this approach, every job has a base Name. This name will carry through from S3 into the Sagemaker instances.
Training data files will reside in `<s3bucket>/<key>/out/Name`.

As before, in each `Name` subfolder, we will divide the combined good and bad data pools as follows:
- a training set that the model iterates over during the learning process
- a test set that is used to evaluate the model during training
- a validation set that is kept separate to test the model after training is complete. We need separate test and validate pools in order to make sure that we're not overfitting the model to a single set of test data.

All of these functions are in the FrameSplitter class. 


In [57]:
from importlib import reload
import lib.JupHelper.JupHelper as jh
reload(jh)

<module 'lib.JupHelper.JupHelper' from '/home/ec2-user/SageMaker/mlai-harvesting/lib/JupHelper/JupHelper.py'>

In [58]:
import lib.JupHelper.JupHelper as jh

csv = jh.FrameSplitter( bad_col, [sess_col]) # FrameSplitter only holds onto the definition of the y_col and the x_cols - everything else is passed in

In [61]:
job_csvs = csv.make_all_csvs( gs )           # Drives the whole conversion process - look in the class for other helper methods

In [17]:
csvs = csv.get_all_csv_names(gs)

# Upload to S3 and move into SageMaker

Move all of the current csv's up into S3 for SageMaker, then start configuring jobs.

In [57]:
from importlib import reload
import lib.JupHelper.JupHelper as jh
reload(jh)

<module 'lib.JupHelper.JupHelper' from '/home/ec2-user/SageMaker/mlai-harvesting/lib/JupHelper/JupHelper.py'>

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri

import boto3
from time import gmtime, strftime
import time

class XgbHyper:
    def __init__(self):
        self.params = {
            "max_depth":"5",
            "eta":"0.2",
            "gamma":"4",
            "min_child_weight":"6",
            "subsample":"0.7",
            "silent":"0",
            "objective":"binary:logistic",
            "num_round":"50"
        }

class TrainRunner:
    def __init__( self, prefix, region='us-east-1' ):
        '''
        Wrapper object to run multiple 
        '''
        self.prefix = prefix
        self.region = region

        self.running_jobs = []
        self.models = []
        self.sg_client = boto3.client('sagemaker', region_name=region )
        # sagemaker session, role
        self.sagemaker_session = sagemaker.Session()
        self.role = sagemaker.get_execution_role()
    
    def start_jobs(self, jobs, bucket ):
        for (name, csvs) in jobs:
            job_name = self.make_job_name( name )
            print( job_name )
            self.start_training_job( bucket, job_name, csvs )
            self.running_jobs.append([name,job_name])
        return self.running_jobs
    
    def clear_jobs(self):
        self.running_jobs = []
        self.models = []
        
        #delete endpoints
        #delete configs
    
    def start_training_job(self, bucket, job_name, s3_inputs, image='xgboost', instance = [1, "ml.m4.4xlarge", 5 ], hyper=XgbHyper(), verbose=True):
        self.image = image
        self.s3_input = s3_inputs
        self.instance = instance
        self.hyper = hyper
        self.bucket = bucket

        self.container = get_image_uri( self.region, 'xgboost' )
        job_config = self.make_job_config( job_name, s3_inputs[0], s3_inputs[1], s3_inputs[2])
        res = self.launch_training_job( job_config )
        if verbose:
            print( "Started {}. Response: {}".format( job_name, res))
            
        return res
    
    def make_s3_url(self, file):
        return "s3://{}/{}".format(self.bucket, file)
    
    def get_job_status(self, job):
        [name, job_name] = job
        return self.sg_client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
    
    def check_jobs_still_running( self ):
        still_running = False
        for job in self.running_jobs:
            status = get_job_status( job )
            if status !='Completed' and status !='Failed':
                still_running = True
                break
        
        return still_running
    
    def map_jobs( self, func ):
        for job in self.running_jobs:
            func( job )
    
    def print_job_status( self, job ):
        [name, job_name] = job
        
        print("{}: {}".format(job, self.get_job_status(job) )  )
        
    def trace_jobs( self ):
        self.map_jobs( self.print_job_status )
                    
    def make_job_name( self, job_name ):
        name = "{}-{}-{}".format( self.prefix, job_name, strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
        return name
            
    def launch_training_job( self, job_config ):
        self.sg_client.create_training_job( **job_config )
    
    def wait_for_jobs( self ):
        while True:
            print("Checking job statuses:")
            self.trace_jobs()
            if not self.check_jobs_still_running():
                break
            time.sleep(15)

    def get_model_name(self, name):
        return name + "-model"
    
    def get_endpoint_config_name(self, name):
        return name + "-config"
     
    def get_endpoint_name(self, name):
        return name + "-endpoint"

    def create_models( self ):
        for [name, job_name] in self.running_jobs:
            info = self.sg_client.describe_training_job(TrainingJobName=job_name)
            model_data = info['ModelArtifacts']['S3ModelArtifacts']
            primary_container = {
                'Image': self.container,
                'ModelDataUrl': model_data
            }
            res = self.sg_client.create_model(
                ModelName = self. get_model_name(job_name),
                ExecutionRoleArn = role,
                PrimaryContainer = primary_container)
            print(res['ModelArn'])
            self.models.append( res )
            
    def create_endpoint_configs(self, job_name ):
        endpoint_config_name = self.get_endpoint_config_name(job_name)
        model_name = self.get_model_name( job_name )
        print(endpoint_config_name)
        res = self.sg_client.create_endpoint_config(
            EndpointConfigName = endpoint_config_name,
            ProductionVariants=[{
                'InstanceType':'ml.m4.xlarge',
                'InitialVariantWeight':1,
                'InitialInstanceCount':1,
                'ModelName':model_name,
                'VariantName':'AllTraffic'}])
        print("Endpoint Config Arn: " + res['EndpointConfigArn'])
        
    def create_endpoints(self):
        for [name, job_name] in self.running_jobs:
            self.create_endpoint_configs(job_name)
            endpoint_name = self.get_endpoint_name(job_name)
            print(endpoint_name)
            res = self.sg_client.create_endpoint(
                EndpointName=endpoint_name,
                EndpointConfigName=self.get_endpoint_config_name( job_name ))
            print(res['EndpointArn'])

    def wait_for_endpoints(self):
        still_creating = False
            
        while still_creating:
            still_creating = False

            print("Checking endpoint statuses:")

            for [name, job_name] in self.running_jobs:
                resp = self.sg_client.describe_endpoint(EndpointName=get_endpoint_name( job_name ))
                status = resp['EndpointStatus']
                if status == 'Creating':
                    still_creating = True
                else:
                    print("Arn: " + resp['EndpointArn'])
                
                print( "{}: {}".format( job_name, status))
                    
            time.sleep(30)
        
        print( "All endpoints created.")
        
    def test_model(self, job_name, csv):
        '''
        job_name - name of a job that has run all the way through to an endpoint
        csv - a csv with the same y_col and x_col structure as the training data
        returns - a single-column dataframe with the predictions from the model.
        '''
        endpoint_name = self.get_endpoint_name( job_name )
        with open(csv, 'r') as f:
            payload = f.read().strip()
        response = self.sg_client.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='text/csv', 
                                   Body=payload)
        result = response['Body'].read()
        result = result.decode("utf-8")
        result = result.split(',')
        result = [round(float(i)) for i in result]
        return pd.DataFrame( result )
            
    def make_job_config(self, job_name, train, test, val ):
        return {
        "AlgorithmSpecification": {
            "TrainingImage": self.container,
            "TrainingInputMode": "File"
        },
        "RoleArn": self.role,
        "OutputDataConfig": {
            "S3OutputPath": os.path.join("s3://", self.bucket, "out", "xgb-class") 
        },
        "ResourceConfig": {
            "InstanceCount": self.instance[0],
            "InstanceType": self.instance[1],
            "VolumeSizeInGB": self.instance[2]
        },
        "TrainingJobName": job_name,
        "HyperParameters": self.hyper.params,        
        "StoppingCondition": {
            "MaxRuntimeInSeconds": 3600
        },
        "InputDataConfig": [
            {
                "ChannelName": "train",
                "DataSource": {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": self.make_s3_url( train ), # "s3://sagemaker-mlai-harvesting/out/train.csv" , 
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },
                "ContentType": "text/csv",
                "CompressionType": "None"
            },
            {
                "ChannelName": "test",
                "DataSource": {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": self.make_s3_url( test ), # "s3://sagemaker-mlai-harvesting/out/test.csv" , 
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },
                "ContentType": "text/csv",
                "CompressionType": "None"
            },
            {
                "ChannelName": "validation",
                "DataSource": {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": self.make_s3_url( val ), # "s3://sagemaker-mlai-harvesting/out/validate.csv" ,
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },
                "ContentType": "text/csv",
                "CompressionType": "None"
            }
        ]
    }


class SageHelper:
    def __init__(self, bucket, image, region='us-east-1'):
        self.bucket = bucket
        self.image = image
        self.container = get_image_uri(region, self.image)
        self.s3_client = boto3.client('s3')
    
    def s3_upload(self, file, s3_path="", verbose = True):
        target = os.path.join(s3_path, file) 
        if verbose:
            print( "Uploading {} to s3://{}/{}".format(file, bucket, target))
        response = self.s3_client.upload_file( file, bucket, target )
        if verbose:
            print( response )
        return response
    
        
   

In [149]:
sh = SageHelper( bucket, 'xgboost' )

# Prepare and train a model
Boilerplate code mostly copied from Amazon sample code at https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.ipynb, with ample room for improvement.

In [231]:
tr= TrainRunner("harvest-xgb-bc")

In [232]:
%%time

running_jobs = tr.start_jobs(job_csvs, bucket)

tr.wait_for_jobs()

harvest-xgb-bc-drop10-2019-07-09-19-53-25
Started harvest-xgb-bc-drop10-2019-07-09-19-53-25. Response: None
harvest-xgb-bc-first5-2019-07-09-19-53-25
Started harvest-xgb-bc-first5-2019-07-09-19-53-25. Response: None
harvest-xgb-bc-last5-2019-07-09-19-53-25
Started harvest-xgb-bc-last5-2019-07-09-19-53-25. Response: None
Checking job statuses:
['drop10', 'harvest-xgb-bc-drop10-2019-07-09-19-53-25']: InProgress
['first5', 'harvest-xgb-bc-first5-2019-07-09-19-53-25']: InProgress
['last5', 'harvest-xgb-bc-last5-2019-07-09-19-53-25']: InProgress
Checking job statuses:
['drop10', 'harvest-xgb-bc-drop10-2019-07-09-19-53-25']: InProgress
['first5', 'harvest-xgb-bc-first5-2019-07-09-19-53-25']: InProgress
['last5', 'harvest-xgb-bc-last5-2019-07-09-19-53-25']: InProgress
Checking job statuses:
['drop10', 'harvest-xgb-bc-drop10-2019-07-09-19-53-25']: InProgress
['first5', 'harvest-xgb-bc-first5-2019-07-09-19-53-25']: InProgress
['last5', 'harvest-xgb-bc-last5-2019-07-09-19-53-25']: InProgress
Che

# Launching endpoints for trained models.

In a straight-through pipeline, launch endpoints, run tests, collect output, and shut endpoints down.

In [233]:
tr.create_models()

arn:aws:sagemaker:us-east-1:872344130825:model/harvest-xgb-bc-drop10-2019-07-09-19-53-25-model
arn:aws:sagemaker:us-east-1:872344130825:model/harvest-xgb-bc-first5-2019-07-09-19-53-25-model
arn:aws:sagemaker:us-east-1:872344130825:model/harvest-xgb-bc-last5-2019-07-09-19-53-25-model


In [234]:
tr.create_endpoints()

harvest-xgb-bc-drop10-2019-07-09-19-53-25-config
Endpoint Config Arn: arn:aws:sagemaker:us-east-1:872344130825:endpoint-config/harvest-xgb-bc-drop10-2019-07-09-19-53-25-config
harvest-xgb-bc-drop10-2019-07-09-19-53-25-endpoint
arn:aws:sagemaker:us-east-1:872344130825:endpoint/harvest-xgb-bc-drop10-2019-07-09-19-53-25-endpoint
harvest-xgb-bc-first5-2019-07-09-19-53-25-config
Endpoint Config Arn: arn:aws:sagemaker:us-east-1:872344130825:endpoint-config/harvest-xgb-bc-first5-2019-07-09-19-53-25-config
harvest-xgb-bc-first5-2019-07-09-19-53-25-endpoint
arn:aws:sagemaker:us-east-1:872344130825:endpoint/harvest-xgb-bc-first5-2019-07-09-19-53-25-endpoint
harvest-xgb-bc-last5-2019-07-09-19-53-25-config
Endpoint Config Arn: arn:aws:sagemaker:us-east-1:872344130825:endpoint-config/harvest-xgb-bc-last5-2019-07-09-19-53-25-config
harvest-xgb-bc-last5-2019-07-09-19-53-25-endpoint
arn:aws:sagemaker:us-east-1:872344130825:endpoint/harvest-xgb-bc-last5-2019-07-09-19-53-25-endpoint


In [235]:
tr.wait_for_endpoints()

All endpoints created.


# Test the model
Currently, we launch an endpoint to test the model. This endpoint includes a simple web service that takes POST request with rows of or model's X values - columns other than BadActor - and returns a corresponding list of Y values - BadActor predictions.

The endpoint approach is most suitable to interactive use, such as possibly using the model to blacklist a harvesting session as soon as it is identified. For offline analysis, this should be reconfigured to run batch transform jobs instead, which are cheaper to run and more streamlined to invoke.

In [245]:
region = 'us-east-1'
sg_client = boto3.client('runtime.sagemaker', region_name=region)

import json
from itertools import islice
import math
import struct

!head -10000 out/test.csv > out/single-test.csv

file_name = 'out/single-test.csv' 

# file_name = "out/may8.csv"

csv = pd.read_csv(file_name, header=None)
csv.columns
label = csv[0]
csv = csv.drop(0,axis=1)

single = "out/single.csv"

csv.to_csv(path_or_buf=single, header=False, index=False)

with open(single, 'r') as f:
    payload = f.read().strip()
    
# csv
# drop10 = pd.read_csv("out/drop10_train.csv", header=None)
# drop10, csv

In [251]:
drop10 = pd.read_csv("out/drop10_train.csv", header = None)
first5 = pd.read_csv("out/first5_train.csv", header = None)
last5 = pd.read_csv("out/last5_train.csv", header = None)

In [253]:
(len(drop10.columns),len(first5.columns),len(last5.columns),len(csv.columns))

(37, 34, 36, 36)

In [256]:
tr.running_jobs

[['drop10', 'harvest-xgb-bc-drop10-2019-07-09-19-53-25'],
 ['first5', 'harvest-xgb-bc-first5-2019-07-09-19-53-25'],
 ['last5', 'harvest-xgb-bc-last5-2019-07-09-19-53-25']]

In [248]:
endpoint_name = 'harvest-xgb-bc-drop10-2019-07-09-19-53-25-endpoint'
endpoints = [get_endpoint_name(ep) for ep in tr.running_jobs]

response = sg_client.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='text/csv', 
                                   Body=payload)
result = response['Body'].read()
result = result.decode("utf-8")
result = result.split(',')
result = [round(float(i)) for i in result]


In [268]:
def test_model(job_name, csv):
    '''
    job_name - name of a job that has run all the way through to an endpoint
    csv - a csv with the same y_col and x_col structure as the training data
    returns - a single-column dataframe with the predictions from the model.
    '''
    endpoint_name = tr.get_endpoint_name( job_name )
    with open(csv, 'r') as f:
        payload = f.read().strip()
    response = sg_client.invoke_endpoint(EndpointName=endpoint_name, 
                               ContentType='text/csv', 
                               Body=payload)
    result = response['Body'].read()
    result = result.decode("utf-8")
    result = result.split(',')
    result = [round(float(i)) for i in result]
    return pd.DataFrame( result )


In [270]:
job_name=tr.running_jobs[0][1]
result = test_model( job_name, single)

# Compute the confusion metrics

A confusion matrix describes the proportions of true and false positives and negatives, together with some derived metrics.

In [274]:
class EvalHelper:
    def __init__(self):
        pass
    
    def compute_confusion(self, reference, test):
        '''
        reference - single-column dataframe of expected y-values - labels
        test - single-column dataframe of computed y-values for comparison
        '''
        comp = pd.concat( [reference, test], axis = 1)
        comp.columns =["label",'prediction']
        label_positive = comp['label'] == 1
        predict_positive = comp['prediction'] == 1
        tp = len( comp[label_positive & predict_positive])
        fp = len( comp[~label_positive & predict_positive])
        tn = len( comp[~label_positive & ~predict_positive])
        fn = len( comp[label_positive & ~predict_positive])
        m = len(comp)

        accuracy = (tp+tn)/m
        precision = tp/(tp+fp)
        recall = tp/(tp+fn)

        print("accuracy: {} precision: {} recall {}".format(accuracy, precision,recall))
        return (accuracy, precision, recall, tp,fp,tn,fn)
    

In [275]:
eh = EvalHelper()
eh.compute_confusion(label, result)

accuracy: 0.9654743390357698 precision: 0.998220640569395 recall 0.8360655737704918


(0.9654743390357698, 0.998220640569395, 0.8360655737704918, 1683, 3, 7629, 330)