# Lab 4, Exercise 3

In [1]:
import numpy as np
import sys
import os

## Load data 

The data is separated into three folders: Attack_Data_Master, Training_Data_Master, and Validation_Data_Master
These can be found here:
data/exercise3/Training_Data_Master
data/exercise3/Validation_Data_Master
data/exercise3/Attack_Data_Master

All of the data in Training_Data_Master and Validation_Data_Master is normal, 
and all the data in Attack_Data_Master is malicious

For the purpose of this exercise, you will ignore the predefined training/validation splits, and simply use Training_Data_Master
and Validation_Data_Master as a single pool of normal data

As mentioned, each system call trace is stored as a single file.  Treat each system call trace as a separate datapoint.

In [2]:
# Load all the normal system call traces (i.e., everything in Training_Data_Master and Validation_Data_Master)

# CODE HERE
normal_data = []
paths = [
    'data/exercise3/Training_Data_Master',
    'data/exercise3/Validation_Data_Master',
    'data/exercise3/Attack_Data_Master'
]
for f_name in os.listdir(paths[0]):
    with open(paths[0]+'/'+f_name, 'r') as f:
        normal_data.append(f.read())
for f_name in os.listdir(paths[1]):
    with open(paths[1]+'/'+f_name, 'r') as f:
        normal_data.append(f.read())

# Load all the malicious system call traces (i.e., everything in Attack_Data_Master)

# CODE HERE
malicious_data = []
for dir_name in os.listdir(paths[2]):
    for f_name in os.listdir(paths[2]+'/'+dir_name):
        with open(paths[2]+'/'+dir_name+'/'+f_name, 'r') as f:
            malicious_data.append(f.read())

# Hint: A useful way to load this is as one or two Python lists, where each entry in the list corresponds to the text string
#       of system calls ids; feel free to use a single list for all the data, or separate lists for malicious versus normal
#       data

## Feature extraction

Tokenize and create a dataset where each datapoint corresponds to (normalized) counts of 
system call n-grams. Try various sizes of ngrams.

Reminder: A sequence of system call IDs that looks like this:
'6 6 63 6 42'

contains the following 3-grams:
'6 6 63'
'6 63 6'
'63 6 42'

Note: There are a number of ways you could code this up, but if you loaded the data
as lists of strings, you could consider using some of the feature extraction methods in 
sklearn.feature_extraction.text

In [3]:
# Look at the classdemo notebook for an example of doing this
# CODE HERE
from sklearn.feature_extraction.text import CountVectorizer
cnt_vec = CountVectorizer(analyzer='word', ngram_range=(3,3))
raw_cnts = cnt_vec.fit_transform(normal_data + malicious_data)
features = cnt_vec.get_feature_names()

from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False)
all_data = tf_transformer.fit_transform(raw_cnts)
raw_labels = [0]*len(normal_data) + [1]*len(malicious_data)
all_labels = np.asarray(raw_labels)
indices = np.arange(len(all_labels))

## Create train/test split

In [4]:
# Use 50% of the data for the training set and the rest for the test set
# CODE HERE
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(all_data, all_labels, indices, test_size=0.5, random_state=42)

## Train a classifier

In [5]:
# Please use Logistic Regression for this exercise
# Feel free to experiment with the various hyperparameters available to you in sklearn
# CODE HERE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=42).fit(X_train, y_train)



## Inference and results

In [6]:
# Run inference on the test data and predict labels for each data point in the test data
# CODE HERE
preds = model.predict(X_test)

# Calculate and print the following metrics: precision, recall, f1-measure, and accuracy
# CODE HERE
from sklearn import metrics
precision = metrics.precision_score(y_test, preds)
recall = metrics.recall_score(y_test, preds)
f1measure = metrics.f1_score(y_test, preds)
accuracy = metrics.accuracy_score(y_test, preds)
print('precision = ' + str(precision))
print('   recall = ' + str(recall))
print('f1measure = ' + str(f1measure))
print(' accuracy = ' + str(accuracy))

precision = 0.7789473684210526
   recall = 0.6132596685082873
f1measure = 0.686244204018547
 accuracy = 0.9317876344086021


# Part 2: Varying class priors

Create several new test datasets where you have randomly subsampled the number of 
attack datapoints.

In particular, create the following datasets:
- 10 datasets where 25% of the attack datapoints are removed from the original test set
- 10 datasets where 50% of the attack datapoints are removed from the original test set
- 10 datasets where 75% of the attack datapoints are removed from the original test set
- 10 datasets where 90% of the attack datapoints are removed from the original test set
- 10 datasets where 95% of the attack datapoints are removed from the original test set

Report five sets of precision, recall, f1-measure, and accuracy corresponding to the following:
- Average precision, recall, f1-measure, accuracy for datasets where 25% of attack datapoints removed
- Average precision, recall, f1-measure, accuracy for datasets where 50% of attack datapoints removed
- Average precision, recall, f1-measure, accuracy for datasets where 75% of attack datapoints removed
- Average precision, recall, f1-measure, accuracy for datasets where 90% of attack datapoints removed
- Average precision, recall, f1-measure, accuracy for datasets where 95% of attack datapoints removed

Note: You will use the same model trained in part 1 for all of these datasets.  
All you are varying is the class priors during the inference stage.

In [7]:
# Create subsets of the test set by randomly discarding X% of points with label +1
# CODE HERE
import random
random.seed(42)
probs = [0.25, 0.5, 0.75, 0.9, 0.95]
ids_all = range(len(y_test))
for p in probs:
    avg_prec = 0
    avg_recall = 0
    avg_f1 = 0
    avg_acc = 0
    for _ in range(10):
        ids_drop = []
        for i in range(len(y_test)):
            if y_test[i] == 1 and random.random() < p:
                ids_drop.append(i)
        ids_keep = [x for x in ids_all if x not in ids_drop]
        X_test_new = X_test[ids_keep,:]
        y_test_new = y_test[ids_keep]
        # predict and metrics
        preds_new = model.predict(X_test_new)
        avg_prec += metrics.precision_score(y_test_new, preds_new)
        avg_recall += metrics.recall_score(y_test_new, preds_new)
        avg_f1 += metrics.f1_score(y_test_new, preds_new)
        avg_acc += metrics.accuracy_score(y_test_new, preds_new)
    avg_prec /= 10
    avg_recall /= 10
    avg_f1 /= 10
    avg_acc /= 10
    print('AVG over 10 rounds of {}% attack datapoints removed'.format(p*100))
    print('\tprecision = ' + str(avg_prec))
    print('\t   recall = ' + str(avg_recall))
    print('\tf1measure = ' + str(avg_f1))
    print('\t accuracy = ' + str(avg_acc))

AVG over 10 rounds of 25.0% attack datapoints removed
	precision = 0.7256045419634172
	   recall = 0.6090142018895736
	f1measure = 0.6621297346369859
	 accuracy = 0.9411021754034359
AVG over 10 rounds of 50.0% attack datapoints removed
	precision = 0.6336376951212832
	   recall = 0.6070608641593191
	f1measure = 0.6198168120078908
	 accuracy = 0.9521150519264789
AVG over 10 rounds of 75.0% attack datapoints removed
	precision = 0.45560403781349523
	   recall = 0.6011991178824132
	f1measure = 0.5180762242513929
	 accuracy = 0.9636977955454524
AVG over 10 rounds of 90.0% attack datapoints removed
	precision = 0.2610950707613507
	   recall = 0.5646104344963792
	f1measure = 0.3556190278733884
	 accuracy = 0.9697453360664197
AVG over 10 rounds of 95.0% attack datapoints removed
	precision = 0.13779985983831045
	   recall = 0.6178381001523416
	f1measure = 0.22397364178211648
	 accuracy = 0.9735443529395351


# Questions

1) In Part 1, what size of ngrams gives the best performance? What are the tradeoffs as you change the size?

Using only size 3 n-grams gives the best performance. Using larger n-grams means more features, but seems to reduce scores for all 4 metrics used. This might be because it overfits on the training data rather than being flexible for differences in the testing data.

2) In Part 1, how does performance change if we use simple counts as features (i.e., 1-grams) as opposed to counts of 2-grams? What does this tell you about the role of sequences in prediction for this dataset?

Performance is worse when using 1-grams compared to 2-grams, and both are worse compared to using 3-grams. This shows that these certain sequences of syscalls should be noted and are good predictors of attacks.

3) How does performance change as a function of class prior in Part 2?

Although accuracy increases as more attack datapoints are removed, the other 3 metrics decrease.