This testing notebook walks through the use of the manual and NLP feature extraction techniques and runs those features through 5 sets of MLP neural network tests. This set of 5 tests corresponds to the MLP results section in the final paper. The  two manual feature related functions and three NLP feature related functions are called from the respective .py files in the codebase.

**References**: any code used from a source has been reference via a comment directly next to that specific line of code. All coding references used for the manual and NLP feature extraction techniques can be found directly in their respective .py files in the codebase.

**Change 1**: some of this code has been altered from its original form in our project for purposes of working with the smaller sample dataset that was provided in this codebase. For example, all neural network model epoch counts have been reduced, as this notebook is intended to quickly run through the sample dataset provided. The neural network code itself has remained the same, however some of the hyperparameters have been changed just for ease of testing the code out.

**Change 2**: Another change that was made in the below code is that, just for the purposes of testing this code, the sample dataset is split 80/20 for test/training of the neural network models. This was done as only one sample dataset is provided in the codebase. It should be noted that all testing scenarios reported in the final report where tested/trained on the 5 testing scenarios created for this project - all train/test data files are provided in the supplemental material link in the github.

**Change 3**: In the final project, the Hugging Face Distilbert NLP transformer was fine tuned on a much larger sample dataset from training scenarios 1-4, and those mode parameters were saved and used for feature extraction purposes. Instead of using the existing model parameters, for the purposes of showing the process and the code used, the sample dataset provided in the codebase is used to fine tune the model directly in this file, then those model parameters are used later to extract the features. This is intened to show the user how to fine tune the model and show the code used. Please note that the actual model parameters used with the NLP in the project is provided separately in the supplemental material link in the github.

# Table of Contents

1. Import Sample Dataset

1. Run Files Containing Functions

1. Sorting Data with SQL for Batched Features

1. Generating Manual Features

1. Generating Batched Manual Features

1. Fine Tune Distilbert NLP

1. Generate NLP Features

1. Generate Batched NLP Features

1. Generate AE Features

1. Consolidate All Features and Labels For Testing

1. Test 1: MLP w/o Dropout

1. Test 2: MLP w/ Dropout

1. Test 3: K Means

1. Test 4: MLP and K Means Averaging

1. Test 5: MLP using K Means

# Import Sample Dataset

Start off by mounting Google colab drive and reading in the "Codebase_Sample_Dataset.csv" and "Codebase_Sample_Dataset_Labels.csv" sample dataset files from where you have stored them on your Google Drive.

Note: you will have to change the sample dataset file location based on where you store the two files.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense
from sklearn.metrics import confusion_matrix
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from keras.layers import Dropout

In [5]:
#import sample data file and labels

data_df = pd.read_csv('/content/drive/My Drive/ECE 697 Project/10. Final Presentation/codebase sample dataset/Codebase_Sample_Dataset.csv')
labels = pd.read_csv('/content/drive/My Drive/ECE 697 Project/10. Final Presentation/codebase sample dataset/Codebase_Sample_Dataset_Labels.csv', header = None)
data_df.columns =['No','Time','Source','Destination','Protocol','Length','Info']
labels.columns = ['label']

In [6]:
data_df

Unnamed: 0,No,Time,Source,Destination,Protocol,Length,Info
0,1,0.000000,192.168.1.6,192.168.0.1,TCP,54,52531 > 80 [ACK] Seq=1 Ack=1 Win=253 Len=0
1,2,0.011750,192.168.1.6,192.168.0.1,TCP,54,"52531 > 80 [FIN, ACK] Seq=1 Ack=1 Win=253 Len=0"
2,3,0.011760,192.168.0.1,192.168.1.6,TCP,54,80 > 52531 [ACK] Seq=1 Ack=2 Win=237 Len=0
3,4,0.325331,192.168.0.8,192.168.0.1,TCP,74,54236 > 80 [SYN] Seq=0 Win=29200 Len=0 MSS=1...
4,5,0.325364,192.168.0.1,192.168.0.8,TCP,74,"80 > 54236 [SYN, ACK] Seq=0 Ack=1 Win=28960 ..."
...,...,...,...,...,...,...,...
495,496,1454.171676,10.128.0.88,192.168.1.9,TCP,54,50948 > 80 [ACK] Seq=1 Ack=1 Win=29312 Len=0
496,497,1454.171685,10.128.0.82,192.168.1.9,TCP,54,44318 > 80 [ACK] Seq=1 Ack=1 Win=29312 Len=0
497,498,1454.171686,10.128.0.88,192.168.1.9,TCP,54,33222 > 80 [ACK] Seq=1 Ack=1 Win=29312 Len=0
498,499,1454.171691,10.128.0.82,192.168.1.9,TCP,54,51034 > 80 [ACK] Seq=1 Ack=1 Win=29312 Len=0


In [7]:
labels

Unnamed: 0,label
0,0
1,0
2,0
3,0
4,0
...,...
495,1
496,1
497,1
498,1


# Run Files Containing Functions

Next set of code installs needed packages for the NLP fine tuning that will be performed later. This code also runs the two .py files needed to call the manual and NLP feature extraction functions.

In [31]:
  !pip install torch-summary 
  !pip install datasets 
  !pip install transformers
  !pip install scapy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scapy
  Downloading scapy-2.4.5.tar.gz (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 7.6 MB/s 
[?25hBuilding wheels for collected packages: scapy
  Building wheel for scapy (setup.py) ... [?25l[?25hdone
  Created wheel for scapy: filename=scapy-2.4.5-py2.py3-none-any.whl size=1261555 sha256=270c033a42544fadb336fd9e3bfd8b5a87a2055aa2c3555895717e67006448eb
  Stored in directory: /root/.cache/pip/wheels/b9/6e/c0/0157e466a5e02d3ff28fc7587dff329b4a967a23b3f9b11385
Successfully built scapy
Installing collected packages: scapy
Successfully installed scapy-2.4.5


note: make sure to change the file location to wherever you have stored these files, similar to the sample dataset.

In [9]:
#File containing manual feature extraction function and batched manual feature extraction function
%run '/content/drive/My Drive/ECE 697 Project/8. Classification Testing/NN and Clustering using Manual FE/Final Evaluation/manual_feature_data_generation_functions.py'

#File containing NLP fine tuning function, NLP feature extraction function, and batched NLP feature extraction function
%run '/content/drive/My Drive/ECE 697 Project/8. Classification Testing/NN and Clustering using Manual FE/Final Evaluation/nlp_feature_data_generation_functions.py'

# Sorting Data via SQL for Batched Features

SQL is used to sort the datasets by IP Address. This is required for only the batched manual feature extraction process and the batched NLP feature extraction process.

In [10]:
#start with the sample data and labels
data_with_labels = pd.concat([data_df,labels],axis = 1)

In [11]:
np.shape(data_with_labels)

(500, 8)

In [12]:
#create db using sql lite, and put the sample data on the database

connection_to_db = create_engine('sqlite:///sample_database.db')  #https://uwmadison.app.box.com/s/05pdwm1sebn77ge4se8od25q7ws79k1z
data_with_labels.to_sql('sample_data', con=connection_to_db, if_exists='replace')  #https://uwmadison.app.box.com/s/05pdwm1sebn77ge4se8od25q7ws79k1z

In [14]:
%load_ext sql
%sql sqlite:///sample_database.db

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


'Connected: @sample_database.db'

In [15]:
#https://uwmadison.app.box.com/s/05pdwm1sebn77ge4se8od25q7ws79k1z
#create new table that contains the sample data elements, ordered by Source in ascending order
%%sql

DROP TABLE IF EXISTS data_grouped;

CREATE TABLE data_grouped AS 
SELECT No,Time, Source, Destination, Protocol, Length, Info, label FROM sample_data ORDER BY Source ASC

 * sqlite:///sample_database.db
Done.
Done.


[]

In [16]:
#https://uwmadison.app.box.com/s/05pdwm1sebn77ge4se8od25q7ws79k1z
#export results
result = %sql SELECT * FROM data_grouped
data_df_grouped = result.DataFrame()

 * sqlite:///sample_database.db
Done.


In [17]:
data_df_grouped

Unnamed: 0,No,Time,Source,Destination,Protocol,Length,Info,label
0,248,1454.124546,10.128.0.50,192.168.1.9,TCP,74,54704 > 80 [SYN] Seq=0 Win=29200 Len=0 MSS=1...,1
1,255,1454.124587,10.128.0.50,192.168.1.9,TCP,74,48260 > 80 [SYN] Seq=0 Win=29200 Len=0 MSS=1...,1
2,257,1454.124607,10.128.0.50,192.168.1.9,TCP,54,54704 > 80 [ACK] Seq=1 Ack=1 Win=29312 Len=0,1
3,258,1454.124665,10.128.0.50,192.168.1.9,HTTP,254,GET /bcrypt.php HTTP/1.1,1
4,260,1454.124669,10.128.0.50,192.168.1.9,TCP,54,48260 > 80 [ACK] Seq=1 Ack=1 Win=29312 Len=0,1
...,...,...,...,...,...,...,...,...
495,485,1454.168024,192.168.1.9,10.128.0.57,TCP,66,"80 > 36659 [SYN, ACK] Seq=0 Ack=1 Win=42340 ...",0
496,492,1454.171596,192.168.1.9,10.128.0.88,TCP,66,"80 > 50948 [SYN, ACK] Seq=0 Ack=1 Win=42340 ...",0
497,493,1454.171596,192.168.1.9,10.128.0.88,TCP,66,"80 > 33222 [SYN, ACK] Seq=0 Ack=1 Win=42340 ...",0
498,494,1454.171610,192.168.1.9,10.128.0.82,TCP,66,"80 > 44318 [SYN, ACK] Seq=0 Ack=1 Win=42340 ...",0


# Generating Manual Features

Call the manual feature extraction function. Outputs n x 16 feature matrix

In [18]:
manual_features = MF(data_df)

In [19]:
np.shape(manual_features)

(500, 16)

# Generating Batched Manual Features

Using the manual features generated in the previous step (although this time we generate them from the sorted file), concatenate those features with the original dataset, so the IP Addresses are still present.

In [20]:
#run manual feature extraction on sorted file
manual_features_sorted = MF(data_df_grouped)
manual_features_sorted = pd.DataFrame(manual_features_sorted)

#concatenate with original IP addresses
manual_features_sorted = pd.concat([data_df_grouped,manual_features_sorted],axis = 1)

#run batched manual feature function
manual_features_batched = MF_batching(manual_features_sorted,10,500)

In [21]:
np.shape(manual_features_batched)

(75, 23)

In [22]:
#the labels are included in the last column
manual_features_batched[:,22]

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.])

# Fine Tune Distilbert NLP

The sample dataset is used to fine tune the NLP model below.

Note: In the final project, the Hugging Face Distilbert NLP transformer was fine tuned on a much larger sample dataset from training scenarios 1-4, and those mode parameters were saved and used for feature extraction purposes. Instead of using the existing model parameters, for the purposes of showing the process and the code used, the sample dataset provided in the codebase is used to fine tune the model directly in this file, then those model parameters are used later to extract the features. This is intened to show the user how to fine tune the model and show the code used. Please note that the actual model parameters used with the NLP in the project is provided separately in the supplemental material link in the github.

In [23]:
#Fine tune the NLP model:
FTmodel = FT_NLP(data_df,labels)

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/455k [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Downloading pytorch_model.bin:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classi

Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




Note: if you do not want to use the fine-tuning function and just want to use the Hugging Face Distilbert pretained weights, run the following code below instead (uncomment both lines and run):

In [None]:
#model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased') #https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb#scrollTo=q1InADgf5xm2
#FTmodel = model_class.from_pretrained(pretrained_weights) #https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb#scrollTo=q1InADgf5xm2

# Generate NLP Features

NLP features are generated using the function below

In [24]:
#generate NLP features:
NLP_features = NLP_Features(data_df,labels,FTmodel)

loading file https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/0e1bbfda7f63a99bb52e3915dcf10c3c92122b827d92eb2d34ce94ee79ba486c.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
loading file https://huggingface.co/distilbert-base-uncased/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/distilbert-base-uncased/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/distilbert-base-uncased/resolve/main/tokenizer_config.json from cache at /root/.cache/huggingface/transformers/8c8624b8ac8aa99c60c912161f8332de003484428c47906d7ff7eb7f73eecdbb.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79
loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf

(150, 768)
(300, 768)
(450, 768)
(500, 768)


In [25]:
np.shape(NLP_features)

(500, 768)

# Generate Batched NLP Features

Similar to the manual batched features, here the NLP features (generated using the sorted dataset) is used to generate the batched NLP features. Before running the function, the NLP features need to be concatenated back with the original IP address numbers.

In [27]:
#need to use the grouped features and labels for this portion. Get the labels ready:
data_df_grouped.iloc[:,0:7]
labels_sorted = pd.DataFrame(data_df_grouped.iloc[:,7], columns = ['label'])
#labels_sorted

In [28]:
#run manual feature extraction on sorted file
NLP_features = NLP_Features(data_df_grouped.iloc[:,0:7],labels_sorted,FTmodel)
NLP_features = pd.DataFrame(NLP_features)

#concatenate NLP features with IP addresses:
NLP_features_for_batching = pd.concat([data_df_grouped.iloc[:,0:7],NLP_features,labels_sorted], axis = 1)
NLP_features_for_batching=NLP_features_for_batching.drop(columns=['No'])

#run batching function:
NLP_batched_features = NLP_batching(NLP_features_for_batching,10,500)
np.shape(NLP_batched_features)

loading file https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/0e1bbfda7f63a99bb52e3915dcf10c3c92122b827d92eb2d34ce94ee79ba486c.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
loading file https://huggingface.co/distilbert-base-uncased/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/distilbert-base-uncased/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/distilbert-base-uncased/resolve/main/tokenizer_config.json from cache at /root/.cache/huggingface/transformers/8c8624b8ac8aa99c60c912161f8332de003484428c47906d7ff7eb7f73eecdbb.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79
loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf

(150, 768)
(300, 768)
(450, 768)
(500, 768)


(75, 769)

In [None]:
#this is how you would run the function on the unsorted data - just keeping this for reference.

#run manual feature extraction on sorted file
#NLP_features = pd.DataFrame(NLP_features)

#concatenate NLP features with IP addresses:
#NLP_features_for_batching = pd.concat([data_df,NLP_features,labels], axis = 1)
#NLP_features_for_batching=NLP_features_for_batching.drop(columns=['No'])

#run batching function:
#NLP_batched_features = NLP_batching(NLP_features_for_batching,10,500)
#np.shape(NLP_batched_features)

(318, 769)

# Generate AE Features

Need to specify locations of the encoder file and the pcap file from codebase repository

/content/drive/My Drive/ECE 697 Project/8. Classification Testing/NN and Clustering using Manual FE/Final Evaluation/encoder/

/content/drive/My Drive/ECE 697 Project/10. Final Presentation/codebase sample dataset/codebase_sample_dataset_SUEE1_98900_to_99399.pcap

In [32]:
%run '/content/drive/My Drive/ECE 697 Project/8. Classification Testing/NN and Clustering using Manual FE/Final Evaluation/autoencoder_feature_generator.py'


Enter pcap file path:/content/drive/My Drive/ECE 697 Project/10. Final Presentation/codebase sample dataset/Codebase_Sample_Dataset.pcap
Enter encoder file path:/content/drive/My Drive/ECE 697 Project/8. Classification Testing/NN and Clustering using Manual FE/Final Evaluation/encoder/


In [33]:
ae_features = np.array(pd.read_csv('Generated_AE_Features.csv', header=None))

# Consolidate All Features and Labels for Testing

This set of code is used to consolidate the manual features, batched manual features, NLP features, and batched NLP features into lists. Those lists are used in the testing scripts below to run all features at once.

In [34]:
feature_list = [manual_features,manual_features_batched[:,0:22],NLP_features,NLP_batched_features[:,0:768],ae_features]
label_list = [labels,manual_features_batched[:,22],labels,NLP_batched_features[:,768],labels]
feature_names = ['manual features', 'batched manual features', 'NLP features', 'batched NLP features', 'AE features']
input_count = [16,22,768,768,128]
dense_count = [12,18,576,576,96]

# Test 1: MLP w/o dropout

In [36]:
for i in range(0,5):

  #split sample dataset into test and train sets
  X_train, X_test, y_train, y_test = train_test_split(feature_list[i], label_list[i], test_size=0.3, random_state=42)

  #standard scaler on features
  scaler = preprocessing.StandardScaler().fit(X_train)
  X_train = scaler.transform(X_train)
  X_test = scaler.transform(X_test)

  #change type for keras model
  X_train = X_train.astype(np.float64)
  X_test = X_test.astype(np.float64)

  #sequential keras model
  model = Sequential()
  model.add(Dense(dense_count[i], input_dim=input_count[i], activation='relu'))
  model.add(Dense(dense_count[i], activation='relu'))
  model.add(Dense(1, activation='sigmoid'))

  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) #https://machinelearningmastery.com/binary-classification-tutorial-with-the-keras-deep-learning-library/
  model.fit(X_train, y_train.astype(np.float64), epochs=20, batch_size=10)

  #predict X_test labels
  pred = model.predict(X_test) 
  y_pred = np.rint(pred)

  #print confusion matrices for each set of features:
  print(feature_names[i], ' confusion matrix results: ')
  print(confusion_matrix(y_test.astype(np.float64), y_pred))
  print()

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20




manual features  confusion matrix results: 
[[36 38]
 [ 4 72]]

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
batched manual features  confusion matrix results: 
[[ 7  0]
 [ 0 16]]

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20




NLP features  confusion matrix results: 
[[33 41]
 [34 42]]

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
batched NLP features  confusion matrix results: 
[[ 7  0]
 [ 0 16]]

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
AE features  confusion matrix results: 
[[74  0]
 [53 23]]



# Test 2: MLP w/ dropout

In [37]:
for i in range(0,5):

  #split sample dataset into test and train sets
  X_train, X_test, y_train, y_test = train_test_split(feature_list[i], label_list[i], test_size=0.3, random_state=42)

  #standard scaler on features
  scaler = preprocessing.StandardScaler().fit(X_train)
  X_train = scaler.transform(X_train)
  X_test = scaler.transform(X_test)

  #change type for keras model
  X_train = X_train.astype(np.float64)
  X_test = X_test.astype(np.float64)

  #sequential keras model with dropout layers
  model = Sequential()
  model.add(Dense(dense_count[i], input_dim=input_count[i], activation='relu'))
  model.add(Dropout(0.2))
  model.add(Dense(dense_count[i], activation='relu'))
  model.add(Dropout(0.2))
  model.add(Dense(dense_count[i], activation='relu'))
  model.add(Dropout(0.2))
  model.add(Dense(1, activation='sigmoid'))

  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) #https://machinelearningmastery.com/binary-classification-tutorial-with-the-keras-deep-learning-library/
  model.fit(X_train, y_train.astype(np.float64), epochs=20, batch_size=10)

  #predict X_test labels
  pred = model.predict(X_test) 
  y_pred = np.rint(pred)

  #print confusion matrices for each feature type
  print(feature_names[i], ' confusion matrix results: ')
  print(confusion_matrix(y_test.astype(np.float64), y_pred))
  print()


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
manual features  confusion matrix results: 
[[72  2]
 [53 23]]

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
batched manual features  confusion matrix results: 
[[ 7  0]
 [ 0 16]]

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
NLP features  confusion matrix results: 
[[37 37]
 [33 43]]

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20

# Test 3: K Means

In [38]:
for i in range(0,5):

  #split sample dataset into test and train sets
  X_train, X_test, y_train, y_test = train_test_split(feature_list[i], label_list[i], test_size=0.3, random_state=42)

  #standard scaler on features
  scaler = preprocessing.StandardScaler().fit(X_train)
  X_train = scaler.transform(X_train)
  X_test = scaler.transform(X_test)

  X_train = X_train.astype(np.float64)
  X_test = X_test.astype(np.float64)

  #perform K Means clustering, 2 clusters
  kmeans = KMeans(n_clusters=2, random_state=0).fit(X_train)

  #preduct clusters for X_test
  cluster_pred = kmeans.predict(X_test)

  #print confusion matrices for each feature type
  print(feature_names[i], ' confusion matrix results: ')
  print(confusion_matrix(y_test.astype(np.float64), cluster_pred))
  print()

manual features  confusion matrix results: 
[[13 61]
 [12 64]]

batched manual features  confusion matrix results: 
[[ 1  6]
 [ 0 16]]

NLP features  confusion matrix results: 
[[66  8]
 [71  5]]

batched NLP features  confusion matrix results: 
[[ 0  7]
 [16  0]]

AE features  confusion matrix results: 
[[73  1]
 [53 23]]



# Test 4: MLP and K Means Averaging

In [39]:
for i in range(0,5):

  #split sample dataset into test and train sets
  X_train, X_test, y_train, y_test = train_test_split(feature_list[i], label_list[i], test_size=0.3, random_state=42)

  #standard scaler on features
  scaler = preprocessing.StandardScaler().fit(X_train)
  X_train = scaler.transform(X_train)
  X_test = scaler.transform(X_test)

  #change type for keras
  X_train = X_train.astype(np.float64)
  X_test = X_test.astype(np.float64)

  #sequential keras model
  model = Sequential()
  model.add(Dense(dense_count[i], input_dim=input_count[i], activation='relu'))
  model.add(Dense(dense_count[i], activation='relu'))
  model.add(Dense(1, activation='sigmoid'))

  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) #https://machinelearningmastery.com/binary-classification-tutorial-with-the-keras-deep-learning-library/
  model.fit(X_train, y_train.astype(np.float64), epochs=20, batch_size=10)

  #predict X_test labels
  pred = model.predict(X_test) 
  
  #perform K Means on train set, then predict test set
  kmeans = KMeans(n_clusters=2, random_state=0).fit(X_train)
  cluster_pred = kmeans.predict(X_test)

  a,b = np.shape(X_test)

  #average prediction from two models
  for j in range(0,a):
    pred[j] = (pred[j]+cluster_pred[j])/2

  y_pred = np.rint(pred)

  #print confusion matrix for each feature
  print(feature_names[i], ' confusion matrix results: ')
  print(confusion_matrix(y_test.astype(np.float64), y_pred))
  print()

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
manual features  confusion matrix results: 
[[13 61]
 [12 64]]

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
batched manual features  confusion matrix results: 
[[ 1  6]
 [ 0 16]]

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
NLP features  confusion matrix results: 
[[66  8]
 [71  5]]

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20

# Test 5: MLP using K Means 

In [40]:
for i in range(0,5):

  #split sample dataset into test and train sets
  X_train, X_test, y_train, y_test = train_test_split(feature_list[i], label_list[i], test_size=0.3, random_state=42)

  #standard scaler on features
  scaler = preprocessing.StandardScaler().fit(X_train)
  X_train = scaler.transform(X_train)
  X_test = scaler.transform(X_test)

  #change type for kears
  X_train = X_train.astype(np.float64)
  X_test = X_test.astype(np.float64)

  #this time, run K Means first, then concatenate the results with the existing dataset used for model training
  kmeans = KMeans(n_clusters=2, random_state=0).fit(X_train)
  cluster_pred = kmeans.predict(X_train)
  cluster_pred = cluster_pred.reshape(-1,1)
  X_train = np.concatenate((X_train,cluster_pred),axis=1)

  cluster_pred = kmeans.predict(X_test)
  cluster_pred = cluster_pred.reshape(-1,1)
  X_test = np.concatenate((X_test,cluster_pred),axis=1)

  #sequenatil keras model
  model = Sequential()
  model.add(Dense(dense_count[i], input_dim=input_count[i]+1, activation='relu'))
  model.add(Dense(dense_count[i], activation='relu'))
  model.add(Dense(1, activation='sigmoid'))

  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) #https://machinelearningmastery.com/binary-classification-tutorial-with-the-keras-deep-learning-library/
  model.fit(X_train, y_train.astype(np.float64), epochs=20, batch_size=10)

  #predict new labels
  pred = model.predict(X_test) 
  y_pred = np.rint(pred)

  #print confusion matrix for each feature type
  print(feature_names[i], ' confusion matrix results: ')
  print(confusion_matrix(y_test.astype(np.float64), y_pred))
  print()

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
manual features  confusion matrix results: 
[[67  7]
 [23 53]]

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
batched manual features  confusion matrix results: 
[[ 7  0]
 [ 0 16]]

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
NLP features  confusion matrix results: 
[[36 38]
 [35 41]]

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20