# About

The paper that we are comparing to uses a dataset with 8 malware classes. They use this malware dataset in two ways:

## 1) Multiple binary classifications:

Let's use adware as an example. Adware is a class label in the dataset. The researchers frame this as a binary classification case where they set all non-adware instances to have a label of 0, with adware instances having a label of 1. They then train a classifier to predict adware or not.

They do this for all classes. This results in 8 separate classifiers, each trained to determine one class.

## 2) Multi-class classification

The authors also develop a multi-class classification model, which is tasked to predict the class of adware, out of 8 possible classes.


# This notebook

In this notebook, we pre-process the data into the proper form for binary classification (outputs 8 dataframes, one for each problem, with 2 labels [0 or 1] for each), as well as the proper form for multi-class classification (one dataframe with 8 possible labels).

The researchers originally do not use a validation set. We will use a validation set as that is proper practice. However, we will keep the test set the same as theirs to allow for a fair comparison (even though we are running the same models that they did).

For the binary classification case, the dataset is unbalanced. We randomly oversample the training set mitigate this but leave the test set alone.

To create a validation set we simply split the train set in two. This could be improved upon but perhaps is not needed to be.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


# Set data path, and set parameters for preprocessing

In [None]:
raw_data_path = '/content/drive/My Drive/Research/CyberBERT/data'
destination_folder = '/content/drive/My Drive/Research/CyberBERT/model'

# train_test_ratio = 0.10
# train_valid_ratio = 0.80

# first_n_words = 200

# Imports

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

import keras
import numpy as np
import pandas as pd
import pickle
import sys
import tensorflow as tf
import importlib

from itertools import chain
from keras import backend as K
from keras.models import load_model, Sequential
from keras.layers import Dense, Dropout, Activation, Flatten

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.layers import LSTM, Dense, Dropout, Embedding
from keras.preprocessing import sequence
from keras.utils import np_utils
from keras.models import Sequential

import seaborn as sns

import matplotlib.pyplot as plt

# Preprocessing

In [None]:
# Read in data
## Calls: the API calls made by the malware
malware_calls_df = pd.read_csv(f"{raw_data_path}/calls.zip", compression="zip",
                               sep="\t", names=["API_Calls"])
## Labels (types of malware)
malware_labels_df = pd.read_csv(f"{raw_data_path}/types.zip", compression="zip",
                               sep="\t", names=["API_Labels"])

In [None]:
malware_calls_df["API_Labels"] = malware_labels_df.API_Labels
malware_calls_df["API_Calls"] = malware_calls_df.API_Calls.apply(lambda x: " ".join(x.split(",")))


In [None]:
malware_calls_df.head()

Unnamed: 0,API_Calls,API_Labels
0,292 291 292 291 291 291 291 291 291 291 291 29...,Trojan
1,278 192 199 192 290 291 291 291 291 290 291 29...,Trojan
2,290 291 51 34 232 238 220 221 220 69 69 66 80 ...,Backdoor
3,292 291 292 291 291 291 291 291 291 291 291 29...,Backdoor
4,292 291 291 291 291 291 291 291 291 291 291 29...,Trojan


In [None]:
labels = malware_calls_df.API_Labels.unique()

In [None]:
def preprocess_binary_data(input_df, class_label):
  """Preprocess data for binary classification and output a train, test dataframe.
  Given a class label, and an input dataframe, label every instance with 0
  if the instance is not from the class_label target, and 1 otherwise.
  
  Parameters
  ----------
  input_df: pd.DataFrame
    the input dataframe, contains all data (not split into train/valid/test)
  class_label: str
    the class label for which we're creating the train and test set.
    e.g., if class label is "Adware" then we're creating a dataset where
    the only instances with a 1 label are those that correspond to adware.
  """

  df = input_df.copy()
  print(f"Labelling {class_label} df")
  df["API_Labels"] = df.API_Labels.apply(lambda x: 1 if x == class_label else 0)
  max_words = 800
  max_len = 100

  X = df.API_Calls
  Y = df.API_Labels.astype('category').cat.codes

  tok = Tokenizer(num_words=max_words)
  tok.fit_on_texts(X)
  print('Found %s unique tokens.' % len(tok.word_index))
  X = tok.texts_to_sequences(X.values)
  X = sequence.pad_sequences(X, maxlen=max_len)
  print('Shape of data tensor:', X.shape)

  # Note: this test_size is set as 0.15 since the original paper
  # uses a split like this for the test set
  # the original paper's code also does not have a validation set,
  # so we'll manually create that ourselves
  X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=0.15)
  
  data_dict = {
    "calls": list(X_train),
    "label": list(Y_train)
  }

  train_df = pd.DataFrame(data_dict, columns=['label', "calls"])
  train_df["calls"] = train_df['calls'].apply(lambda x: ' '.join(map(str, x)))
  data_dict = {
      "calls": list(X_test),
      "label": list(Y_test)
  }

  test_df = pd.DataFrame(data_dict, columns=['label', "calls"])
  test_df["calls"] = test_df['calls'].apply(lambda x: ' '.join(map(str, x)))  
  
  print(f"Writing {class_label} df to disk")

  # Write test dataframe
  test_df.to_csv(f"/content/drive/My Drive/Research/CyberBERT/data/test_{class_label}.csv", index=False)

  return train_df, test_df

def balance_binary_data(train_df, class_label):
  """Balance an input dataframe.

  Parameters
  ----------
  train_df: pd.DataFrame
    the training dataframe that needs to be balanced.
  class_label: str
    the class label for which we're balancing the data
  Note
  ----
  The dataset is very unbalanced when we consider the binary
  classification case only. For the multi-class classification case,
  we will not rebalance.
  """

  print(f"balancing {class_label} df")
  count_class_0, count_class_1 = train_df.label.value_counts()

  # Divide by class
  df_class_0 = train_df[train_df['label'] == 0]
  df_class_1 = train_df[train_df['label'] == 1]

  df_class_1_over = df_class_1.sample(count_class_0, replace=True)
  df_train_over = pd.concat([df_class_0, df_class_1_over], axis=0)

  print('Random over-sampling:')
  print(df_train_over.label.value_counts())

  print(f"writing over sampled {class_label} df")

  df_train_over = df_train_over.sample(frac=1, random_state=42)
  # create a validation set as well
  df_train_over[:4000].to_csv(f"/content/drive/My Drive/Research/CyberBERT/data/train_over_{class_label}.csv", index=False)
  df_train_over[4000:].to_csv(f"/content/drive/My Drive/Research/CyberBERT/data/valid_over_{class_label}.csv", index=False)


In [None]:
# We will process each dataset separately - not the most efficient, but it fits better with the current work flow.
def preprocess_and_balance_binary_data(input_df, class_label):

  train_df, test_df = preprocess_binary_data(input_df, class_label)

  balance_binary_data(train_df, class_label)


In [None]:
for class_label in labels:
  preprocess_and_balance_binary_data(malware_calls_df, class_label)

Labelling Trojan df
Found 278 unique tokens.
Shape of data tensor: (7107, 100)
Writing Trojan df to disk
balancing Trojan df
Random over-sampling:
1    5195
0    5195
Name: label, dtype: int64
writing over sampled Trojan df
Labelling Backdoor df
Found 278 unique tokens.
Shape of data tensor: (7107, 100)
Writing Backdoor df to disk
balancing Backdoor df
Random over-sampling:
1    5184
0    5184
Name: label, dtype: int64
writing over sampled Backdoor df
Labelling Downloader df
Found 278 unique tokens.
Shape of data tensor: (7107, 100)
Writing Downloader df to disk
balancing Downloader df
Random over-sampling:
1    5201
0    5201
Name: label, dtype: int64
writing over sampled Downloader df
Labelling Worms df
Found 278 unique tokens.
Shape of data tensor: (7107, 100)
Writing Worms df to disk
balancing Worms df
Random over-sampling:
1    5192
0    5192
Name: label, dtype: int64
writing over sampled Worms df
Labelling Spyware df
Found 278 unique tokens.
Shape of data tensor: (7107, 100)
Writ

In [None]:
def preprocess_multiclass_data(input_df):

  # Don't over sample
  df = input_df.copy()
  print("Labelling multiclass df")

  max_words = 800
  max_len = 100

  X = df.API_Calls

  Y = df.API_Labels.astype('category').cat.codes
  category_list = df.API_Labels.astype('category').cat.categories
  category_dict = dict()
  for x in range(len(category_list)):
    category_dict[x] = category_list[x]

  import json
  # Write category dict so we can map labels back to categories later
  with open('category_dict.json', 'w') as f:
      json.dump(category_dict, f)

  tok = Tokenizer(num_words=max_words)
  tok.fit_on_texts(X)
  print('Found %s unique tokens.' % len(tok.word_index))
  X = tok.texts_to_sequences(X.values)
  X = sequence.pad_sequences(X, maxlen=max_len)
  print('Shape of data tensor:', X.shape)

  X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=0.15)
  
  data_dict = {
    "calls": list(X_train),
    "label": list(Y_train)
  }

  train_df = pd.DataFrame(data_dict, columns=['label', "calls"])
  train_df["calls"] = train_df['calls'].apply(lambda x: ' '.join(map(str, x)))
  data_dict = {
      "calls": list(X_test),
      "label": list(Y_test)
  }

  test_df = pd.DataFrame(data_dict, columns=['label', "calls"])
  test_df["calls"] = test_df['calls'].apply(lambda x: ' '.join(map(str, x)))  
  
  print("Writing multiclass df to disk")

  # Write test dataframe
  # create a validation set
  train_df[:4000].to_csv("/content/drive/My Drive/Research/CyberBERT/data/train_multiclass.csv", index=False)

  train_df[4000:].to_csv("/content/drive/My Drive/Research/CyberBERT/data/valid_multiclass.csv", index=False)

  test_df.to_csv("/content/drive/My Drive/Research/CyberBERT/data/test_multiclass.csv", index=False)



In [None]:
preprocess_multiclass_data(malware_calls_df)

Labelling multiclass df
Found 278 unique tokens.
Shape of data tensor: (7107, 100)
Writing multiclass df to disk


# Create vocab list

A vocabularly list is needed to train the BERT model.

This stays the same regardless of dataset label. So we can do it once.

In [None]:
X = malware_calls_df.API_Calls

In [None]:
vocab_list = []
for i in range(len(X)):
  call_list = X[i]
  call_list = call_list.split(" ")
  call_list = list(set(call_list))
  to_add = [x for x in call_list if x not in vocab_list]
  vocab_list.append(to_add)
# Add special tokens to vocab
vocab_list.append(['SEP', 'PAD', 'UNK', 'MASK', 'CLS'])


In [None]:
from itertools import chain

vocab_list = list(set(list(chain(*vocab_list))))

In [None]:
# There are 283 tokens, including the special tokens
len(vocab_list)

283

In [None]:
with open('/content/drive/My Drive/Research/CyberBERT/data/vocab.txt', 'w') as f:
    for item in vocab_list:
        f.write(f"[{item}]\n")