<h1 align='center'>Cybersecurity threats detection using Deep Learning Architectures</h1>

### Types of Attacks

- *Denial of service attack (DoS)*: freezing or stopping the service permanently or temporarily, by sending a large amount of traffic
- *Remote to local attack*: unauthorized access is granted by sending packets between the network and the system
- *Probing*: information and data collected by scanning and mapping the network
- *User to root attack*: normal users' password is traced
- *Adversarial Attacks*: Deep Neural network are targeted by integrating noise in training data
- *Integrity Attacks*: system data is corrupted or encrypted
- *Causative Attacks*: neural network decision-making algorithm is attacked leading to miss-classification

### USTC-TK2016 Dataset

USTC-TK2016 is composed by a set of pcap files containing raw network traffic from 10 bening and 10 malware apps as shown at the table below:<br>
![USTC-TK2016](data/img/USTC-TK2016.png)<br>

### Approach

- *CNN*: Pcap files will be transformed to mist images fed to CNN
- *DNN~LSTM*: Argus api will be used to extract features from pcap files

#### Malware Traffic Classification Using CNN

##### Data preprocessing

- Step 1: Install pre-requisites (DO NOT RUN)

In [None]:
# Connect to Drive
from google.colab import drive
drive.mount('/content/drive')

# Update the list of packages
!sudo apt-get update
# Install pre-requisite packages.
!sudo apt-get install -y wget apt-transport-https software-properties-common
# Download the Microsoft repository GPG keys
!wget -q https://packages.microsoft.com/config/ubuntu/16.04/packages-microsoft-prod.deb
# Register the Microsoft repository GPG keys
!sudo dpkg -i packages-microsoft-prod.deb
# Update the list of packages after we added packages.microsoft.com
!sudo apt-get update
# Install PowerShell
!sudo apt-get install -y powershell
# Install SplitCap pre-requisite
!sudo apt install mono-runtime
# Install find dupes
!sudo apt-get install fdupes

%cd drive/MyDrive/UNIPI/DL_Cybersecurity/
# Clone the repository on "ubuntu" branch
!sudo git clone -b ubuntu https://github.com/yungshenglu/USTC-TK2016 USTC-TK2016
# Install the required packages
!pip3 install -r requirements.txt
# Download the traffic dataset
%cd USTC-TK2016/1_Pcap/
!sudo git clone -b master https://github.com/yungshenglu/USTC-TFC2016
# Grand run permission to executable files
%cd ../
!chmod 777 0_Tool/SplitCap_2-1/SplitCap.exe
!chmod 777 1_Pcap2Session.ps1
!chmod 777 2_ProcessSession.ps1


- Step 2: Split the PCAP files by each session (DO NOT RUN)


In [None]:
!pwsh -File ./1_Pcap2Session.ps1

- Step 3: Process Sessions  (DO NOT RUN)

Top 60000 large PCAP files selected and trimmed and randomly distributed into test and train sets.

In [None]:
!pwsh -File ./2_ProcessSession.ps1

- Step 4: PCAP files converted to images (DO NOT RUN)

Trimmed PCAP files into size is 784 bytes (28 x 28) (0x00 element is appended if the PCAP file is shorter than 784 bytes)

In [None]:
!python3 3_Session2Png.py

- Step 5: Png files are labeled and converted to IDX files (DO NOT RUN)

In [None]:
!python3 4_Png2Mnist.py

##### Training and Test

Imports

In [2]:
import gzip
import numpy as np
from plotly.subplots import make_subplots
import plotly.express as px

IMAGE_SIZE = 28
DATA_DIR = 'drive/MyDrive/UNIPI/DL_Cybersecurity/USTC-TK2016/5_Mnist/'
dict_2class = {0:'Benign',1:'Malware'}
dict_20class = {0:'BitTorrent',1:'Facetime',2:'FTP',3:'Gmail',4:'MySQL',5:'Outlook',6:'Skype',7:'SMB',8:'Weibo',9:'WorldOfWarcraft',10:'Cridex',11:'Geodo',12:'Htbot',13:'Miuref',14:'Neris',15:'Nsis-ay',16:'Shifu',17:'Tinba',18:'Virut',19:'Zeus'}



def extract_data(filename, num_images):
  """Extract the images into a 4D tensor [image index, y, x, channels].
  Values are rescaled from [0, 255] down to [0, 1].
  """
  print('Extracting', filename)
  with gzip.open(filename) as bytestream:
    bytestream.read(16)
    buf = bytestream.read(IMAGE_SIZE * IMAGE_SIZE * num_images)
    data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
    #data = (data - (PIXEL_DEPTH / 2.0)) / PIXEL_DEPTH
    data = data.reshape(num_images, IMAGE_SIZE, IMAGE_SIZE)
    return data


def extract_labels(filename, num_images):
  """Extract the labels into a vector of int64 label IDs."""
  print('Extracting', filename)
  with gzip.open(filename) as bytestream:
    bytestream.read(8)
    buf = bytestream.read(1 * num_images)
    labels = np.frombuffer(buf, dtype=np.uint8).astype(np.int64)
  return labels

TypeError: object of type 'module' has no len()

Extract images

In [None]:
# Extract it into np arrays.
train_data = extract_data(DATA_DIR + 'train-images-idx3-ubyte.gz', 60000) / 250.0
train_labels = extract_labels(DATA_DIR + 'train-labels-idx1-ubyte.gz', 60000)
test_data = extract_data(DATA_DIR + 't10k-images-idx3-ubyte.gz', 10000) / 250.0
test_labels = extract_labels(DATA_DIR + 't10k-labels-idx1-ubyte.gz', 10000)

# summarize loaded dataset
print('Train: X=%s, y=%s' % (train_data.shape, train_labels.shape))
print('Test: X=%s, y=%s' % (test_data.shape, test_labels.shape))

bitTorrentSamples = [train_data[i] for i in np.where(train_labels == 0)[0][:9]]
cridexSamples = [train_data[i] for i in np.where(train_labels == 9)[0][:9]]
samples = [train_data[i] for i in [np.where(train_labels == j)[0][0] for j in range(20)]]

Plot BitTorrent (Benign) images

In [None]:
bitTorrentFig = make_subplots(
    rows=3, cols=3
)

for i in range(9):
  bitTorrentFig.add_trace(px.imshow(bitTorrentSamples[i], color_continuous_scale='gray').data[0], row=int(i/3)+1, col=i%3+1)

bitTorrentFig.update_layout(title_text="BitTorrent (Benign) images", title_xanchor="center", title_yanchor="top", title_x=0.5, title_y=0.9, height=700)
coloraxis = px.imshow(bitTorrentSamples[0], color_continuous_scale='gray').layout.coloraxis
bitTorrentFig.layout.coloraxis = coloraxis
bitTorrentFig.show()

Plot Cridex (Malware)

In [None]:
cridexFig = make_subplots(
    rows=3, cols=3
)

for i in range(9):
  cridexFig.add_trace(px.imshow(cridexSamples[i], color_continuous_scale='gray').data[0], row=int(i/3)+1, col=i%3+1)

cridexFig.update_layout(title_text="Cridex (Malware) images", title_xanchor="center", title_yanchor="top", title_x=0.5, title_y=0.9, height=700)
coloraxis = px.imshow(cridexSamples[0], color_continuous_scale='gray').layout.coloraxis
cridexFig.layout.coloraxis = coloraxis
cridexFig.show()

Compare Benign and Malware images

In [None]:
benignFig = make_subplots(
    rows=2, cols=5,
    subplot_titles=([dict_20class.get(key) for key in range(10)]))

malwareFig = make_subplots(
    rows=2, cols=5,
    subplot_titles=([dict_20class.get(key) for key in range(10,20)]))

for i in range(10):
  benignFig.add_trace(px.imshow(samples[i], color_continuous_scale='gray').data[0], row=int(i/5)+1, col=i%5+1)
  malwareFig.add_trace(px.imshow(samples[i+10], color_continuous_scale='gray').data[0], row=int(i/5)+1, col=i%5+1)

benignFig.update_layout(title_text="Benign images", title_xanchor="center", title_yanchor="top", title_x=0.5, title_y=0.9, height=700)
malwareFig.update_layout(title_text="Malware images", title_xanchor="center", title_yanchor="top", title_x=0.5, title_y=0.9, height=700)
coloraxis = px.imshow(samples[0], color_continuous_scale='gray').layout.coloraxis
benignFig.layout.coloraxis = coloraxis
malwareFig.layout.coloraxis = coloraxis
benignFig.show()
malwareFig.show()

Build Model

In [None]:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models

# Add a channels dimension
#train_data = train_data[..., tf.newaxis].astype("float32")
#test_data = test_data[..., tf.newaxis].astype("float32")

#map labels from 20 -> 2
binary_train_labels = np.asarray([0 if l in range(10) else 1 for l in train_labels])
binary_test_labels = np.asarray([0 if l in range(10) else 1 for l in test_labels])

model = models.Sequential([
    layers.Conv2D(32, 5, activation='relu', input_shape=(IMAGE_SIZE, IMAGE_SIZE, 1)),
    layers.BatchNormalization(),
    layers.MaxPooling2D(),
    layers.Conv2D(64, 5, activation='relu'),
    layers.BatchNormalization(),
    layers.MaxPooling2D(),
    layers.Conv2D(64, 5, activation='relu', padding="same"),
    layers.BatchNormalization(),
    layers.MaxPooling2D(),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.25),
    layers.Dense(128, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.35),
    layers.Dense(1, activation='sigmoid')
])

model.summary()

Train model

In [None]:
import pickle

BATCH_SIZE = 64
EPOCHS = [10, 20, 40]
LEARNING_RATE = [0.001, 0.0001]

loss = tf.keras.losses.BinaryCrossentropy()
metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]

for epoch in EPOCHS:
  for lr in LEARNING_RATE:
    optimizer = tf.keras.optimizers.Adam(learning_rate=lr)

    model.compile(loss=loss, optimizer=optimizer, metrics=metrics)

    results = model.fit(train_data, binary_train_labels, epochs=EPOCHS[0], batch_size=BATCH_SIZE, validation_data=(test_data, binary_test_labels), validation_batch_size=BATCH_SIZE)

    model_name = 'model_' + str(epoch) + '_' + str(lr)

    model.save('drive/MyDrive/UNIPI/DL_Cybersecurity/saved_model/' + model_name)

    with open('drive/MyDrive/UNIPI/DL_Cybersecurity/saved_history/' + model_name, 'wb') as file_pi:
      pickle.dump(results.history, file_pi)
