<a href="https://colab.research.google.com/github/teogoulas/cybersecurity/blob/main/Cybersecurity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align='center'>Cybersecurity threats detection using Deep Learning Architectures</h1>

### Types of Attacks

- *Denial of service attack (DoS)*: freezing or stopping the service permanently or temporarily, by sending a large amount of traffic
- *Remote to local attack*: unauthorized access is granted by sending packets between the network and the system
- *Probing*: information and data collected by scanning and mapping the network
- *User to root attack*: normal users' password is traced
- *Adversarial Attacks*: Deep Neural network are targeted by integrating noise in training data
- *Integrity Attacks*: system data is corrupted or encrypted
- *Causative Attacks*: neural network decision-making algorithm is attacked leading to miss-classification

### USTC-TK2016 Dataset

USTC-TK2016 is composed by a set of pcap files containing raw network traffic from 10 bening and 10 malware apps as shown at the table below:<br>
![USTC-TK2016](https://github.com/teogoulas/cybersecurity/blob/main/data/img/USTC-TK2016.png?raw=1)<br>

### Approach

- *CNN*: Pcap files will be transformed to mist images fed to CNN

#### Malware Traffic Classification Using CNN

##### Data preprocessing

- Step 1: Install pre-requisites (DO NOT RUN)

In [None]:
# Connect to Drive
from google.colab import drive
drive.mount('/content/drive')

# Update the list of packages
!sudo apt-get update
# Install pre-requisite packages.
!sudo apt-get install -y wget apt-transport-https software-properties-common
# Download the Microsoft repository GPG keys
!wget -q https://packages.microsoft.com/config/ubuntu/16.04/packages-microsoft-prod.deb
# Register the Microsoft repository GPG keys
!sudo dpkg -i packages-microsoft-prod.deb
# Update the list of packages after we added packages.microsoft.com
!sudo apt-get update
# Install PowerShell
!sudo apt-get install -y powershell
# Install SplitCap pre-requisite
!sudo apt install mono-runtime
# Install find dupes
!sudo apt-get install fdupes

%cd drive/MyDrive/UNIPI/DL_Cybersecurity/
# Clone the repository on "ubuntu" branch
!sudo git clone -b ubuntu https://github.com/yungshenglu/USTC-TK2016 USTC-TK2016
# Install the required packages
!pip3 install -r requirements.txt
# Download the traffic dataset
%cd USTC-TK2016/1_Pcap/
!sudo git clone -b master https://github.com/yungshenglu/USTC-TFC2016
# Grand run permission to executable files
%cd ../
!chmod 777 0_Tool/SplitCap_2-1/SplitCap.exe
!chmod 777 1_Pcap2Session.ps1
!chmod 777 2_ProcessSession.ps1


- Step 2: Split the PCAP files by each session (DO NOT RUN)


In [None]:
!pwsh -File ./1_Pcap2Session.ps1

- Step 3: Process Sessions  (DO NOT RUN)

Top 60000 large PCAP files selected and trimmed and randomly distributed into test and train sets.

In [None]:
!pwsh -File ./2_ProcessSession.ps1

- Step 4: PCAP files converted to images (DO NOT RUN)

Trimmed PCAP files into size is 784 bytes (28 x 28) (0x00 element is appended if the PCAP file is shorter than 784 bytes)

In [None]:
!python3 3_Session2Png.py

- Step 5: Png files are labeled and converted to IDX files (DO NOT RUN)

In [None]:
!python3 4_Png2Mnist.py

##### Training and Test

###### Extract data

In [4]:
import gzip
import time
import sys
import numpy as np
import os

IMAGE_SIZE = 28
DATA_DIR = 'drive/MyDrive/UNIPI/DL_Cybersecurity/USTC-TK2016/5_Mnist/'
dict_2class = {0:'Benign',1:'Malware'}
dict_20class = {0:'BitTorrent',1:'Facetime',2:'FTP',3:'Gmail',4:'MySQL',5:'Outlook',6:'Skype',7:'SMB',8:'Weibo',9:'WorldOfWarcraft',10:'Cridex',11:'Geodo',12:'Htbot',13:'Miuref',14:'Neris',15:'Nsis-ay',16:'Shifu',17:'Tinba',18:'Virut',19:'Zeus'}



def extract_data(filename, num_images):
  """Extract the images into a 4D tensor [image index, y, x, channels].
  Values are rescaled from [0, 255] down to [-0.5, 0.5].
  """
  print('Extracting', filename)
  with gzip.open(filename) as bytestream:
    bytestream.read(16)
    buf = bytestream.read(IMAGE_SIZE * IMAGE_SIZE * num_images)
    data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
    #data = (data - (PIXEL_DEPTH / 2.0)) / PIXEL_DEPTH
    data = data.reshape(num_images, IMAGE_SIZE, IMAGE_SIZE)
    return data


def extract_labels(filename, num_images):
  """Extract the labels into a vector of int64 label IDs."""
  print('Extracting', filename)
  with gzip.open(filename) as bytestream:
    bytestream.read(8)
    buf = bytestream.read(1 * num_images)
    labels = np.frombuffer(buf, dtype=np.uint8).astype(np.int64)
  return labels

# Extract it into np arrays.
train_data = extract_data(DATA_DIR + 'train-images-idx3-ubyte.gz', 60000) / 250.0
train_labels = extract_labels(DATA_DIR + 'train-labels-idx1-ubyte.gz', 60000)
test_data = extract_data(DATA_DIR + 't10k-images-idx3-ubyte.gz', 10000) / 250.0
test_labels = extract_labels(DATA_DIR + 't10k-labels-idx1-ubyte.gz', 10000)

# summarize loaded dataset
print('Train: X=%s, y=%s' % (train_data.shape, train_labels.shape))
print('Test: X=%s, y=%s' % (test_data.shape, test_labels.shape))

bitTorrentSamples = [train_data[i] for i in np.where(train_labels == 0)[0][:9]]
cridexSamples = [train_data[i] for i in np.where(train_labels == 9)[0][:9]]
benignSamples = [train_data[i] for i in [np.where(train_labels == j)[0][0] for j in range(20)]]


Extracting drive/MyDrive/UNIPI/DL_Cybersecurity/USTC-TK2016/5_Mnist/train-images-idx3-ubyte.gz
Extracting drive/MyDrive/UNIPI/DL_Cybersecurity/USTC-TK2016/5_Mnist/train-labels-idx1-ubyte.gz
Extracting drive/MyDrive/UNIPI/DL_Cybersecurity/USTC-TK2016/5_Mnist/t10k-images-idx3-ubyte.gz
Extracting drive/MyDrive/UNIPI/DL_Cybersecurity/USTC-TK2016/5_Mnist/t10k-labels-idx1-ubyte.gz
Train: X=(60000, 28, 28), y=(60000,)
Test: X=(10000, 28, 28), y=(10000,)


###### Plot Benign & Malware Images

In [None]:
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.graph_objects as go

bitTorrentFig = make_subplots(
    rows=3, cols=3
)

for i in range(9):
  bitTorrentFig.add_trace(px.imshow(bitTorrentSamples[i], color_continuous_scale='gray').data[0], row=int(i/3)+1, col=i%3+1)

bitTorrentFig.update_layout(title_text="BitTorrent (Benign) images", title_xanchor="center", title_yanchor="top", title_x=0.5, title_y=0.9, height=700)
coloraxis = px.imshow(bitTorrentSamples[0], color_continuous_scale='gray').layout.coloraxis
bitTorrentFig.layout.coloraxis = coloraxis
bitTorrentFig.show()


In [None]:
cridexFig = make_subplots(
    rows=3, cols=3
)

for i in range(9):
  cridexFig.add_trace(px.imshow(cridexSamples[i], color_continuous_scale='gray').data[0], row=int(i/3)+1, col=i%3+1)

cridexFig.update_layout(title_text="Cridex (Malware) images", title_xanchor="center", title_yanchor="top", title_x=0.5, title_y=0.9, height=700)
coloraxis = px.imshow(cridexSamples[0], color_continuous_scale='gray').layout.coloraxis
cridexFig.layout.coloraxis = coloraxis
cridexFig.show()

In [None]:
benignFig = make_subplots(
    rows=2, cols=5,
    subplot_titles=([dict_20class.get(key) for key in range(10)]))

malwareFig = make_subplots(
    rows=2, cols=5,
    subplot_titles=([dict_20class.get(key) for key in range(10,20)]))

for i in range(10):
  benignFig.add_trace(px.imshow(benignSamples[i], color_continuous_scale='gray').data[0], row=int(i/5)+1, col=i%5+1)
  malwareFig.add_trace(px.imshow(benignSamples[i+10], color_continuous_scale='gray').data[0], row=int(i/5)+1, col=i%5+1)

benignFig.update_layout(title_text="Benign images", title_xanchor="center", title_yanchor="top", title_x=0.5, title_y=0.9, height=700)
malwareFig.update_layout(title_text="Malware images", title_xanchor="center", title_yanchor="top", title_x=0.5, title_y=0.9, height=700)
coloraxis = px.imshow(benignSamples[0], color_continuous_scale='gray').layout.coloraxis
benignFig.layout.coloraxis = coloraxis
malwareFig.layout.coloraxis = coloraxis
benignFig.show()
malwareFig.show()


###### Build DL model

In [6]:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models

# Add a channels dimension
train_data = train_data[..., tf.newaxis].astype("float32")
test_data = test_data[..., tf.newaxis].astype("float32")

#map labels from 20 -> 2
binary_train_labels = np.asarray([0 if l in range(10) else 1 for l in train_labels])
binary_test_labels = np.asarray([0 if l in range(10) else 1 for l in test_labels])

model = models.Sequential([
    layers.Conv2D(32, 5, activation='relu', input_shape=(IMAGE_SIZE, IMAGE_SIZE, 1)),
    layers.BatchNormalization(),
    layers.MaxPooling2D(),
    layers.Conv2D(64, 5, activation='relu'),
    layers.BatchNormalization(),
    layers.MaxPooling2D(),
    layers.Conv2D(64, 5, activation='relu', padding="same"),
    layers.BatchNormalization(),
    layers.MaxPooling2D(),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.25),
    layers.Dense(128, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.35),
    layers.Dense(1, activation='sigmoid')
])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 24, 24, 32)        832       
_________________________________________________________________
batch_normalization (BatchNo (None, 24, 24, 32)        128       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 12, 12, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 8, 8, 64)          51264     
_________________________________________________________________
batch_normalization_1 (Batch (None, 8, 8, 64)          256       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 4, 4, 64)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 4, 4, 64)          1

###### Train DL model

In [None]:
import pickle

BATCH_SIZE = 64
EPOCHS = [20, 40]
LEARNING_RATE = [0.001, 0.0001]

loss = tf.keras.losses.BinaryCrossentropy()
metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]
for epoch in EPOCHS:
  for lr in LEARNING_RATE:
    optimizer = tf.keras.optimizers.Adam(learning_rate=lr)

    model.compile(loss=loss, optimizer=optimizer, metrics=metrics)

    results = model.fit(train_data, binary_train_labels, epochs=epoch, batch_size=BATCH_SIZE, validation_data=(test_data, binary_test_labels), validation_batch_size=BATCH_SIZE)

    model_name = 'model_' + str(epoch) + '_' + str(lr)

    model.save('drive/MyDrive/UNIPI/DL_Cybersecurity/saved_model/' + model_name)

    with open('drive/MyDrive/UNIPI/DL_Cybersecurity/saved_history/' + model_name, 'wb') as file_pi:
      pickle.dump(results.history, file_pi)



Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
INFO:tensorflow:Assets written to: drive/MyDrive/UNIPI/DL_Cybersecurity/saved_model/model_20_0.001/assets
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
INFO:tensorflow:Assets written to: drive/MyDrive/UNIPI/DL_Cybersecurity/saved_model/model_20_0.0001/assets
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/4

###### Build ML classifier (SVM)

In [None]:
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split


flattened_train_data = [d[:, :, 0].flatten() for d in train_data]
flattened_test_data = [d[:, :, 0].flatten() for d in test_data]

scaler = preprocessing.StandardScaler()
train_data_scaled = scaler.fit_transform(flattened_train_data)
test_data_scaled = scaler.fit_transform(flattened_test_data)

_, X_train, _, y_train = train_test_split(train_data_scaled, binary_train_labels, test_size=0.10, random_state=42)

grid_params = {
        'C': [0.1, 1, 10],
        'gamma': [0.1, 0.01, 0.001],
        'kernel': ['rbf', 'linear', 'sigmoid']
    }

clf = SVC(random_state=45, probability=True)

gs = GridSearchCV(
        clf,
        grid_params,
        verbose=3,
        cv=5,
        return_train_score=True
    )

gs.fit(X_train, y_train)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  C=0.1, gamma=0.1, kernel=rbf, score=(train=0.805, test=0.812), total= 1.7min
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.0min remaining:    0.0s


[CV]  C=0.1, gamma=0.1, kernel=rbf, score=(train=0.809, test=0.800), total= 1.7min
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  3.9min remaining:    0.0s


[CV]  C=0.1, gamma=0.1, kernel=rbf, score=(train=0.812, test=0.787), total= 1.6min
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV]  C=0.1, gamma=0.1, kernel=rbf, score=(train=0.802, test=0.821), total= 1.7min
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV]  C=0.1, gamma=0.1, kernel=rbf, score=(train=0.811, test=0.797), total= 1.7min
[CV] C=0.1, gamma=0.1, kernel=linear .................................
[CV]  C=0.1, gamma=0.1, kernel=linear, score=(train=0.995, test=0.927), total=  34.5s
[CV] C=0.1, gamma=0.1, kernel=linear .................................
[CV]  C=0.1, gamma=0.1, kernel=linear, score=(train=0.996, test=0.932), total=  31.1s
[CV] C=0.1, gamma=0.1, kernel=linear .................................
[CV]  C=0.1, gamma=0.1, kernel=linear, score=(train=0.995, test=0.924), total=  32.7s
[CV] C=0.1, gamma=0.1, kernel=linear .................................
[CV]  C=0.1, gamma=0.1, kernel=linear, score=(train=0.994, test=0.9

[Parallel(n_jobs=1)]: Done 135 out of 135 | elapsed: 159.4min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=True, random_state=45, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.1, 1, 10], 'gamma': [0.1, 0.01, 0.001],
                         'kernel': ['rbf', 'linear', 'sigmoid']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring=None, verbose=3)

In [None]:
gs.best_params_

{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}

###### Training ML classifier

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score 

clf = SVC(random_state=45, probability=True, C=10, gamma=0.001, kernel='rbf')
clf.fit(train_data_scaled, binary_train_labels)
y_predicted = clf.predict(test_data_scaled)

# scoring = ['accuracy', 'precision_macro', 'recall_macro'
acc_score = accuracy_score(binary_test_labels, y_predicted)
prec_score = precision_score(binary_test_labels, y_predicted, average='macro')
rec_score = recall_score(binary_test_labels, y_predicted, average='macro')

print('Accuracy: {}'.format(acc_score))
print('Precision: {}'.format(prec_score))
print('Recall: {}'.format(rec_score))

In [8]:
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split


flattened_train_data = [d[:, :, 0].flatten() for d in train_data]
flattened_test_data = [d[:, :, 0].flatten() for d in test_data]

train_data_scaled = preprocessing.scale(flattened_train_data)
test_data_scaled = preprocessing.scale(flattened_test_data)

In [2]:
# Connect to Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
