<div id="singlestore-header" style="display: flex; background-color: rgba(235, 249, 245, 0.25); padding: 5px;">
    <div id="icon-image" style="width: 90px; height: 90px;">
        <img width="100%" height="100%" src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/header-icons/browser.png" />
    </div>
    <div id="text" style="padding: 5px; margin-left: 10px;">
        <div id="badge" style="display: inline-block; background-color: rgba(0, 0, 0, 0.15); border-radius: 4px; padding: 4px 8px; align-items: center; margin-top: 6px; margin-bottom: -2px; font-size: 80%">SingleStore Notebooks</div>
        <h1 style="font-weight: 500; margin: 8px 0 0 4px;">IT Threat Detection, Part 1</h1>
    </div>
</div>

This notebook demonstrates the application of SingleStoreDB's similarity search to create a system for identifying infrequent occurrences, a common requirement in fields such as cybersecurity and fraud detection where only a small percentage of events are potentially malicious.

In this instance, we aim to construct a network intrusion detection system. These systems continuously monitor incoming and outgoing network traffic, generating alerts when potential threats are detected. We'll utilize a combination of a deep learning model and similarity search to identify and classify network intrusion traffic.

Our initial step involves a dataset of labeled traffic events, distinguishing between benign and malicious events, by transforming them into vector embeddings. These vector embeddings serve as comprehensive mathematical representations of network traffic events. SingleStoreDB's built-in similarity-search algorithms allow us to measure the similarity between different network events. To generate these embeddings, we'll leverage a deep learning model based on recent academic research.

Subsequently, we'll apply this dataset to search for the most similar matches when presented with new, unseen network events. We'll retrieve these matches along with their corresponding labels. This process enables us to classify the unseen events as either **benign** or **malicious** by propagating the labels of the matched events. It's essential to note that intrusion detection is a complex classification task, primarily because malicious events occur infrequently. The similarity search service plays a crucial role in identifying relevant historical labeled events, thus enabling the identification of these rare events while maintaining a low rate of false alarms.

## Install Dependencies

In [4]:
!pip3 install tensorflow keras pandas --upgrade --quiet

In [5]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import pandas as pd
import tensorflow.keras.backend as K
from tensorflow import keras
from tensorflow.keras.models import Model

We'll define a Python context manager called `clear_memory()` using the **contextlib** module. This context manager will be used to clear memory by running Python's garbage collector (`gc.collect()`) after a block of code is executed.

In [6]:
import contextlib
import gc

@contextlib.contextmanager
def clear_memory():
    try:
        yield
    finally:
        gc.collect()

We'll will incorporate portions of code from [research work](https://github.com/Colorado-Mesa-University-Cybersecurity/DeepLearning-IDS). To begin, we'll clone the repository required for data preparation.

In [7]:
!git clone -q https://github.com/Colorado-Mesa-University-Cybersecurity/DeepLearning-IDS.git

## Data Preparation

The datasets we'll utilize comprise two types of network traffic:

1. Benign (normal)
2. Malicious (attack)

stemming from various network attacks. Our focus will be solely on web-based attacks. These web attacks fall into three common categories:

1. Cross-site scripting (BruteForce-XSS)
2. SQL-Injection (SQL-Injection)
3. Brute force attempts on administrative and user passwords (BruteForce-Web)

The original data was collected over a span of two days.

### Download Data

We'll proceed by downloading data for two specific dates:

1. February 22, 2018
2. February 23, 2018

These files will be retrieved and saved to the current directory. Our intention is to use one of these dates for training and generating vectors, while the other will be reserved for testing purposes.

In [8]:
!wget "https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv" -q --show-progress
!wget "https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Friday-23-02-2018_TrafficForML_CICFlowMeter.csv" -q --show-progress



### Review Data

In [14]:
with clear_memory():
    data = pd.read_csv('Friday-23-02-2018_TrafficForML_CICFlowMeter.csv')

data.Label.value_counts()

Label
Benign              1048009
Brute Force -Web        362
Brute Force -XSS        151
SQL Injection            53
Name: count, dtype: int64

### Clean Data

We'll run a cleanup script from the previously downloaded GitHub repo.

In [17]:
!python DeepLearning-IDS/data_cleanup.py "Friday-23-02-2018_TrafficForML_CICFlowMeter.csv" "result23022018"

cleaning Friday-23-02-2018_TrafficForML_CICFlowMeter.csv
total rows read = 1048576
all done writing 1042868 rows; dropped 5708 rows


We'll now review the cleaned data from the previous step.

In [18]:
with clear_memory():
    data_23_cleaned = pd.read_csv('result23022018.csv')

data_23_cleaned.head()

Unnamed: 0,Dst Port,Protocol,Timestamp,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,...,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,22,6,1519374000.0,1532698,11,11,1179,1969,648,0,...,32,0.0,0.0,0,0,0.0,0.0,0,0,Benign
1,500,17,1519374000.0,117573855,3,0,1500,0,500,500,...,8,0.0,0.0,0,0,58786927.5,23753240.0,75583006,41990849,Benign
2,500,17,1519374000.0,117573848,3,0,1500,0,500,500,...,8,0.0,0.0,0,0,58786924.0,23753250.0,75583007,41990841,Benign
3,22,6,1519374000.0,1745392,11,11,1179,1969,648,0,...,32,0.0,0.0,0,0,0.0,0.0,0,0,Benign
4,500,17,1519374000.0,89483474,6,0,3000,0,500,500,...,8,4000364.0,0.0,4000364,4000364,21370777.5,15280920.0,41989576,7200485,Benign


In [19]:
data_23_cleaned.Label.value_counts()

Label
Benign              1042301
Brute Force -Web        362
Brute Force -XSS        151
SQL Injection            53
Name: count, dtype: int64

## Load Model

In this section, we'll load a pre-trained model that has been trained on data collected from the same date.

There are slight modifications to the original model, specifically, altering the number of classes. Initially, the model was designed to classify into four categories:

1. Benign
2. BruteForce-Web
3. BruteForce-XSS
4. SQL-Injection

Our modified model has been adjusted to classify into just two categories:

1. Benign
2. Attack

<div class="alert alert-block alert-warning">
    <b class="fa fa-solid fa-exclamation-circle"></b>
    <div>
        <p><b>Action Required</b></p>
        <p>The ZIP file is hosted on a Google Drive.</p>
        <p>Using the <b>Edit Firewall</b> button in the top right, add the following to the SingleStoreDB Cloud notebook firewall, one-by-one:
            <ul style="list-style: none;">
                <li><b>drive.google.com</b></li>
                <li><b>*.googleapis.com</b></li>
                <li><b>*.googleusercontent.com</b></li>
            </ul>
        </p>
    </div>
</div>

In [20]:
!wget -q -O it_threat_model.zip "https://drive.google.com/uc?export=download&id=1ahr5dYlhuxS56M6helUFI0yIxxIoFk9o" 
!unzip -q it_threat_model.zip

In [21]:
with clear_memory():
    model = keras.models.load_model('it_threat_model')

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 128)               10240     
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
Total params: 18561 (72.50 KB)
Trainable params: 18561 (72.50 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [22]:
with clear_memory():
    # Use the first layer
    layer_name = 'dense'
    intermediate_layer_model = Model(
        inputs = model.input,
        outputs = model.get_layer(layer_name).output
    )

## Upload Data to SingleStoreDB

### Prepare Data
We'll use a method for defining item IDs that aligns with the event's label.

In [23]:
from tqdm import tqdm
items_to_upload = []

with clear_memory():
    model_res = intermediate_layer_model.predict(K.constant(data_23_cleaned.iloc[:,:-1]))
    
    for i, res in tqdm(zip(data_23_cleaned.iterrows(), model_res), total = len(model_res)):
        benign_or_attack = i[1]['Label'][:3]
        items_to_upload.append((benign_or_attack + '_' + str(i[0]), res.tolist()))



100%|██████████| 1042867/1042867 [00:50<00:00, 20810.99it/s]


We'll store the data in a Pandas DataFrame.

In [29]:
with clear_memory():
    df = pd.DataFrame(items_to_upload, columns=['ID', 'Model_Results'])

df.head()

Unnamed: 0,ID,Model_Results
0,Ben_0,"[0.0, 0.0, 0.0, 125628656.0, 0.0, 0.0, 5421442..."
1,Ben_1,"[0.0, 0.0, 0.0, 356751744.0, 1190461440.0, 0.0..."
2,Ben_2,"[0.0, 0.0, 0.0, 356751680.0, 1190461440.0, 0.0..."
3,Ben_3,"[0.0, 0.0, 0.0, 125515856.0, 0.0, 0.0, 5432884..."
4,Ben_4,"[0.0, 0.0, 0.0, 26214912.0, 698683840.0, 0.0, ..."


Now we'll convert the vectors to a binary format, ready to store in SingleStoreDB.

In [30]:
import struct

def data_to_binary(data: list[float]):
    format_string = 'f' * len(data)
    return struct.pack(format_string, *data)

with clear_memory():
    df['Model_Results'] = df['Model_Results'].apply(data_to_binary)

We'll check the DataFrame.

In [31]:
df.head()

Unnamed: 0,ID,Model_Results
0,Ben_0,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
1,Ben_1,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
2,Ben_2,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
3,Ben_3,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...
4,Ben_4,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...


### Create Database and Table

In [34]:
%%sql
%%sql
DROP DATABASE IF EXISTS siem_log_kafka_demo;

CREATE DATABASE IF NOT EXISTS siem_log_kafka_demo;

USE siem_log_kafka_demo;

DROP TABLE IF EXISTS model_results_demo;

CREATE TABLE IF NOT EXISTS model_results (
    id TEXT,
    Model_Results BLOB
);

### Get Connection Details

<div class="alert alert-block alert-warning">
    <b class="fa fa-solid fa-exclamation-circle"></b>
    <div>
        <p><b>Action Required</b></p>
        <p>Select the database from the drop-down menu at the top of this notebook. It updates the <b>connection_url</b> which is used by SQLAlchemy to make connections to the selected database.</p>
    </div>
</div>

In [36]:
from sqlalchemy import *

db_connection = create_engine(connection_url)

### Store DataFrame

In [37]:
with clear_memory():
    df.to_sql(
        'model_results',
        con = db_connection,
        if_exists = 'append',
        index = False,
        chunksize = 1000
    )

### Check Stored Data

In [47]:
%%sql
%%sql
USE siem_log_kafka_demo;

SELECT ID, JSON_ARRAY_UNPACK(Model_Results) AS Model_Results
FROM model_results
LIMIT 1;

ID,Model_Results
Ben_764632,"[0, 0, 0, 161398336, 0, 0, 91465440, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 186428320, 0, 0, 0, 167306864, 0, 0, 277207904, 0, 92328576, 73124928, 0, 0, 0, 95751136, 0, 0, 0, 0, 0, 0, 0, 0, 230162768, 273622432, 511405120, 0, 0, 0, 0, 0, 0, 0, 0, 0, 145775152, 106490400, 373456928, 0, 0, 0, 211604256, 30848250, 0, 0, 0, 0, 326004800, 0, 0, 0, 0, 13625428, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 248507264, 0, 121489904, 196521904, 0, 2331058, 0, 0, 234076784, 247954704, 0, 0, 16321682, 0, 0, 0, 343808992, 0, 0, 0, 74993352, 0, 0, 59710728, 0, 0, 89274704, 0, 174431776, 107296112, 0, 0, 134864096, 0, 0]"
