# Introduction
The project our group members are currently focusing on can be generalized as the problem with regard to <b>"Intrusive Detection System(IDS)"</b>. IDS is one type of threat where people with an area of expertise in Networking are reluctant to meet. It can be categorized into three groups:
- Signature-Based IDS
- Anomaly-Based IDS
- Hybrid IDS


Signature-Based IDS refers to system where labels are already identified. As suggested by its name, a high performance efficiency is more likely to be obtained in executing labels that are not unknown, but those without designated patterns usually do not achieve an efficacious result compared to those known labels.

Anomaly-Based IDS, based the its special functionality and application, is another category that can resolve the drawback shown above that is troubled with many network programmers. For IDSs without labels, it plays an imperative role in that both the training errors and the testing errors during the process of training the model can be decreased significantly. However, while concentrating on those labels with specific signatures, its corresponding efficiency cannot be performed satisfactoraily. Therefore, a more comprehensive approach should be considered wisely so that both the training errors and the testing errors in terms of two types above can be optimized.

Finally, a hybrid approach by integrating Signature-Based IDS and Anomaly-Based altogether appears in that IDSs containing labels and non-labels can be tested rigorously.




# Purpose
According to the instructions illustrated above, the main purpose of this research is to simulate and evaluate the scenario of IDS in a dataset from Kaggle https://www.kaggle.com/datasets/aryashah2k/nfuqnidsv2-network-intrusion-detection-dataset with respect to three machine learning models: Decision Tree, Naive Bayes, and K-Nearest Neighbor.

In [47]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

network = pd.read_csv('NF-UQ-NIDS-v2.csv', nrows=500000)

In [48]:
print("The dimension of the dataset extracted is: ", network.shape)
print("How many rows does this dataset contain: ", network.shape[0])
print("How many columns does this dataset contain: ", network.shape[1])




The dimension of the dataset extracted is:  (500000, 46)
How many rows does this dataset contain:  500000
How many columns does this dataset contain:  46


In [49]:
print(network.isnull().any())

IPV4_SRC_ADDR                  False
L4_SRC_PORT                    False
IPV4_DST_ADDR                  False
L4_DST_PORT                    False
PROTOCOL                       False
L7_PROTO                       False
IN_BYTES                       False
IN_PKTS                        False
OUT_BYTES                      False
OUT_PKTS                       False
TCP_FLAGS                      False
CLIENT_TCP_FLAGS               False
SERVER_TCP_FLAGS               False
FLOW_DURATION_MILLISECONDS     False
DURATION_IN                    False
DURATION_OUT                   False
MIN_TTL                        False
MAX_TTL                        False
LONGEST_FLOW_PKT               False
SHORTEST_FLOW_PKT              False
MIN_IP_PKT_LEN                 False
MAX_IP_PKT_LEN                 False
SRC_TO_DST_SECOND_BYTES        False
DST_TO_SRC_SECOND_BYTES        False
RETRANSMITTED_IN_BYTES         False
RETRANSMITTED_IN_PKTS          False
RETRANSMITTED_OUT_BYTES        False
R

In [50]:
print("What are the numbers and distributions of the various types of attacks in the dataset:\n", network['Attack'].value_counts())
print("\n")
print("\n")
print("How many sources does this dataset contain:\n", network['Dataset'].value_counts())

What are the numbers and distributions of the various types of attacks in the dataset:
 Benign            165203
DDoS              143262
DoS               117999
scanning           25001
Reconnaissance     17186
xss                16081
password            7562
injection           4500
Bot                  935
Brute Force          809
Infilteration        786
Exploits             206
Fuzzers              135
Backdoor             122
Generic               95
mitm                  50
ransomware            28
Analysis              18
Theft                 11
Shellcode              9
Worms                  2
Name: Attack, dtype: int64




How many sources does this dataset contain:
 NF-BoT-IoT-v2            248848
NF-CSE-CIC-IDS2018-v2    124029
NF-ToN-IoT-v2            111418
NF-UNSW-NB15-v2           15705
Name: Dataset, dtype: int64


In [51]:
print(network.columns)

Index(['IPV4_SRC_ADDR', 'L4_SRC_PORT', 'IPV4_DST_ADDR', 'L4_DST_PORT',
       'PROTOCOL', 'L7_PROTO', 'IN_BYTES', 'IN_PKTS', 'OUT_BYTES', 'OUT_PKTS',
       'TCP_FLAGS', 'CLIENT_TCP_FLAGS', 'SERVER_TCP_FLAGS',
       'FLOW_DURATION_MILLISECONDS', 'DURATION_IN', 'DURATION_OUT', 'MIN_TTL',
       'MAX_TTL', 'LONGEST_FLOW_PKT', 'SHORTEST_FLOW_PKT', 'MIN_IP_PKT_LEN',
       'MAX_IP_PKT_LEN', 'SRC_TO_DST_SECOND_BYTES', 'DST_TO_SRC_SECOND_BYTES',
       'RETRANSMITTED_IN_BYTES', 'RETRANSMITTED_IN_PKTS',
       'RETRANSMITTED_OUT_BYTES', 'RETRANSMITTED_OUT_PKTS',
       'SRC_TO_DST_AVG_THROUGHPUT', 'DST_TO_SRC_AVG_THROUGHPUT',
       'NUM_PKTS_UP_TO_128_BYTES', 'NUM_PKTS_128_TO_256_BYTES',
       'NUM_PKTS_256_TO_512_BYTES', 'NUM_PKTS_512_TO_1024_BYTES',
       'NUM_PKTS_1024_TO_1514_BYTES', 'TCP_WIN_MAX_IN', 'TCP_WIN_MAX_OUT',
       'ICMP_TYPE', 'ICMP_IPV4_TYPE', 'DNS_QUERY_ID', 'DNS_QUERY_TYPE',
       'DNS_TTL_ANSWER', 'FTP_COMMAND_RET_CODE', 'Label', 'Attack', 'Dataset'],
      dtype='ob

In [52]:
network = network.drop(["PROTOCOL", "L7_PROTO", "TCP_FLAGS", "CLIENT_TCP_FLAGS", "SERVER_TCP_FLAGS", 
                        "DURATION_IN", "DURATION_OUT", "SRC_TO_DST_SECOND_BYTES", "DST_TO_SRC_SECOND_BYTES", 
                        "RETRANSMITTED_IN_BYTES", "RETRANSMITTED_IN_PKTS", "RETRANSMITTED_OUT_BYTES",
                        "RETRANSMITTED_OUT_PKTS", "NUM_PKTS_UP_TO_128_BYTES", "NUM_PKTS_128_TO_256_BYTES",
                        "NUM_PKTS_256_TO_512_BYTES", "NUM_PKTS_512_TO_1024_BYTES", "NUM_PKTS_1024_TO_1514_BYTES",
                        "TCP_WIN_MAX_IN", "TCP_WIN_MAX_OUT", "ICMP_TYPE", "ICMP_IPV4_TYPE", "DNS_QUERY_ID",
                        "DNS_QUERY_TYPE", "DNS_TTL_ANSWER", "FTP_COMMAND_RET_CODE", "Label"], axis=1)
network

Unnamed: 0,IPV4_SRC_ADDR,L4_SRC_PORT,IPV4_DST_ADDR,L4_DST_PORT,IN_BYTES,IN_PKTS,OUT_BYTES,OUT_PKTS,FLOW_DURATION_MILLISECONDS,MIN_TTL,MAX_TTL,LONGEST_FLOW_PKT,SHORTEST_FLOW_PKT,MIN_IP_PKT_LEN,MAX_IP_PKT_LEN,SRC_TO_DST_AVG_THROUGHPUT,DST_TO_SRC_AVG_THROUGHPUT,Attack,Dataset
0,192.168.100.148,65389,192.168.100.7,80,420,3,0,0,4293092,64,64,140,140,0,140,1120000,0,DoS,NF-BoT-IoT-v2
1,192.168.100.148,11154,192.168.100.5,80,280,2,40,1,4294499,64,64,140,40,40,140,0,320000,DoS,NF-BoT-IoT-v2
2,192.168.1.31,42062,192.168.1.79,1041,44,1,40,1,0,0,0,44,40,40,44,352000,320000,Benign,NF-ToN-IoT-v2
3,192.168.1.34,46849,192.168.1.79,9110,44,1,40,1,0,0,0,44,40,40,44,352000,320000,Benign,NF-ToN-IoT-v2
4,192.168.1.30,50360,192.168.1.152,1084,44,1,40,1,0,0,0,44,40,40,44,352000,320000,Benign,NF-ToN-IoT-v2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,172.31.66.53,57700,172.31.0.2,53,70,1,183,1,0,0,0,183,70,70,183,560000,1464000,Benign,NF-CSE-CIC-IDS2018-v2
499996,192.168.100.147,58270,192.168.100.6,80,56,2,0,0,4294248,64,64,28,28,0,28,224000,0,DDoS,NF-BoT-IoT-v2
499997,172.31.64.111,59275,172.31.0.2,53,66,1,124,1,0,0,0,124,66,66,124,528000,992000,Benign,NF-CSE-CIC-IDS2018-v2
499998,192.168.100.148,2327,192.168.100.5,80,280,2,0,0,4294686,64,64,140,140,0,140,0,0,DoS,NF-BoT-IoT-v2


(500000, 19)