# Assignment 3

## Overview
In this assignment, the focus is on using the RT-IoT2022 cybersecurity dataset to understand the different features. The dataset, introduced by Sharmila et al. (2023), captures adversarial and normal network behaviors using over 80 variables derived from IoT devices such as Amazon Alexa and Raspberry Pi, as well as simulated attacks like DDoS and ARP poisoning. These features include metrics like packet counts, flow durations, payload sizes, and TCP/IP-specific flags. There are over 80+ features. 

In more detail we will outline for each "bucket" of features: 

• What the feature represents


• What datatype the feature is (e.g., categorical, numerical, float, integer, etc.)


• A judgement as to whether the feature might be an important predictor of your target variable


• Some characterization of the range of observed values (e.g., mean and standard deviation for numerical variables, list or description of the levels for categorical variables)


• Identify if any transformation (e.g., log transform) or encoding (e.g., one-hot encoding) might be needed to use the
feature in a predictive model

### Dataset Loading

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import LabelEncoder
pd.set_option('display.max_rows',100)
from IPython.display import display

file_path = '/Users/tejaleburu/Desktop/RT_IOT2022.csv'
data = pd.read_csv(file_path)
data.drop(columns=['id.orig_p', 'id.resp_p', 'Unnamed: 0'], inplace=True, errors='ignore')
data.dropna(inplace=True)
data.head()

Unnamed: 0,proto,service,flow_duration,fwd_pkts_tot,bwd_pkts_tot,fwd_data_pkts_tot,bwd_data_pkts_tot,fwd_pkts_per_sec,bwd_pkts_per_sec,flow_pkts_per_sec,...,active.std,idle.min,idle.max,idle.tot,idle.avg,idle.std,fwd_init_window_size,bwd_init_window_size,fwd_last_window_size,Attack_type
0,tcp,mqtt,32.011598,9,5,3,3,0.281148,0.156193,0.437341,...,0.0,29729180.0,29729180.0,29729180.0,29729180.0,0.0,64240,26847,502,MQTT_Publish
1,tcp,mqtt,31.883584,9,5,3,3,0.282277,0.156821,0.439097,...,0.0,29855280.0,29855280.0,29855280.0,29855280.0,0.0,64240,26847,502,MQTT_Publish
2,tcp,mqtt,32.124053,9,5,3,3,0.280164,0.155647,0.435811,...,0.0,29842150.0,29842150.0,29842150.0,29842150.0,0.0,64240,26847,502,MQTT_Publish
3,tcp,mqtt,31.961063,9,5,3,3,0.281593,0.15644,0.438033,...,0.0,29913770.0,29913770.0,29913770.0,29913770.0,0.0,64240,26847,502,MQTT_Publish
4,tcp,mqtt,31.902362,9,5,3,3,0.282111,0.156728,0.438839,...,0.0,29814700.0,29814700.0,29814700.0,29814700.0,0.0,64240,26847,502,MQTT_Publish


### Categorical Features

In [24]:
proto_counts = data['proto'].value_counts()
service_counts = data['service'].value_counts()
attack_type_counts = data['Attack_type'].value_counts()

print('Proto Counts')
print('')
display(proto_counts)
print('Service Counts')
print('')
display(service_counts)
print('Attack Type Counts')
print('')
display(attack_type_counts)

Proto Counts



tcp     110427
udp      12633
icmp        57
Name: proto, dtype: int64

Service Counts



-         102861
dns         9753
mqtt        4132
http        3464
ssl         2663
ntp          121
dhcp          50
irc           43
ssh           28
radius         2
Name: service, dtype: int64

Attack Type Counts



DOS_SYN_Hping                 94659
Thing_Speak                    8108
ARP_poisioning                 7750
MQTT_Publish                   4146
NMAP_UDP_SCAN                  2590
NMAP_XMAS_TREE_SCAN            2010
NMAP_OS_DETECTION              2000
NMAP_TCP_scan                  1002
DDOS_Slowloris                  534
Wipro_bulb                      253
Metasploit_Brute_Force_SSH       37
NMAP_FIN_SCAN                    28
Name: Attack_type, dtype: int64

#### (a) proto
Meaning: Protocol used in the communication flow (e.g., TCP, UDP, ICMP)

Data Type: Categorical

Unique Values: 3 (tcp, udp, icmp)

Potential Importance: Likely relevant since different attack types may be protocol-specific.

Transformation Required: One-hot encoding. It will allow us to convert this categorical variable into a binary format. 

#### (b) service

Meaning: Network service involved (e.g., HTTP, DNS, MQTT, SSL, NTP, DHCP, IRC, SSH, Radius, or Unknown)

Data Type: Categorical

Unique Values: 10 (e.g., http, dns, mqtt, ssl, ntp, dhcp, irc, ssh, radius, or unknown)

Potential Importance: This could be important as different attacks target different services.

Transformation Required: One-hot encoding. It will allow us to convert this categorical variable into a binary format. 

#### (c) attack_type (target variable)

Meaning: Type of attack detected in the network flow.

Data Type: Categorical

Unique Values: 12 (e.g., Normal, DoS, DDoS, Brute Force, etc.) - See Above. 

Potential Importance: Target variable. It's what we are trying to predict

Transformation Required: We will likely have to use one-hot encoding or something encoding method that can help us capture the fact that the response variable is categorical and not linearly related. 

### Numerical Features 

The mean and standard deviation of the numerical variables are displayed here but it's important to note that this may not be needed for some variables that are numerical but not linearly correlated like port number. 

In [26]:


f = data.describe().T[['mean', 'std']]
display(f)

Unnamed: 0,mean,std
flow_duration,3.809566,130.0054
fwd_pkts_tot,2.268826,22.33656
bwd_pkts_tot,1.909509,33.01831
fwd_data_pkts_tot,1.471218,19.6352
bwd_data_pkts_tot,0.8202604,32.29395
fwd_pkts_per_sec,351806.3,370764.5
bwd_pkts_per_sec,351762.0,370801.5
flow_pkts_per_sec,703568.3,741563.4
down_up_ratio,0.8545706,0.3376403
fwd_header_size_tot,53.89238,393.0272


#### (a) Time-Based Features
* flow_duration
* fwd_iat.min, fwd_iat.max, fwd_iat.tot, fwd_iat.avg, fwd_iat.std
* bwd_iat.min, bwd_iat.max, bwd_iat.tot, bwd_iat.avg, bwd_iat.std
* flow_iat.min, flow_iat.max, flow_iat.tot, flow_iat.avg, flow_iat.std
* active.min, active.max, active.tot, active.avg, active.std
* idle.min, idle.max, idle.tot, idle.avg, idle.std

Description: These features describe packet interarrival times and active/idle durations, which help identify irregular behavior like botnets or DDoS attacks.

Datatype: Floats/Integers 

Importance: Significant variations in these variables can indicate network congestion, delays, and coordinate attacks. 

Transformation: Log transform and normaization could be used to handle skewness and standardization for the models to be compatible. 

#### (b) Packet Count & Flow Features
* fwd_pkts_tot, bwd_pkts_tot, flow_pkts_per_sec
* fwd_data_pkts_tot, bwd_data_pkts_tot
* fwd_subflow_pkts, bwd_subflow_pkts


Description: These indicate total packets per flow and per second, useful for detecting flood attacks or unusually high traffic patterns.

Datatype: Floats/Integers 

Importance: Abnormal packet volume often signifies DoS/DDoS attacks or excessive data transfer.

Transformation: Log transform (if excessive outlines) and normalization could be used to handle skewness and standardization for the models to be compatible. 


#### (c) Packet Size & Header Features
* fwd_header_size_tot, fwd_header_size_min, fwd_header_size_max
* bwd_header_size_tot, bwd_header_size_min, bwd_header_size_max


Description: These features capture the header size of forwarded and backward packets, revealing anomalies in protocol behavior.

Datatype: Floats/Integers 

Importance: Header manipulations can indicate evasion techniques or protocol misuse.

Transformation: Log transform (if excessive outlines) and normalization could be used to handle skewness and standardization for the models to be compatible. 
#### (d) Payload Features
* fwd_pkts_payload.min, fwd_pkts_payload.max, fwd_pkts_payload.tot
* fwd_pkts_payload.avg, fwd_pkts_payload.std
* bwd_pkts_payload.min, bwd_pkts_payload.max, bwd_pkts_payload.tot
* bwd_pkts_payload.avg, bwd_pkts_payload.std
* flow_pkts_payload.min, flow_pkts_payload.max, flow_pkts_payload.tot
* flow_pkts_payload.avg, flow_pkts_payload.std
* payload_bytes_per_second


Description: Measures packet payloads, which may reveal unusual data transmissions such as injection attacks or data exfiltration.

Datatype: Floats/Integers 

Importance: Payload-based attacks can manipulate network traffic because they exploit the content within data packets (payloads) to inject malicious code or commands, allowing attackers to alter the normal flow of information and potentially steal data, disrupt operations, or gain unauthorized access to a system. 

Transformation: Log transform (if excessive outlines) and normalization could be used to handle skewness and standardization for the models to be compatible. 

#### (e) Bulk Transfer Features
* fwd_bulk_bytes, bwd_bulk_bytes
* fwd_bulk_packets, bwd_bulk_packets
* fwd_bulk_rate, bwd_bulk_rate


Description: These features represent bulk data transfer rates, indicating large file transfers or exfiltration attempts.

Datatype: Floats/Integers 

Importance: Large transfers may indicate unauthorized data movement, especially if they are copying to new or unconventional locations. 

Transformation: Log transform (if excessive outlines) and normalization could be used to handle skewness and standardization for the models to be compatible. 

#### (f) Flow Flag Counts (TCP/IP Behavior)
* flow_FIN_flag_count, flow_SYN_flag_count, flow_RST_flag_count
* fwd_PSH_flag_count, bwd_PSH_flag_count
* flow_ACK_flag_count
* fwd_URG_flag_count, bwd_URG_flag_count
* flow_CWR_flag_count, flow_ECE_flag_count


Description: TCP flag counts used for connection establishment and teardown, helping detect suspicious patterns like SYN floods.

Datatype: Integers 

Importance: By monitoring the count of "SYN" packets received within a short time frame, we can identify suspicious patterns like a SYN flood attack, where an attacker overwhelms a server with excessive connection initiation requests, causing it to become unresponsive to legitimate traffic. For reference, when establishing a TCP connection, the primary flag used is "SYN" (synchronize), which is exchanged in a three-way handshake. 

Transformation: Log transform (if excessive outlines) and normalization could be used to handle skewness and standardization for the models to be compatible. 


#### (g) Window Sizes (TCP Flow Control)
* fwd_init_window_size, bwd_init_window_size
* fwd_last_window_size


Description: Window size measurements for TCP flow control, which can indicate congestion control anomalies.

Datatype: Integers 

Importance: You can leverage very small window sizes with SYN packets to overwhelm the queue, and also use window size probing to understand info about the OS etc, also manipulating window size itself can cause congestability. 

Transformation: Log transform (if excessive outlines) and normalization could be used to handle skewness and standardization for the models to be compatible. 


#### (h) Port Numbers (Source & Destination)
* id.orig_p (Originating port)
* id.resp_p (Responding port)


Description: The originating and responding port numbers in network flows, commonly targeted in attacks (e.g., SSH-22, HTTP-80).

Datatype: Integers (But Not Linearly Related) 

Importance: Malicious activity usually involves very secific ports. 

Transformation: Bin encoding to group common vs. uncommon ports may be the best approach here to proceed. As that is what we are really trying to understand from the data. 
