In [1]:
import pandas as pd 
import numpy as np 


### Here is the complete overview to the dataset which includes its features the type of the problem that we are going to solve, the columns we will be using as an independent and dependent column and every single details of them !

In [2]:
data =pd.read_csv('cybersecurity_attacks.csv')
data.head(5)

Unnamed: 0,Timestamp,Source IP Address,Destination IP Address,Source Port,Destination Port,Protocol,Packet Length,Packet Type,Traffic Type,Payload Data,...,Action Taken,Severity Level,User Information,Device Information,Network Segment,Geo-location Data,Proxy Information,Firewall Logs,IDS/IPS Alerts,Log Source
0,2023-05-30 06:33:58,103.216.15.12,84.9.164.252,31225,17616,ICMP,503,Data,HTTP,Qui natus odio asperiores nam. Optio nobis ius...,...,Logged,Low,Reyansh Dugal,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,Segment A,"Jamshedpur, Sikkim",150.9.97.135,Log Data,,Server
1,2020-08-26 07:08:30,78.199.217.198,66.191.137.154,17245,48166,ICMP,1174,Data,HTTP,Aperiam quos modi officiis veritatis rem. Omni...,...,Blocked,Low,Sumer Rana,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,Segment B,"Bilaspur, Nagaland",,Log Data,,Firewall
2,2022-11-13 08:23:25,63.79.210.48,198.219.82.17,16811,53600,UDP,306,Control,HTTP,Perferendis sapiente vitae soluta. Hic delectu...,...,Ignored,Low,Himmat Karpe,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,Segment C,"Bokaro, Rajasthan",114.133.48.179,Log Data,Alert Data,Firewall
3,2023-07-02 10:38:46,163.42.196.10,101.228.192.255,20018,32534,UDP,385,Data,HTTP,Totam maxime beatae expedita explicabo porro l...,...,Blocked,Medium,Fateh Kibe,Mozilla/5.0 (Macintosh; PPC Mac OS X 10_11_5; ...,Segment B,"Jaunpur, Rajasthan",,,Alert Data,Firewall
4,2023-07-16 13:11:07,71.166.185.76,189.243.174.238,6131,26646,TCP,1462,Data,DNS,Odit nesciunt dolorem nisi iste iusto. Animi v...,...,Blocked,Low,Dhanush Chad,Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ...,Segment C,"Anantapur, Tripura",149.6.110.119,,Alert Data,Firewall


In [3]:
## Size of the datas with the number of samples and the features
print(f"The shape of the data is {data.shape}, the number of samples in the dataset is {data.shape[0]}, and the number of features are {data.shape[1]}.")

The shape of the data is (40000, 25), the number of samples in the dataset is 40000, and the number of features are 25.


### Lets define and see what exactly each of the feature in the whole dataset is and what it says about the data !

### 1. Malware Indicators: Here we have two values 1. IoC detected : Malicious or Has Malware whereas 2. NaN which means that it is benign 

### About Malware Indicators: Malware Indicators are the signal that says the given file can be malicious which are also called IoC / Indicators of Compromise in cybersecurity. 
### In this dataset, the model will try to learn from the other features on predicting if the given data is malicioius or not !

In [4]:
data['Malware Indicators'] = data['Malware Indicators'].fillna('No IoC') ## Here we fill the NaN values with the No IoC to make sure we know what is in the dataset
data['Malware Indicators'] = data['Malware Indicators'].map({'IoC Detected':1, 'No IoC': 0}) ## we do feature engineering to turn the column to the binary classification 


## Features that we will be using !

---

### 1. Time Stamp: The time when the network packet was sent or received, By tracking the unusuality of the time we can say if the given package is malicious or not. So time stamp is one of the most prominent aspect of the whole dataset. 

### To further understand about the timestamp and the impact it has in detecting if the given file is malicious or not we will be extracting the time stamp into day, month , time of the day, as well as if the day is weekend or not !

In [5]:
## The first step is to convert the given column to the datetime as it is in the object format which is the wrong format 
data['Timestamp'] = pd.to_datetime(data['Timestamp'])
data['Month_of_received'] = data['Timestamp'].dt.month
data['hour_of_the_day'] = data['Timestamp'].dt.hour
data['day_of_the_week'] = data['Timestamp'].dt.weekday
data['is_weekend'] = data['day_of_the_week'].isin([5, 6]).astype(int) ## If the given date is weekend or not 
data = data.drop(columns=['Timestamp'])

### Here in the TimeStamp column we did the feature decomposition. The feature 'Timstamp' was itself a raw, and invalid data-type column. We extracted four important columns which will assists us in figuring out if the given data in the given exact date and time seems suspicious or not! 

---

### Feature Number 2: Source IP and Destination IP Address:

### Here the Source IP: is the Ip address of the sender, and Destination IP is the address of the receiver. Every network has a server and a receiver. 

### We will be splitting each of the IP into four different octets. For example : 103.216.15.12 is not a single number but four different octet with the size of 0-255 and is storedin 8 bits. So technically 103.216.15.12 can be written as 103|216|15|12 where each of the four parts represent octet.

### So the parts can be traced using the following structure: Network -> Sub Network -> Device Group -> Specific Device 

### For example 103.216.15.12 and 103.216.15.99 are two different IP address but the only difference in here is device and the network, sub network are same 

### Why does this matter in Cyber Security ?! 

### Attackers use the same Network to attack but different devices. The machine learning model can learn about the IP address and evaluate if it is malicious or not !

In [6]:
def splitting(ip):
    try:
        return list(map(int, ip.split('.')))
    except:
        return [0,0,0,0]
    
data[['src_ip_1', 'src_ip_2', 'src_ip_3', 'src_ip_4']] = data['Source IP Address'].apply(lambda x: pd.Series(splitting(x)))
data[['dest_ip_1', 'dest_ip_2', 'dest_ip_3', 'dest_ip_4']] = data['Destination IP Address'].apply(lambda x: pd.Series(splitting(x)))

data = data.drop(columns=['Destination IP Address'])

### Here we have created four different columns for Source and Destination. Where each of the column from 1-4 represents: Network, Sub-Network, Device Group, Specific Device !

--- 
### Private IP Address Vs Public IP Address: 

### Public IP address and Private IP Address play prominent role in identifying if the given file is malicious or not !

### Publlic IP address: Is the address on your Internet or the address provided by the Internet Service Provider.
### Private IP address: Is the address inside your local network which is hidden from the internet which is provided by a Router.

### Private IP address is like your room number which can be same for two different buildings. But, for a specific builiding the IP address is always unique whereas the Public IP address can be your house number, which is unique but anyone can deliver mail to your house. Thinking of it like if you are using the public wifi. There are billions of private IP address which is provided by the Router and the translation of the private and public IP address is done by using NAT (Network Access Translation)!

### The Private IP Address ranges usually start from 10.x.x.x or 172.16.x.x / 172.31.x.x or 192.168.x.x, so these are the ranges of the private IP address. Apart from these ranges all of them are public IP address. 

### Here if it starts with 10 then it is private, if it starts with 172 and the second part is 16 till 31 then its private, and if the first octet is 192 and the second octet is 168 then also it is private. Else everything is public



In [7]:
## Function to check if the given IP address is private or public 
def is_private(ip):
    first, second, _, _ = ip.split('.')
    first = int(first)
    second = int(first)
    if first == 10:
        return 1
    elif first == 172 and second >= 17 and second <= 31:
        return 1 
    elif first == 196 and second == 168:
        return 1 
    else: 
        return 0 
    
data['Is_Private'] = data['Source IP Address'].apply(lambda x: is_private(x))

data = data.drop(columns=['Source IP Address'])



---

### Ports: Ports can be said as the door of the computer system. Each port has a unique value and also has a specific data carriage 

### Examples include: Port 80 is usually for the web Traffic and port 443 is usally for the secure web traffic (HTTPS)

### Sender Port is a random port used to send the file or any information
### Destination Port is a port selected based on the type of request from the sender port, If we are accessing the file from the sender it is usually done in port 21, If you are browsing the internet then it is usually done from port 80 or port 443. 

### Port singly handly doesnot define if the given sender is legit or malicious but the patterns does. 
---



### Protocol :
### Protocol is like the rules for talking in the internet. Lets say it was an ettiquete. It is like language, order, and what happens when someone doesnot reply type of stuff. The same goes with the computers as well, A protocol tells computers : How to send data, how to receive them, and what to do if something goes wrong 


### Types of Protocol: 

### 1. TCP (Transmission Control Protocol):
### Tcp are the reliable, ordered, and slower but safe protocol which are typically used while sending or receving an email. Web browsing, file transfer, and email are few domains where TCP are used 

### 2. UDP (User Diagram Protocol):
### UDP or user diagram protocol is faster way of transferring data. There are no gurantees or no confirmation while using UDP. UDP is used for gaming streaming and many more.

### ICMP(Internet Control Message Protocol):
### The use of internet Control message protocol is to do the network health checks. It is not supposed to transfer data, but is used to see the speed of the internet such as 'ping'. 


In [8]:
data = pd.get_dummies(data, columns=['Protocol'], prefix='protocol', dtype = int)

### Packet, Its Types and Packet Length 

### Packet: Packet are the small unit of data transmitted across the network. Large files and communications are broken down into mulitple packets to ensure reliable delivery. Packet length provides the size of those packets and provides insights into the network behaviour. Abnormal packet sizes or unusual packet patterns are the signs of malicious activity.

### Extremely small packets repeated thousand of times and abonormally large packets to uncommon ports can be taken as example of how packets can indicate if the given file is malicious or not. 

### The types of packets are Data Packets (carry content) and Control Packets (manages control settings): 


In [9]:
data['Packet Type'] = data['Packet Type'].map({"Data":1, "Control":0})


---
### Traffic Type:

### Traffic Type is the type of the traffic that is happening in the network 
### HTTP -> Web Browsing Traffic, DNS -> Domain Name Lookups, FTP -> File Transfer, 
### So even if the packet looks normal, the traffic type + pattern can be the strong signal of malicious behaviour. 



In [10]:
data = pd.get_dummies(data, columns = ['Traffic Type'], prefix = "traffic", dtype = int)


### The Packet Type and Packet length and its relation with the file being malicious or not !

### The packet type such as control and data explains the information carried and the connection details whereas if the control data is sent to multiple number of ports, it can be said to be malicious. 

### Similarly, The size of packet also matters becuase the hacker might send the packet with small length inorder to avoid detection whereas A large file may look legitimate but may be malicious.

---

### Payload data 
### Payload data is all about evaluating what is inside the data itself. 

### Anomaly Scores: 
### Anomaly Scores is the numerical value representing how unusual a network packet is ! 
### Higher the number, more unusual the packet, more prone to being malicious !

### Alerts/Warnings 

### These are the notifications generated by the security systems like Firewall, IDS/IPS, or antivirus whenever they detect suspicious or potentially harmful activity. 



In [11]:
data['Alerts/Warnings'] = data['Alerts/Warnings'].fillna('Safe')
data['Alerts/Warnings'] = data['Alerts/Warnings'].map({'Alert Triggered': 0, 'Safe':1})

---
### Attack Type

### Attack Type basically describes if it was malicious or IoC detected, what type of attack it might be ?! 

### The common examples of the attack type include: 
### 1. Malware: Software intentionally designed to harm or exploit a system 
### 2. DDos : An Attack that floods the server with a massive traffic 
### 3. Intrusion: When someone unauthorized access to the computer

In [12]:
data = pd.get_dummies(data, columns= ['Attack Type'], prefix=('Attack_Type'), dtype = int)


### Attack Signature: 
### Attack Signature is the specific pattern that matches a Pre-identified attack (example include: Malware, Intrusion, or Ddos Attack)


In [13]:
data['Attack Signature'] = data['Attack Signature'].map({'Known Pattern A':1, 'Known Pattern B': 0})

### Severity Level 
### It indicates how critical the detected attack is :
### It is divided into the groups based on how much damage it would cost, how likely is it to succeed, and could it affect multiple devices or not 


In [14]:
data = pd.get_dummies(data,columns=['Severity Level'], prefix= 'Severity', dtype = int)


### Network Segment :

### Network segment is the subdivision of the larger computer network. It groups devices that are physically or logically connected to eachother.

In [15]:
data = pd.get_dummies(data,columns=['Network Segment'], prefix = 'Segments', dtype = int)

In [16]:
### Columns to be avoided 

data = data.drop(columns = ['Payload Data', 'Action Taken', 'User Information', 'Device Information', 'Geo-location Data', 'Proxy Information','Firewall Logs', 'IDS/IPS Alerts', 'Log Source'], axis = 1)


### This is the final dataset with all the columns that we will be using with the detail explanation on what exactly the column is what exact information does the model contain! 


In [17]:
data.to_csv('../Datasets/required.csv')