In [45]:
import pandas as pd 
import numpy as np 


### Here is the complete overview to the dataset which includes its features the type of the problem that we are going to solve, the columns we will be using as an independent and dependent column and every single details of them !

In [46]:
data =pd.read_csv('cybersecurity_attacks.csv')
data.head(5)

Unnamed: 0,Timestamp,Source IP Address,Destination IP Address,Source Port,Destination Port,Protocol,Packet Length,Packet Type,Traffic Type,Payload Data,...,Action Taken,Severity Level,User Information,Device Information,Network Segment,Geo-location Data,Proxy Information,Firewall Logs,IDS/IPS Alerts,Log Source
0,2023-05-30 06:33:58,103.216.15.12,84.9.164.252,31225,17616,ICMP,503,Data,HTTP,Qui natus odio asperiores nam. Optio nobis ius...,...,Logged,Low,Reyansh Dugal,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,Segment A,"Jamshedpur, Sikkim",150.9.97.135,Log Data,,Server
1,2020-08-26 07:08:30,78.199.217.198,66.191.137.154,17245,48166,ICMP,1174,Data,HTTP,Aperiam quos modi officiis veritatis rem. Omni...,...,Blocked,Low,Sumer Rana,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,Segment B,"Bilaspur, Nagaland",,Log Data,,Firewall
2,2022-11-13 08:23:25,63.79.210.48,198.219.82.17,16811,53600,UDP,306,Control,HTTP,Perferendis sapiente vitae soluta. Hic delectu...,...,Ignored,Low,Himmat Karpe,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,Segment C,"Bokaro, Rajasthan",114.133.48.179,Log Data,Alert Data,Firewall
3,2023-07-02 10:38:46,163.42.196.10,101.228.192.255,20018,32534,UDP,385,Data,HTTP,Totam maxime beatae expedita explicabo porro l...,...,Blocked,Medium,Fateh Kibe,Mozilla/5.0 (Macintosh; PPC Mac OS X 10_11_5; ...,Segment B,"Jaunpur, Rajasthan",,,Alert Data,Firewall
4,2023-07-16 13:11:07,71.166.185.76,189.243.174.238,6131,26646,TCP,1462,Data,DNS,Odit nesciunt dolorem nisi iste iusto. Animi v...,...,Blocked,Low,Dhanush Chad,Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ...,Segment C,"Anantapur, Tripura",149.6.110.119,,Alert Data,Firewall


In [47]:
## Size of the datas with the number of samples and the features
print(f"The shape of the data is {data.shape}, the number of samples in the dataset is {data.shape[0]}, and the number of features are {data.shape[1]}.")

The shape of the data is (40000, 25), the number of samples in the dataset is 40000, and the number of features are 25.


### Lets define and see what exactly each of the feature in the whole dataset is and what it says about the data !

### 1. Malware Indicators: Here we have two values 1. IoC detected : Malicious or Has Malware whereas 2. NaN which means that it is benign 

### About Malware Indicators: Malware Indicators are the signal that says the given file can be malicious which are also called IoC / Indicators of Compromise in cybersecurity. 
### In this dataset, the model will try to learn from the other features on predicting if the given data is malicioius or not !

In [48]:
data['Malware Indicators'] = data['Malware Indicators'].fillna('No IoC') ## Here we fill the NaN values with the No IoC to make sure we know what is in the dataset
data['Malware Indicators'] = data['Malware Indicators'].map({'IoC Detected':1, 'No IoC': 0}) ## we do feature engineering to turn the column to the binary classification 


## Features that we will be using !

---

### 1. Time Stamp: The time when the network packet was sent or received, By tracking the unusuality of the time we can say if the given package is malicious or not. So time stamp is one of the most prominent aspect of the whole dataset. 

### To further understand about the timestamp and the impact it has in detecting if the given file is malicious or not we will be extracting the time stamp into day, month , time of the day, as well as if the day is weekend or not !

In [49]:
## The first step is to convert the given column to the datetime as it is in the object format which is the wrong format 
data['Timestamp'] = pd.to_datetime(data['Timestamp'])
data['Month_of_received'] = data['Timestamp'].dt.month
data['hour_of_the_day'] = data['Timestamp'].dt.hour
data['day_of_the_week'] = data['Timestamp'].dt.weekday
data['is_weekend'] = data['day_of_the_week'].isin([5, 6]).astype(int) ## If the given date is weekend or not 
data = data.drop(columns=['Timestamp'])

### Here in the TimeStamp column we did the feature decomposition. The feature 'Timstamp' was itself a raw, and invalid data-type column. We extracted four important columns which will assists us in figuring out if the given data in the given exact date and time seems suspicious or not! 

---

### Feature Number 2: Source IP and Destination IP Address:

### Here the Source IP: is the Ip address of the sender, and Destination IP is the address of the receiver. Every network has a server and a receiver. 

### We will be splitting each of the IP into four different octets. For example : 103.216.15.12 is not a single number but four different octet with the size of 0-255 and is storedin 8 bits. So technically 103.216.15.12 can be written as 103|216|15|12 where each of the four parts represent octet.

### So the parts can be traced using the following structure: Network -> Sub Network -> Device Group -> Specific Device 

### For example 103.216.15.12 and 103.216.15.99 are two different IP address but the only difference in here is device and the network, sub network are same 

### Why does this matter in Cyber Security ?! 

### Attackers use the same Network to attack but different devices. The machine learning model can learn about the IP address and evaluate if it is malicious or not !

In [50]:
def splitting(ip):
    try:
        return list(map(int, ip.split('.')))
    except:
        return [0,0,0,0]
    
data[['src_ip_1', 'src_ip_2', 'src_ip_3', 'src_ip_4']] = data['Source IP Address'].apply(lambda x: pd.Series(splitting(x)))
data[['dest_ip_1', 'dest_ip_2', 'dest_ip_3', 'dest_ip_4']] = data['Destination IP Address'].apply(lambda x: pd.Series(splitting(x)))

data = data.drop(columns=['Destination IP Address'])

### Here we have created four different columns for Source and Destination. Where each of the column from 1-4 represents: Network, Sub-Network, Device Group, Specific Device !

--- 
### Private IP Address Vs Public IP Address: 

### Public IP address and Private IP Address play prominent role in identifying if the given file is malicious or not !

### Publlic IP address: Is the address on your Internet or the address provided by the Internet Service Provider.
### Private IP address: Is the address inside your local network which is hidden from the internet which is provided by a Router.

### Private IP address is like your room number which can be same for two different buildings. But, for a specific builiding the IP address is always unique whereas the Public IP address can be your house number, which is unique but anyone can deliver mail to your house. Thinking of it like if you are using the public wifi. There are billions of private IP address which is provided by the Router and the translation of the private and public IP address is done by using NAT (Network Access Translation)!

### The Private IP Address ranges usually start from 10.x.x.x or 172.16.x.x / 172.31.x.x or 192.168.x.x, so these are the ranges of the private IP address. Apart from these ranges all of them are public IP address. 

### Here if it starts with 10 then it is private, if it starts with 172 and the second part is 16 till 31 then its private, and if the first octet is 192 and the second octet is 168 then also it is private. Else everything is public



In [None]:
## Function to check if the given IP address is private or public 
def is_private(ip):
    first, second, _, _ = ip.split('.')
    first = int(first)
    second = int(first)
    if first == 10:
        return 1
    elif first == 172 and second >= 17 and second <= 31:
        return 1 
    elif first == 196 and second == 168:
        return 1 
    else: 
        return 0 
    
data['Is_Private'] = data['Source IP Address'].apply(lambda x: is_private(x))

data = data.drop(columns=['Source IP Address'])



Unnamed: 0,Source Port,Destination Port,Protocol,Packet Length,Packet Type,Traffic Type,Payload Data,Malware Indicators,Anomaly Scores,Alerts/Warnings,...,is_weekend,src_ip_1,src_ip_2,src_ip_3,src_ip_4,dest_ip_1,dest_ip_2,dest_ip_3,dest_ip_4,Is_Private
0,31225,17616,ICMP,503,Data,HTTP,Qui natus odio asperiores nam. Optio nobis ius...,1,28.67,,...,0,103,216,15,12,84,9,164,252,0
1,17245,48166,ICMP,1174,Data,HTTP,Aperiam quos modi officiis veritatis rem. Omni...,1,51.5,,...,0,78,199,217,198,66,191,137,154,0
2,16811,53600,UDP,306,Control,HTTP,Perferendis sapiente vitae soluta. Hic delectu...,1,87.42,Alert Triggered,...,1,63,79,210,48,198,219,82,17,0
3,20018,32534,UDP,385,Data,HTTP,Totam maxime beatae expedita explicabo porro l...,0,15.79,Alert Triggered,...,1,163,42,196,10,101,228,192,255,0
4,6131,26646,TCP,1462,Data,DNS,Odit nesciunt dolorem nisi iste iusto. Animi v...,0,0.52,Alert Triggered,...,1,71,166,185,76,189,243,174,238,0
5,17430,52805,UDP,1423,Data,HTTP,Repellat quas illum harum fugit incidunt exerc...,0,5.76,,...,0,198,102,5,160,147,190,155,133,0
6,26562,17416,TCP,379,Data,DNS,Qui numquam inventore repellat ratione fugit o...,0,31.55,,...,0,97,253,103,59,77,16,101,53,0
7,34489,20396,ICMP,1022,Data,DNS,Amet libero optio quidem praesentium libero. E...,1,54.05,Alert Triggered,...,1,11,48,99,245,178,157,14,116,0
8,56296,20857,TCP,1281,Control,FTP,Veritatis nihil amet atque molestias aperiam m...,1,56.34,Alert Triggered,...,0,49,32,208,167,72,202,237,9,0
9,37918,50039,UDP,224,Data,HTTP,Consequatur ipsum autem reprehenderit quae. Do...,0,16.51,Alert Triggered,...,1,114,109,149,113,160,88,194,172,0
