# Analysis DDOS Attack & Provide Secure Network using ML for VANET

**Author Name**: Saqib shah\
**Email** :fa22mscs0033@maju.edu.pk\
github: [xaqibshah](www.github.com/xaqibshah)


This data was collect from the [link](https://www.unb.ca/cic/datasets/ddos-2019.html)


# The dateset collected from the source has following description:
## About Dataset
Distributed Denial of Service (DDoS) attack is a menace to network security that aims at exhausting the target networks with malicious traffic. Although many statistical methods have been designed for DDoS attack detection, designing a real-time detector with low computational overhead is still one of the main concerns. On the other hand, the evaluation of new detection algorithms and techniques heavily relies on the existence of well-designed datasets.

In this paper, we first review the existing datasets comprehensively and propose a new taxonomy for DDoS attacks. Secondly, we generate a new dataset, namely CICDDoS2019, which remedies all current shortcomings. Thirdly, using the generated dataset, we propose a new detection and family classification approach based on a set of network flow features. Finally, we provide the most important feature sets to detect different types of DDoS attacks with their corresponding weights.

**Introduction**

There are a number of survey studies that have proposed taxonomies with respect to DDoS attacks. Although all have done a commendable job in proposing new taxonomies, the scope of attacks has so far been limited. There is a need to identify new attacks and come up with new taxonomies. Hence, we have analyzed new attacks that can be carried out using TCP/UDP based protocols at the application layer and proposed a new taxonomy. The rest of this sub-section has been explained the detailed taxonomy of DDoS attacks and illustrated in Figure 1, in terms of reflection-based and exploitation-based attacks.

Reflection-based DDoS: Are those kinds of attacks in which the identity of the attacker remains hidden by utilizing legitimate third-party component. The packets are sent to reflector servers by attackers with source IP address set to target victim&rsquo;s IP address to overwhelm the victim with response packets. These attacks can be carried out through application layer protocols using transport layer protocols, i.e., Transmission control protocol (TCP), User datagram protocol (UDP) or through a combination of both. As Figure 1 shows, in this category, TCP based attacks include MSSQL, SSDP while as UDP based attacks include CharGen, NTP and TFTP. There are certain attacks that can be carried out using either TCP or UDP like DNS, LDAP, NETBIOS and SNMP.

Exploitation-based attacks: Are those kinds of attacks in which the identity of the attacker remains hidden by utilizing legitimate third-party component. The packets are sent to reflector servers by attackers with the source IP address set to the target victim&rsquo;s IP address to overwhelm the victim with response packets. These attacks can also be carried out through application layer protocols using transport layer protocols i.e., TCP and UDP. TCP based exploitation attacks include SYN flood and UDP based attacks include UDP flood and UDP- Lag. UDP flood attack is initiated on the remote host by sending a large number of UDP packets.

These UDP packets are sent to random ports on the target machine at a very high rate. As a result, the available bandwidth of the network gets exhausted, system crashes and performance degrade. On the other hand, the SYN flood also consumes server resources by exploiting TCP-three-way handshake. This attack is initiated by sending repeated SYN packets to the target machine until server crashes/ malfunctions. The UDP-Lag attack is that kind of attack that disrupts the connection between the client and the server. This attack is mostly used in online gaming where the players want to slow down/ interrupt the movement of other players to outmaneuver them. This attack can be carried in two ways, i.e., using a hardware switch known as a lag switch or by a software program that runs on the network and hogs the bandwidth of other users. 
**Features**

**Data Source**
https://www.unb.ca/cic/datasets/ddos-2019.html

**License**

You may redistribute, republish and mirror the CICDDoS2019 dataset in any form. However, any use or redistribution of the data must include a citation to the CICDDoS2019 dataset and related published paper. A research paper outlining the details of analyzing the similar IDS/ IPS dataset and related principles:

Iman Sharafaldin, Arash Habibi Lashkari, Saqib Hakak, and Ali A. Ghorbani, "Developing Realistic Distributed Denial of Service (DDoS) Attack Dataset and Taxonomy", IEEE 53rd International Carnahan Conference on Security Technology, Chennai, India, 2019.

## Puropse of this analysis:



In [None]:
!pip install sklearn

# Import the Required Libraries

In [2]:
# Import the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler

# Import the Dataset

In [3]:
# Import the dataset
# df = pd.read_csv("CSV-01-12/SSH.csv")
df = pd.read_csv("DrDoS_UDP.csv")
# df = pd.read_csv("CSV-01-12/HTTP.csv")
# df = pd.read_csv("CSV-01-12/TFTP.csv")

  df = pd.read_csv("DrDoS_UDP.csv")


# 1.   Composition & Data Exploration

In [4]:
df.shape

(3136802, 88)

In [5]:
df = df.dropna()

In [6]:
df.shape

(3136794, 88)

# Split the Dataset into two equal parts randomly for easy processing

In [7]:
df1, df2 = train_test_split(df, test_size=0.5, random_state=42)

In [8]:
df1.shape

(1568397, 88)

In [9]:
#explore the data  (Composition)
df1.info()


<class 'pandas.core.frame.DataFrame'>
Index: 1568397 entries, 2324042 to 2219117
Data columns (total 88 columns):
 #   Column                        Non-Null Count    Dtype  
---  ------                        --------------    -----  
 0   Unnamed: 0                    1568397 non-null  int64  
 1   Flow ID                       1568397 non-null  object 
 2    Source IP                    1568397 non-null  object 
 3    Source Port                  1568397 non-null  int64  
 4    Destination IP               1568397 non-null  object 
 5    Destination Port             1568397 non-null  int64  
 6    Protocol                     1568397 non-null  int64  
 7    Timestamp                    1568397 non-null  object 
 8    Flow Duration                1568397 non-null  int64  
 9    Total Fwd Packets            1568397 non-null  int64  
 10   Total Backward Packets       1568397 non-null  int64  
 11  Total Length of Fwd Packets   1568397 non-null  float64
 12   Total Length of Bwd Packet

In [10]:
df1.head()


Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
2324042,67683,172.16.0.5-192.168.50.1-53817-63762-17,172.16.0.5,53817,192.168.50.1,63762,17,2018-12-01 13:02:46.865922,107716,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
2904739,119129,172.16.0.5-192.168.50.1-37308-42816-17,172.16.0.5,37308,192.168.50.1,42816,17,2018-12-01 13:04:17.661306,215415,6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
851316,18260,172.16.0.5-192.168.50.1-57830-51380-17,172.16.0.5,57830,192.168.50.1,51380,17,2018-12-01 12:57:36.331864,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
2343373,63352,172.16.0.5-192.168.50.1-58892-38747-17,172.16.0.5,58892,192.168.50.1,38747,17,2018-12-01 13:02:51.112785,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
835715,75398,172.16.0.5-192.168.50.1-41044-2177-17,172.16.0.5,41044,192.168.50.1,2177,17,2018-12-01 12:57:33.409184,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP


In [11]:
df[' Label'].unique()

array(['DrDoS_UDP', 'BENIGN'], dtype=object)

In [15]:
# count the total values in label column
df1[' Label'].value_counts()

 Label
DrDoS_UDP    1567300
BENIGN          1097
Name: count, dtype: int64

In [14]:
# max column printing option
pd.set_option('display.max_columns', None)

In [15]:
df1.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Min Packet Length,Max Packet Length,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Avg Fwd Segment Size,Avg Bwd Segment Size,Fwd Header Length.1,Fwd Avg Bytes/Bulk,Fwd Avg Packets/Bulk,Fwd Avg Bulk Rate,Bwd Avg Bytes/Bulk,Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
2324042,67683,172.16.0.5-192.168.50.1-53817-63762-17,172.16.0.5,53817,192.168.50.1,63762,17,2018-12-01 13:02:46.865922,107716,4,0,1398.0,0.0,369.0,330.0,349.5,22.51666,0.0,0.0,0.0,0.0,12978.57,37.13469,35905.333333,62187.263522,107713.0,1.0,107716.0,35905.333333,62187.263522,107713.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,80,0,37.13469,0.0,330.0,369.0,345.6,21.36118,456.3,0,0,0,0,0,0,0,0,0.0,432.0,349.5,0.0,80,0,0,0,0,0,0,4,1398,0,0,-1,-1,3,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
2904739,119129,172.16.0.5-192.168.50.1-37308-42816-17,172.16.0.5,37308,192.168.50.1,42816,17,2018-12-01 13:04:17.661306,215415,6,0,2088.0,0.0,393.0,321.0,348.0,35.08846,0.0,0.0,0.0,0.0,9692.918,27.85321,43083.0,59019.54109,110255.0,1.0,215415.0,43083.0,59019.54109,110255.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,108,0,27.85321,0.0,321.0,393.0,344.142857,33.617597,1130.142857,0,0,0,0,0,0,0,0,0.0,401.5,348.0,0.0,108,0,0,0,0,0,0,6,2088,0,0,-1,-1,5,14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
851316,18260,172.16.0.5-192.168.50.1-57830-51380-17,172.16.0.5,57830,192.168.50.1,51380,17,2018-12-01 12:57:36.331864,1,2,0,750.0,0.0,375.0,375.0,375.0,0.0,0.0,0.0,0.0,0.0,750000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,2000000.0,0.0,375.0,375.0,375.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,562.5,375.0,0.0,0,0,0,0,0,0,0,2,750,0,0,-1,-1,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
2343373,63352,172.16.0.5-192.168.50.1-58892-38747-17,172.16.0.5,58892,192.168.50.1,38747,17,2018-12-01 13:02:51.112785,1,2,0,750.0,0.0,375.0,375.0,375.0,0.0,0.0,0.0,0.0,0.0,750000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,2000000.0,0.0,375.0,375.0,375.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,562.5,375.0,0.0,0,0,0,0,0,0,0,2,750,0,0,-1,-1,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
835715,75398,172.16.0.5-192.168.50.1-41044-2177-17,172.16.0.5,41044,192.168.50.1,2177,17,2018-12-01 12:57:33.409184,1,2,0,802.0,0.0,401.0,401.0,401.0,0.0,0.0,0.0,0.0,0.0,802000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,2000000.0,0.0,401.0,401.0,401.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,601.5,401.0,0.0,0,0,0,0,0,0,0,2,802,0,0,-1,-1,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP


In [16]:
# let see all column names at once
df1.columns

Index(['Unnamed: 0', 'Flow ID', ' Source IP', ' Source Port',
       ' Destination IP', ' Destination Port', ' Protocol', ' Timestamp',
       ' Flow Duration', ' Total Fwd Packets', ' Total Backward Packets',
       'Total Length of Fwd Packets', ' Total Length of Bwd Packets',
       ' Fwd Packet Length Max', ' Fwd Packet Length Min',
       ' Fwd Packet Length Mean', ' Fwd Packet Length Std',
       'Bwd Packet Length Max', ' Bwd Packet Length Min',
       ' Bwd Packet Length Mean', ' Bwd Packet Length Std', 'Flow Bytes/s',
       ' Flow Packets/s', ' Flow IAT Mean', ' Flow IAT Std', ' Flow IAT Max',
       ' Flow IAT Min', 'Fwd IAT Total', ' Fwd IAT Mean', ' Fwd IAT Std',
       ' Fwd IAT Max', ' Fwd IAT Min', 'Bwd IAT Total', ' Bwd IAT Mean',
       ' Bwd IAT Std', ' Bwd IAT Max', ' Bwd IAT Min', 'Fwd PSH Flags',
       ' Bwd PSH Flags', ' Fwd URG Flags', ' Bwd URG Flags',
       ' Fwd Header Length', ' Bwd Header Length', 'Fwd Packets/s',
       ' Bwd Packets/s', ' Min Packet Len

In [17]:
# let's have a look on dtypes
df1.dtypes

Unnamed: 0           int64
Flow ID             object
 Source IP          object
 Source Port         int64
 Destination IP     object
                    ...   
 Idle Max          float64
 Idle Min          float64
SimillarHTTP        object
 Inbound             int64
 Label              object
Length: 88, dtype: object

In [18]:
# summary statistics
df1.describe().T

  sqr = _ensure_numeric((avg - values) ** 2)
  sqr = _ensure_numeric((avg - values) ** 2)


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,1568397.0,62822.641704,36386.590278,0.0,31381.0,62698.0,94050.0,1.325930e+05
Source Port,1568397.0,46788.463617,8463.673138,0.0,39812.0,46897.0,53900.0,6.553000e+04
Destination Port,1568397.0,33266.912724,18653.183895,0.0,17154.0,33284.0,49423.0,6.553500e+04
Protocol,1568397.0,16.990930,0.321962,0.0,17.0,17.0,17.0,1.700000e+01
Flow Duration,1568397.0,94102.505391,643403.650087,0.0,1.0,179.0,108845.0,1.199999e+08
...,...,...,...,...,...,...,...,...
Idle Mean,1568397.0,2321.298898,316944.551618,0.0,0.0,0.0,0.0,1.166865e+08
Idle Std,1568397.0,124.265439,43610.675057,0.0,0.0,0.0,0.0,3.551041e+07
Idle Max,1568397.0,2431.974765,327914.235679,0.0,0.0,0.0,0.0,1.166865e+08
Idle Min,1568397.0,2226.845826,310535.555550,0.0,0.0,0.0,0.0,1.166865e+08


In [19]:
df.describe()

  sqr = _ensure_numeric((avg - values) ** 2)
  sqr = _ensure_numeric((avg - values) ** 2)


Unnamed: 0.1,Unnamed: 0,Source Port,Destination Port,Protocol,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Min Packet Length,Max Packet Length,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Avg Fwd Segment Size,Avg Bwd Segment Size,Fwd Header Length.1,Fwd Avg Bytes/Bulk,Fwd Avg Packets/Bulk,Fwd Avg Bulk Rate,Bwd Avg Bytes/Bulk,Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Inbound
count,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0,3136794.0
mean,62847.41,46784.73,33267.81,16.99076,94546.83,3.769511,0.005931853,1370.534,1.257006,388.5625,359.5046,372.6106,15.29711,0.2544541,0.02457382,0.08196504,0.08033872,inf,inf,19614.3,30937.46,56433.52,25.07007,94207.45,19743.77,31027.66,56217.77,26.16895,3612.934,322.1647,596.4163,1545.011,0.002078555,0.0001125353,0.0,0.0,0.0,-115955700.0,-12196.37,798817.4,4.841729,359.5042,388.718,370.4443,14.66433,478.563,0.0,3.187968e-05,0.0001125353,0.0,0.000282454,0.0004998734,0.0002661954,0.0,0.0009359875,504.9256,372.6106,0.08196504,-115955700.0,0.0,0.0,0.0,0.0,0.0,0.0,3.769511,1370.534,0.005931853,1.257006,10.60404,1.817273,2.763057,-39799840.0,144.1217,50.98776,189.1839,106.1239,2348.639,103.0999,2454.051,2266.857,0.9988208
std,36407.39,8463.085,18651.67,0.3249252,682539.5,3.745691,0.6018703,1120.631,243.1258,49.50445,57.96194,51.89682,16.17661,22.16357,1.7823,6.271603,7.431685,,,89219.26,155325.7,335301.8,10744.22,680081.9,93369.95,160645.0,333751.7,11423.29,491212.9,32440.15,62825.53,191024.0,0.2047236,0.01060767,0.0,0.0,0.0,726655200.0,5091439.0,945248.0,1189.377,57.96353,52.34156,52.62405,16.23331,4702.979,0.0,0.005646121,0.01060767,0.0,0.016804,0.02235226,0.01631333,0.0,0.04132645,109.5564,51.89682,6.271603,726655200.0,0.0,0.0,0.0,0.0,0.0,0.0,3.745691,1120.631,0.6018703,243.1258,768.1697,300.1419,3.142463,201772300.0,52064.86,31491.79,68798.53,43393.6,316889.4,36674.95,326491.8,311768.6,0.03431966
min,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0342799,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-59512260000.0,-2125438000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-59512260000.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-1.0,-1.0,0.0,-1062719000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,31368.0,39806.0,17159.0,17.0,1.0,2.0,0.0,766.0,0.0,375.0,330.0,349.5,0.0,0.0,0.0,0.0,0.0,12964.61,36.7539,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,36.72757,0.0,330.0,375.0,345.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,432.0,349.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,766.0,0.0,0.0,-1.0,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,62736.0,46894.0,33294.0,17.0,132.0,4.0,0.0,1398.0,0.0,389.0,366.0,374.5,21.85406,0.0,0.0,0.0,0.0,12081820.0,31250.0,52.0,48.65336,108.0,1.0,99.0,51.0,30.7614,78.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.0,0.0,39.42965,0.0,366.0,389.0,373.0,21.36118,456.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,509.1667,374.5,0.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1398.0,0.0,0.0,-1.0,-1.0,3.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,94104.0,53896.0,49426.0,17.0,108844.0,4.0,0.0,1438.0,0.0,393.0,383.0,383.0,34.06367,0.0,0.0,0.0,0.0,766000000.0,2000000.0,36277.33,61277.36,108815.0,1.0,108844.0,36277.0,61276.49,108815.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,80.0,0.0,2000000.0,0.0,383.0,393.0,383.0,32.31563,1044.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,574.5,383.0,0.0,80.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1438.0,0.0,0.0,-1.0,-1.0,3.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,132594.0,65531.0,65535.0,17.0,120000000.0,2232.0,361.0,59076.0,205790.0,3547.0,1472.0,1472.0,1271.117,3522.0,488.0,1796.923,1398.644,inf,inf,38895490.0,67368950.0,116686500.0,10047940.0,120000000.0,38895490.0,67368950.0,116686500.0,10047940.0,119999900.0,11800800.0,26387390.0,59185540.0,100.0,1.0,0.0,0.0,0.0,30716.0,4724.0,4000000.0,1000000.0,1472.0,3547.0,1557.333,1308.654,1712575.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,8.0,2208.0,1472.0,1796.923,30716.0,0.0,0.0,0.0,0.0,0.0,0.0,2232.0,59076.0,361.0,205790.0,65535.0,65535.0,167.0,1472.0,61512890.0,48680470.0,72868430.0,61512890.0,116686500.0,35510410.0,116686500.0,116686500.0,1.0


In [20]:
#how many are the missing values
df1.isnull().sum()
# df.isna().sum()

Unnamed: 0         0
Flow ID            0
 Source IP         0
 Source Port       0
 Destination IP    0
                  ..
 Idle Max          0
 Idle Min          0
SimillarHTTP       0
 Inbound           0
 Label             0
Length: 88, dtype: int64

In [21]:
df.columns

Index(['Unnamed: 0', 'Flow ID', ' Source IP', ' Source Port',
       ' Destination IP', ' Destination Port', ' Protocol', ' Timestamp',
       ' Flow Duration', ' Total Fwd Packets', ' Total Backward Packets',
       'Total Length of Fwd Packets', ' Total Length of Bwd Packets',
       ' Fwd Packet Length Max', ' Fwd Packet Length Min',
       ' Fwd Packet Length Mean', ' Fwd Packet Length Std',
       'Bwd Packet Length Max', ' Bwd Packet Length Min',
       ' Bwd Packet Length Mean', ' Bwd Packet Length Std', 'Flow Bytes/s',
       ' Flow Packets/s', ' Flow IAT Mean', ' Flow IAT Std', ' Flow IAT Max',
       ' Flow IAT Min', 'Fwd IAT Total', ' Fwd IAT Mean', ' Fwd IAT Std',
       ' Fwd IAT Max', ' Fwd IAT Min', 'Bwd IAT Total', ' Bwd IAT Mean',
       ' Bwd IAT Std', ' Bwd IAT Max', ' Bwd IAT Min', 'Fwd PSH Flags',
       ' Bwd PSH Flags', ' Fwd URG Flags', ' Bwd URG Flags',
       ' Fwd Header Length', ' Bwd Header Length', 'Fwd Packets/s',
       ' Bwd Packets/s', ' Min Packet Len

In [22]:
len(df1)

1568397

In [23]:
df1.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Min Packet Length,Max Packet Length,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Avg Fwd Segment Size,Avg Bwd Segment Size,Fwd Header Length.1,Fwd Avg Bytes/Bulk,Fwd Avg Packets/Bulk,Fwd Avg Bulk Rate,Bwd Avg Bytes/Bulk,Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
2324042,67683,172.16.0.5-192.168.50.1-53817-63762-17,172.16.0.5,53817,192.168.50.1,63762,17,2018-12-01 13:02:46.865922,107716,4,0,1398.0,0.0,369.0,330.0,349.5,22.51666,0.0,0.0,0.0,0.0,12978.57,37.13469,35905.333333,62187.263522,107713.0,1.0,107716.0,35905.333333,62187.263522,107713.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,80,0,37.13469,0.0,330.0,369.0,345.6,21.36118,456.3,0,0,0,0,0,0,0,0,0.0,432.0,349.5,0.0,80,0,0,0,0,0,0,4,1398,0,0,-1,-1,3,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
2904739,119129,172.16.0.5-192.168.50.1-37308-42816-17,172.16.0.5,37308,192.168.50.1,42816,17,2018-12-01 13:04:17.661306,215415,6,0,2088.0,0.0,393.0,321.0,348.0,35.08846,0.0,0.0,0.0,0.0,9692.918,27.85321,43083.0,59019.54109,110255.0,1.0,215415.0,43083.0,59019.54109,110255.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,108,0,27.85321,0.0,321.0,393.0,344.142857,33.617597,1130.142857,0,0,0,0,0,0,0,0,0.0,401.5,348.0,0.0,108,0,0,0,0,0,0,6,2088,0,0,-1,-1,5,14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
851316,18260,172.16.0.5-192.168.50.1-57830-51380-17,172.16.0.5,57830,192.168.50.1,51380,17,2018-12-01 12:57:36.331864,1,2,0,750.0,0.0,375.0,375.0,375.0,0.0,0.0,0.0,0.0,0.0,750000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,2000000.0,0.0,375.0,375.0,375.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,562.5,375.0,0.0,0,0,0,0,0,0,0,2,750,0,0,-1,-1,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
2343373,63352,172.16.0.5-192.168.50.1-58892-38747-17,172.16.0.5,58892,192.168.50.1,38747,17,2018-12-01 13:02:51.112785,1,2,0,750.0,0.0,375.0,375.0,375.0,0.0,0.0,0.0,0.0,0.0,750000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,2000000.0,0.0,375.0,375.0,375.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,562.5,375.0,0.0,0,0,0,0,0,0,0,2,750,0,0,-1,-1,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
835715,75398,172.16.0.5-192.168.50.1-41044-2177-17,172.16.0.5,41044,192.168.50.1,2177,17,2018-12-01 12:57:33.409184,1,2,0,802.0,0.0,401.0,401.0,401.0,0.0,0.0,0.0,0.0,0.0,802000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,2000000.0,0.0,401.0,401.0,401.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,601.5,401.0,0.0,0,0,0,0,0,0,0,2,802,0,0,-1,-1,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP


In [24]:
# Show all hide columns and Rows
pd.set_option('display.max_rows', None)



In [25]:
df1.describe().T 

  sqr = _ensure_numeric((avg - values) ** 2)
  sqr = _ensure_numeric((avg - values) ** 2)


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,1568397.0,62822.64,36386.59,0.0,31381.0,62698.0,94050.0,132593.0
Source Port,1568397.0,46788.46,8463.673,0.0,39812.0,46897.0,53900.0,65530.0
Destination Port,1568397.0,33266.91,18653.18,0.0,17154.0,33284.0,49423.0,65535.0
Protocol,1568397.0,16.99093,0.3219618,0.0,17.0,17.0,17.0,17.0
Flow Duration,1568397.0,94102.51,643403.7,0.0,1.0,179.0,108845.0,119999900.0
Total Fwd Packets,1568397.0,3.769862,3.44598,1.0,2.0,4.0,4.0,1006.0
Total Backward Packets,1568397.0,0.005633778,0.6059067,0.0,0.0,0.0,0.0,361.0
Total Length of Fwd Packets,1568397.0,1371.036,1122.027,0.0,766.0,1398.0,1438.0,59076.0
Total Length of Bwd Packets,1568397.0,1.23557,251.4439,0.0,0.0,0.0,0.0,173232.0
Fwd Packet Length Max,1568397.0,388.547,49.29893,0.0,375.0,389.0,393.0,3547.0


In [26]:
df.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Min Packet Length,Max Packet Length,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Avg Fwd Segment Size,Avg Bwd Segment Size,Fwd Header Length.1,Fwd Avg Bytes/Bulk,Fwd Avg Packets/Bulk,Fwd Avg Bulk Rate,Bwd Avg Bytes/Bulk,Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,101418,172.16.0.5-192.168.50.1-43443-6652-17,172.16.0.5,43443,192.168.50.1,6652,17,2018-12-01 12:36:57.628026,218395,6,0,2088.0,0.0,393.0,321.0,348.0,35.08846,0.0,0.0,0.0,0.0,9560.658,27.47316,43679.0,59812.246961,110075.0,0.0,218395.0,43679.0,59812.246961,110075.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-6,0,27.47316,0.0,321.0,393.0,344.142857,33.617597,1130.142857,0,0,0,0,0,0,0,0,0.0,401.5,348.0,0.0,-6,0,0,0,0,0,0,6,2088,0,0,-1,-1,5,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
1,21564,172.16.0.5-192.168.50.1-54741-9712-17,172.16.0.5,54741,192.168.50.1,9712,17,2018-12-01 12:36:57.628076,108219,4,0,1398.0,0.0,369.0,330.0,349.5,22.51666,0.0,0.0,0.0,0.0,12918.25,36.96209,36073.0,62478.536731,108217.0,1.0,108219.0,36073.0,62478.536731,108217.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-4,0,36.96209,0.0,330.0,369.0,345.6,21.36118,456.3,0,0,0,0,0,0,0,0,0.0,432.0,349.5,0.0,-4,0,0,0,0,0,0,4,1398,0,0,-1,-1,3,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
2,23389,172.16.0.5-192.168.50.1-56589-4680-17,172.16.0.5,56589,192.168.50.1,4680,17,2018-12-01 12:36:57.628164,104579,4,0,1438.0,0.0,389.0,330.0,359.5,34.063666,0.0,0.0,0.0,0.0,13750.37,38.2486,34859.666667,60376.981751,104577.0,1.0,104579.0,34859.666667,60376.981751,104577.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-4,0,38.2486,0.0,330.0,389.0,353.6,32.315631,1044.3,0,0,0,0,0,0,0,0,0.0,442.0,359.5,0.0,-4,0,0,0,0,0,0,4,1438,0,0,-1,-1,3,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
3,48872,172.16.0.5-192.168.50.1-40233-2644-17,172.16.0.5,40233,192.168.50.1,2644,17,2018-12-01 12:36:57.628166,110967,4,0,1544.0,0.0,389.0,383.0,386.0,3.464102,0.0,0.0,0.0,0.0,13914.05,36.04675,36989.0,64065.09527,110965.0,1.0,110967.0,36989.0,64065.09527,110965.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-4,0,36.04675,0.0,383.0,389.0,386.6,3.286335,10.8,0,0,0,0,0,0,0,0,0.0,483.25,386.0,0.0,-4,0,0,0,0,0,0,4,1544,0,0,-1,-1,3,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
4,80354,172.16.0.5-192.168.50.1-33989-16901-17,172.16.0.5,33989,192.168.50.1,16901,17,2018-12-01 12:36:57.628217,1,2,0,766.0,0.0,383.0,383.0,383.0,0.0,0.0,0.0,0.0,0.0,766000000.0,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,-2,0,2000000.0,0.0,383.0,383.0,383.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,574.5,383.0,0.0,-2,0,0,0,0,0,0,2,766,0,0,-1,-1,1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP


In [36]:
# select  Flow Duration, Flow Bytes/s, Total Length of Fwd Packets and  URG Flag Count column and show its data type
df[[' Flow Duration', 'Flow Bytes/s', 'Total Length of Fwd Packets', ' URG Flag Count']].dtypes


 Flow Duration                   int64
Flow Bytes/s                   float64
Total Length of Fwd Packets    float64
 URG Flag Count                  int64
dtype: object

In [28]:
df1.shape

(1568397, 88)

In [29]:
features = [' Flow Duration', ' Total Fwd Packets', ' Total Backward Packets',
            'Total Length of Fwd Packets', ' Total Length of Bwd Packets', ' Fwd Packet Length Max',
            ' Fwd Packet Length Min', ' Fwd Packet Length Mean', ' Fwd Packet Length Std',
            'Bwd Packet Length Max', ' Bwd Packet Length Min', ' Bwd Packet Length Mean',
            ' Bwd Packet Length Std', 'Flow Bytes/s', ' Flow Packets/s', ' Label']

df1 = df1[features]

# Encode labels
df1[' Label'] = df1[' Label'].astype('category').cat.codes

# Define X and y
X = df1.drop(' Label', axis=1)
y = df1[' Label']

In [None]:
# Summary statistics
print(X.describe())

# Data visualization
sns.pairplot(df1, hue=' Label')
plt.show()

In [31]:
X_scaled = scaler.fit_transform(np.nan_to_num(X))


  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  new_unnormalized_variance -= correction**2 / new_sample_count


In [42]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Train Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict
y_pred = rf_model.predict(X_test)
y_train_predict = rf_model.predict(X_train)
# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Accuracy on Training Data:", accuracy_score(y_train, y_train_predict))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.9999362407549095
Accuracy on Training Data: 0.9999799613253579
Confusion Matrix:
 [[   322     13]
 [    17 470168]]
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.96      0.96       335
           1       1.00      1.00      1.00    470185

    accuracy                           1.00    470520
   macro avg       0.97      0.98      0.98    470520
weighted avg       1.00      1.00      1.00    470520

