## Problem Statement

Cyber-adversaries are becoming more sophisticated in their efforts to avoid detection, and many modern malware tools are already incorporating new ways to bypass antivirus and other threat detection measures. Software to detect network attacks protects a computer network from unauthorized users, including perhaps insiders. We need new methods to find out the attacks.
This is a multi-class classification problem. The given dataset contains 41 feature columns and 1 label column. The task is to classify the data into 5 types of network attacks, namely:-
<br /><br />

ipsweep probe <br />
back dos <br />
satan probe <br />
portsweep probe <br />
normal<br />

ipsweep attack: IP sweep also known as ICMP sweep attack occurs when the attacker sends ICMP(Internet Control Message Protocol) echo requests to multiple destination addresses. If a target host replies to these requests, the reply reveals the targets IP address to the attacker. IP sweep is aimed at determining which range of IP addresses map to live hosts. <br /> <br />

satan probe: SATAN is a historical vulnerability scanner that was developed in the 1990s by Dan Farmer and Wietse Venema. It was designed to assess the security of networked systems by scanning for known vulnerabilities and misconfigurations. SATAN provided a web-based interface to display the results of its scans and offered recommendations for remediation.<br /> <br />

portsweep probe: A portsweep attack is a network reconnaissance technique used by malicious actors to scan a range of network ports on multiple target systems. The objective is to identify which ports are open and potentially vulnerable to exploitation.<br /> <br />


## Processing the input

In [20]:
import pandas as pd
import numpy as np

# Visualisation libraries
import seaborn as sns
import plotly.graph_objects as go
import matplotlib.pyplot as plt

In [22]:
# Reading the input

input_df = pd.read_csv("train.csv")
input_df.head()

Unnamed: 0,Id,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,type_of_attack
0,1,0,tcp,other,REJ,0,0,0,0,0,...,1,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,satan.
1,2,0,tcp,http,SF,54540,8314,0,0,0,...,178,1.0,0.0,0.01,0.0,0.0,0.0,0.01,0.01,back.
2,3,0,tcp,other,REJ,0,0,0,0,0,...,1,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,satan.
3,4,0,icmp,eco_i,SF,8,0,0,0,0,...,86,1.0,0.0,1.0,0.51,0.0,0.0,0.0,0.0,ipsweep.
4,5,0,tcp,other,REJ,0,0,0,0,0,...,1,0.0,1.0,0.0,0.0,0.07,0.0,0.93,1.0,satan.


In [4]:
input_df.shape

(12989, 43)

In [5]:
input_df.columns

Index(['Id', 'duration', 'protocol_type', 'service', 'flag', 'src_bytes',
       'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot',
       'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell',
       'su_attempted', 'num_root', 'num_file_creations', 'num_shells',
       'num_access_files', 'num_outbound_cmds', 'is_host_login',
       'is_guest_login', 'count', 'srv_count', 'serror_rate',
       'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
       'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
       'dst_host_srv_count', 'dst_host_same_srv_rate',
       'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate',
       'dst_host_srv_diff_host_rate', 'dst_host_serror_rate',
       'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
       'dst_host_srv_rerror_rate', 'type_of_attack'],
      dtype='object')

In [7]:
input_df.dtypes

Id                               int64
duration                         int64
protocol_type                   object
service                         object
flag                            object
src_bytes                        int64
dst_bytes                        int64
land                             int64
wrong_fragment                   int64
urgent                           int64
hot                              int64
num_failed_logins                int64
logged_in                        int64
num_compromised                  int64
root_shell                       int64
su_attempted                     int64
num_root                         int64
num_file_creations               int64
num_shells                       int64
num_access_files                 int64
num_outbound_cmds                int64
is_host_login                    int64
is_guest_login                   int64
count                            int64
srv_count                        int64
serror_rate              

In [24]:
ids = input_df['Id']
X = input_df.drop(columns=["type_of_attack","Id"])
Y = input_df["type_of_attack"]
del input_df

## Exploratory Data Analysis

### Distribution of target variable

In [12]:
target_distrib = Y.value_counts()
target_distrib

ipsweep.      5318
satan.        3711
portsweep.    1999
back.         1400
normal.        561
Name: type_of_attack, dtype: int64

In [13]:
labels = target_distrib.index
values = list(target_distrib)

fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='label+percent',
                             insidetextorientation='radial'
                            )])
fig.show()

### Distribution of other categorical variables

In [14]:
X["protocol_type"].value_counts()

tcp     8052
icmp    4665
udp      272
Name: protocol_type, dtype: int64

In [15]:
X["service"].value_counts() 

eco_i        4663
other        3400
private      2592
http         1971
ftp_data       33
             ... 
efs             1
pop_3           1
http_443        1
http_2784       1
iso_tsap        1
Name: service, Length: 61, dtype: int64

In [16]:
X["flag"].value_counts()  

SF        6925
REJ       3831
RSTR      1793
S0         317
RSTOS0      76
RSTO        36
OTH          5
S2           4
S1           2
Name: flag, dtype: int64

In [47]:
for col in X.columns:
    print(X[col].value_counts())

0     12710
1       133
6        32
5        20
25       12
27       12
2        10
4         6
10        5
7         5
21        4
8         4
13        3
24        3
15        3
17        3
28        3
22        3
3         3
20        2
29        2
14        2
23        2
19        2
9         1
78        1
12        1
60        1
11        1
Name: duration, dtype: int64
0        6059
8        4654
54540    1372
1         271
6          22
         ... 
250         1
246         1
160         1
292         1
349         1
Name: src_bytes, Length: 179, dtype: int64
0         10953
8314       1365
424          31
15           23
7300         18
          ...  
1377          1
2611          1
191273        1
1273          1
920           1
Name: dst_bytes, Length: 260, dtype: int64
0    11586
2     1361
1       42
Name: hot, dtype: int64
0    10993
1     1996
Name: logged_in, dtype: int64
0    11626
1     1363
Name: num_compromised, dtype: int64
0    12987
1        2
Name: num_root, dt

### Checking for null values

In [25]:
X.isnull().sum()

duration                       0
protocol_type                  0
service                        0
flag                           0
src_bytes                      0
dst_bytes                      0
land                           0
wrong_fragment                 0
urgent                         0
hot                            0
num_failed_logins              0
logged_in                      0
num_compromised                0
root_shell                     0
su_attempted                   0
num_root                       0
num_file_creations             0
num_shells                     0
num_access_files               0
num_outbound_cmds              0
is_host_login                  0
is_guest_login                 0
count                          0
srv_count                      0
serror_rate                    0
srv_serror_rate                0
rerror_rate                    0
srv_rerror_rate                0
same_srv_rate                  0
diff_srv_rate                  0
srv_diff_h

In [28]:
# Dropping columns that have a constant value (Here 0)
const_value_cols = ['land', 'wrong_fragment', 'urgent', 'num_failed_logins', 'root_shell', 'su_attempted', 'num_file_creations', 'num_outbound_cmds', 'is_host_login', 'is_guest_login']

X.drop(columns=const_value_cols, inplace=True)

In [29]:
X.columns

Index(['duration', 'protocol_type', 'service', 'flag', 'src_bytes',
       'dst_bytes', 'hot', 'logged_in', 'num_compromised', 'num_root',
       'num_shells', 'num_access_files', 'count', 'srv_count', 'serror_rate',
       'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
       'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
       'dst_host_srv_count', 'dst_host_same_srv_rate',
       'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate',
       'dst_host_srv_diff_host_rate', 'dst_host_serror_rate',
       'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
       'dst_host_srv_rerror_rate'],
      dtype='object')

In [30]:
#Converting object dtypes to category
cat_attr=X.select_dtypes(include ='object').columns
X[cat_attr] = X[cat_attr].astype('category')

In [33]:
#Applying one hot encoding on the categorical attributes
X = pd.get_dummies(columns=cat_attr, data = X, prefix=cat_attr, prefix_sep="_", drop_first=True)

In [34]:
#Label encoding target variable
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(Y)
Y = le.transform(Y)

In [50]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [51]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0, class_weight='balanced', multi_class='ovr', solver='liblinear').fit(X_train, y_train)

In [52]:
predictions = clf.predict(X_test)

In [53]:
#Accuracy of the model
score = clf.score(X_train, y_train)
score*100

100.0

In [54]:
#Accuracy of the model
score = clf.score(X_test, y_test)
score*100

99.96150885296382

In [55]:
from sklearn import metrics

cm = metrics.confusion_matrix(y_test, predictions)
print(cm)

[[ 280    0    0    0    0]
 [   0 1064    0    0    0]
 [   0    0  112    0    0]
 [   0    0    0  400    0]
 [   0    0    0    1  741]]


In [56]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions, target_names=["ipsweep.","satan.", "portsweep.", "back.", "normal."]))

              precision    recall  f1-score   support

    ipsweep.       1.00      1.00      1.00       280
      satan.       1.00      1.00      1.00      1064
  portsweep.       1.00      1.00      1.00       112
       back.       1.00      1.00      1.00       400
     normal.       1.00      1.00      1.00       742

    accuracy                           1.00      2598
   macro avg       1.00      1.00      1.00      2598
weighted avg       1.00      1.00      1.00      2598

