# **Logistic Regression model on Network Intrusion Detection Dataset**

Within the context of machine learning algorithm,in this kernel I conducted a study on the Network Intrusion Detection dataset using Logistic Regression to classify network behavior as normal or anomaly and I  observed an accuracy score.

First of all, I ran this algorithm without using the Logistic Regression library to understand what kind of mathematics the algorithm has. Then I made analyzes using its library.

**Network Intrusion Detection Dataset**

The dataset to be audited was provided which consists of a wide variety of intrusions simulated in a military network environment. For this dataset TCP/IP dump data was acquired by simulating a typical US Air Force LAN.For each TCP/IP connection, 41 quantitative and qualitative features are obtained from normal and attack data (3 qualitative and 38 quantitative features).The class variable has two categories:
* Normal
* Anomalous

Since we need a binary result for this study, we will classify it according to whether it is normal or not.


First we need to import necessary libraries to get dataset which will be used. Kaggle does it automatically for us.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Next, we should read the .csv files using pandas and assign it to a dataframe variable. And then we should import matplotlib.pyplot library for visualizing the relation between *cost* and *iteration* values. Because we want to see the accuracy result, first it is needed to split *Train_data* as train and test. Hereby we can get a chance to see whether our algorithm is fit enough or not.

In [None]:
import matplotlib.pyplot as plt

df_train = pd.read_csv("/kaggle/input/network-intrusion-detection/Train_data.csv")

print(df_train.info())

Since we will do some mathematical matrix operations, it is necessary to convert the *object* type qualitative columns into quantitative ones. We can use *get_dummies* method for this. In this way, if we called the number of different values in objective type columns as k, k-1 feature column is created.

In [None]:
data_train = pd.get_dummies(df_train,columns = ["protocol_type", "flag", "service", "class"], drop_first=True)

data_train.head()

In this scenario "class" feature is determining factor, in other words, target values. If the behaviour is normal, it assigned "1" else "0". Now, we specify the x and y datas. X means my features, and Y is my target values.

In [None]:
y_data = data_train.class_normal.values
x_data = data_train.drop(["class_normal"], axis=1)
y_data = y_data.astype('int64') #we convert dtype of y_data to int64 here for a positive decrease in cost values.

print("Y datas : ", y_data) #if it is normal 1 else 0.
print("X datas : \n", x_data.head())


In order to prevent the large numbers make too small numbers insignificant we need to do normalization.

In [None]:
x = (x_data-np.min(x_data))/(np.max(x_data)-np.min(x_data)).values

x.head(n=10)

After normalization, owing to *num_outbound_cmds* and *is_host_login* columns has NaN variables, we need to drop them.

In [None]:
x[['num_outbound_cmds','is_host_login']].head()

In [None]:
x.drop(["num_outbound_cmds", "is_host_login"], axis = 1, inplace=True)
x.head()

As you can see, we now have 113 columns.We will use 80 percent of the these normalized *x* and *y* datas for learning and 20 percent for testing. So we have to split these *x* and *y_data* using *train_test_split* method. Afterwards, we will continue by taking the transposition for our transactions.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y_data, test_size=0.2, random_state=42)

x_train = x_train.T
x_test = x_test.T
y_train = y_train.T
y_test = y_test.T

print("x_train :", x_train.shape)
print("x_test :", x_test.shape)
print("y_train :", y_train.shape)
print("y_test :", y_test.shape)

Logistic regression algorithm consists of two stages as forward and backward propagation during learning. It needs some parameters to learn called *bias* and *weight*. After some iterations, algorithm decides which value of them is fit for this model. 

Forward propagation : 
* z = np.dot(w.T, x_train) + b

In forward propagation, after calculating *z*, it needs to go *sigmoid* function. After *sigmoid*, the algorithm decides whether the class is 1 (normal) or 0 (anomalous). Thus, we obtain y_head values.
For backward propagation we need to calculate a derivative of *weight* and *bias* respect to *z*.

Of course, it is necessary to define initialize values of *weight* and *bias*. That's why we need to define functions called *def_initialize_weight_and_bias* and *sigmoid*.

In [None]:
def initialize_weigths_and_bias(dimension):
    w = np.full((dimension,1),0.01)
    b = 0.0
    return w,b

def sigmoid(z):
    y_head = 1/(1 + np.exp(-z))
    return y_head

Now, we can define *forward_backward_propagation* function. The reason of using *shape* is scaling. In this function we calculate cost values as well.

In [None]:
def forward_backward_propagation(w,b,x_train,y_train):
    #forward propagation
    z = np.dot(w.T, x_train) + b
    y_head = sigmoid(z)
    loss = -y_train*np.log(y_head) - (1-y_train)*np.log(1-y_head)
    cost = (np.sum(loss)) / x_train.shape[1]
    
    #backward propagation 
    derivative_weight = (np.dot(x_train,((y_head-y_train).T))) / x_train.shape[1]    
    derivative_bias = np.sum(y_head-y_train) / x_train.shape[1]    
    gradients = {"derivative_weight" : derivative_weight, "derivative_bias" : derivative_bias}
    return cost, gradients

It is necessary to constantly update the values of *weight* and *bias* as much as the given number of iterations and bring them closer to more accurate predictions. The *learning_rate* parameter is necessary for the algorithm to learn faster and more effectively, and it must be manually determined by us.For these reasons, *update* function must be defined.

In [None]:
def update(w,b,x_train,y_train,learning_rate,number_of_iteration):
    cost_list = []
    cost_list2 = []
    index = []
    
    #updating(learning) parametersis number_of_iteration times 
    for i in range(number_of_iteration):
        #make forward and backwar propagation to find cost and gradients
        cost, gradients = forward_backward_propagation(w, b, x_train, y_train)
        cost_list.append(cost)
        #lets update 
        w = w - learning_rate * gradients["derivative_weight"]
        b = b - learning_rate * gradients["derivative_bias"]
        if i % 3 == 0 :
            cost_list2.append(cost)
            index.append(i)
            print("cost after iteration %i : %f" %(i,cost))
            
    #store updated parameters in a parameters dictionary
    parameters = {"weight":w,"bias":b}
    plt.plot(index,cost_list2)
    plt.xticks(index,rotation='vertical')
    plt.xlabel("Number of iteration")
    plt.ylabel("Cost")
    plt.show()
    return parameters, gradients, cost_list

Now it's time to predict *y_prediction* values which is related to the splitted *x_test* that have also real y values.By doing this, we can get a chance to calculate the accuracy score using real y and predicted y values. In this step, we just need to do *forward propagation* now that we have trained parameters (*weight, bias*).

In [None]:
def predict(w,b,x_test):  
    z = sigmoid(np.dot(w.T,x_test)+b) 
    Y_prediction = np.zeros((1,x_test.shape[1])) # the length of Y-prediction should be as the number of samples.
    # if y_head is less than 0.5 the class is 0, else 1.
    for i in range(z.shape[1]):
        if z[0,i] <= 0.5:
            Y_prediction[0,i] = 0
        else:
            Y_prediction[0,i] = 1
        
    return Y_prediction


Now that all the necessary functions have been defined, we need to combine them all together under the *logistic_regression* function and calculate the accuracy score here.

In [None]:
def logistic_regression(x_train,y_train,x_test,y_test,learning_rate,num_iterations):
    dimension = x_train.shape[0]
    w,b = initialize_weigths_and_bias(dimension)
    
    #run learning algorithm
    parameters, gradients, cost_list = update(w,b,x_train,y_train,learning_rate,num_iterations)
    
    #We get prediction values with the test data and the trained parameters we give.
    y_prediction_test = predict(parameters["weight"],parameters["bias"], x_test) 
    
    error_value = np.mean(np.abs(y_prediction_test - y_test)) * 100    
    print("test accuracy: {} %".format(100 - error_value ))
    
    return  y_prediction_test, y_test

Let's try our Logistic Regression algorithm.

In [None]:
y_prediction_test, y_test = logistic_regression(x_train, y_train, x_test, y_test, learning_rate=3, num_iterations=100)

We must observe that the "*cost*" value decrease as the number of iteration increase.Besides, "*cost*" and "*test accuracy*" are inversely proportional. As the cost value decrease, accuracy score is expected to increase.Looking at the test accuracy score, we observe that it has performed a prediction close to 100 percent. This is a good result we want to get.

Let's see the real and prediction values of "*y*". Print first 10 elements of *y_test* and *y_prediction_test* together. While 1 is normal behaviour, 0 is anomalous.

In [None]:
y_test_values = np.zeros((1,x_test.shape[1]))
y_test_values[:1,:10] = y_test[:10] 

print("real : ", y_test_values[:1,:10])
print("pred : ", y_prediction_test[:1,:10])


Of course, there are libraries already available in Python for Logistic Regression algorithm. Using these libraries accuracy score can be obtained performing operations in 3 lines. To do this, we need to import LogisticRegression module.

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=200)
lr.fit(x_train.T,y_train.T)
print("test acuracy : {}".format(lr.score(x_test.T,y_test.T)))

As you can see, this algorithm, which uses many different parameters, yielded an accuracy value of almost 100 percent, as well.