# **MACHINE LEARNING PROJECT IMPLEMENTATION**

## **INTERNET FIREWALL DATA**

## **Presented By: Engr. Saad**
## **Roll Number:**  **-----**


## **IMPORTING LIBRARIES**

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
import imblearn
from collections import Counter
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline


## **DATA ACQUISITON**

Data acquiring is the process to fetch or find dataset from available resource. Here we use UCI site, UCI allows users to find and publish data sets, We downloaded Internet Firewall data dataset on this site (http://archive.ics.uci.edu/ml//datasets/Internet+Firewall+Data). Then i uploaded it on my google drive. 

### **Loading Dataset**

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
df=pd.read_csv("/content/drive/MyDrive/Colab Projects/Colab Notebooks/Project-01/log2.csv")  #Original Dataset
df2=pd.read_csv("/content/drive/MyDrive/Colab Projects/Colab Notebooks/Project-01/log2.csv") #Output Converted into Numerical
df3=pd.read_csv("/content/drive/MyDrive/Colab Projects/Colab Notebooks/Project-01/log2.csv") #If Original Data is needed anywhere
df4=pd.read_csv("/content/drive/MyDrive/Colab Projects/Colab Notebooks/Project-01/log2.csv") #If Original Data is needed anywhere

df1 will be original data.
df2 will be labeling output from categorical to numerical
df3 will be used for smote purpose.

In [4]:
dataMapping={"allow":3,
            "deny":2,
            "drop":1,
            "reset-both":0}
df2["Action"]=df2["Action"].map(dataMapping) # Action Column was catergorical but is converted into Numberical

## **Data Exploration and Analysis**

### **Displaying Dataframe**

In [5]:
df.head()

Unnamed: 0,Source Port,Destination Port,NAT Source Port,NAT Destination Port,Action,Bytes,Bytes Sent,Bytes Received,Packets,Elapsed Time (sec),pkts_sent,pkts_received
0,57222,53,54587,53,allow,177,94,83,2,30,1,1
1,56258,3389,56258,3389,allow,4768,1600,3168,19,17,10,9
2,6881,50321,43265,50321,allow,238,118,120,2,1199,1,1
3,50553,3389,50553,3389,allow,3327,1438,1889,15,17,8,7
4,50002,443,45848,443,allow,25358,6778,18580,31,16,13,18


In [6]:
df2.head(5)

Unnamed: 0,Source Port,Destination Port,NAT Source Port,NAT Destination Port,Action,Bytes,Bytes Sent,Bytes Received,Packets,Elapsed Time (sec),pkts_sent,pkts_received
0,57222,53,54587,53,3,177,94,83,2,30,1,1
1,56258,3389,56258,3389,3,4768,1600,3168,19,17,10,9
2,6881,50321,43265,50321,3,238,118,120,2,1199,1,1
3,50553,3389,50553,3389,3,3327,1438,1889,15,17,8,7
4,50002,443,45848,443,3,25358,6778,18580,31,16,13,18


**Dataset Dimension**

In [7]:
df.shape

(65532, 12)

In [8]:
df2.shape

(65532, 12)

**Dataset Information**

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65532 entries, 0 to 65531
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Source Port           65532 non-null  int64 
 1   Destination Port      65532 non-null  int64 
 2   NAT Source Port       65532 non-null  int64 
 3   NAT Destination Port  65532 non-null  int64 
 4   Action                65532 non-null  object
 5   Bytes                 65532 non-null  int64 
 6   Bytes Sent            65532 non-null  int64 
 7   Bytes Received        65532 non-null  int64 
 8   Packets               65532 non-null  int64 
 9   Elapsed Time (sec)    65532 non-null  int64 
 10  pkts_sent             65532 non-null  int64 
 11  pkts_received         65532 non-null  int64 
dtypes: int64(11), object(1)
memory usage: 6.0+ MB


As per this command, we came to know that there is no missing values as 65532 entries are filled. If we had any missing value, we would had taken care of these values by
1) Getting Rid of missing values. (Deleting the whole row. But if these missing values are too much or we have small dataset, then this won't be appropriable. 
2) We can delete the whole attribute (Feature/column). But only if the value is not affecting output. Let's say it has high co-linearity/relation with output, then this will also be not a good option.
3) We can fill missing values by some other value. It can be 0 or mean or mode to make it perform better.

**Dataset Co-relation Matrix:**

In [10]:
cor1=df.corr()
print("Correlation Between Different Features")
print(cor1)
cor2=df2.corr()
print("Correlation Between Different Features when output is also numerical")
print(cor2)

Correlation Between Different Features
                      Source Port  Destination Port  NAT Source Port  \
Source Port              1.000000         -0.332246         0.145391   
Destination Port        -0.332246          1.000000        -0.281676   
NAT Source Port          0.145391         -0.281676         1.000000   
NAT Destination Port    -0.024843          0.410042         0.178435   
Bytes                    0.000221         -0.005297         0.010659   
Bytes Sent              -0.000931          0.001675         0.002242   
Bytes Received           0.001950         -0.014684         0.020827   
Packets                 -0.001742         -0.006063         0.012633   
Elapsed Time (sec)      -0.046515          0.023537         0.141485   
pkts_sent               -0.001422         -0.002134         0.007180   
pkts_received           -0.001962         -0.010909         0.018772   

                      NAT Destination Port     Bytes  Bytes Sent  \
Source Port                 

**Describing Dataset**

In [11]:
df.describe()

Unnamed: 0,Source Port,Destination Port,NAT Source Port,NAT Destination Port,Bytes,Bytes Sent,Bytes Received,Packets,Elapsed Time (sec),pkts_sent,pkts_received
count,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0
mean,49391.969343,10577.385812,19282.972761,2671.04993,97123.95,22385.8,74738.15,102.866,65.833577,41.39953,61.466505
std,15255.712537,18466.027039,21970.689669,9739.162278,5618439.0,3828139.0,2463208.0,5133.002,302.461762,3218.871288,2223.332271
min,0.0,0.0,0.0,0.0,60.0,60.0,0.0,1.0,0.0,1.0,0.0
25%,49183.0,80.0,0.0,0.0,66.0,66.0,0.0,1.0,0.0,1.0,0.0
50%,53776.5,445.0,8820.5,53.0,168.0,90.0,79.0,2.0,15.0,1.0,1.0
75%,58638.0,15000.0,38366.25,443.0,752.25,210.0,449.0,6.0,30.0,3.0,2.0
max,65534.0,65535.0,65535.0,65535.0,1269359000.0,948477200.0,320881800.0,1036116.0,10824.0,747520.0,327208.0


In [12]:
df2.describe()

Unnamed: 0,Source Port,Destination Port,NAT Source Port,NAT Destination Port,Action,Bytes,Bytes Sent,Bytes Received,Packets,Elapsed Time (sec),pkts_sent,pkts_received
count,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0,65532.0
mean,49391.969343,10577.385812,19282.972761,2671.04993,2.376625,97123.95,22385.8,74738.15,102.866,65.833577,41.39953,61.466505
std,15255.712537,18466.027039,21970.689669,9739.162278,0.794945,5618439.0,3828139.0,2463208.0,5133.002,302.461762,3218.871288,2223.332271
min,0.0,0.0,0.0,0.0,0.0,60.0,60.0,0.0,1.0,0.0,1.0,0.0
25%,49183.0,80.0,0.0,0.0,2.0,66.0,66.0,0.0,1.0,0.0,1.0,0.0
50%,53776.5,445.0,8820.5,53.0,3.0,168.0,90.0,79.0,2.0,15.0,1.0,1.0
75%,58638.0,15000.0,38366.25,443.0,3.0,752.25,210.0,449.0,6.0,30.0,3.0,2.0
max,65534.0,65535.0,65535.0,65535.0,3.0,1269359000.0,948477200.0,320881800.0,1036116.0,10824.0,747520.0,327208.0


From above command, we came to know that the data is scattered and not normalized. So, we will normalize it to get all features into same range.

### **Segregating Dependent & Independant Variable**

In [13]:
x=df.drop(["Action"], axis=1)
print(x.shape)
y=df["Action"]
print(y.shape)

(65532, 11)
(65532,)


In [14]:
x2=df.drop(["Action"], axis=1)
print(x2.shape)
y2=df["Action"]
print(y2.shape)

(65532, 11)
(65532,)


In [15]:
x3=df.drop(["Action"], axis=1)
print(x3.shape)
y3=df["Action"]
print(y3.shape)

(65532, 11)
(65532,)


In [16]:
counter = Counter(y)
print("Values of y are:")
print(counter)
print("Values of y2 are:")
print(y2.value_counts())

Values of y are:
Counter({'allow': 37640, 'deny': 14987, 'drop': 12851, 'reset-both': 54})
Values of y2 are:
allow         37640
deny          14987
drop          12851
reset-both       54
Name: Action, dtype: int64


In [17]:
oversample = SMOTE()
x2, y2 = oversample.fit_resample(x2, y2)

In [18]:
print("Values of y2 by using Smote are:")
print(y2.value_counts())

Values of y2 by using Smote are:
allow         37640
drop          37640
deny          37640
reset-both    37640
Name: Action, dtype: int64


In [19]:
#over = SMOTE(sampling_strategy=0.1)
#under = RandomUnderSampler(sampling_strategy=1)
#steps = [('o', over), ('u', under)]
#pipeline = Pipeline(steps=steps)
#x3, y3 = pipeline.fit_resample(x3, y3)

### **Spliting Training Data and Testing Data**

In [20]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2, random_state=21)
print("Splitted Data is as: (x_train), (x_test), (y_train), (y_test):")
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
print("Total Number of y_train values are:")
print(y_train.value_counts())
print("Total Number of y_train values are:")
print(y_test.value_counts())

Splitted Data is as: (x_train), (x_test), (y_train), (y_test):
(52425, 11) (13107, 11) (52425,) (13107,)
Total Number of y_train values are:
allow         30145
deny          11995
drop          10244
reset-both       41
Name: Action, dtype: int64
Total Number of y_train values are:
allow         7495
deny          2992
drop          2607
reset-both      13
Name: Action, dtype: int64


In [21]:
from sklearn.model_selection import train_test_split
x2_train, x2_test, y2_train, y2_test=train_test_split(x2,y2,test_size=0.2, random_state=21)
print("Splitted Data is as: (x2_train), (x2_test), (y2_train), (y2_test):")
print(x2_train.shape, x2_test.shape, y2_train.shape, y2_test.shape)
print("Total Number of y2_train values are:")
print(y2_train.value_counts())
print("Total Number of y2_train values are:")
print(y2_test.value_counts())

Splitted Data is as: (x2_train), (x2_test), (y2_train), (y2_test):
(120448, 11) (30112, 11) (120448,) (30112,)
Total Number of y2_train values are:
drop          30147
deny          30125
reset-both    30110
allow         30066
Name: Action, dtype: int64
Total Number of y2_train values are:
allow         7574
reset-both    7530
deny          7515
drop          7493
Name: Action, dtype: int64


In [22]:
y2_train.value_counts()

drop          30147
deny          30125
reset-both    30110
allow         30066
Name: Action, dtype: int64

In [23]:
#from sklearn.model_selection import StratifiedShuffleSplit

We would have used from sklearn.model_selection import StratifiedShuffleSplit for splitting the data set equally. Lets say if we had only 25 entries of deny and all of these are dumped in testing data. Then, model wont be able to know that there was a deny command as well in the data. And model will be under trained. As it will be confused for deny entries.


In [24]:
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65532 entries, 0 to 65531
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   Source Port           65532 non-null  int64
 1   Destination Port      65532 non-null  int64
 2   NAT Source Port       65532 non-null  int64
 3   NAT Destination Port  65532 non-null  int64
 4   Bytes                 65532 non-null  int64
 5   Bytes Sent            65532 non-null  int64
 6   Bytes Received        65532 non-null  int64
 7   Packets               65532 non-null  int64
 8   Elapsed Time (sec)    65532 non-null  int64
 9   pkts_sent             65532 non-null  int64
 10  pkts_received         65532 non-null  int64
dtypes: int64(11)
memory usage: 5.5 MB


## **FEATURE SCALING**

There are two types for feature scaling in Sklearn library.

1.   MinMaxScaler (Normalization)
Formula is basically ((Value-Min)/(Max-Min))

2.   StandardScaler (Standardization) 
Formula is ((Value-Mean)/(Std))

We could have scaled data before splitting as well. But in case if we need original data without scaling, then we will need to introduce new variable. So, new variables are already introduced to avoid any inconvinience.



In [25]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(x_train)
x_train_sc=scaler.transform(x_train)
x_test_sc=scaler.transform(x_test)

In [26]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(x2_train)
x2_train_sc=scaler.transform(x2_train)
x2_test_sc=scaler.transform(x2_test)

In [27]:
x_train_sc

array([[ 0.92143407, -0.5492364 , -0.87816149, ..., -0.22065615,
        -0.01987704, -0.02876705],
       [-2.42006536,  0.04407232, -0.87816149, ..., -0.22065615,
        -0.01987704, -0.02876705],
       [ 0.00383689,  0.89314278,  0.82179396, ..., -0.1209097 ,
        -0.01987704, -0.02824988],
       ...,
       [ 0.56086279, -0.56901877,  0.27465411, ..., -0.14085899,
        -0.01506342, -0.02566404],
       [ 0.04234068, -0.56901877,  1.09575108, ...,  0.06528368,
        -0.01643874, -0.02669838],
       [ 0.08123804, -0.56901877,  1.42674085, ..., -0.13088434,
        -0.00681151, -0.02049237]])

In [28]:
x2_train_sc

array([[-1.75051984,  2.06169413, -0.54806817, ..., -0.14192491,
        -0.00777264, -0.02073311],
       [ 0.76785946, -0.79434991,  0.38284615, ..., -0.05781094,
        -0.00458889, -0.01557868],
       [-1.2697176 ,  0.10949173, -0.54806817, ..., -0.14192491,
        -0.00777264, -0.02073311],
       ...,
       [-0.9038019 , -0.62967576,  0.87543186, ..., -0.14192491,
        -0.006863  , -0.01987404],
       [-0.0672454 ,  0.45446642, -0.54806817, ..., -0.14192491,
        -0.00777264, -0.02073311],
       [ 0.89492937,  1.09101738, -0.54806817, ..., -0.14192491,
        -0.00777264, -0.02073311]])

## **IMPLEMENTING MODELS:**

## **K-Nearest Neighbors Model**

The model is implemented with original data, original data modified by smote, scaled data and scaled data modified by smote. 
Hypetuning was done by iterating K values only. 
We can also choose diferent distance formulas option as well to calculate the distance. Like Eculadian distance or hamming distance or manhattan distance.

### ***KNN Model using Original Data***

In [29]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
i=1
while i<=10:
  classifier1=KNeighborsClassifier(n_neighbors=i)
  classifier1.fit(x_train, y_train)
  prediction1=classifier1.predict(x_test)
  print("Prediction Shape when, k=",i)
  print(prediction1.shape)
  print("Classification Report when, k=",i)
  print(classification_report(y_test, prediction1))
  print("Classifier Score when, k=",i)
  print(classifier1.score(x_train,y_train))
  print("Classifier Score with Predictions, when k=",i)
  print(classifier1.score(x_test, prediction1))
  i=i+1


Prediction Shape when, k= 1
(13107,)
Classification Report when, k= 1
              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.98      0.99      0.99      2992
        drop       1.00      1.00      1.00      2607
  reset-both       1.00      0.23      0.38        13

    accuracy                           0.99     13107
   macro avg       0.99      0.80      0.84     13107
weighted avg       0.99      0.99      0.99     13107

Classifier Score when, k= 1
0.9997329518359561
Classifier Score with Predictions, when k= 1
1.0
Prediction Shape when, k= 2
(13107,)
Classification Report when, k= 2


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       0.99      1.00      1.00      7495
        deny       0.98      0.98      0.98      2992
        drop       1.00      0.99      0.99      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           0.99     13107
   macro avg       0.74      0.74      0.74     13107
weighted avg       0.99      0.99      0.99     13107

Classifier Score when, k= 2
0.9968717215069146
Classifier Score with Predictions, when k= 2
1.0
Prediction Shape when, k= 3
(13107,)
Classification Report when, k= 3


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.98      0.99      0.99      2992
        drop       1.00      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           0.99     13107
   macro avg       0.74      0.75      0.75     13107
weighted avg       0.99      0.99      0.99     13107

Classifier Score when, k= 3
0.9956318550309967
Classifier Score with Predictions, when k= 3
1.0
Prediction Shape when, k= 4
(13107,)
Classification Report when, k= 4


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.98      0.99      0.99      2992
        drop       1.00      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           0.99     13107
   macro avg       0.74      0.75      0.74     13107
weighted avg       0.99      0.99      0.99     13107

Classifier Score when, k= 4
0.99517405817835
Classifier Score with Predictions, when k= 4
1.0
Prediction Shape when, k= 5
(13107,)
Classification Report when, k= 5


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      0.99      1.00      7495
        deny       0.98      0.99      0.99      2992
        drop       1.00      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           0.99     13107
   macro avg       0.74      0.75      0.75     13107
weighted avg       0.99      0.99      0.99     13107

Classifier Score when, k= 5
0.9950596089651884
Classifier Score with Predictions, when k= 5
1.0
Prediction Shape when, k= 6
(13107,)
Classification Report when, k= 6


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      0.99      1.00      7495
        deny       0.98      0.99      0.99      2992
        drop       1.00      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           0.99     13107
   macro avg       0.74      0.75      0.75     13107
weighted avg       0.99      0.99      0.99     13107

Classifier Score when, k= 6
0.9948688602765856
Classifier Score with Predictions, when k= 6
1.0
Prediction Shape when, k= 7
(13107,)
Classification Report when, k= 7


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      0.99      1.00      7495
        deny       0.98      0.99      0.99      2992
        drop       1.00      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           0.99     13107
   macro avg       0.74      0.75      0.74     13107
weighted avg       0.99      0.99      0.99     13107

Classifier Score when, k= 7
0.9943919885550787
Classifier Score with Predictions, when k= 7
1.0
Prediction Shape when, k= 8
(13107,)
Classification Report when, k= 8


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      0.99      1.00      7495
        deny       0.98      0.99      0.99      2992
        drop       1.00      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           0.99     13107
   macro avg       0.74      0.75      0.74     13107
weighted avg       0.99      0.99      0.99     13107

Classifier Score when, k= 8
0.994067715784454
Classifier Score with Predictions, when k= 8
1.0
Prediction Shape when, k= 9
(13107,)
Classification Report when, k= 9


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      0.99      1.00      7495
        deny       0.98      0.99      0.99      2992
        drop       0.99      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           0.99     13107
   macro avg       0.74      0.75      0.74     13107
weighted avg       0.99      0.99      0.99     13107

Classifier Score when, k= 9
0.9938960419647115
Classifier Score with Predictions, when k= 9
1.0
Prediction Shape when, k= 10
(13107,)
Classification Report when, k= 10


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      0.99      1.00      7495
        deny       0.98      0.99      0.98      2992
        drop       0.99      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           0.99     13107
   macro avg       0.74      0.75      0.74     13107
weighted avg       0.99      0.99      0.99     13107

Classifier Score when, k= 10
0.9937815927515499
Classifier Score with Predictions, when k= 10
1.0


### **KNN Model When Smote is used on Original Data**

In [30]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
i=1
while i<=10:
  classifier2=KNeighborsClassifier(n_neighbors=i)
  classifier2.fit(x2_train, y2_train)
  prediction2=classifier2.predict(x2_test)
  print("Prediction Shape when, k=",i)
  print(prediction2.shape)
  print("Classification Report when, k=",i)
  print(classification_report(y2_test, prediction2))
  print("Classifier Score when, k=",i)
  print(classifier2.score(x2_train,y2_train))
  print("Classifier Score with Predictions, when k=",i)
  print(classifier2.score(x2_test, prediction2))
  i=i+1

Prediction Shape when, k= 1
(30112,)
Classification Report when, k= 1
              precision    recall  f1-score   support

       allow       1.00      0.99      1.00      7574
        deny       0.99      0.99      0.99      7515
        drop       1.00      1.00      1.00      7493
  reset-both       0.99      1.00      0.99      7530

    accuracy                           1.00     30112
   macro avg       1.00      1.00      1.00     30112
weighted avg       1.00      1.00      1.00     30112

Classifier Score when, k= 1
0.9996845111583422
Classifier Score with Predictions, when k= 1
1.0
Prediction Shape when, k= 2
(30112,)
Classification Report when, k= 2
              precision    recall  f1-score   support

       allow       0.99      1.00      1.00      7574
        deny       0.98      0.99      0.99      7515
        drop       1.00      1.00      1.00      7493
  reset-both       0.99      0.99      0.99      7530

    accuracy                           0.99     30112
   

### **KNN Model When Original Data but Normalized Data.**

In [31]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
i=1
while i<=10:
  classifier3=KNeighborsClassifier(n_neighbors=i)
  classifier3.fit(x_train_sc, y_train)
  prediction3=classifier3.predict(x_test_sc)
  print("Prediction Shape when, k=",i)
  print(prediction3.shape)
  print("Classification Report when, k=",i)
  print(classification_report(y_test, prediction3))
  print("Classifier Score when, k=",i)
  print(classifier3.score(x_train_sc,y_train))
  print("Classifier3 Score with Predictions3, when k=",i)
  print(classifier3.score(x_test_sc, prediction3))
  i=i+1

Prediction Shape when, k= 1
(13107,)
Classification Report when, k= 1
              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.99      1.00      0.99      2992
        drop       1.00      1.00      1.00      2607
  reset-both       1.00      0.23      0.38        13

    accuracy                           1.00     13107
   macro avg       1.00      0.81      0.84     13107
weighted avg       1.00      1.00      1.00     13107

Classifier Score when, k= 1
0.9997329518359561
Classifier3 Score with Predictions3, when k= 1
1.0
Prediction Shape when, k= 2
(13107,)
Classification Report when, k= 2


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.99      1.00      0.99      2992
        drop       1.00      0.99      0.99      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           1.00     13107
   macro avg       0.75      0.75      0.75     13107
weighted avg       0.99      1.00      1.00     13107

Classifier Score when, k= 2
0.9981688125894135
Classifier3 Score with Predictions3, when k= 2
1.0
Prediction Shape when, k= 3
(13107,)
Classification Report when, k= 3


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.99      1.00      0.99      2992
        drop       1.00      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           1.00     13107
   macro avg       0.75      0.75      0.75     13107
weighted avg       1.00      1.00      1.00     13107

Classifier Score when, k= 3
0.9979017644253696
Classifier3 Score with Predictions3, when k= 3
1.0
Prediction Shape when, k= 4
(13107,)
Classification Report when, k= 4


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.99      1.00      0.99      2992
        drop       1.00      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           1.00     13107
   macro avg       0.75      0.75      0.75     13107
weighted avg       1.00      1.00      1.00     13107

Classifier Score when, k= 4
0.9978063900810682
Classifier3 Score with Predictions3, when k= 4
1.0
Prediction Shape when, k= 5
(13107,)
Classification Report when, k= 5


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.99      0.99      0.99      2992
        drop       1.00      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           1.00     13107
   macro avg       0.75      0.75      0.75     13107
weighted avg       1.00      1.00      1.00     13107

Classifier Score when, k= 5
0.9974821173104435
Classifier3 Score with Predictions3, when k= 5
1.0
Prediction Shape when, k= 6
(13107,)
Classification Report when, k= 6


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.99      0.99      0.99      2992
        drop       1.00      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           1.00     13107
   macro avg       0.75      0.75      0.75     13107
weighted avg       1.00      1.00      1.00     13107

Classifier Score when, k= 6
0.9973867429661422
Classifier3 Score with Predictions3, when k= 6
1.0
Prediction Shape when, k= 7
(13107,)
Classification Report when, k= 7


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.99      0.99      0.99      2992
        drop       0.99      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           1.00     13107
   macro avg       0.75      0.75      0.75     13107
weighted avg       1.00      1.00      1.00     13107

Classifier Score when, k= 7
0.9969289461134955
Classifier3 Score with Predictions3, when k= 7
1.0
Prediction Shape when, k= 8
(13107,)
Classification Report when, k= 8


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.99      0.99      0.99      2992
        drop       0.99      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           1.00     13107
   macro avg       0.75      0.75      0.75     13107
weighted avg       1.00      1.00      1.00     13107

Classifier Score when, k= 8
0.9969098712446351
Classifier3 Score with Predictions3, when k= 8
1.0
Prediction Shape when, k= 9
(13107,)
Classification Report when, k= 9


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.99      0.99      0.99      2992
        drop       0.99      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           1.00     13107
   macro avg       0.75      0.75      0.75     13107
weighted avg       1.00      1.00      1.00     13107

Classifier Score when, k= 9
0.9963567000476872
Classifier3 Score with Predictions3, when k= 9
1.0
Prediction Shape when, k= 10
(13107,)
Classification Report when, k= 10


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.99      0.99      0.99      2992
        drop       0.99      1.00      1.00      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           1.00     13107
   macro avg       0.75      0.75      0.75     13107
weighted avg       1.00      1.00      1.00     13107

Classifier Score when, k= 10
0.9963376251788268
Classifier3 Score with Predictions3, when k= 10
1.0


### **KNN Model when Data Balanced with Smote and Also Normalized.**

In [32]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
i=1
while i<=10:
  classifier4=KNeighborsClassifier(n_neighbors=i)
  classifier4.fit(x2_train_sc, y2_train)
  prediction4=classifier4.predict(x2_test_sc)
  print("Prediction Shape when, k=",i)
  print(prediction4.shape)
  print("Classification Report when, k=",i)
  print(classification_report(y2_test, prediction4))
  print("Classifier Score when, k=",i)
  print(classifier4.score(x2_train_sc,y2_train))
  print("Classifier Score with Predictions, when k=",i)
  print(classifier4.score(x2_test_sc, prediction4))
  i=i+1

Prediction Shape when, k= 1
(30112,)
Classification Report when, k= 1
              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7574
        deny       0.99      0.99      0.99      7515
        drop       1.00      1.00      1.00      7493
  reset-both       0.99      0.99      0.99      7530

    accuracy                           1.00     30112
   macro avg       1.00      1.00      1.00     30112
weighted avg       1.00      1.00      1.00     30112

Classifier Score when, k= 1
0.999701115834219
Classifier Score with Predictions, when k= 1
1.0
Prediction Shape when, k= 2
(30112,)
Classification Report when, k= 2
              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7574
        deny       0.98      0.99      0.99      7515
        drop       1.00      1.00      1.00      7493
  reset-both       0.99      0.99      0.99      7530

    accuracy                           0.99     30112
   m

## **Logistic Regression Model**

The model is implemented with original data, original data modified by smote, scaled data and scaled data modified by smote. Hypetuning was done by adding penality and solver. I have used Penality as l2 and solver as newton-cg. There are other penalities like l1, none, elasticity, and solver like saga, sag, liblinear etc. But we used only l2 and newton-cg to see if result gets better or not.

### **Logistic Regression Model on Original Data Without Penality**

In [33]:
from sklearn.linear_model import LogisticRegression
model5=LogisticRegression()
print(x_train.shape, y_train.shape)
model5.fit(x_train, y_train)
prediction5=model5.predict(x_test)
print(prediction5.shape)
print("Classification Report when Penality is not applied")
print(classification_report(y_test,prediction5))
print("Classifier Score when when Penality is not Applied")
print(model5.score(x_train,y_train))
print("Classifier Score with Predictions when Penlity is not applied")
print(model5.score(x_test, prediction5))

(52425, 11) (52425,)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, msg_start, len(result))


(13107,)
Classification Report when Penality is not applied
              precision    recall  f1-score   support

       allow       0.99      0.99      0.99      7495
        deny       0.99      0.94      0.97      2992
        drop       0.93      1.00      0.96      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           0.98     13107
   macro avg       0.73      0.73      0.73     13107
weighted avg       0.98      0.98      0.98     13107

Classifier Score when when Penality is not Applied
0.9792274678111588
Classifier Score with Predictions when Penlity is not applied
1.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### **Logistic Regression Model on Original Data With Penality**

In [34]:
from sklearn.linear_model import LogisticRegression
model6_pen=LogisticRegression(penalty='l2', solver='newton-cg', multi_class='multinomial')
model6_pen.fit(x_train, y_train)
prediction6=model6_pen.predict(x_test)
print(prediction6.shape)
print("Classification Report when Penality is applied")
print(classification_report(y_test,prediction6))
print("Classifier Score when when Penality is Applied")
print(model6_pen.score(x_train,y_train))
print("Classifier Score with Predictions when Penlity is applied")
print(model6_pen.score(x_test, prediction6))

  _warn_prf(average, modifier, msg_start, len(result))


(13107,)
Classification Report when Penality is applied
              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.99      0.97      0.98      2992
        drop       0.96      1.00      0.98      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           0.99     13107
   macro avg       0.74      0.74      0.74     13107
weighted avg       0.99      0.99      0.99     13107

Classifier Score when when Penality is Applied
0.9898712446351932
Classifier Score with Predictions when Penlity is applied
1.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### **Logistic Regression Model on Smote Data Without Penality**

In [35]:
from sklearn.linear_model import LogisticRegression
model7=LogisticRegression()
print(x2_train.shape, y2_train.shape)
model7.fit(x2_train, y2_train)
prediction7=model7.predict(x2_test)
print(prediction7.shape)
print("Classification Report when Penality is not applied")
print(classification_report(y2_test,prediction7))
print("Classifier Score when when Penality is not Applied")
print(model7.score(x2_train,y2_train))
print("Classifier Score with Predictions when Penlity is not applied")
print(model7.score(x2_test, prediction7))

(120448, 11) (120448,)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


(30112,)
Classification Report when Penality is not applied
              precision    recall  f1-score   support

       allow       0.82      0.99      0.89      7574
        deny       0.65      0.76      0.70      7515
        drop       0.95      1.00      0.97      7493
  reset-both       0.65      0.37      0.47      7530

    accuracy                           0.78     30112
   macro avg       0.77      0.78      0.76     30112
weighted avg       0.77      0.78      0.76     30112

Classifier Score when when Penality is not Applied
0.7739024309245484
Classifier Score with Predictions when Penlity is not applied
1.0


### **Logistic Regression Model on SMOTE Data With Penality**

In [36]:
from sklearn.linear_model import LogisticRegression
model8_pen=LogisticRegression(penalty='l2', solver='newton-cg', multi_class='multinomial')
model8_pen.fit(x2_train, y2_train)
prediction8=model6_pen.predict(x2_test)
print(prediction8.shape)
print("Classification Report when Penality is applied")
print(classification_report(y2_test,prediction8))
print("Classifier Score when when Penality is Applied")
print(model8_pen.score(x2_train,y2_train))
print("Classifier Score with Predictions when Penlity is applied")
print(model8_pen.score(x2_test, prediction8))



(30112,)
Classification Report when Penality is applied


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       0.81      1.00      0.90      7574
        deny       0.55      0.97      0.70      7515
        drop       0.97      1.00      0.98      7493
  reset-both       0.00      0.00      0.00      7530

    accuracy                           0.74     30112
   macro avg       0.58      0.74      0.65     30112
weighted avg       0.58      0.74      0.65     30112

Classifier Score when when Penality is Applied
0.8638001461211477
Classifier Score with Predictions when Penlity is applied
0.7596307120085016


### **Logistic Regression Model on Scaled Original Data Without Penality**

In [37]:
from sklearn.linear_model import LogisticRegression
model9=LogisticRegression()
print(x_train.shape, y_train.shape)
model9.fit(x_train_sc, y_train)
prediction9=model9.predict(x_test_sc)
print(prediction9.shape)
print("Classification Report when Penality is not applied")
print(classification_report(y_test,prediction9))
print("Classifier Score when when Penality is not Applied")
print(model9.score(x_train_sc,y_train))
print("Classifier Score with Predictions when Penlity is not applied")
print(model9.score(x_test_sc, prediction9))

(52425, 11) (52425,)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, msg_start, len(result))


(13107,)
Classification Report when Penality is not applied
              precision    recall  f1-score   support

       allow       1.00      0.99      1.00      7495
        deny       0.99      0.96      0.98      2992
        drop       0.95      1.00      0.98      2607
  reset-both       0.00      0.00      0.00        13

    accuracy                           0.99     13107
   macro avg       0.74      0.74      0.74     13107
weighted avg       0.99      0.99      0.99     13107

Classifier Score when when Penality is not Applied
0.9862470195517405
Classifier Score with Predictions when Penlity is not applied
1.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### **Logistic Regression Model on Scaled Original Data With Penality**

In [38]:
from sklearn.linear_model import LogisticRegression
model10_pen=LogisticRegression(penalty='l2', solver='newton-cg', multi_class='multinomial')
model10_pen.fit(x2_train_sc, y2_train)
prediction10=model6_pen.predict(x2_test_sc)
print(prediction10.shape)
print("Classification Report when Penality is applied")
print(classification_report(y2_test,prediction10))
print("Classifier Score when when Penality is Applied")
print(model10_pen.score(x2_train_sc,y2_train))
print("Classifier Score with Predictions when Penlity is applied")
print(model10_pen.score(x2_test, prediction10))



(30112,)
Classification Report when Penality is applied


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      0.23      0.37      7574
        deny       0.26      1.00      0.42      7515
        drop       0.00      0.00      0.00      7493
  reset-both       0.00      0.00      0.00      7530

    accuracy                           0.31     30112
   macro avg       0.32      0.31      0.20     30112
weighted avg       0.32      0.31      0.20     30112

Classifier Score when when Penality is Applied
0.826663788522848
Classifier Score with Predictions when Penlity is applied
0.44982066950053134




### **Logistic Regression Model on Scaled SMOTE Data Without Penality**

In [39]:
from sklearn.linear_model import LogisticRegression
model11=LogisticRegression()
print(x2_train.shape, y2_train.shape)
model11.fit(x2_train, y2_train)
prediction11=model7.predict(x2_test)
print(prediction11.shape)
print("Classification Report when Penality is not applied")
print(classification_report(y2_test,prediction11))
print("Classifier Score when when Penality is not Applied")
print(model11.score(x2_train,y2_train))
print("Classifier Score with Predictions when Penlity is not applied")
print(model11.score(x2_test, prediction11))

(120448, 11) (120448,)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


(30112,)
Classification Report when Penality is not applied
              precision    recall  f1-score   support

       allow       0.82      0.99      0.89      7574
        deny       0.65      0.76      0.70      7515
        drop       0.95      1.00      0.97      7493
  reset-both       0.65      0.37      0.47      7530

    accuracy                           0.78     30112
   macro avg       0.77      0.78      0.76     30112
weighted avg       0.77      0.78      0.76     30112

Classifier Score when when Penality is not Applied
0.7739024309245484
Classifier Score with Predictions when Penlity is not applied
1.0


### **Logistic Regression Model on Scaled SMOTE Data With Penality**

In [40]:
from sklearn.linear_model import LogisticRegression
model12=LogisticRegression()
print(x2_train_sc.shape, y2_train.shape)
model12.fit(x2_train_sc, y2_train)
prediction12=model12.predict(x2_test_sc)
print(prediction12.shape)
print("Classification Report when Penality is not applied")
print(classification_report(y2_test,prediction12))
print("Classifier Score when when Penality is not Applied")
print(model12.score(x2_train_sc,y2_train))
print("Classifier Score with Predictions when Penlity is not applied")
print(model12.score(x2_test_sc, prediction12))

(120448, 11) (120448,)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


(30112,)
Classification Report when Penality is not applied
              precision    recall  f1-score   support

       allow       1.00      0.98      0.99      7574
        deny       0.66      0.68      0.67      7515
        drop       0.96      1.00      0.98      7493
  reset-both       0.69      0.65      0.67      7530

    accuracy                           0.83     30112
   macro avg       0.83      0.83      0.83     30112
weighted avg       0.83      0.83      0.83     30112

Classifier Score when when Penality is not Applied
0.82675511424017
Classifier Score with Predictions when Penlity is not applied
1.0


## **Linear Regression Model**

We can use SGDRegressor for tuning hyperparameters for Linear Models. For Hyperparams tune Linear Regressions, we should use Lasso, Ridge or ElasticNet.

### **LABELING DATA FOR CATERGORICAL DATA TO NUMERICAL DATA**

In [41]:
from sklearn import preprocessing
from sklearn import utils
lab_enc=preprocessing.LabelEncoder()
encoded_y_df=lab_enc.fit_transform(y)
encoded_y=lab_enc.fit_transform(y_train)
encoded_yt=lab_enc.fit_transform(y_test)
oversample2 = SMOTE()
x5, y5 = oversample.fit_resample(x, encoded_y_df)
print("Values of encoded_y2 by using Smote are:")
x5_train, x5_test, y5_train, y5_test=train_test_split(x5,y5, test_size=0.2, random_state=21)
counter2=Counter(y2)
print(counter2)

Values of encoded_y2 by using Smote are:
Counter({'allow': 37640, 'drop': 37640, 'deny': 37640, 'reset-both': 37640})


### **Linear Regression Model on Original Data**

In [42]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import classification_report
import seaborn as sns
from sklearn.metrics import r2_score
#sns.pairplot(df)
#sns.heatmap(df.corr())
#sns.heatmap(df2, annot=True)
LR1=LinearRegression()
LR1.fit(x_train, encoded_y)
print("LR1 Intercept is")
print(LR1.intercept_)
predictionx1=LR1.predict(x_test)
print(predictionx1.shape)
print(encoded_yt.shape)
print(x_test.shape)
Accuracy1=(((predictionx1-encoded_yt)/encoded_yt)*100)
print(Accuracy1)
np.seterr(divide='ignore', invalid='ignore')
print("Model Score when on Original Data")
print(LR1.score(x_train,encoded_y))
print("Model Score on Original Data with Prediction")
print(LR1.score(x_test, predictionx1))

LR1 Intercept is
0.8223632235034649
(13107,)
(13107,)
(13107, 11)
[  2.91617975         -inf          inf ... -22.33083891 -38.53338038
   9.74859182]
Model Score when on Original Data
0.5015390773410934
Model Score on Original Data with Prediction
1.0


  Accuracy1=(((predictionx1-encoded_yt)/encoded_yt)*100)


### **Linear Regression Model on SMOTE Data**

In [43]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import classification_report
import seaborn as sns
from sklearn.metrics import r2_score
#sns.pairplot(df)
#sns.heatmap(df.corr())
#sns.heatmap(df2, annot=True)
LR2=LinearRegression()
LR2.fit(x5_train, y5_train)
print("LR2 Intercept is")
print(LR2.intercept_)
predictionx2=LR2.predict(x2_test)
print(predictionx2.shape)
print(y5_test.shape)
print(x5_test.shape)
Accuracy2=(((predictionx2-y5_test)/y5_test)*100)
#print(Accuracy2)
np.seterr(divide='ignore', invalid='ignore')
r2_score1=r2_score(y5_test, predictionx2)
print("R2 Score When Smote is applied")
print(r2_score)
#print("Classification Report ")
#print(classification_report(y5_test,predictionx2))
print("LR2 Score when when X5_train and y5_train")
print(LR2.score(x5_train,y5_train))
print("LR2 Score when when X5_test and Y5_test")
print(LR2.score(x5_test, y5_test))
print("LR2 Score when when X5_test and prediction")
print(LR2.score(x5_test, predictionx2))

LR2 Intercept is
2.6204671218537
(30112,)
(30112,)
(30112, 11)
R2 Score When Smote is applied
<function r2_score at 0x7f685ced2040>
LR2 Score when when X5_train and y5_train
0.3460200558549067
LR2 Score when when X5_test and Y5_test
0.3519549008782137
LR2 Score when when X5_test and prediction
0.6036998767492849


## **DECISION TREE CLASSIFICATION MODEL**

### **Decision Tree Classification Model on Original Data**

In [44]:
from sklearn import tree
from sklearn.metrics import classification_report
classifier_tr1=tree.DecisionTreeClassifier()
classifier_tr1.fit(x_train, y_train)
prediction_tr1=classifier_tr1.predict(x_test)
print("Prediction Shape is:")
print(prediction_tr1.shape)
print("F1 Score is")
print(classification_report(y_test, prediction_tr1))
print("Claasifier Score of xtrain and ytrain is")
print(classifier_tr1.score(x_train,y_train))
print("Claasifier Score of xtest and ytest is")
print(classifier_tr1.score(x_test,y_test))
print("Claasifier Score of xtest and prediction is")
print(classifier_tr1.score(x_test, prediction_tr1))

from sklearn.tree import plot_tree, export_text
plt.figure(figsize =(80,20))
plot_tree(classifier_tr1, feature_names=x_train.columns, max_depth=2, filled=True);


Prediction Shape is:
(13107,)
F1 Score is
              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.99      1.00      0.99      2992
        drop       1.00      1.00      1.00      2607
  reset-both       0.50      0.23      0.32        13

    accuracy                           1.00     13107
   macro avg       0.87      0.81      0.83     13107
weighted avg       1.00      1.00      1.00     13107

Claasifier Score of xtrain and ytrain is
0.9997329518359561
Claasifier Score of xtest and ytest is
0.9976348516060121
Claasifier Score of xtest and prediction is
1.0


### **Decision Tree Classification Model on Smote Data**

In [45]:
from sklearn import tree
from sklearn.metrics import classification_report
classifier_tr1=tree.DecisionTreeClassifier()
classifier_tr1.fit(x2_train, y2_train)
prediction_tr1=classifier_tr1.predict(x2_test)
print("Prediction Shape is:")
print(prediction_tr1.shape)
print("F1 Score is")
print(classification_report(y2_test, prediction_tr1))
print("Claasifier Score of x2train and y2train is")
print(classifier_tr1.score(x2_train,y2_train))
print("Claasifier Score of xtest and ytest is")
print(classifier_tr1.score(x2_test,y2_test))
print("Claasifier Score of xtest and prediction is")
print(classifier_tr1.score(x2_test, prediction_tr1))

Prediction Shape is:
(30112,)
F1 Score is
              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7574
        deny       0.99      0.99      0.99      7515
        drop       1.00      1.00      1.00      7493
  reset-both       0.99      0.99      0.99      7530

    accuracy                           1.00     30112
   macro avg       1.00      1.00      1.00     30112
weighted avg       1.00      1.00      1.00     30112

Claasifier Score of x2train and y2train is
0.9997177205100957
Claasifier Score of xtest and ytest is
0.996048087141339
Claasifier Score of xtest and prediction is
1.0


### **Decision Tree Classification Model on Original Scaled Data**

In [46]:
from sklearn import tree
from sklearn.metrics import classification_report
classifier_tr1=tree.DecisionTreeClassifier()
classifier_tr1.fit(x_train_sc, y_train)
prediction_tr1=classifier_tr1.predict(x_test_sc)
print("Prediction Shape is:")
print(prediction_tr1.shape)
print("F1 Score is")
print(classification_report(y_test, prediction_tr1))
print("Claasifier Score of xtrain and ytrain is")
print(classifier_tr1.score(x_train_sc,y_train))
print("Claasifier Score of xtest and ytest is")
print(classifier_tr1.score(x_test_sc,y_test))
print("Claasifier Score of xtest and prediction is")
print(classifier_tr1.score(x_test_sc, prediction_tr1))

Prediction Shape is:
(13107,)
F1 Score is
              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7495
        deny       0.99      1.00      0.99      2992
        drop       1.00      1.00      1.00      2607
  reset-both       0.50      0.23      0.32        13

    accuracy                           1.00     13107
   macro avg       0.87      0.81      0.83     13107
weighted avg       1.00      1.00      1.00     13107

Claasifier Score of xtrain and ytrain is
0.9997329518359561
Claasifier Score of xtest and ytest is
0.9976348516060121
Claasifier Score of xtest and prediction is
1.0


### **Decision Tree Classification Model on SMOTE Scaled Data**

In [47]:
from sklearn import tree
from sklearn.metrics import classification_report
classifier_tr1=tree.DecisionTreeClassifier()
classifier_tr1.fit(x2_train_sc, y2_train)
prediction_tr1=classifier_tr1.predict(x2_test_sc)
print("Prediction Shape is:")
print(prediction_tr1.shape)
print("F1 Score is")
print(classification_report(y2_test, prediction_tr1))
print("Claasifier Score of x2train and y2train is")
print(classifier_tr1.score(x2_train_sc,y2_train))
print("Claasifier Score of xtest and ytest is")
print(classifier_tr1.score(x2_test_sc,y2_test))
print("Claasifier Score of xtest and prediction is")
print(classifier_tr1.score(x2_test_sc, prediction_tr1))

Prediction Shape is:
(30112,)
F1 Score is
              precision    recall  f1-score   support

       allow       1.00      1.00      1.00      7574
        deny       0.99      0.99      0.99      7515
        drop       1.00      1.00      1.00      7493
  reset-both       0.99      0.99      0.99      7530

    accuracy                           1.00     30112
   macro avg       1.00      1.00      1.00     30112
weighted avg       1.00      1.00      1.00     30112

Claasifier Score of x2train and y2train is
0.9997177205100957
Claasifier Score of xtest and ytest is
0.9961477151965994
Claasifier Score of xtest and prediction is
1.0


### **Decision Tree Classification Model (Hyperparameter Tuning, Setting Max_Depth)**

I have set max_depth of decision tree to show the change in result. Tested on 20,40,60,80,90, the result gets better at more depth. Random_State can also be changed as per requirements.

In [48]:
from sklearn import tree
from sklearn.metrics import classification_report
classifier_tr1=tree.DecisionTreeClassifier(max_depth=3, random_state=25)
classifier_tr1.fit(x2_train_sc, y2_train)
prediction_tr1=classifier_tr1.predict(x2_test_sc)
print("Prediction Shape is:")
print(prediction_tr1.shape)
print("F1 Score is")
print(classification_report(y2_test, prediction_tr1))
print("Claasifier Score of x2train and y2train is")
print(classifier_tr1.score(x2_train_sc,y2_train))
print("Claasifier Score of xtest and ytest is")
print(classifier_tr1.score(x2_test_sc,y2_test))
print("Claasifier Score of xtest and prediction is")
print(classifier_tr1.score(x2_test_sc, prediction_tr1))

Prediction Shape is:
(30112,)
F1 Score is
              precision    recall  f1-score   support

       allow       1.00      0.99      1.00      7574
        deny       0.82      0.84      0.83      7515
        drop       1.00      1.00      1.00      7493
  reset-both       0.84      0.83      0.83      7530

    accuracy                           0.91     30112
   macro avg       0.91      0.91      0.91     30112
weighted avg       0.91      0.91      0.91     30112

Claasifier Score of x2train and y2train is
0.9107996811902231
Claasifier Score of xtest and ytest is
0.91451912858661
Claasifier Score of xtest and prediction is
1.0


### **Decision Tree Classification Model (Hyperparameter Tuning, Setting Max_Lead Nodes)**

I have set low max_kead_nodes of decision tree to show the change in result. Tested on different values, the result gets better at more leaf nodes. Random_State can also be changed as per requirements.

In [49]:
from sklearn import tree
from sklearn.metrics import classification_report
classifier_tr1=tree.DecisionTreeClassifier(max_leaf_nodes=3, random_state=5)
classifier_tr1.fit(x2_train_sc, y2_train)
prediction_tr1=classifier_tr1.predict(x2_test_sc)
print("Prediction Shape is:")
print(prediction_tr1.shape)
print("F1 Score is")
print(classification_report(y2_test, prediction_tr1))
print("Claasifier Score of x2train and y2train is")
print(classifier_tr1.score(x2_train_sc,y2_train))
print("Claasifier Score of xtest and ytest is")
print(classifier_tr1.score(x2_test_sc,y2_test))
print("Claasifier Score of xtest and prediction is")
print(classifier_tr1.score(x2_test_sc, prediction_tr1))

Prediction Shape is:
(30112,)
F1 Score is


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       allow       1.00      0.99      1.00      7574
        deny       0.00      0.00      0.00      7515
        drop       0.94      1.00      0.97      7493
  reset-both       0.51      1.00      0.68      7530

    accuracy                           0.75     30112
   macro avg       0.61      0.75      0.66     30112
weighted avg       0.61      0.75      0.66     30112

Claasifier Score of x2train and y2train is
0.7465213204038257
Claasifier Score of xtest and ytest is
0.747442879914984
Claasifier Score of xtest and prediction is
1.0


## **RANDOM FOREST MODEL**

Hyperparameter Tuning in Random Forest Involves Estimator to be set different. 100 is usually by default value. setting maximum features, Minimum samples split, minimum sample leaf, min impurity decrease. I will be using few of these. 

### **Random Forest Model on Original Data**

In [50]:
from sklearn.ensemble import RandomForestRegressor
RF_model=RandomForestRegressor(n_estimators=100)
RF_model.fit(x_train, encoded_y)
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
predictions1=RF_model.predict(x_test)
error=metrics.r2_score(encoded_yt, predictions1)
print("R Squared Error:" , error)
encoded_yt.shape
predictions1.shape
#y3=np.encoded_yt
from sklearn.metrics import r2_score
print(r2_score(encoded_yt, predictions1))
print("F1 Score is")
#print(classification_report(encoded_yt, predictions1))
r2_score1=r2_score(encoded_yt, predictions1)
print(r2_score)
print("Claasifier Score of xtrain and ytrain is")
print(RF_model.score(x_train,encoded_y))
print("Claasifier Score of xtest and ytest is")
print(RF_model.score(x_test,encoded_yt))
print("Claasifier Score of xtest and prediction is")
print(RF_model.score(x_test, predictions1))

R Squared Error: 0.995275764767684
0.995275764767684
F1 Score is
<function r2_score at 0x7f685ced2040>
Claasifier Score of xtrain and ytrain is
0.9991030272209213
Claasifier Score of xtest and ytest is
0.995275764767684
Claasifier Score of xtest and prediction is
1.0


### **Random Forest Model on Original Data (Setting N_Estimators=10)**

In [51]:
from sklearn.ensemble import RandomForestRegressor
RF_model=RandomForestRegressor(n_estimators=10)
RF_model.fit(x_train, encoded_y)
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
predictions1=RF_model.predict(x_test)
error=metrics.r2_score(encoded_yt, predictions1)
print("R Squared Error:" , error)
encoded_yt.shape
predictions1.shape
#y3=np.encoded_yt
from sklearn.metrics import r2_score
print(r2_score(encoded_yt, predictions1))
print("F1 Score is")
#print(classification_report(encoded_yt, predictions1))
r2_score1=r2_score(encoded_yt, predictions1)
print(r2_score)
print("Claasifier Score of xtrain and ytrain is")
print(RF_model.score(x_train,encoded_y))
print("Claasifier Score of xtest and ytest is")
print(RF_model.score(x_test,encoded_yt))
print("Claasifier Score of xtest and prediction is")
print(RF_model.score(x_test, predictions1))

R Squared Error: 0.9945512209977694
0.9945512209977694
F1 Score is
<function r2_score at 0x7f685ced2040>
Claasifier Score of xtrain and ytrain is
0.9989566489841954
Claasifier Score of xtest and ytest is
0.9945512209977694
Claasifier Score of xtest and prediction is
1.0


### **Random Forest Model on Original Data (Setting N_Estimators=1000)**

In [52]:
from sklearn.ensemble import RandomForestRegressor
RF_model=RandomForestRegressor(n_estimators=1000)
RF_model.fit(x_train, encoded_y)
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
predictions1=RF_model.predict(x_test)
error=metrics.r2_score(encoded_yt, predictions1)
print("R Squared Error:" , error)
encoded_yt.shape
predictions1.shape
#y3=np.encoded_yt
from sklearn.metrics import r2_score
print(r2_score(encoded_yt, predictions1))
print("F1 Score is")
#print(classification_report(encoded_yt, predictions1))
r2_score1=r2_score(encoded_yt, predictions1)
print(r2_score)
print("Claasifier Score of xtrain and ytrain is")
print(RF_model.score(x_train,encoded_y))
print("Claasifier Score of xtest and ytest is")
print(RF_model.score(x_test,encoded_yt))
print("Claasifier Score of xtest and prediction is")
print(RF_model.score(x_test, predictions1))

R Squared Error: 0.9952620681810246
0.9952620681810246
F1 Score is
<function r2_score at 0x7f685ced2040>
Claasifier Score of xtrain and ytrain is
0.9991621087076712
Claasifier Score of xtest and ytest is
0.9952620681810246
Claasifier Score of xtest and prediction is
1.0


### **Random Forest Model on Original Scaled Data (Setting N_Estimators=10)**

In [53]:
from sklearn.ensemble import RandomForestRegressor
RF_model=RandomForestRegressor(n_estimators=10)
RF_model.fit(x_train_sc, encoded_y)
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
predictions1=RF_model.predict(x_test_sc)
error=metrics.r2_score(encoded_yt, predictions1)
print("R Squared Error:" , error)
encoded_yt.shape
predictions1.shape
#y3=np.encoded_yt
from sklearn.metrics import r2_score
print(r2_score(encoded_yt, predictions1))
print("F1 Score is")
#print(classification_report(encoded_yt, predictions1))
r2_score1=r2_score(encoded_yt, predictions1)
print(r2_score)
print("Claasifier Score of xtrain and ytrain is")
print(RF_model.score(x_train_sc,encoded_y))
print("Claasifier Score of xtest and ytest is")
print(RF_model.score(x_test_sc,encoded_yt))
print("Claasifier Score of xtest and prediction is")
print(RF_model.score(x_test_sc, predictions1))

R Squared Error: 0.993774056162416
0.993774056162416
F1 Score is
<function r2_score at 0x7f685ced2040>
Claasifier Score of xtrain and ytrain is
0.9989886675176806
Claasifier Score of xtest and ytest is
0.993774056162416
Claasifier Score of xtest and prediction is
1.0


## **SUPPORT VECTOR MACHINE MODEL**

Hyperparameter Tuning of SVM involve change of Kernel or C value or gamma value etc. Kernal can be set as linear or poly or rbf or sigmoid. Similary C value can be set as C as float number, by default it is 0. and gamma value is by default scale, it can be set as auto or float number. 

### **SVM Model on Original Data**

In [54]:
from sklearn import svm
classifier_sv=svm.SVC(kernel='linear')
classifier_sv.fit(x_train, y_train
                  )

In [None]:
prediction_sv1=classifier_sv.predict(x_test)
y_test.shape
prediction_sv1.shape
print("F1 Score is")
print(classification_report(y_test, prediction_sv1))
print("Claasifier Score of xtrain and ytrain is")
print(classifier_sv.score(x_train,y_train))
print("Claasifier Score of xtest and ytest is")
print(classifier_sv.score(x_test,y_test))
print("Claasifier Score of xtest and prediction is")
print(classifier_sv.score(x_test, prediction_sv1))

### **SVM Model on SMOTE Data**

In [None]:
from sklearn import svm
classifier_sv=svm.SVC(kernel='linear')
classifier_sv.fit(x2_train, y2_train)

In [None]:
prediction_sv1=classifier_sv.predict(x2_test)
y2_test.shape
prediction_sv1.shape
print("F1 Score is")
print(classification_report(y2_test, prediction_sv1))
print("Claasifier Score of xtrain and ytrain is")
print(classifier_sv.score(x2_train,y2_train))
print("Claasifier Score of xtest and ytest is")
print(classifier_sv.score(x2_test,y2_test))
print("Claasifier Score of xtest and prediction is")
print(classifier_sv.score(x2_test, prediction_sv1))

### **SVM Model on Original Scaled Data**

In [None]:
from sklearn import svm
classifier_sv=svm.SVC(kernel='linear')
classifier_sv.fit(x_train_sc, y_train)

In [None]:
prediction_sv1=classifier_sv.predict(x_test_sc)
y_test.shape
prediction_sv1.shape
print("F1 Score is")
print(classification_report(y_test, prediction_sv1))
print("Claasifier Score of xtrain and ytrain is")
print(classifier_sv.score(x_train_sc,y_train))
print("Claasifier Score of xtest and ytest is")
print(classifier_sv.score(x_test_sc,y_test))
print("Claasifier Score of xtest and prediction is")
print(classifier_sv.score(x_test_sc, prediction_sv1))

### **SVM Model on SMOTE Data and setting Kernal=sigmoid**

In [None]:
from sklearn import svm
classifier_sv=svm.SVC(kernel='sigmoid')
classifier_sv.fit(x_train, y_train)

In [None]:
prediction_sv1=classifier_sv.predict(x_test)
y_test.shape
prediction_sv1.shape
print("F1 Score is")
print(classification_report(y_test, prediction_sv1))
print("Claasifier Score of xtrain and ytrain is")
print(classifier_sv.score(x_train,y_train))
print("Claasifier Score of xtest and ytest is")
print(classifier_sv.score(x_test,y_test))
print("Claasifier Score of xtest and prediction is")
print(classifier_sv.score(x_test, prediction_sv1))

### **SVM Model on SMOTE Data and setting Kernal=poly**

In [None]:
from sklearn import svm
classifier_sv=svm.SVC(kernel='poly')
classifier_sv.fit(x_train, y_train)

In [None]:
prediction_sv1=classifier_sv.predict(x_test)
y_test.shape
prediction_sv1.shape
print("F1 Score is")
print(classification_report(y_test, prediction_sv1))
print("Claasifier Score of xtrain and ytrain is")
print(classifier_sv.score(x_train,y_train))
print("Claasifier Score of xtest and ytest is")
print(classifier_sv.score(x_test,y_test))
print("Claasifier Score of xtest and prediction is")
print(classifier_sv.score(x_test, prediction_sv1))

### **EXAMPLE FOR ALLOW DATA**

In [None]:
example=[[60513, 47094, 45469, 47094, 320, 140, 180, 6, 7,3,3]]
scaled_example=scaler.transform(example)
print("Prediction by KNN Model:")
print(classifier3.predict(example)[0])
print("Prediction by Logistic Regression Model:")
print(model7.predict(example)[0])
print("Prediction by Linear Regression Model:")
print(LR2.predict(example)[0])
print("Prediction by Decision Tree Model:")
print(classifier_tr1.predict(example)[0])
print("Prediction by Random Forest Model:")
print(RF_model.predict(example)[0])
print("Prediction by SVM Model:")
print(classifier_sv.predict(example)[0])

In [None]:
example=[[60513, 47094, 45469, 47094, 320, 140, 180, 6, 7,3,3]]
scaled_example=scaler.transform(example)
print("Prediction by KNN Model:")
print(classifier3.predict(example)[0])
print("Prediction by Logistic Regression Model:")
print(model7.predict(example)[0])
print("Prediction by Linear Regression Model:")
print(LR2.predict(example)[0])
print("Prediction by Decision Tree Model:")
print(classifier_tr1.predict(example)[0])
print("Prediction by Random Forest Model:")
print(RF_model.predict(example)[0])
print("Prediction by SVM Model:")
print(classifier_sv.predict(example)[0])

In [None]:
example2=[[60513, 47, 0, 0, 320, 140, 1, 0, 1,3,1]]
scaled_example2=scaler.transform(example2)
print("Prediction by KNN Model:")
print(classifier3.predict(example2)[0])
print("Prediction by Logistic Regression Model:")
print(model7.predict(example2)[0])
print("Prediction by Linear Regression Model:")
print(LR2.predict(example2)[0])
print("Prediction by Decision Tree Model:")
print(classifier_tr1.predict(example2)[0])
print("Prediction by Random Forest Model:")
print(RF_model.predict(example2)[0])
print("Prediction by SVM Model:")
print(classifier_sv.predict(example2)[0])

# **IMPLEMENTING PIPELINE:**

## **WARNING FILTER**

In [None]:
#Warning Filter,
import warnings
warnings.filterwarnings("ignore")

## **IMPORTING LIBRARIES**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import imblearn
from collections import Counter
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.svm import SVC
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn import preprocessing
from sklearn import utils

## **SEGREGATING INPUT & OUTPUT (LABELS)**

In [None]:
x9=df.drop(["Action"], axis=1)

In [None]:
y9=df["Action"]

## **SPLITTING DATA INTO TRAIN AND TEST PART**

In [None]:
x9_train,x9_test, y9_train, y9_test=train_test_split(x9,y9,test_size=0.2, random_state=21)

## **LABELING CATEGORICAL DATA INTO NUMERICAL**

In [None]:
lab_enc=preprocessing.LabelEncoder()
encoded_y9_df=lab_enc.fit_transform(y9)
encoded_y9=lab_enc.fit_transform(y9_train)
encoded_y9t=lab_enc.fit_transform(y9_test)
np.set_printoptions(threshold=np.inf)
print(encoded_y9)

[2 1 0 0 0 2 1 0 1 2 0 1 0 0 0 1 2 0 0 0 1 0 1 1 1 1 1 0 0 1 1 0 2 0 0 0 2
 0 0 0 0 0 0 1 2 2 0 2 0 0 0 2 2 2 0 2 1 1 2 2 2 0 1 2 0 0 0 2 0 2 0 2 0 0
 0 0 2 0 2 2 0 1 1 1 0 0 1 0 1 1 0 0 2 0 2 2 2 2 0 0 0 0 2 1 0 0 1 0 0 2 0
 0 1 0 0 0 0 2 0 2 0 0 2 0 2 1 0 2 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 2 0 0
 0 1 0 0 0 2 0 1 2 0 0 0 1 0 1 1 0 1 0 0 1 0 0 1 1 0 2 1 0 1 0 2 0 1 0 0 2
 2 1 0 0 2 1 0 0 0 0 0 0 2 2 2 0 2 0 2 0 0 0 1 0 1 2 0 0 0 0 0 1 1 0 0 2 0
 0 1 0 2 0 0 2 1 2 0 0 0 2 2 2 1 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 1 2 0 0 2 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 1 1 0 1 1 1 0 2 0
 1 0 1 0 0 0 0 2 1 0 1 0 2 1 1 1 0 0 1 1 1 2 1 0 0 1 0 0 1 0 0 2 1 0 0 2 0
 1 0 2 0 1 0 2 0 1 0 1 0 0 1 1 2 2 0 1 2 0 0 2 0 0 0 1 1 0 0 2 1 1 0 1 0 0
 0 2 0 2 1 2 0 0 0 1 0 2 0 0 1 0 2 2 2 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 0 1
 0 1 1 0 1 2 2 0 0 2 0 0 0 1 2 0 2 2 0 2 0 2 0 2 0 0 1 0 2 1 0 1 0 0 0 0 0
 0 2 0 0 0 0 0 1 1 0 0 0 0 0 0 2 1 0 0 1 2 2 0 0 0 0 2 0 1 2 2 0 0 0 0 2 0
 0 2 1 1 0 2 2 0 1 1 2 1 

## **DEFINING PIPELINE FOR KNN MODEL:**

In [None]:
KNNPipeline=Pipeline([("myscaler", StandardScaler()),
                    ("mysmote", SMOTE()),
                    ("myModel", KNeighborsClassifier())])

## **DEFINING PIPELINE FOR LOGISTIC REGRESSION MODEL:**

In [None]:
LogisticRegressionPipeline=Pipeline([("myscaler", StandardScaler()),
                    ("mysmote", SMOTE()),
                    ("myModel", LogisticRegression())])

## **DEFINING PIPELINE FOR LINEAR REGRESSION MODEL:**

In [None]:
LinearRegressionPipeline=Pipeline([("myscaler", StandardScaler()),
                    ("mysmote", SMOTE()),
                    ("myModel", LinearRegression())])

## **DEFINING PIPELINE FOR DECISION TREE MODEL:**

In [None]:
TreePipeline=Pipeline([("myscaler", StandardScaler()),
                    ("mysmote", SMOTE()),
                    ("myModel", DecisionTreeClassifier())])

## **DEFINING PIPELINE FOR RANDOM FOREST REGRESSION MODEL:**


In [None]:
RandomForestPipeline=Pipeline([("myscaler", StandardScaler()),
                    ("mysmote", SMOTE()),
                    ("myModel", RandomForestRegressor())])

## **DEFINING PIPELINE FOR SVM MODEL:**


In [None]:
SVMPipeline=Pipeline([("myscaler", StandardScaler()),
                    ("mysmote", SMOTE()),
                    ("myModel", SVC())])

## **DEFINING MYPIPELINE AND ADDING ALL MODELS IN IT:**


In [None]:
mypipeline=[KNNPipeline, LogisticRegressionPipeline, LinearRegressionPipeline, TreePipeline, RandomForestPipeline, SVMPipeline]

## **DEFINING INITIAL SCORE AS ZERO**


In [None]:
accuracy=0.0
classifier=0
pipeline=""

## **ASSIGNING NUMBERICAL VALUE TO EACH MODEL VARIABLE, AND USING FOR LOOP TO FIT PIPELINE TO ALL MODELS:**

In [None]:
PipelineDict={0: "KNNPipeline", 1:"LogisticRegressionPipeline", 2:"LinearRegressionPipeline", 3:"TreePipeline", 4: "RandomForestPipeline", 5: "SVMPipeline"}
for mypipe in mypipeline:
  mypipe.fit(x9_train, encoded_y9)

## **USING FOR LOOP TO FIND SCORE OF ALL MODELS>**

In [None]:
for i, model in enumerate(mypipeline):
  print("{} Test Accuracy: {}".format(PipelineDict[i], model.score(x9_test, encoded_y9t)))

KNNPipeline Test Accuracy: 0.9861905851834898
LogisticRegressionPipeline Test Accuracy: 0.9205767910276951
LinearRegressionPipeline Test Accuracy: -0.3302133370851612
TreePipeline Test Accuracy: 0.9970244907301442
RandomForestPipeline Test Accuracy: 0.9900361603117543
SVMPipeline Test Accuracy: 0.9458304722667277


## **DISPLAYING SHAPE OF TRAIN AND TEST INPUT & OUTPUT**

In [None]:
x9_train.shape, x9_test.shape,encoded_y9.shape, encoded_y9t.shape

((52425, 11), (13107, 11), (52425,), (13107,))

## **USING FOR LOOP TO SELECT BEST MODEL IN PIPELINE**

In [None]:
for i, model in enumerate(mypipeline):
  if (model.score(x9_test, encoded_y9t))>accuracy:
    accuracy=model.score(x9_test, encoded_y9t)
    pipeline=model
    classifier=i
print("Best Model in Pipeline is:{}".format(PipelineDict[classifier]))

Best Model in Pipeline is:TreePipeline
