In [1]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport

# Buisiness Understanding
<blockquote>
Firewall Rules and their actions:

* Allow: Explicitly allows traffic that matches the rule to pass, and then implicitly denies everything else.
* Bypass: Allows traffic to bypass both firewall and Intrusion Prevention analysis. Use this setting only for media-intensive protocols. A Bypass Rule can be based on IP, port, traffic direction, and protocol.
* Deny: Explicitly blocks traffic that matches the rule.
* Force Allow: Forcibly allows traffic that would otherwise be denied by other rules.
  * Traffic permitted by a Force Allow Rule will still be subject to analysis by the Intrusion Prevention module.
* Log only: Traffic will only be logged. No other action will be taken.

Allow rules have two functions:

* Permit traffic that is explicitly allowed.
* Implicitly deny all other traffic.
 
**Note:** Traffic that is not explicitly allowed by an Allow rule is dropped, and gets recorded as an Out of "allowed" Policy Firewall Event.
</blockquote>

-- [TrendMicro Firewall Rule Actions and Priorities](https://help.deepsecurity.trendmicro.com/10/0/Protection-Modules/Firewall/firewall-rule-action-priority.html)

# EDA

In [2]:
df = pd.read_csv("log2.csv")
df.head()

Unnamed: 0,Source Port,Destination Port,NAT Source Port,NAT Destination Port,Action,Bytes,Bytes Sent,Bytes Received,Packets,Elapsed Time (sec),pkts_sent,pkts_received
0,57222,53,54587,53,allow,177,94,83,2,30,1,1
1,56258,3389,56258,3389,allow,4768,1600,3168,19,17,10,9
2,6881,50321,43265,50321,allow,238,118,120,2,1199,1,1
3,50553,3389,50553,3389,allow,3327,1438,1889,15,17,8,7
4,50002,443,45848,443,allow,25358,6778,18580,31,16,13,18


In [3]:
profile = ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

There is some correlation w/ some various fields such as:
* Bytes Sent & Bytes
* pkts_received & Bytes Received
* pkts_received & Packets

Additionally, there is a extrememly small percentage of rows with "reset-both" (54 rows, i.e., less than 1% of the data). I will be dropping it as such a small amount of data in comparision to the rest would most likely not be sufficient to represent the actual populations behavior for prediction.

Next, as we've reviewed some background knowledge from TrendMicro, "drop" traffic is not an explicit "deny" rule, rather it is dropped as there is no rule to handle such traffic. We will need to decide whether to add "drop" to "deny" or drop those rows altogether. This decision will impact whether or not our decision function for the SVM is set nas a one-vs-one (ovo) or one-vs-rest (ovr) approach.

As for the duplicates, it should not be a concern as they target the same "Action" or label. For now, I will not drop the duplicate tdataset since the duplicates are negligible in the size of the dataset (the highest duplicate count is less than 0.1% of the total dataset; total duplicates is .16%)

# Preprocessing

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

Drop "reset-both" Action rows

In [6]:
df_drop = df[df.Action != "reset-both"]
df_drop.Action.value_counts()

allow    37640
deny     14987
drop     12851
Name: Action, dtype: int64

In [10]:
# get the target and numeric column names
num_cols = df.select_dtypes(exclude=['object']).columns.tolist()  # Gets all numerical columns in list
cat_cols = df.select_dtypes(include=['object']).columns.tolist()  # Gets Action

In [11]:
X = df.loc[:, df.columns != "Action"]


df["Action"] = df_drop.Action.replace({'allow':0,'deny':1,'drop':2})
y = df.loc[:, df.columns == "Action"]

In [15]:
stand_scaler = StandardScaler()

X_stand = stand_scaler.fit_transform(X) 

# SVM Model

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

Kernel types for SVM:
* Radial Basis Function (RBF)
* Linear
* Polynomial
* Sigmoid

To quote SKLearn on RBF:
<blockquote>
When training an SVM with the Radial Basis Function (RBF) kernel, two parameters must be considered: C and gamma. The parameter C, common to all SVM kernels, trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly. gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected.

Proper choice of C and gamma is critical to the SVM’s performance. One is advised to use GridSearchCV with C and gamma spaced exponentially far apart to choose good values.
</blockquote>

To quote SKLearn on Linear Kernel:

In [None]:
parameters_svm = {'C':[0.9,0.01],'kernel':['rbf','linear','polynomial','sigmoid'], 'gamma':[0,0.1,'auto'], 'probability':[True,False],
                  'random_state':[0,7,16],'decision_function_shape':['ovo','ovr'],'degree':[3,4,10]}

clf_svm = SVC()

grid_svm = GridSearchCV(estimator = clf_svm, param_grid = parameters_svm, cv = 10, 
                        scoring = 'accuracy')
