# Supervised Machine Learning - Data Classification Demo

> This project aims to analyze and classify a real network traffic data set to detect malicious/benign traffic records. It compares the performance of various Machine Learning algorithms and tunes the best-performing model to maintain the highest accuracy and lowest False Negative metrics.

> The dataset used in this demo is [CTU-IoT-Malware-Capture-34-1](https://mcfp.felk.cvut.cz/publicDatasets/IoT-23-Dataset/IndividualScenarios/CTU-IoT-Malware-Capture-34-1/bro/), it is part of [Aposemat IoT-23 dataset](https://www.stratosphereips.org/datasets-iot23) (A labeled dataset with malicious and benign IoT network traffic).
>

# 1. Initialization

In [63]:
#importing required libraries and modules
import pandas as pd

#set display options
pd.set_option('display.max_columns', None)

## 2. Data Acquisition

### Reading dataset file into DataFrame

In [64]:
#initialize required variables to read from the data file
data_file_location = "data\\raw\\"
data_file_name = "conn.log.labeled"

#column names are included in the file in a commented line, so we need to read the corresponding line separately and remove the first description word.
data_columns = pd.read_csv(data_file_location + data_file_name, sep='\t', skiprows=6, nrows=1, header=None).iloc[0][1:]

#read the actual dataset
data_df = pd.read_csv(data_file_location + data_file_name, sep='\t', comment="#", header=None)

#set column names
data_df.columns = data_columns

### Exploring dataset summary and statistics

In [65]:
#check dataset shape
data_df.shape

(23145, 21)

In [66]:
#check dataset head
data_df.head()

Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,resp_bytes,conn_state,local_orig,local_resp,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents label detailed-label
0,1545404000.0,CrDn63WjJEmrWGjqf,192.168.1.195,41040,185.244.25.235,80,tcp,-,3.139211,0,0,S0,-,-,0,S,3,180,0,0,- Benign -
1,1545404000.0,CY9lJW3gh1Eje4usP6,192.168.1.195,41040,185.244.25.235,80,tcp,-,-,-,-,S0,-,-,0,S,1,60,0,0,- Benign -
2,1545404000.0,CcFXLynukEDnUlvgl,192.168.1.195,41040,185.244.25.235,80,tcp,-,-,-,-,S0,-,-,0,S,1,60,0,0,- Benign -
3,1545404000.0,CDrkrSobGYxHhYfth,192.168.1.195,41040,185.244.25.235,80,tcp,http,1.477656,149,128252,SF,-,-,2896,ShADadttcfF,94,5525,96,139044,- Benign -
4,1545404000.0,CTWZQf2oJSvq6zmPAc,192.168.1.195,41042,185.244.25.235,80,tcp,-,3.147116,0,0,S0,-,-,0,S,3,180,0,0,- Benign -


In [67]:
#check dataset summary
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23145 entries, 0 to 23144
Data columns (total 21 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   ts                                       23145 non-null  float64
 1   uid                                      23145 non-null  object 
 2   id.orig_h                                23145 non-null  object 
 3   id.orig_p                                23145 non-null  int64  
 4   id.resp_h                                23145 non-null  object 
 5   id.resp_p                                23145 non-null  int64  
 6   proto                                    23145 non-null  object 
 7   service                                  23145 non-null  object 
 8   duration                                 23145 non-null  object 
 9   orig_bytes                               23145 non-null  object 
 10  resp_bytes                               23145

# 3. Data Cleaning

In [68]:
#fixing the last column due to unmatched separators in the original file
#split the last combined column into three ones
tunnel_parents_column = data_df.iloc[:,-1].apply(lambda x: x.split()[0])
label_column = data_df.iloc[:,-1].apply(lambda x: x.split()[1])
detailed_label_column = data_df.iloc[:,-1].apply(lambda x: x.split()[2])

#drop the combined column
data_df.drop(["tunnel_parents   label   detailed-label"], axis=1, inplace=True)

#add new individual columns to the dataset
data_df["tunnel_parents"] = tunnel_parents_column
data_df["label"] = label_column
data_df["detailed_label"] = detailed_label_column