# Supervised Machine Learning - Data Classification Demo

> This project aims to analyze and classify a real network traffic data set to detect malicious/benign traffic records. It compares the performance of various Machine Learning algorithms and tunes the best-performing model to maintain the highest accuracy and lowest False Negative metrics.

> The dataset used in this demo is [CTU-IoT-Malware-Capture-34-1](https://mcfp.felk.cvut.cz/publicDatasets/IoT-23-Dataset/IndividualScenarios/CTU-IoT-Malware-Capture-34-1/bro/), it is part of [Aposemat IoT-23 dataset](https://www.stratosphereips.org/datasets-iot23) (A labeled dataset with malicious and benign IoT network traffic).
>

## 1. Data Acquisition

### Importing required libraries and modules

In [1]:
import pandas as pd

### Reading the dataset file into DataFrame

In [2]:
#initialize required variables to read from the data file
data_file_location = "data\\raw\\"
data_file_name = "conn.log.labeled"

#column names are included in the file in a commented line, so we need to read the corresponding line separately and remove the first description word.
data_columns = pd.read_csv(data_file_location + data_file_name, sep='\t', skiprows=6, nrows=1, header=None).iloc[0][1:]

#read the actual dataset
data_df = pd.read_csv(data_file_location + data_file_name, sep='\t', comment="#", header=None)

#set column names
data_df.columns = data_columns

In [3]:
data_df.shape

(23145, 21)

In [4]:
data_df.head()

Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,...,conn_state,local_orig,local_resp,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents label detailed-label
0,1545404000.0,CrDn63WjJEmrWGjqf,192.168.1.195,41040,185.244.25.235,80,tcp,-,3.139211,0,...,S0,-,-,0,S,3,180,0,0,- Benign -
1,1545404000.0,CY9lJW3gh1Eje4usP6,192.168.1.195,41040,185.244.25.235,80,tcp,-,-,-,...,S0,-,-,0,S,1,60,0,0,- Benign -
2,1545404000.0,CcFXLynukEDnUlvgl,192.168.1.195,41040,185.244.25.235,80,tcp,-,-,-,...,S0,-,-,0,S,1,60,0,0,- Benign -
3,1545404000.0,CDrkrSobGYxHhYfth,192.168.1.195,41040,185.244.25.235,80,tcp,http,1.477656,149,...,SF,-,-,2896,ShADadttcfF,94,5525,96,139044,- Benign -
4,1545404000.0,CTWZQf2oJSvq6zmPAc,192.168.1.195,41042,185.244.25.235,80,tcp,-,3.147116,0,...,S0,-,-,0,S,3,180,0,0,- Benign -


In [5]:
data_df.tail()

Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,...,conn_state,local_orig,local_resp,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents label detailed-label
23140,1545490000.0,C2F17zSUnGOcWzBa7,192.168.1.195,57110,185.244.25.235,6667,tcp,irc,32.840994,62,...,S3,-,-,0,ShAdDaf,7,434,6,589,- Malicious C&C
23141,1545490000.0,C93P4z4k5IRJD1rXJg,192.168.1.195,57092,185.244.25.235,6667,tcp,irc,36.290833,62,...,S3,-,-,0,ShAdDaf,10,606,7,632,- Malicious C&C
23142,1545490000.0,CXLZ3A2QY5E8weqpDk,192.168.1.195,123,147.251.48.140,123,udp,-,-,-,...,S0,-,-,0,D,1,76,0,0,- Benign -
23143,1545490000.0,CuXpFN3fWesWBXUhq1,192.168.1.195,123,82.113.53.40,123,udp,-,-,-,...,S0,-,-,0,D,1,76,0,0,- Benign -
23144,1545490000.0,Ct2Yhy4d33oL3yyZY9,192.168.1.195,123,89.221.210.188,123,udp,-,-,-,...,S0,-,-,0,D,1,76,0,0,- Benign -
