### ECE 572 Course Project - Cool Fancy-Dancy Name Here

Project Description:
Python-based program designed to extract and visualize network traffic information from a Wireshark generated .csv file. 

Team Members:

Devang Sharma - V00931210 - devsharma@uvic.ca

Alex Spurgeon - V00818626 - aespurge@uvic.ca

Aditya Naren Yerramilli - V00938179 - naren1@uvic.ca


#### Useful Resources for this project:

General References:

Pandas: https://nbviewer.jupyter.org/gist/manujeevanprakash/996d18985be612072ee0
https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html

Mix of everything: https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/

Examples of analysis:
https://null-byte.wonderhowto.com/how-to/analyze-wi-fi-data-captures-with-jupyter-notebook-0201490/

https://github.com/skickar/Research/blob/master/RedLineResearch.ipynb

https://medium.com/hackervalleystudio/learning-packet-analysis-with-data-science-5356a3340d4e

https://github.com/secdevopsai/Packet-Analytics/blob/master/Packet-Analytics.ipynb

(Some minor examples here) https://www.python4networkengineers.com/posts/wireshark/analyzing_wireshark_data_with_pandas/

In [1]:
# Imports and essentials
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime

In [4]:
# Import data from csv files - ECE572Test.csv, ECE572Benign.csv, & ECE572DoS.csv
test_df = pd.read_csv('ECE572Test.csv', 
                 delimiter = ',', encoding='latin-1', header=0) # Header labels all in row = 0
normal_df = pd.read_csv('ECE572Benign.csv', 
                 delimiter = ',', encoding='latin-1', header=0) 
dos_df = pd.read_csv('ECE572DoS.csv', 
                 delimiter = ',', encoding='latin-1', header=0) 

In [16]:
print(test_df.head(10))
print(test_df.shape)

   No.                           Time             Source        Destination  \
0    1  2020-07-12 20:54:38.115614448     192.168.56.102     192.168.56.255   
1    2  2020-07-12 20:54:38.115854052     192.168.56.101     192.168.56.102   
2    3  2020-07-12 20:54:38.115905970       192.168.56.1    255.255.255.255   
3    4  2020-07-12 20:54:38.116133751     192.168.56.101       192.168.56.1   
4    5  2020-07-12 20:54:43.120683917  PcsCompu_ae:93:92  PcsCompu_af:7b:18   
5    6  2020-07-12 20:54:43.120710417  PcsCompu_ae:93:92  0a:00:27:00:00:1a   
6    7  2020-07-12 20:54:43.120711737  0a:00:27:00:00:1a  PcsCompu_ae:93:92   
7    8  2020-07-12 20:54:43.121003898  PcsCompu_af:7b:18  PcsCompu_ae:93:92   
8    9  2020-07-12 20:54:51.116719776     192.168.56.102     192.168.56.101   
9   10  2020-07-12 20:54:51.116968323     192.168.56.101     192.168.56.102   

  Protocol  Source Port  Dest Port  Length  \
0     NBNS        137.0      137.0      92   
1     NBNS        137.0      137.0    

In [21]:
# Work with one column based on label Source Port
print(test_df['Source Port'].head(5))

# Print or work with only 2 or more columns based on name e.g. Source Port & Dest Port
print(test_df[['Source Port','Dest Port']].head(10))

0      137.0
1      137.0
2    54926.0
3      137.0
4        NaN
Name: Source Port, dtype: float64
   Source Port  Dest Port
0        137.0      137.0
1        137.0      137.0
2      54926.0      137.0
3        137.0    54926.0
4          NaN        NaN
5          NaN        NaN
6          NaN        NaN
7          NaN        NaN
8      60348.0       22.0
9         22.0    60348.0


In [26]:
# Preprocessing - Fill all Missing Port Numbers with '0' for all missing data across all data sets
# Missing data is mostly in Source/Destination Ports for Non-TCP/UDP Traffic - e.g. ARP & ICMP
test_df = test_df.fillna(0)
print(test_df[['Source Port','Dest Port']].head(10))
normal_df = normal_df.fillna(0)
dos_df = dos_df.fillna(0)

   Source Port  Dest Port
0        137.0      137.0
1        137.0      137.0
2      54926.0      137.0
3        137.0    54926.0
4          0.0        0.0
5          0.0        0.0
6          0.0        0.0
7          0.0        0.0
8      60348.0       22.0
9         22.0    60348.0
