# Dataset Preprocessing
In this notebook, the goal is to process the dataset so that the dataset is ready to be used for training. To achieve this goal, here are the steps that will be performed in this notebook:
1. Combine the CSV files of the dataset into one dataset
2. Preliminary analysis of the dataset
3. Dataset cleaning - clean missing values, duplicates, etc
4. Further processing, includes addressing the high class imbalance problem and rename the columns name (if necessary)
5. Save the processed dataset

note: The notebook will be seperated into two subsections, where the training dataset and the testing dataset will be processed separately. However, same steps will be applied to both dataset. 

In [1]:
# import the packages to be used
import os       # to create directories and remove files
import numpy as np
import pandas as pd
import random

# set the random seed to ensure the result is reproducible
random.seed(10)
np.random.seed(10)

# LUFlow Dataset (2020)

For the implementation using the LUFlow dataset, the data collected in July 2020 is being used as training dataset. 

## Step 1. Combine the files of the dataset

In [2]:
luflow_csv_files = ('2020.07.01',
                    '2020.07.02',
                    '2020.07.03',
                    '2020.07.04',
                    '2020.07.05',
                    '2020.07.06',
                    '2020.07.07',
                    '2020.07.08',
                    '2020.07.09',
                    '2020.07.10',
                    '2020.07.11',
                    '2020.07.12',
                    '2020.07.13',
                    '2020.07.14',
                    '2020.07.15',
                    '2020.07.16',
                    '2020.07.17',
                    '2020.07.18',
                    '2020.07.19',
                    '2020.07.20',
                    '2020.07.21',
                    '2020.07.22',
                    '2020.07.23',
                    '2020.07.24',
                    '2020.07.25',
                    '2020.07.26',
                    '2020.07.27',
                    '2020.07.28',
                    '2020.07.29',
                    '2020.07.30',
                    '2020.07.31',
                    )

In [3]:
# Define the function to combine seperate files of a dataset
# Besides combining the files, it has two additional functionality:
# 1. Replace the replacement character (\uFFFD) with '-', which is needed for the CIC-IDS2017 dataset
# 2. Down sample the dataset. If the reduce_sample_size parameter is set to true, 
#    only 10% of the dataset will be randomly selected and saved. 
def combine_csv_files(dataset: str, file_names: tuple, reduce_sample_size: bool = False):

    # create a new directory to place the combined dataset
    os.makedirs('./Dataset/dataset_combined', exist_ok=True)

    # remove the dataset if it already exist
    merged_dataset_directory = f'Dataset/dataset_combined/{dataset}.csv'
    if os.path.isfile(merged_dataset_directory):
        os.remove(merged_dataset_directory)
        print(f'original file({merged_dataset_directory}) has been removed')

    for (file_index, file_name) in enumerate(file_names):
        with open(f"Dataset/{dataset}/{file_name}.csv", 'r') as file, open(merged_dataset_directory, 'a') as out_file:
            for (line_index, line) in enumerate(file):
                
                # only the header of the first file will be taken
                if 'Label' in line or 'label' in line:
                    if file_index != 0 or line_index != 0:
                        continue
                elif reduce_sample_size:
                    if random.randint(1, 10) > 1:
                        continue
                # replace the replacement character (\uFFFD) with '-' if exist     
                out_file.write(line.replace(' ï¿½ ', '-'))
                

In [4]:
combine_csv_files(dataset='LUFlow', 
                    file_names=luflow_csv_files, 
                    reduce_sample_size=True)

original file(Dataset/dataset_combined/LUFlow.csv) has been removed


## Step 2. Preliminary analysis

In [5]:
# read the dataset
luflow2020 = pd.read_csv('Dataset/dataset_combined/LUFlow.csv')
luflow2020.head()

Unnamed: 0,avg_ipt,bytes_in,bytes_out,dest_ip,dest_port,entropy,num_pkts_out,num_pkts_in,proto,src_ip,src_port,time_end,time_start,total_entropy,label,duration
0,0.0,0,2896,786,9200.0,4.263922,2,0,6,786,32862.0,1593574200046275,1593574200046255,12348.318,benign,2e-05
1,0.285714,0,10136,786,9200.0,2.710831,7,3,6,786,32898.0,1593574199053693,1593574199051282,27476.982,benign,0.002411
2,0.0,0,6074,786,9200.0,3.692507,5,0,6,786,32902.0,1593574200047203,159357420004712,22428.287,benign,8.3e-05
3,0.0,0,2896,786,9200.0,4.803335,2,0,6,786,32902.0,1593574201053831,1593574201053819,13910.459,benign,1.2e-05
4,0.0,0,0,786,32898.0,0.0,1,0,6,786,9200.0,1593574202069599,1593574202069599,0.0,benign,0.0


In [6]:
print(f"Number of rows: {luflow2020.shape[0]}")
print(f"Number of columns: {luflow2020.shape[1]}")

Number of rows: 2506350
Number of columns: 16


In [7]:
print("Columns in the dataset:")
luflow2020.columns

Columns in the dataset:


Index(['avg_ipt', 'bytes_in', 'bytes_out', 'dest_ip', 'dest_port', 'entropy',
       'num_pkts_out', 'num_pkts_in', 'proto', 'src_ip', 'src_port',
       'time_end', 'time_start', 'total_entropy', 'label', 'duration'],
      dtype='object')

In [8]:
print('Class distribution:')
luflow2020['label'].value_counts()

Class distribution:


benign       1396168
malicious     905395
outlier       204787
Name: label, dtype: int64

From the class distribution, we can see that the LUFlow 2020 dataset is slightly imbalance. The benign samples account for 56% of total samples while malicious only account for 36%. 

In [9]:
print('Class distribution (normalized):')
luflow2020['label'].value_counts()/luflow2020.shape[0]*100

Class distribution (normalized):


benign       55.705229
malicious    36.124045
outlier       8.170726
Name: label, dtype: float64

### Check for null value

Count the number of rows include null values in each column. 

In [10]:
luflow2020_null_count = luflow2020.isnull().sum()
luflow2020_null_count = luflow2020_null_count[luflow2020_null_count > 0]
print(f"Rows contain null value: \n{luflow2020_null_count}\n")

luflow2020_null_count = luflow2020_null_count / luflow2020.shape[0] * 100
print(f"Rows contain null value (percentage): \n{luflow2020_null_count}\n")

Rows contain null value: 
dest_port    29268
src_port     29268
dtype: int64

Rows contain null value (percentage): 
dest_port    1.167754
src_port     1.167754
dtype: float64



### Check for infinity value

Check for number of rows that include null value

In [11]:
print('Number of samples contains infinity value:')
np.isinf(luflow2020.iloc[:, :-2]).any(axis=1).sum()

Number of samples contains infinity value:


0

### Check for columns that contain string values

Check for columns that is type `object`, which indicate the columns contain string value. The aim is to find if any column contain numeric and alphabetic value. Such column often include string value like '?' to indicate missing value, thus needed to be cleaned. 

In [12]:
luflow2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2506350 entries, 0 to 2506349
Data columns (total 16 columns):
 #   Column         Dtype  
---  ------         -----  
 0   avg_ipt        float64
 1   bytes_in       int64  
 2   bytes_out      int64  
 3   dest_ip        int64  
 4   dest_port      float64
 5   entropy        float64
 6   num_pkts_out   int64  
 7   num_pkts_in    int64  
 8   proto          int64  
 9   src_ip         int64  
 10  src_port       float64
 11  time_end       int64  
 12  time_start     int64  
 13  total_entropy  float64
 14  label          object 
 15  duration       float64
dtypes: float64(6), int64(9), object(1)
memory usage: 306.0+ MB


From here, we can see that only the 'label' column is of type `object`. As the 'label' column store the label of each sample using `string`, it has no problem with the column. 

### Check for duplicates

In [13]:
# check for duplicated column
luflow2020.columns[luflow2020.columns.value_counts() > 1]

Index([], dtype='object')

In [14]:
luflow_duplicates = luflow2020[luflow2020.duplicated()]
print(f"{luflow_duplicates.shape[0]} rows are duplicates")
print(f"{luflow_duplicates.shape[0]/luflow2020.shape[0]*100:.2f}% of rows are duplicates")

508 rows are duplicates
0.02% of rows are duplicates


## Step 3. Dataset cleaning

From the analysis in Step 2, it has been discovered that the dataset contains a small amount of missing value and duplicates. Since those problematic rows only account for a small portion of the entire dataset, those rows are simply removed. 

In [15]:
# remove rows contain missing value
luflow2020 = luflow2020.dropna(how='any')
luflow2020.shape

(2477082, 16)

In [16]:
luflow2020 = luflow2020.drop_duplicates()
luflow2020.shape

(2476585, 16)

## Step 4. Dataset preparation

In this step, the main goal is to prepare the dataset to be ready for training. For the LUFlow dataset, there are two tasks to be completed:
* Remove outlier - The LUFlow dataset contains an "outlier" class. In the outlier class, it contains samples that are malicious, but could also be benign. Hence, the outliers are not meaningful as they sits in the grey area between malicious and benign class. 
* Balance the class distribution - The benign samples are slightly more than the malicious samples, which could result in the models bias towards the benign samples. Since we have a large number of samples, we will downsampling the benign samples, so that the ratio between benign and malicious samples is 1:1. 

In [17]:
attack = luflow2020[luflow2020['label']=='malicious']
benign = luflow2020[luflow2020['label']=='benign'].sample(n=len(attack)).reset_index(drop=True)

luflow2020_exclude_outlier = pd.concat([attack, benign])
del attack
del benign

luflow2020_exclude_outlier['label'].value_counts()

benign       879740
malicious    879740
Name: label, dtype: int64

## Step 5. Save the dataset

In [18]:
# function to save the cleaned dataset
def save_cleaned_dataset(dataframe: pd.DataFrame,dataset: str, tag: str = ""):
    # create a new directory to save the cleaned dataset
    os.makedirs('./Dataset/dataset_cleaned', exist_ok=True)

    if not(tag == ""):
        tag = "_" + tag

    dataframe.to_csv(f'Dataset/dataset_cleaned/{dataset}{tag}.csv', index=False)

In [19]:
save_cleaned_dataset(dataframe=luflow2020_exclude_outlier, dataset='LUFlow')

# LUFlow Dataset (2021)

For the implementation using the LUFlow dataset, the data collected in January 2021 is being used as testing dataset. 

In [20]:
luflow2021_csv_files = ('2021.01.01',
                        '2021.01.02',
                        '2021.01.03',
                        '2021.01.04',
                        '2021.01.05',
                        '2021.01.06',
                        '2021.01.07',
                        '2021.01.08',
                        '2021.01.09',
                        '2021.01.10',
                        '2021.01.11',
                        '2021.01.12',
                        '2021.01.13',
                        '2021.01.14',
                        '2021.01.15',
                        '2021.01.17',
                        '2021.01.18',
                        '2021.01.19',
                        '2021.01.20',
                        '2021.01.22',
                        '2021.01.23',
                        '2021.01.24',
                        '2021.01.25',
                        '2021.01.26',
                        '2021.01.27',
                        '2021.01.28',
                        '2021.01.29',
                        '2021.01.30',
                        '2021.01.31',
                        )

In [21]:
combine_csv_files(dataset='LUFlow_2021', 
                    file_names=luflow2021_csv_files,
                    reduce_sample_size=True)

original file(Dataset/dataset_combined/LUFlow_2021.csv) has been removed


In [22]:
luflow2021 = pd.read_csv('Dataset/dataset_combined/LUFlow_2021.csv')
luflow2021.head()

Unnamed: 0,avg_ipt,bytes_in,bytes_out,dest_ip,dest_port,entropy,num_pkts_out,num_pkts_in,proto,src_ip,src_port,time_end,time_start,total_entropy,label,duration
0,907.3125,754,32742,786,9092.0,1.111004,32,20,6,786,57148.0,1609467701725488,1609467672683006,37214.2,benign,29.042482
1,0.0,0,0,786,40010.0,0.0,1,1,6,786,9300.0,1609467690746479,1609467690746443,0.0,benign,3.6e-05
2,0.0,174,29008,786,9092.0,1.127009,7,5,6,786,57148.0,1609467711729621,160946771172828,32888.37,benign,0.001341
3,0.0,0,0,786,40012.0,0.0,1,1,6,786,9300.0,1609467690746484,1609467690746469,0.0,benign,1.5e-05
4,1502.333333,664,13772,786,9200.0,2.341442,6,5,6,786,56330.0,1609467732753774,160946772373801,33801.062,benign,9.015764


In [23]:
print(f"Number of rows: {luflow2021.shape[0]}")
print(f"Number of columns: {luflow2021.shape[1]}")

Number of rows: 2699669
Number of columns: 16


In [24]:
print("Columns in the dataset:")
luflow2021.columns

Columns in the dataset:


Index(['avg_ipt', 'bytes_in', 'bytes_out', 'dest_ip', 'dest_port', 'entropy',
       'num_pkts_out', 'num_pkts_in', 'proto', 'src_ip', 'src_port',
       'time_end', 'time_start', 'total_entropy', 'label', 'duration'],
      dtype='object')

In [25]:
print('Class distribution')
luflow2021['label'].value_counts()

Class distribution


benign       1638952
malicious     591372
outlier       469345
Name: label, dtype: int64

From the class distribution, we can see that the LUFlow 2021 dataset is also imbalance. The benign samples account for 56% of total samples while malicious only account for 36%. 

In [26]:
print('Class distribution (normalized):')
luflow2020['label'].value_counts()/luflow2020.shape[0]*100

Class distribution (normalized):


benign       56.359422
malicious    35.522302
outlier       8.118276
Name: label, dtype: float64

### Check for null value

In [27]:
luflow2021_null_count = luflow2021.isnull().sum()
luflow2021_null_count = luflow2021_null_count[luflow2021_null_count > 0]
print(f"Rows contain null value: \n{luflow2021_null_count}\n")

luflow2021_null_count = luflow2021_null_count / luflow2021.shape[0] * 100
print(f"Rows contain null value (percentage): \n{luflow2021_null_count}\n")

Rows contain null value: 
dest_port    33810
src_port     33810
dtype: int64

Rows contain null value (percentage): 
dest_port    1.252376
src_port     1.252376
dtype: float64



### Check for infinity value

In [28]:
print('Number of samples contains infinity value:')
np.isinf(luflow2021.iloc[:, :-2]).any(axis=1).sum()

Number of samples contains infinity value:


0

### Check for columns that contain string values

In [29]:
luflow2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2699669 entries, 0 to 2699668
Data columns (total 16 columns):
 #   Column         Dtype  
---  ------         -----  
 0   avg_ipt        float64
 1   bytes_in       int64  
 2   bytes_out      int64  
 3   dest_ip        int64  
 4   dest_port      float64
 5   entropy        float64
 6   num_pkts_out   int64  
 7   num_pkts_in    int64  
 8   proto          int64  
 9   src_ip         int64  
 10  src_port       float64
 11  time_end       int64  
 12  time_start     int64  
 13  total_entropy  float64
 14  label          object 
 15  duration       float64
dtypes: float64(6), int64(9), object(1)
memory usage: 329.5+ MB


Just like the LUFlow 2020 dataset, only the 'label' column is of type `object`, so there will be no cleaning needed

## Check for duplicates

In [30]:
# check for duplicated column
luflow2021.columns[luflow2021.columns.value_counts() > 1]

Index([], dtype='object')

In [31]:
luflow2021_duplicates = luflow2021[luflow2021.duplicated()]
print(f"{luflow2021_duplicates.shape[0]} rows are duplicates")
print(f"{luflow2021_duplicates.shape[0]/luflow2021.shape[0]*100:.2f}% of rows are duplicates")


290 rows are duplicates
0.01% of rows are duplicates


## Step 3. Dataset cleaning

Just like the LUFlow 2020 dataset, the LUFlow 2021 dataset contains a small amount of missing value and duplicated rows. Those problematic rows in this dataset are also removed. 

In [32]:
luflow2021 = luflow2021.dropna(how='any')
luflow2021.shape

(2665859, 16)

In [33]:
luflow2021 = luflow2021.drop_duplicates()
luflow2021.shape

(2665569, 16)

## Step 4. Dataset preparation

Same as before, the outliers are removed and the class distribution are balanced in this step. 

In [34]:
attack = luflow2021[luflow2021['label'] == 'malicious']
benign = luflow2021[luflow2021['label'] == 'benign'].sample(n=len(attack)).reset_index(drop=True)

luflow2021_exclude_outlier = pd.concat([attack, benign])
del attack
del benign

luflow2021_exclude_outlier['label'].value_counts()

benign       569003
malicious    569003
Name: label, dtype: int64

## Step 5. Save the dataset

In [35]:
save_cleaned_dataset(dataframe=luflow2021_exclude_outlier, dataset='LUFlow2021')