# Unraveled Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

The dataset was downloaded as follows based on the documentation from GitHub:

`aws --no-sign-request --recursive --region eu-central-1 s3 cp s3://dapt2021 .`

In [4]:
import pandas as pd
import pathlib
import numpy as np
import os

In [2]:
DATA_FOLDER_FLOWS = '/data/hypermnt/processed/network-flows'

In [27]:
# Load all available flows, but some files seem to cause an error
all_data = pd.DataFrame()
all_files = pathlib.Path(DATA_FOLDER_FLOWS).rglob('*.csv')
corrupted_files = 0
fine_files = 0

for csv_file in all_files:

    try:
        this_data = pd.read_csv(csv_file)
        all_data = pd.concat([all_data, this_data])
        fine_files += 1
    except:
        print(csv_file)
        corrupted_files += 1
        os.remove(csv_file)

/data/hypermnt/processed/network-flows/Week5_Day4_06242021/net1013x_Flow_labeled.csv
/data/hypermnt/processed/network-flows/Week5_Day4_06242021/net1015x_Flow_labeled.csv
/data/hypermnt/processed/network-flows/Week5_Day4_06242021/netgw_Flow_labeled.csv
/data/hypermnt/processed/network-flows/Week5_Day4_06242021/net1014x_Flow_labeled.csv
/data/hypermnt/processed/network-flows/Week3_Day5-7_06112021-06132021/net1012x_Flow_labeled.csv
/data/hypermnt/processed/network-flows/Week3_Day5-7_06112021-06132021/net1013x_Flow_labeled.csv
/data/hypermnt/processed/network-flows/Week3_Day5-7_06112021-06132021/net1015x_Flow_labeled.csv
/data/hypermnt/processed/network-flows/Week3_Day5-7_06112021-06132021/netgw_Flow_labeled.csv
/data/hypermnt/processed/network-flows/Week3_Day5-7_06112021-06132021/net1011x_Flow_labeled.csv
/data/hypermnt/processed/network-flows/Week3_Day5-7_06112021-06132021/net1014x_Flow_labeled.csv
/data/hypermnt/processed/network-flows/Week1_Day3_05282021/net1012x_Flow_labeled.csv
/data

In [28]:
display(fine_files)
display(corrupted_files)

4

169

Ok, the problem must be elsewhere, since most of the data cause an error. Investigate.

In [29]:
# What's the precise error?
pd.read_csv('/data/hypermnt/processed/network-flows/Week5_Day4_06242021/net1013x_Flow_labeled.csv')

ParserError: Error tokenizing data. C error: Expected 89 fields in line 924, saw 104


In [24]:
# Work with what we have
all_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2013 entries, 0 to 902
Data columns (total 89 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   id                            2013 non-null   int64  
 1   expiration_id                 2013 non-null   int64  
 2   src_ip                        2013 non-null   object 
 3   src_mac                       2013 non-null   object 
 4   src_oui                       2013 non-null   object 
 5   src_port                      2013 non-null   int64  
 6   dst_ip                        2013 non-null   object 
 7   dst_mac                       2013 non-null   object 
 8   dst_oui                       2013 non-null   object 
 9   dst_port                      2013 non-null   int64  
 10  protocol                      2013 non-null   int64  
 11  ip_version                    2013 non-null   int64  
 12  vlan_id                       2013 non-null   int64  
 13  tunnel_id

In [None]:
all_data['Activity'].value_counts()

Activity
Normal    2013
Name: count, dtype: int64

Ok, we cannot work with this, as even the traffic classes cannot be recognized.

## Fixing the Files

In [58]:
# Istead of deleting the whole file, attempt to detele particular lines only:
# Delete lines not containing "88 commas --> 89 columns in total"
all_files = pathlib.Path(DATA_FOLDER_FLOWS).rglob('*.csv')

for csv_file in all_files:
    # Read the contests from the file
    with open(csv_file, 'r', encoding='utf8') as rfhandle:
        in_text = rfhandle.readlines()

    # Strip the longer lines than 89 columns
    out_text = []

    for line in in_text:
        if line.count(',') == 88:
            out_text.append(line)

    # Overwrite the file with the corrrect contents
    with open(csv_file, 'w', encoding='utf8') as wfhandle:
        wfhandle.writelines(out_text)

In [59]:
# After the fixing, let's read the files again...
all_data = pd.DataFrame()
all_files = pathlib.Path(DATA_FOLDER_FLOWS).rglob('*.csv')

for csv_file in all_files:
    all_data = pd.concat([all_data, pd.read_csv(csv_file)])

  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = pd.concat([all_data, pd.read_csv(csv_file)])
  all_data = p

In [60]:
all_data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Index: 6863988 entries, 0 to 33816
Data columns (total 89 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   id                            int64  
 1   expiration_id                 int64  
 2   src_ip                        object 
 3   src_mac                       object 
 4   src_oui                       object 
 5   src_port                      int64  
 6   dst_ip                        object 
 7   dst_mac                       object 
 8   dst_oui                       object 
 9   dst_port                      int64  
 10  protocol                      int64  
 11  ip_version                    int64  
 12  vlan_id                       int64  
 13  tunnel_id                     int64  
 14  bidirectional_first_seen_ms   int64  
 15  bidirectional_last_seen_ms    int64  
 16  bidirectional_duration_ms     int64  
 17  bidirectional_packets         int64  
 18  bidirectional_bytes          

In [61]:
len(all_data)

6863988

In [62]:
all_data['Activity'].value_counts()

Activity
Normal                                                                   6766747
Maintain Access                                                            27128
Encrypted Channel: Symmetric Cryptography                                  27111
Bruteforce: Password Guessing                                              22654
Active Scanning: Scanning IP Blocks                                        11450
Data Transfer Size Limits                                                   6988
Remote System Discovery                                                     1002
Exfiltration over C2 channel                                                 446
Remove Traces                                                                362
Unsecured Credentials                                                         88
Hijack Execution Flow: Path Interception by PATH Environment Variable          7
Active Scanning: Vulnerability Scanning                                        5
Name: count, dtype:

In [63]:
all_data['Stage'].value_counts()

Stage
Benign                6766747
Reconnaissance          34794
Lateral Movement        27445
Establish Foothold      27118
Data Exfiltration        7522
Cover up                  362
Name: count, dtype: int64

In [64]:
all_data['DefenderResponse'].value_counts()

DefenderResponse
Benign      6829194
Detected      34794
Name: count, dtype: int64

In [65]:
all_data['Signature'].value_counts()

Signature
APT    63125
AA     34109
Name: count, dtype: int64

In [66]:
# Compute the Span, although PCAP analysis will provide better info (to s):
start = all_data['bidirectional_first_seen_ms'].sort_values().iloc[0] / 1000
end   = all_data['bidirectional_last_seen_ms'].sort_values().iloc[-1] / 1000

In [68]:
# Seconds to timedate
import datetime

datetime.timedelta(seconds=end - start)

datetime.timedelta(days=40, seconds=45564, microseconds=322000)

In [69]:
45564 / 60 / 60

12.656666666666666

So 40 days and 13 hours - however, this cannot be precisely determined as we removed some data lines... Further analysis will be performed using PCAP files in the console off this notebook. Since this data does not provide any timestamps, temporal analysis is also not possible via flow data.