# Unified Host and Network Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

In [2]:
FILE_SAMPLE_PATH = '/data/disk2/unified/netflow_day-48'

In [3]:
# Specify file header
FLOWS_FEATURES = [
    'Time',
    'Duration',
    'SrcDevice',
    'DstDevice',
    'Protocol',
    'SrcPort',
    'DstPort',
    'SrcPackets',
    'DstPackets',
    'SrcBytes',
    'DstBytes',
]

In [4]:
!wc -l $FILE_SAMPLE_PATH

174830805 /data/disk2/unified/netflow_day-48


In [5]:
# The file looks big, load only 10M rows
data = pd.read_csv(FILE_SAMPLE_PATH, names=FLOWS_FEATURES, nrows=25000000)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000000 entries, 0 to 24999999
Data columns (total 11 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   Time        int64 
 1   Duration    int64 
 2   SrcDevice   object
 3   DstDevice   object
 4   Protocol    int64 
 5   SrcPort     object
 6   DstPort     object
 7   SrcPackets  int64 
 8   DstPackets  int64 
 9   SrcBytes    int64 
 10  DstBytes    int64 
dtypes: int64(7), object(4)
memory usage: 2.0+ GB


In [7]:
data.describe()

Unnamed: 0,Time,Duration,Protocol,SrcPackets,DstPackets,SrcBytes,DstBytes
count,25000000.0,25000000.0,25000000.0,25000000.0,25000000.0,25000000.0,25000000.0
mean,4067552.0,724.1933,8.951324,96.98847,64.01048,30770.75,29701.77
std,3901.965,15280.02,4.876554,87715.16,31274.44,11443530.0,6214358.0
min,4060800.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,4064132.0,1.0,6.0,0.0,0.0,0.0,0.0
50%,4067351.0,1.0,6.0,2.0,1.0,197.0,46.0
75%,4071134.0,1.0,17.0,5.0,4.0,615.0,465.0
max,4074591.0,1222324.0,17.0,294123600.0,57911010.0,35773290000.0,11378840000.0


In [8]:
data.head(5)

Unnamed: 0,Time,Duration,SrcDevice,DstDevice,Protocol,SrcPort,DstPort,SrcPackets,DstPackets,SrcBytes,DstBytes
0,4060800,0,Comp546437,Comp576843,6,Port67184,443,11,0,2454,0
1,4060800,0,Comp621354,Comp275646,17,Port47461,53,1,0,63,0
2,4060800,0,Comp621354,Comp275646,17,Port03657,53,1,0,63,0
3,4060800,0,Comp621354,ActiveDirectory,17,Port31492,389,1,0,171,0
4,4060800,0,Comp621354,ActiveDirectory,17,Port53645,389,1,0,171,0


In [9]:
data['SrcDevice'].value_counts()

SrcDevice
Comp030334    1381218
Comp257274    1380289
Comp866402    1379894
Comp965575    1379411
Comp844043     848297
               ...   
Comp455916          1
Comp269470          1
Comp530442          1
Comp023028          1
Comp645475          1
Name: count, Length: 21414, dtype: int64

In [10]:
data['DstDevice'].value_counts()

DstDevice
Comp275646             3365418
ActiveDirectory        1565397
EnterpriseAppServer    1523014
Comp576843             1481947
Comp576031             1370069
                        ...   
IP696748                     1
IP733202                     1
IP530273                     1
IP452958                     1
IP407296                     1
Name: count, Length: 46620, dtype: int64

In [11]:
data['Protocol'].value_counts()

Protocol
6     18280388
17     6711322
1         8290
Name: count, dtype: int64

In [12]:
data['SrcPort'].value_counts()

SrcPort
Port87103    44726
Port95765    41923
Port59844    21412
137          18850
Port20995    18554
             ...  
Port34251        1
Port29457        1
515              1
Port71531        1
533              1
Name: count, Length: 64962, dtype: int64

In [13]:
data['DstPort'].value_counts().head(10)

DstPort
53           3671860
80           3114892
443          3051519
514          1579977
389           764749
427           539379
88            486279
Port92667     426792
445           417521
Port95765     413952
Name: count, dtype: int64

## Time Continuity Analysis

In [14]:
data['Time'].iloc[-1] - data['Time'].iloc[0]

13791

In [15]:
# Compute how much does the current load cover from the whole file
25000000 / 174830805

0.14299539489050572

So, 14.3% covers 13791 seconds (~3h 50m)

In [16]:
data_sorted = data.sort_values(by='Time', ascending=False)

In [17]:
data_sorted['tstamp'] = pd.to_datetime(data_sorted['Time'], unit='s')

In [18]:
def measure_real_capture_dur(data: pd.Series, gap_max_secs: int = 300) -> float:
    """Computes total timespan of the capture. Expects iterable containing timestamps objects sorted in a descending manner"""
    total_dur = pd.Timedelta(seconds=0)
    current_dur = pd.Timedelta(seconds=0)
    cont_durations = []
    last_tstamp = data.iloc[0]
    contiguous = True

    # Iterate through the dataframe to find out gaps
    for cur_tstamp in data:
        dur_gap = last_tstamp - cur_tstamp

        if dur_gap <= pd.Timedelta(seconds=gap_max_secs):
            total_dur += dur_gap
            current_dur += dur_gap
        else:
            cont_durations.append(current_dur)
            current_dur = 0

            if contiguous:
                contiguous = False

        last_tstamp = cur_tstamp

    cont_durations.sort(reverse=True)

    return total_dur, contiguous, cont_durations


In [19]:
measure_real_capture_dur(data_sorted['tstamp'], 138)

(Timedelta('0 days 03:49:51'), True, [])

This loaded capture part is continuous. Since it is infeasible to analyze the whole notebook, and we trust the documentation, we consider the whole capture as continuous.