# Westermo Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

Data source: [https://github.com/westermo/network-traffic-dataset](https://github.com/westermo/network-traffic-dataset)

In [1]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_columns', None)

In [2]:
ROOT_DIR = '/data/data_surv/westermo/data'

DATA_PCAPS_DIR = os.path.join(ROOT_DIR, 'extended/pcaps')
DATA_CSVS_DIR = os.path.join(ROOT_DIR, 'extended/flows')

## PCAP Analysis

Merge the 3 PCAPS - left, bottom, right into one. According to documentation, they are supposed to be captured at the same period on different devices. For the purpose of this analysis, merging them is appropritate.

In [3]:
pcap_file_merged = os.path.join(DATA_PCAPS_DIR, 'merged.pcap')
pcap_files_all = [os.path.join(DATA_PCAPS_DIR, file) for file in os.listdir(DATA_PCAPS_DIR)]
pcap_files_str = ' '.join(pcap_files_all)
pcap_files_str

'/data/data_surv/westermo/data/extended/pcaps/left.pcap /data/data_surv/westermo/data/extended/pcaps/bottom.pcap /data/data_surv/westermo/data/extended/pcaps/merged.pcap /data/data_surv/westermo/data/extended/pcaps/right.pcap'

In [4]:
# Merge the PCAP into one file for duration and number of packets analysis
!mergecap -w $pcap_file_merged $pcap_files_str

mergecap: The file "/data/data_surv/westermo/data/extended/pcaps/merged.pcap" appears to be damaged or corrupt.
(pcapng: total block lengths (first 112 and second 1093339656) don't match)


In [5]:
# Analyze the merged file via capinfos
!capinfos $pcap_file_merged

File name:           /data/data_surv/westermo/data/extended/pcaps/merged.pcap
File type:           Wireshark/... - pcapng
File encapsulation:  Ethernet
File timestamp precision:  microseconds (6)
Packet size limit:   file hdr: (not set)
Number of packets:   2,245
File size:           261 kB
Data size:           185 kB
Capture duration:    1.375091 seconds
First packet time:   2023-03-10 10:29:34.078823
Last packet time:    2023-03-10 10:29:35.453914
Data byte rate:      135 kBps
Data bit rate:       1,081 kbps
Average packet size: 82.80 bytes
Average packet rate: 1,632 packets/s
SHA256:              4eff608e75e643baea8a509d6d0daeb882fae1479d149c68077d38960ced4d15
RIPEMD160:           a4bc98c33ce39393d9aba2911623cc2430576e2a
SHA1:                f4a455271fbcef078fbf4655744c19bc69d725cc
Strict time order:   True
Capture oper-sys:    Linux 5.15.0-133-generic
Capture application: Mergecap (Wireshark) 3.6.2 (Git v3.6.2 packaged as 3.6.2-2)
Number of interfaces in file: 1
Interface #0 info:


Corresponds to the span (and duration of ~1h 30m) with some additional information above.

## CSV Analysis

In [6]:
os.listdir(DATA_CSVS_DIR)

['output_left.csv', 'output_right.csv', 'output_bottom.csv']

In [7]:
# Load all CSV files into one DataFrame (similar handling as PCAPs)
csv_files_all = [os.path.join(DATA_CSVS_DIR, file) for file in os.listdir(DATA_CSVS_DIR)]
data = pd.DataFrame()

for filepath in csv_files_all:
    data = pd.concat([data, pd.read_csv(filepath)])

In [8]:
data.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Index: 68729 entries, 0 to 55424
Data columns (total 64 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   sAddress         68729 non-null  object 
 1   rAddress         68729 non-null  object 
 2   sMACs            68729 non-null  object 
 3   rMACs            68729 non-null  object 
 4   sIPs             57887 non-null  object 
 5   rIPs             57887 non-null  object 
 6   protocol         68729 non-null  object 
 7   startDate        68729 non-null  object 
 8   endDate          68729 non-null  object 
 9   start            68729 non-null  float64
 10  end              68729 non-null  float64
 11  startOffset      68729 non-null  float64
 12  endOffset        68729 non-null  float64
 13  duration         68729 non-null  float64
 14  sPackets         68729 non-null  int64  
 15  rPackets         68729 non-null  int64  
 16  sBytesSum        68729 non-null  int64  
 17  rBytesSum        

In [9]:
data.head()

Unnamed: 0,sAddress,rAddress,sMACs,rMACs,sIPs,rIPs,protocol,startDate,endDate,start,end,startOffset,endOffset,duration,sPackets,rPackets,sBytesSum,rBytesSum,sBytesMax,rBytesMax,sBytesMin,rBytesMin,sBytesAvg,rBytesAvg,sLoad,rLoad,sPayloadSum,rPayloadSum,sPayloadMax,rPayloadMax,sPayloadMin,rPayloadMin,sPayloadAvg,rPayloadAvg,sInterPacketAvg,rInterPacketAvg,sttl,rttl,sAckRate,rAckRate,sUrgRate,rUrgRate,sFinRate,rFinRate,sPshRate,rPshRate,sSynRate,rSynRate,sRstRate,rRstRate,sWinTCP,rWinTCP,sFragmentRate,rFragmentRate,sAckDelayMax,rAckDelayMax,sAckDelayMin,rAckDelayMin,sAckDelayAvg,rAckDelayAvg,IT_B_Label,IT_M_Label,NST_B_Label,NST_M_Label
0,b8:27:eb:15:88:9c,ff:ff:ff:ff:ff:ff,b8:27:eb:15:88:9c,ff:ff:ff:ff:ff:ff,,,ARP,2023-03-10 10:29:34.078825,2023-03-10 10:29:34.078829,1678441000.0,1678441000.0,0.0,4e-06,5e-06,4,0,240,0,60.0,,60.0,,60.0,,379963600.0,0.0,0,0,0.0,,0.0,,0.0,,1e-06,,,,,,,,,,,,,,,,,,,,,,,,,,0,Normal,0,Normal
1,b8:27:eb:15:88:9c,ff:ff:ff:ff:ff:ff,b8:27:eb:15:88:9c,ff:ff:ff:ff:ff:ff,,,ARP,2023-03-10 10:29:35.102354,2023-03-10 10:29:35.102358,1678441000.0,1678441000.0,1.023529,1.023533,5e-06,4,0,240,0,60.0,,60.0,,60.0,,379963600.0,0.0,0,0,0.0,,0.0,,0.0,,1e-06,,,,,,,,,,,,,,,,,,,,,,,,,,0,Normal,0,Normal
2,b8:27:eb:15:88:9c,ff:ff:ff:ff:ff:ff,b8:27:eb:15:88:9c,ff:ff:ff:ff:ff:ff,,,ARP,2023-03-10 10:29:36.126359,2023-03-10 10:29:36.126364,1678441000.0,1678441000.0,2.047534,2.047539,6e-06,4,0,240,0,60.0,,60.0,,60.0,,319638300.0,0.0,0,0,0.0,,0.0,,0.0,,2e-06,,,,,,,,,,,,,,,,,,,,,,,,,,0,Normal,0,Normal
3,b8:27:eb:15:88:9c,ff:ff:ff:ff:ff:ff,b8:27:eb:15:88:9c,ff:ff:ff:ff:ff:ff,,,ARP,2023-03-10 10:29:40.222634,2023-03-10 10:29:40.222638,1678441000.0,1678441000.0,6.143809,6.143813,5e-06,4,0,240,0,60.0,,60.0,,60.0,,398779000.0,0.0,0,0,0.0,,0.0,,0.0,,1e-06,,,,,,,,,,,,,,,,,,,,,,,,,,0,Normal,0,Normal
4,b8:27:eb:15:88:9c,ff:ff:ff:ff:ff:ff,b8:27:eb:15:88:9c,ff:ff:ff:ff:ff:ff,,,ARP,2023-03-10 10:29:41.246264,2023-03-10 10:29:41.246268,1678441000.0,1678441000.0,7.167439,7.167443,5e-06,4,0,240,0,60.0,,60.0,,60.0,,379963600.0,0.0,0,0,0.0,,0.0,,0.0,,1e-06,,,,,,,,,,,,,,,,,,,,,,,,,,0,Normal,0,Normal


In [10]:
data.describe()

Unnamed: 0,start,end,startOffset,endOffset,duration,sPackets,rPackets,sBytesSum,rBytesSum,sBytesMax,rBytesMax,sBytesMin,rBytesMin,sBytesAvg,rBytesAvg,sLoad,rLoad,sPayloadSum,rPayloadSum,sPayloadMax,rPayloadMax,sPayloadMin,rPayloadMin,sPayloadAvg,rPayloadAvg,sInterPacketAvg,rInterPacketAvg,sttl,rttl,sAckRate,rAckRate,sUrgRate,rUrgRate,sFinRate,rFinRate,sPshRate,rPshRate,sSynRate,rSynRate,sRstRate,rRstRate,sWinTCP,rWinTCP,sFragmentRate,rFragmentRate,sAckDelayMax,rAckDelayMax,sAckDelayMin,rAckDelayMin,sAckDelayAvg,rAckDelayAvg,IT_B_Label,NST_B_Label
count,68729.0,68729.0,68729.0,68729.0,68729.0,68729.0,68729.0,68729.0,68729.0,67298.0,52681.0,67298.0,52681.0,67298.0,52681.0,68729.0,68729.0,68729.0,68729.0,67298.0,52681.0,67298.0,52681.0,67298.0,52681.0,62892.0,52620.0,50827.0,52111.0,50827.0,52111.0,50827.0,52111.0,50827.0,52111.0,50827.0,52111.0,50827.0,52111.0,50827.0,52111.0,50827.0,52111.0,56523.0,52172.0,49844.0,50804.0,49844.0,50804.0,49844.0,50804.0,68729.0,68729.0
mean,1678443000.0,1678443000.0,2747.186999,2747.501637,0.314639,38.466106,29.529762,2673.185686,2042.684136,89.62776,111.860443,61.976686,54.541922,70.30795,68.887781,98588370.0,1802648.0,671.189236,529.234268,36.918393,60.152161,9.545663,2.906456,17.661093,17.112158,0.011273,0.013121,64.0,63.318948,0.996968,0.94784,0.0,0.0,0.007087,0.003846,0.689621,0.71774,0.02829,0.051338,0.007313,0.000822,1236.08322,1097.279905,0.0,0.0,0.111871,0.174028,0.000101,0.000456,0.011206,0.01851,0.652956,0.213898
std,1574.062,1574.066,1574.062202,1574.065929,0.211489,41.413802,27.212631,2953.707989,1791.675789,84.975692,201.514537,33.951491,6.240214,34.792253,21.376537,304631300.0,43791690.0,866.928168,717.767704,86.076003,201.434117,35.983947,6.13187,36.976179,21.084973,0.019649,0.009113,0.0,3.20834,0.023996,0.206993,0.0,0.0,0.043759,0.035249,0.268044,0.269736,0.143077,0.203571,0.045076,0.012385,2660.224975,3036.195141,0.0,0.0,0.096143,0.207029,0.000663,0.006664,0.009457,0.022371,0.476033,0.410059
min,1678441000.0,1678441000.0,0.0,2e-06,1e-06,0.0,0.0,0.0,0.0,36.0,36.0,36.0,36.0,36.0,36.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64.0,39.0,0.571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,502.0,492.862,0.0,0.0,0.0,0.0,0.0,0.0,0.000132,0.000113,0.0,0.0
25%,1678442000.0,1678442000.0,1394.870249,1394.955187,0.005006,4.0,8.0,380.0,488.0,65.0,64.0,52.0,52.0,60.0,58.0,25195.67,15240.26,16.0,0.0,13.0,12.0,0.0,0.0,7.0,6.0,0.00049,0.007092,64.0,64.0,1.0,1.0,0.0,0.0,0.0,0.0,0.522,0.5,0.0,0.0,0.0,0.0,502.0,502.0,0.0,0.0,0.001,0.001,0.0,0.0,0.000531,0.000477,0.0,0.0
50%,1678443000.0,1678443000.0,2743.821532,2744.232317,0.419887,16.0,30.0,1040.0,1856.0,65.0,65.0,52.0,52.0,65.0,62.2,127281.3,31974.24,222.0,192.0,13.0,13.0,0.0,0.0,13.0,10.2,0.00433,0.009277,64.0,64.0,1.0,1.0,0.0,0.0,0.0,0.0,0.571,0.8,0.0,0.0,0.0,0.0,503.05,506.2,0.0,0.0,0.189,0.04,0.0,0.0,0.015272,0.003247,1.0,0.0
75%,1678445000.0,1678445000.0,4125.209604,4125.51895,0.473531,92.0,48.0,6560.0,4020.0,97.0,109.0,65.0,60.0,70.5,73.867,938626.2,77275.98,1776.0,1296.0,45.0,57.0,13.0,8.0,18.5,21.867,0.011682,0.016308,64.0,64.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,509.0,509.0,0.0,0.0,0.194,0.459,0.0,0.0,0.019493,0.031092,1.0,0.0
max,1678446000.0,1678446000.0,5443.588241,5443.588243,0.5,132.0,774.0,31384.0,34048.0,1500.0,1412.0,328.0,233.0,627.68,294.833,7872000000.0,4036668000.0,28784.0,7376.0,1448.0,1360.0,300.0,205.0,575.68,241.5,0.498549,0.166539,64.0,64.0,1.0,1.0,0.0,0.0,0.5,0.5,1.0,1.0,1.0,1.0,0.429,0.5,14600.0,64240.0,0.0,0.0,0.496,0.499,0.039,0.146,0.10907,0.249889,1.0,1.0


In [11]:
# Addresses breakdown
data['sAddress'].value_counts()

sAddress
198.18.134.11        28264
198.18.134.12        18753
b8:27:eb:15:88:9c     7975
198.18.134.1          2958
198.18.134.49         1057
0.0.0.0                924
198.18.134.47          775
198.18.134.42          631
198.18.134.3           513
198.18.134.31          430
198.18.134.43          426
198.18.134.99          415
198.18.134.41          392
198.18.134.48          389
198.18.134.6           376
00:07:7c:88:6e:83      363
00:07:7c:88:6e:80      362
00:07:7c:29:de:60      362
00:07:7c:88:6e:60      362
00:07:7c:8c:43:60      362
198.18.134.38          326
b8:27:eb:d1:b7:ef      288
198.18.134.45          251
198.18.134.46          209
198.18.134.39          194
198.18.134.40          183
00:07:7c:29:de:40      181
b8:27:eb:6d:4f:4b      112
198.18.134.5           107
00:07:7c:8c:43:63      104
00:07:7c:29:de:61       93
198.18.134.4            77
198.18.134.37           76
00:07:7c:88:6e:63       74
198.18.134.50           71
00:24:9b:6d:b8:89       69
00:07:7c:8c:43:83  

In [12]:
data['rAddress'].value_counts()

rAddress
198.18.134.15        19543
198.18.134.31        19283
198.18.134.12         9332
ff:ff:ff:ff:ff:ff     8647
198.18.134.99         5160
224.0.0.22            2341
01:80:c2:00:00:0e     1629
224.0.0.1             1277
255.255.255.255        924
b8:27:eb:3e:5d:96      293
00:24:9b:6d:b8:89      154
b8:27:eb:d1:b7:ef      112
198.18.134.6             9
224.0.0.251              6
198.18.134.45            5
198.18.134.51            5
00:07:7c:8c:43:63        4
198.18.134.48            2
00:07:7c:8c:43:83        1
00:07:7c:88:6e:83        1
00:07:7c:29:de:61        1
Name: count, dtype: int64

Interesting that they mix IP addresses and MAC addresses into a common feature field. Apparently, when protocol without IP address was used (e.g., ARP) MAC address is present, while other protocols with active IP address use it in this feature value.

In [13]:
# According to documentation, NST labeling is more precise - breakdown its labels
data['NST_M_Label'].value_counts()

NST_M_Label
Normal                     54028
BAD-MISCONF-DUPLICATION     4611
BAD-MISCONF                 3889
BAD-SSH                     2968
BAD-PORTSCAN2               2179
BAD-MITM                     708
BAD-PORTSCAN1                267
GOOD-SSH                      79
Name: count, dtype: int64

### Temporal and Data Continuity Analysis

In [14]:
pd.set_option('display.float_format', lambda x: f'{x:.24f}')

In [15]:
# What's the relation between 'startDate' and 'start' feature values?
data[['startDate', 'start', 'endDate', 'end']]

Unnamed: 0,startDate,start,endDate,end
0,2023-03-10 10:29:34.078825,1678440574.078824996948242187500000,2023-03-10 10:29:34.078829,1678440574.078829050064086914062500
1,2023-03-10 10:29:35.102354,1678440575.102354049682617187500000,2023-03-10 10:29:35.102358,1678440575.102358102798461914062500
2,2023-03-10 10:29:36.126359,1678440576.126358985900878906250000,2023-03-10 10:29:36.126364,1678440576.126363992691040039062500
3,2023-03-10 10:29:40.222634,1678440580.222634077072143554687500,2023-03-10 10:29:40.222638,1678440580.222637891769409179687500
4,2023-03-10 10:29:41.246264,1678440581.246263980865478515625000,2023-03-10 10:29:41.246268,1678440581.246268033981323242187500
...,...,...,...,...
55420,2023-03-10 11:57:10.821072,1678445830.821072101593017578125000,2023-03-10 11:57:10.821072,1678445830.821072101593017578125000
55421,2023-03-10 11:59:29.972545,1678445969.972544908523559570312500,2023-03-10 11:59:29.973469,1678445969.973469018936157226562500
55422,2023-03-10 11:50:21.568717,1678445421.568717002868652343750000,2023-03-10 11:50:21.568717,1678445421.568717002868652343750000
55423,2023-03-10 11:57:40.766799,1678445860.766798973083496093750000,2023-03-10 11:57:40.766800,1678445860.766799926757812500000000


Apparently, they are the same, but one is represented in a date format, the second one in an unix timestamp format. Let's utilize the `end` to estimate the duration and continuity.

In [16]:
timestamps = pd.to_datetime(data['end'], unit='s', origin='unix')
timestamps = timestamps.sort_values(ascending=False)
timestamps

4719    2023-03-10 11:00:17.667068958
55368   2023-03-10 11:00:16.495696068
55367   2023-03-10 11:00:16.482666016
55371   2023-03-10 11:00:16.470124006
55370   2023-03-10 11:00:16.466128111
                     ...             
0       2023-03-10 09:29:34.565763950
1       2023-03-10 09:29:34.500118971
0       2023-03-10 09:29:34.078829050
0       2023-03-10 09:29:34.078828096
5       2023-03-10 09:29:34.078826904
Name: end, Length: 68729, dtype: datetime64[ns]

In [17]:
# Compute the span
capture_span = timestamps.iloc[0] - timestamps.iloc[-1]
capture_span

Timedelta('0 days 01:30:43.588242054')

In [18]:
# Span corresponds to the PCAP one. Compute its 1% for continuity estimation
capture_span * 0.01

Timedelta('0 days 00:00:54.435882420')

In [19]:
# Measure capture duration and continousness
def measure_real_capture_dur(data: pd.Series, gap_max_secs: int = 300) -> float:
    """Computes total timespan of the capture. Expects iterable containing timestamps objects sorted in a descending manner"""
    total_dur = pd.Timedelta(seconds=0)
    current_dur = pd.Timedelta(seconds=0)
    cont_durations = []
    last_tstamp = data.iloc[0]
    contiguous = True

    # Iterate through the dataframe to find out gaps
    for cur_tstamp in data:
        dur_gap = last_tstamp - cur_tstamp

        if dur_gap <= pd.Timedelta(seconds=gap_max_secs):
            total_dur += dur_gap
            current_dur += dur_gap
        else:
            cont_durations.append(current_dur)
            current_dur = pd.Timedelta(seconds=0)

            if contiguous:
                contiguous = False

        last_tstamp = cur_tstamp

    # Log the final continuous block and sort them via their length
    cont_durations.append(current_dur)
    cont_durations.sort(reverse=True)

    return total_dur, contiguous, cont_durations

In [20]:
measure_real_capture_dur(timestamps, int(capture_span.total_seconds() * 0.01))

(Timedelta('0 days 01:30:43.588242054'),
 True,
 [Timedelta('0 days 01:30:43.588242054')])

As confirmed by PCAP, this capture is continuous and lasts approximately 1.5 hours.