# AIT-LDSv2.0 Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

We downloaded classical log dataset (https://zenodo.org/records/5789064) and merged it with NetFlow data (https://zenodo.org/records/13168643), placing flows of the specific scenarios under the `flow` directory.

In [1]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_columns', None)

In [2]:
DATA_ROOT = '/data/surv/ait/scenarios/'

In [3]:
!ls $DATA_ROOT

fox  harrison  russellmitchell	santos	shaw  wardbeck	wheeler  wilson


In [4]:
# Let's look at the first scenario to have an idea about its structure
SCENARIO_FOX = os.path.join(DATA_ROOT, 'fox')

!ls $SCENARIO_FOX

dataset.yaml  environment  flow  fox.zip  gather  labels  processing  rules


In [5]:
# Logs are supposed to be under 'gather' directory
# whereas labeling in the 'label' directory
!ls $SCENARIO_FOX/labels
!echo ""
!ls $SCENARIO_FOX/labels/inet-firewall
!echo ""
!ls $SCENARIO_FOX/labels/inet-firewall/logs

inet-firewall  internal_share  intranet_server	monitoring  vpn

logs

dnsmasq.log


In [6]:
# How does a typical labeling look like?
!cat $SCENARIO_FOX/labels/inet-firewall/logs/dnsmasq.log | head -n 10

{"line": 1, "labels": ["dnsteal", "attacker", "dnsteal-received"], "rules": {"dnsteal": ["dnsteal.domain.match"], "attacker": ["dnsteal.domain.match"], "dnsteal-received": ["dnsteal.domain.received"]}}
{"line": 2, "labels": ["dnsteal", "attacker", "dnsteal-received"], "rules": {"dnsteal": ["dnsteal.domain.match"], "attacker": ["dnsteal.domain.match"], "dnsteal-received": ["dnsteal.domain.received"]}}
{"line": 3, "labels": ["dnsteal", "attacker", "dnsteal-received"], "rules": {"dnsteal": ["dnsteal.domain.match"], "attacker": ["dnsteal.domain.match"], "dnsteal-received": ["dnsteal.domain.received"]}}
{"line": 4, "labels": ["dnsteal", "attacker", "dnsteal-received"], "rules": {"dnsteal": ["dnsteal.domain.match"], "attacker": ["dnsteal.domain.match"], "dnsteal-received": ["dnsteal.domain.received"]}}
{"line": 5, "labels": ["dnsteal", "attacker", "dnsteal-received"], "rules": {"dnsteal": ["dnsteal.domain.match"], "attacker": ["dnsteal.domain.match"], "dnsteal-received": ["dnsteal.domain.rec

In [7]:
# How does a log look like?
!ls $SCENARIO_FOX/gather
!echo ""
!ls $SCENARIO_FOX/gather/inet-firewall
!echo ""
!ls $SCENARIO_FOX/gather/inet-firewall/logs
!echo ""
!ls $SCENARIO_FOX/gather/inet-firewall/logs/dnsmasq.log

attacker_0   ext_user_6		  internal_share     remote_employee_2
cloud_share  inet-dns		  intranet_server    remote_employee_3
ext_user_0   inet-firewall	  jonesmorgan_mail   taylorcruz_mail
ext_user_1   internal_employee_0  mail		     vpn
ext_user_2   internal_employee_1  miller_mail	     webserver
ext_user_3   internal_employee_2  monitoring	     whittaker_mail
ext_user_4   internal_employee_3  remote_employee_0
ext_user_5   internal_employee_4  remote_employee_1

configs  facts.json  logs

audit	    dnsmasq.log  kern.log.1	       suricata  syslog.3
auth.log    journal	 shorewall-init.log    syslog.1  syslog.4
auth.log.1  kern.log	 shorewall-init.log.1  syslog.2  syslog.5

/data/surv/ait/scenarios/fox/gather/inet-firewall/logs/dnsmasq.log


In [8]:
!cat $SCENARIO_FOX/gather/inet-firewall/logs/dnsmasq.log | head -n 10

Jan 15 00:00:03 dnsmasq[14522]: query[A] 3x6-.546-.2PoxC1PkS*qtk0p2kKZGSYsWe2X*u678tHnPA6vJb6cp7itF6Qlb7/ZNOUZ*-.tO4afCcp4TpC6S0KJF27aqpRaGLcHzZCkPnUWPug2PpcImBWfcLFKlm5p5r3-.Ewvg4xYu8FqM2a/lO4V8qfcNr2i1bRY/u8wZM19IvDh7deB7cBxUezv5CAKT-.customers_2018.xlsx.ycgjslfptkev.com from 10.35.33.111
Jan 15 00:00:03 dnsmasq[14522]: forwarded 3x6-.546-.2PoxC1PkS*qtk0p2kKZGSYsWe2X*u678tHnPA6vJb6cp7itF6Qlb7/ZNOUZ*-.tO4afCcp4TpC6S0KJF27aqpRaGLcHzZCkPnUWPug2PpcImBWfcLFKlm5p5r3-.Ewvg4xYu8FqM2a/lO4V8qfcNr2i1bRY/u8wZM19IvDh7deB7cBxUezv5CAKT-.customers_2018.xlsx.ycgjslfptkev.com to 192.168.255.254
Jan 15 00:00:03 dnsmasq[14522]: reply 3x6-.546-.2PoxC1PkS*qtk0p2kKZGSYsWe2X*u678tHnPA6vJb6cp7itF6Qlb7/ZNOUZ*-.tO4afCcp4TpC6S0KJF27aqpRaGLcHzZCkPnUWPug2PpcImBWfcLFKlm5p5r3-.Ewvg4xYu8FqM2a/lO4V8qfcNr2i1bRY/u8wZM19IvDh7deB7cBxUezv5CAKT-.customers_2018.xlsx.ycgjslfptkev.com is 195.128.194.168
Jan 15 00:00:20 dnsmasq[14522]: query[A] 3x6-.547-.WharXpRiFOnbAvznOFBIiR4EDr2FH97sAZEw0PT77TTzCvi*vLTdNSfYJ*mB-.Mro0pNd/COk

In [9]:
!ls $SCENARIO_FOX/gather/inet-firewall/logs/suricata

certs	  files		       log.pcap.1642405636  suricata.log
core	  log.pcap.1642159247  log.pcap.1642508536  suricata-start.log
eve.json  log.pcap.1642162264  log.pcap.1642592664
fast.log  log.pcap.1642238378  stats.log


In [10]:
!cat $SCENARIO_FOX/gather/inet-firewall/logs/suricata/stats.log | head -n 20

------------------------------------------------------------------------------------
Date: 1/14/2022 -- 11:20:54 (uptime: 0d, 00h 00m 40s)
------------------------------------------------------------------------------------
Counter                                       | TM Name                   | Value
------------------------------------------------------------------------------------
capture.kernel_packets                        | Total                     | 2
decoder.pkts                                  | Total                     | 2
decoder.bytes                                 | Total                     | 596
decoder.ipv4                                  | Total                     | 2
decoder.ethernet                              | Total                     | 2
decoder.udp                                   | Total                     | 2
decoder.avg_pkt_size                          | Total                     | 298
decoder.max_pkt_size                          | Total      

### Log Counts counting

In [11]:
import glob

# Global counters
logs_lines_total = 0
pcaps_pkts_total = 0

# Count the amount of log files per scenario
scenarios_paths = []
for scenario in os.scandir(DATA_ROOT):
    if scenario.is_dir():
        scenarios_paths.append(scenario.path)

scenarios_paths.sort()

for scenario_path in scenarios_paths:
    scenario_logs_path = os.path.join(scenario_path, 'gather')
    logs_lines_sc = 0
    pcaps_pkts_sc = 0

    # Count the number of lines in logfiles
    for logfile in glob.iglob('*/logs/**/*log*', root_dir=scenario_logs_path, recursive=True):
        logfile_fullpath = os.path.join(scenario_logs_path, logfile)

        if not os.path.isdir(logfile_fullpath):
            current_log_lines = !cat $logfile_fullpath | wc -l
            logs_lines_sc += int(current_log_lines[0])

    # Count the number of packets in PCAPs
    for pcapfile in glob.iglob('*/logs/**/*pcap*', root_dir=scenario_logs_path, recursive=True):
        pcapfile_fullpath = os.path.join(scenario_logs_path, pcapfile)

        if not os.path.isdir(pcapfile_fullpath):
            #current_pcapfile_pkts = !tshark -r $pcapfile_fullpath 2>/dev/null | wc -l
            current_pcapfile_pkts = !capinfos -M -c "$pcapfile_fullpath" 2> /dev/null | tail -n 1 | cut -f 6 -d ' '
            pcaps_pkts_sc += int(current_pcapfile_pkts[0]) if len(current_pcapfile_pkts) == 1 else 0

    # Update the counters and print the scenario info
    logs_lines_total += logs_lines_sc
    pcaps_pkts_total += pcaps_pkts_sc

    print(scenario_path)
    print(logs_lines_sc, pcaps_pkts_sc)
    print()

logs_lines_total, pcaps_pkts_total

/data/surv/ait/scenarios/fox
114900689 27139809

/data/surv/ait/scenarios/harrison
130883155 31307113

/data/surv/ait/scenarios/russellmitchell
64946994 15501022

/data/surv/ait/scenarios/santos
85476285 16349076

/data/surv/ait/scenarios/shaw
137837285 27109300

/data/surv/ait/scenarios/wardbeck
137717718 27047677

/data/surv/ait/scenarios/wheeler
150133056 34622374

/data/surv/ait/scenarios/wilson
203805076 45098276



(1025700258, 224174647)

### Timespan Analysis

In [31]:
# Official declaration of start and ends of individual scenarios
INFO_FILES = !find $DATA_ROOT -name 'dataset.yaml'

for file in sorted(INFO_FILES):
    !cat $file
    !echo ''

name: processed_fox_scenario
start: '2022-01-15T00:00:00'
end: '2022-01-20T00:00:00'

name: processed_harrison_scenario
start: '2022-02-04T00:00:00'
end: '2022-02-09T00:00:00'

name: processed_russellmitchell_scenario
start: '2022-01-21T00:00:00'
end: '2022-01-25T00:00:00'

name: processed_santos_scenario
start: '2022-01-14T00:00:00'
end: '2022-01-18T00:00:00'

name: processed_shaw_scenario
start: '2022-01-25T00:00:00'
end: '2022-01-31T00:00:00'

name: processed_wardbeck_scenario
start: '2022-01-19T00:00:00'
end: '2022-01-24T00:00:00'

name: processed_wheeler_scenario
start: '2022-01-26T00:00:00'
end: '2022-01-31T00:00:00'

name: processed_wilson_scenario
start: '2022-02-03T00:00:00'
end: '2022-02-09T00:00:00'



## NetFlow Data Analysis

Since all scenarios are very common with different parameters, take a closer look on just one -- be it the biggest one -- `wilson`.

In [36]:
import glob

NETFLOW_WILSON_FOLDER_PATH = '/data/surv/ait/scenarios/wilson/flow'
NETFLOW_WILSON_FILES = glob.glob('*.csv', root_dir=NETFLOW_WILSON_FOLDER_PATH)
NETFLOW_WILSON_FILES

['udp_complete.csv', 'tcp_nocomplete.csv', 'tcp_complete.csv']

In [37]:
wilson_flow_data_all = pd.DataFrame()

for file in NETFLOW_WILSON_FILES:
    wilson_flow_part = pd.read_csv(os.path.join(NETFLOW_WILSON_FOLDER_PATH, file))
    wilson_flow_data_all = pd.concat([wilson_flow_data_all, wilson_flow_part])

In [42]:
wilson_flow_data_all.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Index: 629553 entries, 0 to 417671
Data columns (total 157 columns):
 #    Column                 Non-Null Count   Dtype  
---   ------                 --------------   -----  
 0    #c_ip:1                181012 non-null  object 
 1    c_port:2               629553 non-null  int64  
 2    c_first_abs:3          181012 non-null  float64
 3    c_durat:4              181012 non-null  float64
 4    c_bytes_all:5          181012 non-null  float64
 5    c_pkts_all:6           181012 non-null  float64
 6    c_isint:7              181012 non-null  float64
 7    c_iscrypto:8           181012 non-null  float64
 8    c_type:9               181012 non-null  float64
 9    s_ip:10                181012 non-null  object 
 10   s_port:11              181012 non-null  float64
 11   s_first_abs:12         181012 non-null  float64
 12   s_durat:13             181012 non-null  float64
 13   s_bytes_all:14         181012 non-null  float64
 14   s_pkts_all:15          

In [40]:
len(wilson_flow_data_all)

629553

In [41]:
wilson_flow_data_all.head()

Unnamed: 0,#c_ip:1,c_port:2,c_first_abs:3,c_durat:4,c_bytes_all:5,c_pkts_all:6,c_isint:7,c_iscrypto:8,c_type:9,s_ip:10,s_port:11,s_first_abs:12,s_durat:13,s_bytes_all:14,s_pkts_all:15,s_isint:16,s_iscrypto:17,s_type:18,fqdn:19,timestamp,rounded_timestamp,role_cli,ipv4_address_cli,network_cli,role_serv,ipv4_address_serv,network_serv,label,#15#c_ip:1,c_pkts_all:3,c_rst_cnt:4,c_ack_cnt:5,c_ack_cnt_p:6,c_bytes_uniq:7,c_pkts_data:8,c_bytes_all:9,c_pkts_retx:10,c_bytes_retx:11,c_pkts_ooo:12,c_syn_cnt:13,c_fin_cnt:14,s_ip:15,s_port:16,s_pkts_all:17,s_rst_cnt:18,s_ack_cnt:19,s_ack_cnt_p:20,s_bytes_uniq:21,s_pkts_data:22,s_bytes_all:23,s_pkts_retx:24,s_bytes_retx:25,s_pkts_ooo:26,s_syn_cnt:27,s_fin_cnt:28,first:29,last:30,durat:31,c_first:32,s_first:33,c_last:34,s_last:35,c_first_ack:36,s_first_ack:37,c_isint:38,s_isint:39,c_iscrypto:40,s_iscrypto:41,con_t:42,p2p_t:43,http_t:44,c_rtt_avg:45,c_rtt_min:46,c_rtt_max:47,c_rtt_std:48,c_rtt_cnt:49,c_ttl_min:50,c_ttl_max:51,s_rtt_avg:52,s_rtt_min:53,s_rtt_max:54,s_rtt_std:55,s_rtt_cnt:56,s_ttl_min:57,s_ttl_max:58,p2p_st:59,ed2k_data:60,ed2k_sig:61,ed2k_c2s:62,ed2k_c2c:63,ed2k_chat:64,c_f1323_opt:65,c_tm_opt:66,c_win_scl:67,c_sack_opt:68,c_sack_cnt:69,c_mss:70,c_mss_max:71,c_mss_min:72,c_win_max:73,c_win_min:74,c_win_0:75,c_cwin_max:76,c_cwin_min:77,c_cwin_ini:78,c_pkts_rto:79,c_pkts_fs:80,c_pkts_reor:81,c_pkts_dup:82,c_pkts_unk:83,c_pkts_fc:84,c_pkts_unrto:85,c_pkts_unfs:86,c_syn_retx:87,s_f1323_opt:88,s_tm_opt:89,s_win_scl:90,s_sack_opt:91,s_sack_cnt:92,s_mss:93,s_mss_max:94,s_mss_min:95,s_win_max:96,s_win_min:97,s_win_0:98,s_cwin_max:99,s_cwin_min:100,s_cwin_ini:101,s_pkts_rto:102,s_pkts_fs:103,s_pkts_reor:104,s_pkts_dup:105,s_pkts_unk:106,s_pkts_fc:107,s_pkts_unrto:108,s_pkts_unfs:109,s_syn_retx:110,http_req_cnt:111,http_res_cnt:112,http_res:113,c_pkts_push:114,s_pkts_push:115,c_tls_SNI:116,s_tls_SCN:117,c_npnalpn:118,s_npnalpn:119,c_tls_sesid:120,c_last_handshakeT:121,s_last_handshakeT:122,c_appdataT:123,s_appdataT:124,c_appdataB:125,s_appdataB:126,fqdn:127,dns_rslv:128,req_tm:129,res_tm:130
0,192.168.239.254,12298,1643846000000.0,0.0,228.0,1.0,1.0,0.0,19.0,192.168.226.71,53.0,1643846000000.0,0.0,247.0,1.0,1.0,0.0,19.0,-,2022-02-03 00:00:04.193519104,2022-02-03 00:00:04.000000000,inet-dns,192.168.239.254,internet,attacker_0,192.168.226.71,internet,data exfiltration,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,192.168.239.254,31326,1643846000000.0,0.0,228.0,1.0,1.0,0.0,19.0,192.168.226.71,53.0,1643846000000.0,0.0,247.0,1.0,1.0,0.0,19.0,-,2022-02-03 00:00:20.246249984,2022-02-03 00:00:20.000000000,inet-dns,192.168.239.254,internet,attacker_0,192.168.226.71,internet,data exfiltration,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,192.168.239.254,22595,1643846000000.0,0.0,228.0,1.0,1.0,0.0,19.0,192.168.226.71,53.0,1643846000000.0,0.0,247.0,1.0,1.0,0.0,19.0,-,2022-02-03 00:00:35.347027200,2022-02-03 00:00:35.000000000,inet-dns,192.168.239.254,internet,attacker_0,192.168.226.71,internet,data exfiltration,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,192.168.239.254,41224,1643846000000.0,0.0,228.0,1.0,1.0,0.0,19.0,192.168.226.71,53.0,1643846000000.0,0.0,247.0,1.0,1.0,0.0,19.0,-,2022-02-03 00:00:46.414992896,2022-02-03 00:00:46.000000000,inet-dns,192.168.239.254,internet,attacker_0,192.168.226.71,internet,data exfiltration,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,192.168.239.254,58001,1643846000000.0,0.0,228.0,1.0,1.0,0.0,19.0,192.168.226.71,53.0,1643846000000.0,0.0,247.0,1.0,1.0,0.0,19.0,-,2022-02-03 00:01:02.463830016,2022-02-03 00:01:02.000000000,inet-dns,192.168.239.254,internet,attacker_0,192.168.226.71,internet,data exfiltration,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


This is interesting, as apparently, different flow files have different features, creating many NaN values in the process.

In [63]:
# Only the following features seems to be the same/non-null for all 3 files
list(wilson_flow_data_all.columns[~wilson_flow_data_all.isnull().any()])

['c_port:2',
 'timestamp',
 'role_cli',
 'ipv4_address_cli',
 'network_cli',
 'role_serv',
 'ipv4_address_serv',
 'network_serv',
 'label']

Therefore, we can still reason about capture timestamp and label, but ML with the given NetFlows needs to be done per-file type -- e.g., 1 model would not be good for all the data.

In [64]:
# Show the label distribution
wilson_flow_data_all['label'].value_counts()

label
proxy                                       180799
benign DNS                                  124476
HTTPS                                        88281
browsing/update                              56403
HTTP(S) DMZ                                  55511
data exfiltration                            21219
host_discover_local                          16380
service_scan                                 12293
NTP                                          12039
dirb_scan                                     8863
benign_share                                  8696
mail                                          7194
HTTP(S) intra                                 6786
mDNS                                          5611
NetBIOS                                       5576
QUIC                                          5547
mDNSv6                                        4263
HTTP                                          3010
dns_brute_force_start                         1817
bootp                    

In [66]:
len(wilson_flow_data_all['label'].value_counts())

38

However, not all these labels are indeed malicious, one needs to look at the docs to determine which are malicious and which are not.

In [67]:
!cat /data/surv/ait/flows_info/label_info.txt

Benign labels in TCP:
browsing/update
benign_share
broken flow - benign
mail
monitoring
HTTP(S) intra
HTTP(S) DMZ
HTTP
HTTPS
DNS
SSH
proxy
update/command on unassigned port

Malicious labels in TCP:
service_scan
online_cracking
host_discover_dmz
host_discover_local
wpscan
dirb_scan
upload_rce_shell
check_user_id
check_netstat_t
read_resolv
check_network_config
check_ps_a
check_release
read_group
read_passwd
check_date
list_web_dir
check_wp_config
dump_wp_users
read_profile
dns_brute_force_start
list_www
check_who
clear
check_last
check_id
vpn_connect
check_whoami
check_uname_r
check_meminfo
check_uname_a
check_df
check_netstat_nat
list_home
check_netstat_l
check_cpuinfo
check_uptime
check_pwd
list_l

Benign labels in UDP:
mDNSv6
mDNS
bootp
NTP
systemd-resolver (local)
benign DNS - attacker requesting updates
benign DNS
OpenVPN
QUIC
STUN VoIP
NetBIOS
GoogleDNS

Malicious labels in UDP:
data exfiltration

Only 14 attacks were found in this scenario. Maybe, it would be sensible to merge classes from all scenarios to see if others do not offer other traffic classes as well, although, according to the docs, the scenarios should be the same.

### All NetFlow Files Analysis

In [108]:
# Each scenario capture has 4 to 6 days. Therefore, specify a static maximum gap to be 15 minutes
def measure_real_capture_dur(data: pd.Series, gap_minutes: int = 15) -> float:
    """Computes total timespan of the capture. Expects iterable containing timestamps objects sorted in a descending manner"""
    gap_max = pd.Timedelta(minutes=gap_minutes)
    total_dur = pd.Timedelta(seconds=0)
    last_tstamp = data.iloc[0]
    contiguous = True

    # Iterate through the dataframe to find out gaps
    for cur_tstamp in data:
        dur_gap = last_tstamp - cur_tstamp

        if dur_gap < gap_max:
            total_dur += dur_gap
        elif contiguous:
            contiguous = False

        last_tstamp = cur_tstamp

    return total_dur, contiguous

In [126]:
# 3 things are going to be analyzed at once:
# a) traffic classes
# b) capture duration of NetFlow data
# c) amount of flows

scenarios_classes = pd.DataFrame()
scenarios_flows = pd.Series()
scenarios_durs = pd.Series()

# Count the amount of log files per scenario
scenarios_paths = []
for scenario in os.scandir(DATA_ROOT):
    if scenario.is_dir():
        scenarios_paths.append(scenario.path)

scenarios_paths

['/data/surv/ait/scenarios/fox',
 '/data/surv/ait/scenarios/wheeler',
 '/data/surv/ait/scenarios/shaw',
 '/data/surv/ait/scenarios/russellmitchell',
 '/data/surv/ait/scenarios/harrison',
 '/data/surv/ait/scenarios/santos',
 '/data/surv/ait/scenarios/wilson',
 '/data/surv/ait/scenarios/wardbeck']

In [130]:
import glob

# Go through the flow files
for scenario in scenarios_paths:
    scenario_name = scenario.split('/')[-1]
    scenario_flows_dir = os.path.join(scenario, 'flow')
    scenario_df = pd.DataFrame()

    # Get all flow files (CSVs) and load them under one DataFrame
    flow_files = glob.glob('*.csv', root_dir=scenario_flows_dir)

    for flow_file in flow_files:
        tmp_df = pd.read_csv(os.path.join(scenario_flows_dir, flow_file))
        scenario_df = pd.concat([scenario_df, tmp_df])

    # Determine the number of flow entries
    scenarios_flows[scenario_name] = len(scenario_df)

    # Measure class distribution and merge it into the final file
    scenarios_classes[scenario_name] = scenario_df['label'].value_counts()

    # Measure the total capture duration
    scenario_df['timestamp'] = pd.to_datetime(scenario_df['timestamp'], format='%Y-%m-%d %H:%M:%S.%f')
    scenario_df = scenario_df.sort_values(by='timestamp', ascending=False)
    dur, _ = measure_real_capture_dur(scenario_df['timestamp'])

    scenarios_durs[scenario_name] = dur

# Normalize the all classes df
scenarios_classes = scenarios_classes.replace(np.nan, 0).astype(np.int32)
scenarios_classes = scenarios_classes.T.sort_index().T
scenarios_flows = scenarios_flows.sort_index()
scenarios_durs = scenarios_durs.sort_index()

In [135]:
# The amount of flows
display(scenarios_flows)

scenarios_flows.sum()

fox                351908
harrison           437659
russellmitchell    377704
santos             288547
shaw               453541
wardbeck           390224
wheeler            536206
wilson             629553
dtype: int64

3465342

In [136]:
# The "real" duration of the capture based on flows
display(scenarios_durs)

scenarios_durs.sum()

fox               4 days 23:59:58.185704704
wheeler           4 days 23:38:53.720544256
shaw              5 days 23:58:28.333856768
russellmitchell   3 days 23:59:55.115293440
harrison          4 days 23:47:40.523327488
santos            3 days 23:59:51.393707008
wilson            5 days 23:59:49.704099328
wardbeck          4 days 23:59:13.212267776
dtype: timedelta64[ns]

Timedelta('39 days 23:23:50.188800768')

Approximately corresponds to the documentation. Only `wheeler` and `harrison` are 21 and 12 minutes shorter than declared. However, it is possible that no network traffic took place during the start/end of the event, making it shorter.

In [140]:
# Take a look on the classes
scenarios_classes

Unnamed: 0_level_0,fox,harrison,russellmitchell,santos,shaw,wardbeck,wheeler,wilson
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
benign DNS,79043,117445,69211,70915,80423,91748,122186,124476
proxy,64434,66164,48468,84132,0,94926,107778,180799
HTTPS,34918,61948,26261,15877,5181,66154,67838,88281
host_discover_local,32788,65668,131068,1000,1010,2027,65541,16380
browsing/update,27876,37794,25822,29515,273059,37687,45836,56403
benign_share,23074,10810,6208,3940,0,8024,15306,8696
HTTP(S) DMZ,18427,7832,12622,38200,0,17526,24245,55511
data exfiltration,14668,23457,15778,12848,60898,15076,19177,21219
service_scan,13901,1309,23112,4037,432,1234,9445,12293
NTP,10084,9538,6213,7122,12331,10932,10977,12039


In [142]:
# See the total number of flows regardless of the scenario
scenarios_classes.sum(axis=1)

label
benign DNS                                  755447
proxy                                       646701
HTTPS                                       366458
host_discover_local                         315482
browsing/update                             533992
benign_share                                 76058
HTTP(S) DMZ                                 174363
data exfiltration                           183121
service_scan                                 65763
NTP                                          79236
dirb_scan                                    38671
NetBIOS                                      35982
mDNS                                         31134
QUIC                                         26058
mail                                         26268
mDNSv6                                       23747
HTTP(S) intra                                32765
bootp                                         9549
dns_brute_force_start                         9450
HTTP                     

In [158]:
# Determine the number of bening/malicious classes
CLASSES_MALICIOUS = [
   'service_scan',
    'online_cracking',
    'host_discover_dmz',
    'host_discover_local',
    'wpscan',
    'dirb_scan',
    'upload_rce_shell',
    'check_user_id',
    'check_netstat_t',
    'read_resolv',
    'check_network_config',
    'check_ps_a',
    'check_release',
    'read_group',
    'read_passwd',
    'check_date',
    'list_web_dir',
    'check_wp_config',
    'dump_wp_users',
    'read_profile',
    'dns_brute_force_start',
    'list_www',
    'check_who',
    'clear',
    'check_last',
    'check_id',
    'vpn_connect',
    'check_whoami',
    'check_uname_r',
    'check_meminfo',
    'check_uname_a',
    'check_df',
    'check_netstat_nat',
    'list_home',
    'check_netstat_l',
    'check_cpuinfo',
    'check_uptime',
    'check_pwd',
    'list_l ',
    'data exfiltration'
]

CLASSES_BENIGN = [
    'browsing/update',
    'benign_share',
    'broken flow - benign',
    'mail',
    'monitoring',
    'HTTP(S) intra',
    'HTTP(S) DMZ',
    'HTTP',
    'HTTPS',
    'DNS',
    'SSH',
    'proxy',
    'update/command on unassigned port',
    'mDNSv6',
    'mDNS',
    'bootp',
    'NTP',
    'systemd-resolver (local)',
    'benign DNS - attacker requesting updates',
    'benign DNS',
    'OpenVPN',
    'QUIC',
    'STUN VoIP',
    'NetBIOS',
    'GoogleDNS',
]

In [167]:
# Make the counts
cnt_malicious_classes = 0
cnt_malicious_flows = 0
cnt_benign_classes = 0
cnt_benign_flows = 0
cnt_uncategorized = 0

In [168]:
for class_name in scenarios_classes.index:
    if class_name in CLASSES_BENIGN:
        cnt_benign_classes += 1
        cnt_benign_flows += scenarios_classes.loc[class_name].sum()
    elif class_name in CLASSES_MALICIOUS:
        cnt_malicious_classes += 1
        cnt_malicious_flows += scenarios_classes.loc[class_name].sum()
    else:
        cnt_uncategorized += 1
        print(class_name)

In [170]:
cnt_benign_classes, cnt_benign_flows

(24, 2845487)

In [169]:
cnt_malicious_classes, cnt_malicious_flows

(22, 618736)

### Revising Features Across Different NetFlow Files
As shown previously, there were lot of NULL values when we loaded all three files into one dataframe. Let's load it one by one and see how the features across files differ.

In [172]:
# Let's take the "santos" scenario as an example
SANTOS_FLOWS_DIR = '/data/surv/ait/scenarios/santos/flow'
santos_flow_files = glob.glob('*.csv', root_dir=SANTOS_FLOWS_DIR)
santos_flow_files = [os.path.join(SANTOS_FLOWS_DIR, file) for file in santos_flow_files]
santos_flow_files.sort()

santos_flow_files

['/data/surv/ait/scenarios/santos/flow/tcp_complete.csv',
 '/data/surv/ait/scenarios/santos/flow/tcp_nocomplete.csv',
 '/data/surv/ait/scenarios/santos/flow/udp_complete.csv']

In [173]:
santos_tcp_complete = pd.read_csv(santos_flow_files[0])
santos_tcp_complete.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178744 entries, 0 to 178743
Data columns (total 138 columns):
 #    Column                 Non-Null Count   Dtype  
---   ------                 --------------   -----  
 0    #15#c_ip:1             178744 non-null  object 
 1    c_port:2               178744 non-null  int64  
 2    c_pkts_all:3           178744 non-null  int64  
 3    c_rst_cnt:4            178744 non-null  int64  
 4    c_ack_cnt:5            178744 non-null  int64  
 5    c_ack_cnt_p:6          178744 non-null  int64  
 6    c_bytes_uniq:7         178744 non-null  int64  
 7    c_pkts_data:8          178744 non-null  int64  
 8    c_bytes_all:9          178744 non-null  int64  
 9    c_pkts_retx:10         178744 non-null  int64  
 10   c_bytes_retx:11        178744 non-null  int64  
 11   c_pkts_ooo:12          178744 non-null  int64  
 12   c_syn_cnt:13           178744 non-null  int64  
 13   c_fin_cnt:14           178744 non-null  int64  
 14   s_ip:15           

In [174]:
santos_tcp_nocomplete = pd.read_csv(santos_flow_files[1])
santos_tcp_nocomplete.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5778 entries, 0 to 5777
Data columns (total 52 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   #15#c_ip:1         5778 non-null   object 
 1   c_port:2           5778 non-null   int64  
 2   c_pkts_all:3       5778 non-null   int64  
 3   c_rst_cnt:4        5778 non-null   int64  
 4   c_ack_cnt:5        5778 non-null   int64  
 5   c_ack_cnt_p:6      5778 non-null   int64  
 6   c_bytes_uniq:7     5778 non-null   int64  
 7   c_pkts_data:8      5778 non-null   int64  
 8   c_bytes_all:9      5778 non-null   int64  
 9   c_pkts_retx:10     5778 non-null   int64  
 10  c_bytes_retx:11    5778 non-null   int64  
 11  c_pkts_ooo:12      5778 non-null   int64  
 12  c_syn_cnt:13       5778 non-null   int64  
 13  c_fin_cnt:14       5778 non-null   int64  
 14  s_ip:15            5778 non-null   object 
 15  s_port:16          5778 non-null   int64  
 16  s_pkts_all:17      5778 

In [178]:
# Is the TCP no-complete subset of the TCP complete?
display(len(santos_tcp_complete.columns))
len(pd.concat([santos_tcp_complete, santos_tcp_nocomplete]).columns)

138

138

YES! Let's see UDP features now.

In [179]:
santos_udp_complete = pd.read_csv(santos_flow_files[2])
santos_udp_complete.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104025 entries, 0 to 104024
Data columns (total 28 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   #c_ip:1            104025 non-null  object 
 1   c_port:2           104025 non-null  int64  
 2   c_first_abs:3      104025 non-null  float64
 3   c_durat:4          104025 non-null  float64
 4   c_bytes_all:5      104025 non-null  int64  
 5   c_pkts_all:6       104025 non-null  int64  
 6   c_isint:7          104025 non-null  int64  
 7   c_iscrypto:8       104025 non-null  int64  
 8   c_type:9           104025 non-null  int64  
 9   s_ip:10            104025 non-null  object 
 10  s_port:11          104025 non-null  int64  
 11  s_first_abs:12     104025 non-null  float64
 12  s_durat:13         104025 non-null  float64
 13  s_bytes_all:14     104025 non-null  int64  
 14  s_pkts_all:15      104025 non-null  int64  
 15  s_isint:16         104025 non-null  int64  
 16  s_

In [180]:
# Now, see how much features differ
len(pd.concat([santos_tcp_complete, santos_udp_complete]).columns)

157

Ok, many feature names are the same, but there is a number demarking their order. If we drop this number and merge the data afterwards, we receive the total number of **unique** features.

In [200]:
rename_columns = lambda name: ''.join([ch for ch in name if not ch.isdigit()]).replace('#', '').replace(':', '')

In [202]:
santos_tcp_complete.rename(columns=rename_columns, inplace=True)
santos_tcp_nocomplete.rename(columns=rename_columns, inplace=True)
santos_udp_complete.rename(columns=rename_columns, inplace=True)

In [203]:
# With the renamed columns, merge and see the number of unique ones
santos_all = pd.concat([santos_tcp_complete, santos_tcp_nocomplete, santos_udp_complete])
len(santos_all.columns)

145

In [205]:
santos_all.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Index: 288547 entries, 0 to 104024
Data columns (total 145 columns):
 #    Column             Non-Null Count   Dtype  
---   ------             --------------   -----  
 0    c_ip               288547 non-null  object 
 1    c_port             288547 non-null  int64  
 2    c_pkts_all         288547 non-null  int64  
 3    c_rst_cnt          184522 non-null  float64
 4    c_ack_cnt          184522 non-null  float64
 5    c_ack_cnt_p        184522 non-null  float64
 6    c_bytes_uniq       184522 non-null  float64
 7    c_pkts_data        184522 non-null  float64
 8    c_bytes_all        288547 non-null  int64  
 9    c_pkts_retx        184522 non-null  float64
 10   c_bytes_retx       184522 non-null  float64
 11   c_pkts_ooo         184522 non-null  float64
 12   c_syn_cnt          184522 non-null  float64
 13   c_fin_cnt          184522 non-null  float64
 14   s_ip               288547 non-null  object 
 15   s_port             288547 non-null  i