# CIC-IoT Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import re

In [2]:
DATA_PATH = '/data/cic-iot22'

In [3]:
# Determine Folders with PCAP files - ones starting with a number
pcap_dirs = sorted([os.path.join(DATA_PATH, obj) for obj in os.listdir(DATA_PATH) if re.match(r'^\d.*', obj)])
pcap_dirs

['/data/cic-iot22/1_power',
 '/data/cic-iot22/2_idle',
 '/data/cic-iot22/3_interactions',
 '/data/cic-iot22/4_scenarios',
 '/data/cic-iot22/5_active',
 '/data/cic-iot22/6_attacks']

In [15]:
def find_pcaps(dir):
    pcap_files = []

    # Find all PCAP files within the directory and count them for each directory
    for root, _, files in os.walk(dir):
        for file in files:
            if file.endswith(".pcap"):
                pcap_files.append(os.path.join(root, file))

    return sorted(pcap_files)

def count_packets(file):
    pkts = !tshark -r "$file" 2> /dev/null | wc -l

    return int(pkts[0])

def measure_dur(file):
    dur = !capinfos -u "$file" | tail -n 1 | cut -f 6 -d ' '
    dur = dur[0]

    return float(dur) if dur.replace('.', '', 1).isdigit() else 0

In [5]:
# Test the function
power_pcaps = find_pcaps(pcap_dirs[0])
power_pcaps[:7]

['/data/cic-iot22/1_power/Audio/Amazon Echo Dot 1/POWER/echodot1POWER_1.pcap',
 '/data/cic-iot22/1_power/Audio/Amazon Echo Dot 1/POWER/echodot1POWER_2.pcap',
 '/data/cic-iot22/1_power/Audio/Amazon Echo Dot 1/POWER/echodot1POWER_3.pcap',
 '/data/cic-iot22/1_power/Audio/Amazon Echo Dot 2/POWER/echodot2POWER_1.pcap',
 '/data/cic-iot22/1_power/Audio/Amazon Echo Dot 2/POWER/echodot2POWER_2.pcap',
 '/data/cic-iot22/1_power/Audio/Amazon Echo Dot 2/POWER/echodot2POWER_3.pcap',
 '/data/cic-iot22/1_power/Audio/Amazon Echo Spot/POWER/._echospotPOWER_1.pcap']

In [6]:
count_packets(power_pcaps[0])

762

In [11]:
# Merge the 2 algorithms to count the packet volume across all directories
pkts_total = 0

for pdir in pcap_dirs:
    pfiles    = find_pcaps(pdir)
    pkts_iter = 0

    for pfile in pfiles:
        pkts_iter += count_packets(pfile)

    pkts_total += pkts_iter

    print(f'{os.path.basename(pdir):<10} : {pkts_iter}')

print(f'\nTotal packets: {pkts_total}')

1_power    : 464997
2_idle     : 24606490
3_interactions : 139589
4_scenarios : 505852
^C


IndexError: list index out of range

Results from 1-4, 5 was run separately due to very big file:
```
1_power    : 464997
2_idle     : 24606490
3_interactions : 139589
4_scenarios : 505852
```

By manual counting, we receive:
```
5_active  : 63579702
6_attacks : 30330507
```

In [7]:
# Which results in a
464997 + 24606490 + 139589 + 505852 + 63579702 + 30330507

119627137

In [17]:
# Measure the total time of the capture
dur_total = 0

for pdir in pcap_dirs:
    pfiles = find_pcaps(pdir)
    dur_iter = 0

    for pfile in pfiles:
        dur_iter += measure_dur(pfile)

    dur_total += dur_iter

    print(f'{os.path.basename(pdir):<10} : {dur_iter}')

print(f'\nTotal duration: {dur_total}')

1_power    : 30133.343356999994
2_idle     : 1188000.8726869998
3_interactions : 12958.47629900001
4_scenarios : 1071.3074860000002
5_active   : 701236.2726530001
6_attacks  : 15880.862340000003

Total duration: 1949281.1348219998


In [18]:
# Convert retrieved seconds to hours
1949281 / 3600

541.4669444444445

In [20]:
# And to days
1949281 / 3600 / 24

22.561122685185186

## CSV Analysis

In [21]:
CSVS_PATH = '/data/cic-iot22/csv'

In [22]:
# Find all CSV files
csv_files = []

for root, _, files in os.walk(CSVS_PATH):
    for file in files:
        if file.endswith(".csv"):
            csv_files.append(os.path.join(root, file))

csv_files.sort()
csv_files

['/data/cic-iot22/csv/28 devices/amazonplug/amazonplugALEXAOFF_1.csv',
 '/data/cic-iot22/csv/28 devices/amazonplug/amazonplugALEXAOFF_2.csv',
 '/data/cic-iot22/csv/28 devices/amazonplug/amazonplugALEXAOFF_3.csv',
 '/data/cic-iot22/csv/28 devices/amazonplug/amazonplugALEXAON_1.csv',
 '/data/cic-iot22/csv/28 devices/amazonplug/amazonplugALEXAON_2.csv',
 '/data/cic-iot22/csv/28 devices/amazonplug/amazonplugALEXAON_3.csv',
 '/data/cic-iot22/csv/28 devices/amazonplug/amazonplugLANOFF_1.csv',
 '/data/cic-iot22/csv/28 devices/amazonplug/amazonplugLANOFF_2.csv',
 '/data/cic-iot22/csv/28 devices/amazonplug/amazonplugLANOFF_3.csv',
 '/data/cic-iot22/csv/28 devices/amazonplug/amazonplugLANON_1.csv',
 '/data/cic-iot22/csv/28 devices/amazonplug/amazonplugLANON_2.csv',
 '/data/cic-iot22/csv/28 devices/amazonplug/amazonplugLANON_3.csv',
 '/data/cic-iot22/csv/28 devices/amazonplug/amazonplugLOCALOFF_1.csv',
 '/data/cic-iot22/csv/28 devices/amazonplug/amazonplugLOCALOFF_2.csv',
 '/data/cic-iot22/csv/28

In [27]:
# Let's try to open all files within a single DataFrame
data_df = pd.DataFrame()

for csv_path in csv_files:
    data_df = pd.concat([data_df, pd.read_csv(csv_path)])

# Drop first index column and reset pandas indexing after concatenation
data_df = data_df.drop(columns=data_df.columns[0], axis=1).reset_index(drop=True)
data_df

Unnamed: 0,L4_tcp,L4_udp,L7_http,L7_https,port_class_src,port_class_dst,pck_size,ip_dst_new,ethernet_frame_size,ttl,...,average,skew_e,kurt_e,var_e,q3_e,q1_e,iqr_e,epoch_timestamp,inter_arrival_time,time_since_previously_displayed_frame
0,1,0,0,1,1,3,412,3232270669,446,229,...,412.000000,0.000000,-3.000000,0.000000,412.00,412.00,0.00,1.633441e+09,0.000000,0.000000
1,1,0,0,1,3,1,53,911297469,87,255,...,232.500000,0.000000,-2.000000,32220.250000,322.25,142.75,179.50,1.633441e+09,0.013303,0.013303
2,1,0,0,1,1,3,385,3232270669,419,229,...,283.333333,-0.692626,-1.500000,26648.222222,398.50,219.00,179.50,1.633441e+09,0.043204,0.056507
3,1,0,0,1,3,1,53,911297469,87,255,...,225.750000,0.009119,-1.987851,29933.687500,391.75,53.00,338.75,1.633441e+09,0.010916,0.067423
4,1,0,0,1,1,3,20,3232270669,54,229,...,184.600000,0.400763,-1.808827,30720.240000,385.00,53.00,332.00,1.633441e+09,0.069090,0.136513
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
466078,1,0,0,0,2,3,89,3232270601,123,222,...,136.100000,-0.644239,-1.236481,4981.290000,195.75,89.00,106.75,1.633374e+09,0.087505,0.249931
466079,1,0,0,0,3,2,20,1679086845,54,255,...,125.545455,-0.397172,-1.573762,5642.429752,190.50,54.50,136.00,1.633374e+09,0.193948,0.443879
466080,0,1,0,0,3,2,180,4294967295,214,255,...,130.083333,-0.540594,-1.436139,5398.743056,185.25,71.75,113.50,1.633374e+09,3.018225,3.462104
466081,0,1,0,0,3,2,180,4294967295,214,255,...,133.923077,-0.668680,-1.276244,5160.378698,180.00,89.00,91.00,1.633374e+09,5.877368,9.339472


Ok, this looks neither as packet and neither as flow - basically packet-based statistics + some packet fieds extracted into a feature vector. Definitely a category "Other". Also, the CSV data are not directly labeled.\

Unfortunately, most of the data analysis (time measurement, etc.) was performed via terminal.