# SDN Dataset (2020) Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

pd.set_option('display.max_columns', None)

Attacks descriptions:

| **ID** | **Category**   |
| -- | ------------------ |
| 0 | Normal              |
| 1 | DoS                 |
| 2 | DDoS                |
| 3 | Port Scanning       |
| 4 | OS Fingerprinting   |
| 5 | Fuzzing             |


In [3]:
DATA_5_PATH  = '/data/sdn-dataset/SDN-Dataset-master/5-iot/data'
DATA_10_PATH = '/data/sdn-dataset/SDN-Dataset-master/10-iot/data'

In [9]:
# Peek at the data
sample = pd.read_csv('/data/sdn-dataset/SDN-Dataset-master/5-iot/data/ddos1.csv')

In [10]:
sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35392 entries, 0 to 35391
Data columns (total 33 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   srcMAC             35392 non-null  object 
 1   dstMAC             35392 non-null  object 
 2   srcIP              35392 non-null  object 
 3   dstIP              35392 non-null  object 
 4   srcPort            35392 non-null  int64  
 5   dstPort            35392 non-null  int64  
 6   last_seen          35392 non-null  int64  
 7   Protocol           35392 non-null  object 
 8   proto_number       35392 non-null  int64  
 9   Dur                35392 non-null  float64
 10  Mean               35392 non-null  float64
 11  Stddev             35391 non-null  float64
 12  Min                35391 non-null  float64
 13  Max                35391 non-null  float64
 14  Pkts               35391 non-null  float64
 15  Bytes              35391 non-null  float64
 16  Spkts              353

In [15]:
sample.head()

Unnamed: 0,srcMAC,dstMAC,srcIP,dstIP,srcPort,dstPort,last_seen,Protocol,proto_number,Dur,Mean,Stddev,Min,Max,Pkts,Bytes,Spkts,Dpkts,Sbytes,Dbytes,Srate,Drate,Sum,TnBPSrcIP,TnBPDstIP,TnP_PSrcIP,TnP_PDstIP,TnP_PerProto,TnP_Per_Dport,N_IN_Conn_P_DstIP,N_IN_Conn_P_SrcIP,Attack,Category
0,0E:DF:76:42:F6:E2,46:EE:66:73:81:C3,10.0.0.10/32,10.0.0.11/32,33078,3000,1589568334232,udp,17,10.0,6.545455,4.274661,2.0,14.0,2439.0,1205261.0,444.0,0.0,22200.0,0.0,44.4,0.0,72.0,172180.142857,200876.833333,348.428571,406.5,1219.5,304.875,1.833333,1.833333,0.0,0.0
1,46:EE:66:73:81:C3,9A:11:C6:40:82:03,10.0.0.11/32,10.0.0.7/32,2000,48378,1589568334232,tcp,6,6.0,6.545455,4.274661,2.0,14.0,2439.0,1205261.0,3.0,5.0,198.0,398.0,0.5,0.833333,72.0,172180.142857,200876.833333,348.428571,406.5,1219.5,304.875,1.833333,1.833333,0.0,0.0
2,46:EE:66:73:81:C3,EA:8E:F0:07:FE:84,10.0.0.11/32,10.0.0.12/32,12345,50044,1589568334232,tcp,6,14.0,6.545455,4.274661,2.0,14.0,2439.0,1205261.0,841.0,1118.0,55506.0,1124885.0,60.071429,79.857143,72.0,172180.142857,200876.833333,348.428571,406.5,1219.5,304.875,1.833333,1.833333,0.0,0.0
3,9A:11:C6:40:82:03,46:EE:66:73:81:C3,10.0.0.7/32,10.0.0.11/32,48378,2000,1589568334232,tcp,6,6.0,6.545455,4.274661,2.0,14.0,2439.0,1205261.0,5.0,3.0,398.0,198.0,0.833333,0.5,72.0,172180.142857,200876.833333,348.428571,406.5,1219.5,304.875,1.833333,1.833333,0.0,0.0
4,4E:69:C4:82:F0:28,46:EE:66:73:81:C3,10.0.0.5/32,10.0.0.11/32,48994,2000,1589568334232,tcp,6,4.0,6.545455,4.274661,2.0,14.0,2439.0,1205261.0,4.0,2.0,298.0,132.0,1.0,0.5,72.0,172180.142857,200876.833333,348.428571,406.5,1219.5,304.875,1.833333,1.833333,0.0,0.0


In [12]:
sample['Attack'].value_counts()

Attack
0.0    18914
1.0    16477
Name: count, dtype: int64

In [13]:
sample['Category'].value_counts()

Category
0.0    18914
2.0    16477
Name: count, dtype: int64

In [42]:
# Supposing the duration is in seconds
sample['Dur'].head()

0    10.0
1     6.0
2    14.0
3     6.0
4     4.0
Name: Dur, dtype: float64

In [43]:
sample['last_seen'].head()

0    1589568334232
1    1589568334232
2    1589568334232
3    1589568334232
4    1589568334232
Name: last_seen, dtype: int64

Looks good so far. Let's load all the data.

In [25]:
def load_folder(path):
    data = pd.DataFrame()

    for file in sorted(os.listdir(path)):
        data_part = pd.read_csv(os.path.join(path, file))
        data = pd.concat([data, data_part])

    return data

In [26]:
data5 = load_folder(DATA_5_PATH)
data10 = load_folder(DATA_10_PATH)

In [28]:
data5.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Index: 27926896 entries, 0 to 1859984
Data columns (total 33 columns):
 #   Column             Dtype  
---  ------             -----  
 0   srcMAC             object 
 1   dstMAC             object 
 2   srcIP              object 
 3   dstIP              object 
 4   srcPort            float64
 5   dstPort            float64
 6   last_seen          float64
 7   Protocol           object 
 8   proto_number       float64
 9   Dur                float64
 10  Mean               float64
 11  Stddev             float64
 12  Min                float64
 13  Max                float64
 14  Pkts               float64
 15  Bytes              float64
 16  Spkts              float64
 17  Dpkts              float64
 18  Sbytes             float64
 19  Dbytes             float64
 20  Srate              float64
 21  Drate              float64
 22  Sum                float64
 23  TnBPSrcIP          float64
 24  TnBPDstIP          float64
 25  TnP_PSrcIP         flo

In [31]:
data5['Attack'].value_counts()

Attack
1.0    26243324
0.0     1683562
Name: count, dtype: int64

In [32]:
data5['Category'].value_counts()

Category
3.0    20681185
4.0     3392920
0.0     1683562
5.0     1183807
1.0      793646
2.0      191766
Name: count, dtype: int64

In [30]:
data10.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30101899 entries, 0 to 1657614
Data columns (total 33 columns):
 #   Column             Dtype  
---  ------             -----  
 0   srcMAC             object 
 1   dstMAC             object 
 2   srcIP              object 
 3   dstIP              object 
 4   srcPort            float64
 5   dstPort            float64
 6   last_seen          float64
 7   Protocol           object 
 8   proto_number       float64
 9   Dur                float64
 10  Mean               float64
 11  Stddev             float64
 12  Min                float64
 13  Max                float64
 14  Pkts               float64
 15  Bytes              float64
 16  Spkts              float64
 17  Dpkts              float64
 18  Sbytes             float64
 19  Dbytes             float64
 20  Srate              float64
 21  Drate              float64
 22  Sum                float64
 23  TnBPSrcIP          float64
 24  TnBPDstIP          float64
 25  TnP_PSrcIP         flo

In [33]:
data10['Category'].value_counts()

Category
3.0    22445217
4.0     3387049
0.0     2603809
5.0     1053009
1.0      430116
2.0      182690
Name: count, dtype: int64

## Capture Length Determination
All we need at this point of the analysis is the length of the data5 and data10 captures.

In [35]:
# Trivially sort by last_seen
data5_tstamps_sorted = data5['last_seen'].sort_values()

In [36]:
data5_tstamps_sorted.iloc[-1] - data5_tstamps_sorted.iloc[0]

nan

In [37]:
data5_tstamps_sorted.head()

12    1.589114e+12
1     1.589114e+12
0     1.589114e+12
2     1.589114e+12
3     1.589114e+12
Name: last_seen, dtype: float64

In [None]:
data5_tstamps_sorted.tail()

34691    1.589630e+12
34690    1.589630e+12
34681    1.589630e+12
34689    1.589630e+12
28543             NaN
Name: last_seen, dtype: float64

Ok, 1 is nan, lets count with the -2 from the end.

In [None]:
# Duration from msec to sec to minutes
time_mins = (data5_tstamps_sorted.iloc[-2] - data5_tstamps_sorted.iloc[0]) / 1000 / 60
time_hrs = time_mins / 60

display(time_mins)
display(time_hrs)

8591.973883333334

143.19956472222222

This is hellishly long... Let's break it piece by piece...

In [54]:
def count_dur(path):
    dur_total_mins = 0

    for file in os.listdir(path):
        dframe = pd.read_csv(os.path.join(path, file))
        dframe['first_seen'] = dframe['last_seen'] - (dframe['Dur'] * 1000)

        tstamp_start = dframe['first_seen'].sort_values().iloc[0]

        # Let's do -2 to avoid NaN this time to avoid dropping many NaN rows
        tstamp_end   = dframe['last_seen'].sort_values().iloc[-2]

        # Duration to seconds to minutes
        dframe_dur   = (tstamp_end - tstamp_start) / 1000 / 60

        dur_total_mins += dframe_dur

        print(f'{file} : {dframe_dur}')

    return dur_total_mins

In [55]:
count_dur(DATA_5_PATH)

only_normal_traffic1.csv : 30.483349999999998
only_normal_traffic12.csv : 30.300016666666668
ddos6.csv : 30.216666666666665
port_scan3.csv : 32.150016666666666
dos2.csv : 56.37408333333334
only_normal_traffic15.csv : 30.250016666666667
port_scan6.csv : 31.401166666666665
dos6.csv : 30.283333333333335
port_scan2.csv : 32.48336666666667
dos5.csv : 30.550016666666668
ddos8.csv : 23.72301666666667
only_normal_traffic17.csv : 30.28441666666667
ddos12.csv : 30.333383333333334
only_normal_traffic2.csv : 30.38335
only_normal_traffic8.csv : 30.283333333333335
only_normal_traffic9.csv : 30.3106
fuzzer6.csv : 30.566683333333334
only_normal_traffic4.csv : 30.333366666666667
only_normal_traffic19.csv : 30.28335
port_scan9.csv : 31.30325
os_service_detection2.csv : 62.849916666666665
only_normal_traffic7.csv : 30.333366666666667
dos3.csv : 98.01461666666667
dos11.csv : 63.18333333333333
dos10.csv : 54.348499999999994
fuzzer5.csv : 30.666716666666666
fuzzer8.csv : 30.35
only_normal_traffic10.csv : 30

2631.8589833333344

In [56]:
# Mins to hours
2631.8589833333344 / 60

43.86431638888891

In [59]:
# Since merged DataFrame has 143h and non-merged only 43, there are capture gaps.
# The capture is thus non-contiguous.

# Let's do the same for 10-version.
count_dur(DATA_10_PATH)

only_normal_traffic1.csv : 30.38335
only_normal_traffic12.csv : 30.25003333333333
ddos6.csv : 30.239833333333333
port_scan3.csv : 33.2138
dos2.csv : 54.28888333333334
only_normal_traffic15.csv : 30.233349999999998
port_scan6.csv : 31.446216666666665
dos6.csv : 30.45925
port_scan2.csv : 32.533366666666666
dos5.csv : 53.2961
ddos8.csv : 60.35896666666667
only_normal_traffic17.csv : 30.316666666666666
ddos12.csv : 30.291083333333333
only_normal_traffic2.csv : 30.2953
only_normal_traffic8.csv : 30.3
only_normal_traffic9.csv : 30.283333333333335
fuzzer6.csv : 30.350016666666665
only_normal_traffic4.csv : 30.33335
only_normal_traffic19.csv : 30.25
port_scan9.csv : 31.41675
os_service_detection2.csv : 30.483366666666665
only_normal_traffic7.csv : 30.304883333333333
dos3.csv : 43.903800000000004
dos11.csv : 30.266683333333333
fuzzer5.csv : 44.21071666666667
fuzzer8.csv : 30.33335
only_normal_traffic10.csv : 30.233333333333334
os_service_detection10.csv : 30.487266666666667
ddos1.csv : 31.51159

2495.84825

In [70]:
2495.84825 / 60

41.59747083333333

In [69]:
data10_tstamps_sorted = data10['last_seen'].sort_values().dropna().astype(np.uint64)
data10_tstamps_sorted

37268      15899150446
4        1589638180300
9        1589638180300
13       1589638180300
20       1589638180301
             ...      
27709    1591128165714
27708    1591128165714
27707    1591128165714
27705    1591128165714
27702    1591128165714
Name: last_seen, Length: 30101898, dtype: uint64

In [72]:
data10_tstamps_sorted = data10_tstamps_sorted.drop(37268)
data10_tstamps_sorted

4        1589638180300
9        1589638180300
13       1589638180300
20       1589638180301
19       1589638180301
             ...      
27709    1591128165714
27708    1591128165714
27707    1591128165714
27705    1591128165714
27702    1591128165714
Name: last_seen, Length: 30101852, dtype: uint64

In [73]:
# And the merged variant
# Duration from msec to sec to minutes
time_mins = (data10_tstamps_sorted.iloc[-2] - data10_tstamps_sorted.iloc[0]) / 1000 / 60
time_hrs = time_mins / 60

display(time_mins)
display(time_hrs)

24833.090233333336

413.88483722222225

We are done there, though not exactly precise, but estimation of time and the knowledge of discontinuity is proven.