# CUPID Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

Downloaded from: [https://www.cupid.directory/](https://www.cupid.directory/).

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

pd.set_option('display.max_columns', None)

In [2]:
ROOTDIR = 'cupid'

In [3]:
# There are 3 CSV files, but the documentation does not specify if those are different scenarios or the same traffic labeled differently
os.listdir(ROOTDIR)

['CUPID-Auto-Labeled.csv',
 'CUPID-Human-Labeled.csv',
 'CUPID-Baselines-Labeled.csv']

In [4]:
!wc -l $ROOTDIR/*

   189569 cupid/CUPID-Auto-Labeled.csv
  1027241 cupid/CUPID-Baselines-Labeled.csv
   247383 cupid/CUPID-Human-Labeled.csv
  1464193 celkom


As apparent, files are of different length. However, for our analysis, we will consider them as different scenarios and analyze they as a whole. 

In [5]:
data_all = pd.DataFrame()

for file in os.listdir(ROOTDIR):
    filepath = os.path.join(ROOTDIR, file)

    data_all = pd.concat([data_all, pd.read_csv(filepath)])

In [6]:
data_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1464190 entries, 0 to 1027239
Data columns (total 85 columns):
 #   Column                      Non-Null Count    Dtype  
---  ------                      --------------    -----  
 0   Unnamed: 0                  1464190 non-null  int64  
 1   Flow ID                     1464190 non-null  object 
 2   Src IP                      1464190 non-null  object 
 3   Src Port                    1464190 non-null  int64  
 4   Dst IP                      1464190 non-null  object 
 5   Dst Port                    1464190 non-null  int64  
 6   Protocol                    1464190 non-null  int64  
 7   Timestamp                   1464190 non-null  object 
 8   Flow Duration               1464190 non-null  int64  
 9   Total Fwd Packet            1464190 non-null  int64  
 10  Total Bwd packets           1464190 non-null  int64  
 11  Total Length of Fwd Packet  1464190 non-null  float64
 12  Total Length of Bwd Packet  1464190 non-null  float64
 1

In [7]:
len(data_all)

1464190

In [8]:
data_all.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,0,192.168.1.40-192.168.1.13-52947-80-6,192.168.1.40,52947,192.168.1.13,80,6,05/06/2019 12:33:59 PM,322238,7,6,268.0,1042.0,131.0,0.0,38.285714,63.373797,518.0,0.0,173.666667,266.729576,4065.318,40.342852,26853.166667,69312.185628,239406.0,0.0,322238.0,53706.333333,130946.981219,320999.0,0.0,301564.0,60312.8,103669.471836,239406.0,1.0,0,0,0,0,164,144,21.723074,18.619778,0.0,518.0,93.571429,185.734183,34497.19,1,4,0,4,11,0,0,0,0.0,100.769231,38.285714,173.666667,0,0,0,0,6,4353,0,20,0,80,8192,256,3,20,0,0,0,0,1559763000000000.0,94487.568951,1559763000000000.0,1559763000000000.0,0
1,1,192.168.1.40-192.168.1.13-52947-80-6,192.168.1.40,52947,192.168.1.13,80,6,05/06/2019 12:33:59 PM,193,1,3,6.0,6.0,6.0,6.0,6.0,0.0,6.0,0.0,2.0,3.464102,62176.17,20725.388601,64.333333,72.885755,144.0,1.0,0.0,0.0,0.0,0.0,0.0,49.0,24.5,33.234019,48.0,1.0,0,0,0,0,20,60,5181.34715,15544.041451,0.0,6.0,3.6,3.286335,10.8,2,0,0,0,4,0,0,0,3.0,4.5,6.0,2.0,0,0,0,0,0,0,0,1,0,1,254,256,0,20,0,0,0,0,1559763000000000.0,0.0,1559763000000000.0,1559763000000000.0,0
2,2,192.168.1.31-192.168.1.7-52526-389-6,192.168.1.31,52526,192.168.1.7,389,6,05/06/2019 12:34:06 PM,5385,15,10,4292.0,1018.0,1460.0,0.0,286.133333,487.742741,212.0,0.0,101.8,89.915763,986072.4,4642.525534,224.375,522.505404,2112.0,1.0,5385.0,384.642857,690.836239,2202.0,1.0,3188.0,354.222222,604.468109,1839.0,1.0,0,0,0,0,324,224,2785.51532,1857.010214,0.0,1460.0,204.230769,382.126503,146020.7,1,4,0,14,23,0,0,0,0.0,212.4,286.133333,101.8,0,0,0,0,18,1050445,0,171,0,40,8192,2051,11,20,0,0,0,0,1559763000000000.0,1325.551249,1559763000000000.0,1559763000000000.0,0
3,3,192.168.1.31-192.168.1.7-52527-389-6,192.168.1.31,52527,192.168.1.7,389,6,05/06/2019 12:34:06 PM,14480,21,17,4230.0,22682.0,1460.0,0.0,201.428571,428.415403,5107.0,0.0,1334.235294,1837.703136,1858564.0,2624.309392,391.351351,1635.480837,9951.0,1.0,14480.0,724.0,2209.661608,9951.0,1.0,14447.0,902.9375,2540.249213,10236.0,1.0,0,0,0,0,444,364,1450.276243,1174.033149,0.0,5107.0,690.051282,1359.660831,1848678.0,1,4,0,14,36,0,0,0,0.0,708.210526,201.428571,1334.235294,0,0,0,0,27,1918312,0,111,0,596,8192,2051,14,20,0,0,0,0,1559763000000000.0,3691.006761,1559763000000000.0,1559763000000000.0,0
4,4,192.168.1.31-192.168.1.7-52528-389-6,192.168.1.31,52528,192.168.1.7,389,6,05/06/2019 12:34:07 PM,10463,17,16,4218.0,17808.0,1460.0,0.0,248.117647,466.114643,2920.0,0.0,1113.0,1156.476891,2105132.0,3153.971136,326.96875,1065.631465,5992.0,1.0,10463.0,653.9375,1489.906729,5992.0,1.0,3882.0,258.8,471.671526,1426.0,1.0,0,0,0,0,364,344,1624.77301,1529.198127,0.0,2920.0,647.823529,955.616427,913202.8,1,4,0,14,31,0,0,0,0.0,667.454545,248.117647,1113.0,0,0,0,0,25,2198861,0,127,0,539,8192,2051,12,20,0,0,0,0,1559763000000000.0,2051.850016,1559763000000000.0,1559763000000000.0,0


In [9]:
data_all.describe()

Unnamed: 0.1,Unnamed: 0,Src Port,Dst Port,Protocol,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
count,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464158.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0,1464190.0
mean,50313.27,44879.28,14827.74,9.080236,49198990.0,15.68641,15.15098,7871.026,17418.73,394.9073,31.69453,127.1313,116.0027,519.3608,46.14171,187.2401,152.8377,inf,inf,4145972.0,11275360.0,34250650.0,4360.489,48866380.0,9244727.0,16688710.0,33955740.0,4580.647,43020000.0,9077890.0,16100690.0,31743440.0,31278.02,0.04216598,0.0,0.0,0.0,298.0946,293.2953,73376.95,47693.27,25.84992,623.5087,153.6355,165.5896,176754.8,0.6571114,1.143097,0.4007465,12.68738,26.97669,0.0,0.3600004,0.5757887,1.233336,166.515,127.1313,187.2401,0.0,0.0,0.0,0.0,11.93029,1837203.0,0.06487614,65.7829,0.0,89.98588,3612.193,1047.643,10.48458,17.80274,0.0,0.0,0.0,0.0,1382656000000000.0,7164847000000.0,1384183000000000.0,1346494000000000.0,0.116407
std,54841.06,20577.5,22853.28,5.004358,577022500.0,423.4293,441.3805,228183.5,779787.4,991.2863,81.74477,279.3301,291.8609,1489.155,107.5873,425.6189,431.2355,,,62148420.0,164244100.0,455872800.0,149855.7,576347300.0,142895500.0,248702700.0,455210500.0,220438.0,538161500.0,142554200.0,247543800.0,448151300.0,1354805.0,0.2009677,0.0,0.0,0.0,8771.204,8685.979,279925.5,185031.4,54.53025,1588.779,305.7668,386.4389,806315.1,0.6818067,1.873106,1.146187,724.0536,842.5557,0.0,1.625766,1.40108,1.360376,312.5988,279.3301,425.6189,0.0,0.0,0.0,0.0,679.1144,4807590.0,0.2463073,137.5531,0.0,204.6221,6161.29,2451.803,381.1615,7.333025,0.0,0.0,0.0,0.0,495887500000000.0,48304380000000.0,496287100000000.0,538897900000000.0,0.3207125
min,0.0,0.0,0.0,0.0,-14.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-12000000.0,-2000000.0,-14.0,0.0,-14.0,-24.0,-2.0,-2.0,0.0,-2.0,-15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,12622.0,40510.0,88.0,6.0,71.0,1.0,2.0,6.0,6.0,6.0,0.0,6.0,0.0,6.0,0.0,3.0,0.0,1598.539,50.45488,24.66667,26.83747,56.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,16.0,16.0,24.57094,7.069239,0.0,6.0,4.0,3.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.8,6.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,1556019000000000.0,0.0,1556031000000000.0,1556019000000000.0,0.0
50%,25443.0,51820.0,389.0,6.0,960.0,2.0,2.0,86.0,96.0,42.0,6.0,35.0,0.0,48.0,0.0,48.0,3.286335,117647.1,6893.481,183.0,192.333,746.0,1.0,4.0,4.0,0.0,4.0,1.0,5.0,5.0,0.0,5.0,1.0,0.0,0.0,0.0,0.0,20.0,40.0,2923.977,1390.821,0.0,63.0,48.0,8.215838,67.5,1.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,57.75,35.0,48.0,0.0,0.0,0.0,0.0,4.0,4563.0,0.0,17.0,0.0,23.0,2049.0,254.0,1.0,20.0,0.0,0.0,0.0,0.0,1556161000000000.0,73.53911,1556166000000000.0,1556161000000000.0,0.0
75%,74507.0,59479.75,35358.75,17.0,124105.2,7.0,6.0,456.0,537.75,223.0,35.0,90.0,44.92661,211.0,48.0,180.0,50.80682,1928571.0,54545.45,28565.0,45310.95,103754.8,4.0,29086.0,3579.95,5500.279,23990.5,3.0,7405.0,919.3017,2044.293,5698.0,4.0,0.0,0.0,0.0,0.0,164.0,120.0,21276.6,23411.37,35.0,265.0,176.4,94.66231,8960.952,1.0,4.0,0.0,4.0,9.0,0.0,0.0,0.0,2.0,219.8806,90.0,180.0,0.0,0.0,0.0,0.0,5.0,1582665.0,0.0,53.0,0.0,82.0,8192.0,2051.0,3.0,20.0,0.0,0.0,0.0,0.0,1571415000000000.0,29933.24,1571418000000000.0,1571415000000000.0,0.0
max,227127.0,65535.0,65534.0,17.0,11999820000.0,109274.0,161751.0,85337780.0,502301200.0,20700.0,11637.0,11637.0,7138.95,37437.0,9778.0,9778.0,7655.258,inf,inf,1910955000.0,4534405000.0,11996930000.0,118991800.0,11999820000.0,3999001000.0,6926472000.0,11997000000.0,118991800.0,11999820000.0,3998994000.0,6926460000.0,11996980000.0,113457900.0,1.0,0.0,0.0,0.0,2545988.0,3240636.0,4000000.0,4000000.0,1208.0,37437.0,7758.667,7020.624,49289160.0,2.0,14.0,14.0,207440.0,271025.0,0.0,256.0,6.0,24.0,11638.0,11637.0,9778.0,0.0,0.0,0.0,0.0,248464.0,934545500.0,1.0,6118.0,0.0,5305.0,65535.0,65535.0,100527.0,40.0,0.0,0.0,0.0,0.0,1572031000000000.0,1111549000000000.0,1572031000000000.0,1572031000000000.0,1.0


In [10]:
# Label distribution
data_all['Label'].value_counts()

0    1293748
1     170442
Name: Label, dtype: int64

In [11]:
# Number and distribution of source IPs
data_all['Src IP'].value_counts().head(10)

192.168.1.3      264955
192.168.1.9      244362
192.168.1.7      242763
10.10.10.1        67867
192.168.1.11      64671
10.10.10.19       59337
10.10.10.13       55398
192.168.1.127     54131
192.168.1.5       28412
13.89.172.0       25180
Name: Src IP, dtype: int64

In [12]:
data_all['Src IP'].nunique()

561

In [13]:
# Destination port breakdown
data_all['Dst Port'].value_counts()

389      223181
53       212658
88       157389
80        89539
49666     81637
          ...  
38649         1
24660         1
38235         1
28315         1
47941         1
Name: Dst Port, Length: 56354, dtype: int64

In [14]:
# Protocols distribution
data_all['Protocol'].value_counts()

6     1039791
17     415085
0        9314
Name: Protocol, dtype: int64

## Time and Continuity Analysis

In [15]:
data_all['tstamp'] = pd.to_datetime(data_all['Timestamp'])
data_all['tstamp'].head()

0   2019-05-06 12:33:59
1   2019-05-06 12:33:59
2   2019-05-06 12:34:06
3   2019-05-06 12:34:06
4   2019-05-06 12:34:07
Name: tstamp, dtype: datetime64[ns]

In [16]:
data_sorted = data_all.sort_values(by='tstamp', ascending=False)

In [17]:
timespan = data_sorted.iloc[0]['tstamp'] - data_sorted.iloc[-1]['tstamp']
timespan

Timedelta('275 days 22:41:59')

In [18]:
# Are there gaps in the data (was the capture interrupted?)
def measure_real_capture_dur(data: pd.Series, gap_max_secs: int = 300) -> float:
    """Computes total timespan of the capture. Expects iterable containing timestamps objects sorted in a descending manner"""
    total_dur = pd.Timedelta(seconds=0)
    current_dur = pd.Timedelta(seconds=0)
    cont_durations = []
    last_tstamp = data.iloc[0]
    contiguous = True

    # Iterate through the dataframe to find out gaps
    for cur_tstamp in data:
        dur_gap = last_tstamp - cur_tstamp

        if dur_gap <= pd.Timedelta(seconds=gap_max_secs):
            total_dur += dur_gap
            current_dur += dur_gap
        else:
            cont_durations.append(current_dur)
            current_dur = pd.Timedelta(seconds=0)

            if contiguous:
                contiguous = False

        last_tstamp = cur_tstamp

    # Log the final continuous block and sort them via their length
    cont_durations.append(current_dur)
    cont_durations.sort(reverse=True)

    return total_dur, contiguous, cont_durations

In [19]:
measure_real_capture_dur(data_sorted['tstamp'], int(timespan.total_seconds() * 0.01))

(Timedelta('7 days 15:46:26'),
 False,
 [Timedelta('2 days 23:58:54'),
  Timedelta('2 days 06:28:30'),
  Timedelta('2 days 03:02:42'),
  Timedelta('0 days 02:23:34'),
  Timedelta('0 days 01:29:00'),
  Timedelta('0 days 01:27:36'),
  Timedelta('0 days 00:46:55'),
  Timedelta('0 days 00:09:15')])

As apparent, the CUPID capture containing 3 captures is discontinuous with even more gaps than the number of files.