# NGIDS-DS Dataset Analysis

* **Author:** Patrik Goldschmidt (igoldschmidt@fit.vut.cz)
* **Project:** Network Intrusion Datasets: A Survey, Limitations, and Recommendations
* **Date:** 2024

In [1]:
import pandas as pd
import numpy as np

In [3]:
PATH = '/data/kinit/gold_data/ngids-ds/NGIDS/NGIDS-DS-v1/'
PATH_PCAP = PATH + 'NGIDS.pcap'
PATH_GT = PATH + 'ground_truth.csv'
PATH_FTRS = PATH + 'feature_descr.csv'

In [6]:
GT_HEADER = ['date', 'time', 'event', 'batch', 'desc', 'details', 'conn']

In [8]:
gt = pd.read_csv(PATH_GT, header=0, names=GT_HEADER)

In [9]:
gt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 313925 entries, 0 to 313924
Data columns (total 7 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   date     313925 non-null  object
 1   time     313925 non-null  object
 2   event    313925 non-null  object
 3   batch    313925 non-null  object
 4   desc     313925 non-null  object
 5   details  313925 non-null  object
 6   conn     313925 non-null  object
dtypes: object(7)
memory usage: 16.8+ MB


In [10]:
gt.head()

Unnamed: 0,date,time,event,batch,desc,details,conn
0,14/03/2016,12:14:24,Backdoors,All Batch,BlackEnergy Botnet Command and Control Communi...,http://atlas-public.ec2.arbor.net/docs/BlackEn...,175.45.176.1:3495->10.40.85.32:58782
1,15/03/2016,4:19:12,Backdoors,All Batch,Backdoor: Cisco Prime LAN Management (https://...,CVE 2012-6392 (http://cve.mitre.org/cgi-bin/cv...,175.45.176.0:13177->10.40.85.32:514
2,14/03/2016,9:36:00,Backdoors,All Batch,phpmyadmin 3.5.2.2 Backdoor Access and Code Ex...,CVE 2012-5159 (http://cve.mitre.org/cgi-bin/cv...,IP 175.45.176.2:0->10.40.85.32:0 175.45.176.2:...
3,15/03/2016,11:45:36,Backdoors,All Batch,Android AndroidKungFu Malware Command and Cont...,http://about-threats.trendmicro.com/malware.as...,175.45.176.3:61508->10.40.85.32:7500
4,11/03/2016,9:21:36,Backdoors,All Batch,Cisco Network Registrar Default Credentials Ba...,CVE 2011-2024 (http://cve.mitre.org/cgi-bin/cv...,IP 175.45.176.0:0->10.40.85.32:0 175.45.176.0:...


In [11]:
gt['event'].value_counts()

 Exploits                              158316
Exploits                                73301
Malware                                 35903
 Denial of Service                      18702
Generic                                 11300
                                        ...  
 175.45.176.1:48788->10.40.85.32:80         1
 175.45.176.2:31611->10.40.85.32:80         1
 175.45.176.1:21911->10.40.85.32:80         1
 175.45.176.1:15918->10.40.85.32:80         1
 175.45.176.2:46146->10.40.85.32:80         1
Name: event, Length: 357, dtype: int64

In [12]:
gt[gt['event'] == ' Denial of Service'].head()

Unnamed: 0,date,time,event,batch,desc,details,conn
334,11/03/2016,4:48:00,Denial of Service,Miscellaneous Batch,Music Bee 2.0.4663 File Denial of Service (htt...,OSVDB 94534 (http://www.osvdb.org/94534)CVSS-M...,TCP 175.45.176.1:5713->10.40.85.32:25
463,15/03/2016,3:36:00,Denial of Service,Asterisk,Digium Asterisk Maximum HTTP Connections Denia...,CVE 2014-4047 (http://cve.mitre.org/cgi-bin/cv...,IP 175.45.176.2:0->10.40.85.32:0IP 175.45.176....
464,15/03/2016,11:45:36,Denial of Service,Asterisk,Digium Asterisk Cookie Stack Overflow CVE-2014...,CVE 2014-2286 (http://cve.mitre.org/cgi-bin/cv...,175.45.176.0:36694->10.40.85.32:8088
465,15/03/2016,11:45:36,Denial of Service,Asterisk,Digium Asterisk SIP INVITE Session Expiration ...,CVE 2014-2287 (http://cve.mitre.org/cgi-bin/cv...,IP 175.45.176.3:0->10.40.85.32:0UDP 175.45.176...
466,15/03/2016,4:19:12,Denial of Service,Asterisk,Digium Asterisk Maximum HTTP Connections Denia...,CVE 2014-4047 (http://cve.mitre.org/cgi-bin/cv...,IP 175.45.176.2:0->10.40.85.32:0IP 175.45.176....


## Version 2

In [24]:
DATA_COLS = [
    'date',
    'time',
    'pid',
    'path',
    'syscalls',
    'eventid',
    'attack_cat',
    'attack_type',
    'label',
    'c10',
    'c11',
    'joint'
]

In [25]:
TRAIN_CSV_PATH = '/data/kinit/gold_data/ngids-ds/NGIDS/NGIDS-DS-v2/Traing data/0_1.csv'

In [26]:
data_train1 = pd.read_csv(TRAIN_CSV_PATH, header=0, names=DATA_COLS)

In [27]:
data_train1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3329730 entries, 0 to 3329729
Data columns (total 12 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   date         object
 1   time         object
 2   pid          int64 
 3   path         object
 4   syscalls     int64 
 5   eventid      int64 
 6   attack_cat   object
 7   attack_type  object
 8   label        int64 
 9   c10          int64 
 10  c11          int64 
 11  joint        int64 
dtypes: int64(7), object(5)
memory usage: 304.8+ MB


In [28]:
data_train1.head()

Unnamed: 0,date,time,pid,path,syscalls,eventid,attack_cat,attack_type,label,c10,c11,joint
0,15/03/2016,17:49:36,11378,/usr/sbin/apache2,78,90416105,normal,normal,0,350867,1,52
1,15/03/2016,17:49:36,11378,/usr/sbin/apache2,78,90416131,normal,normal,0,350868,1,52
2,15/03/2016,17:49:36,2110,/usr/bin/compiz,102,90416222,normal,normal,0,350869,1,69
3,15/03/2016,17:49:36,2110,/usr/bin/compiz,102,90416246,normal,normal,0,350870,1,69
4,15/03/2016,17:49:36,2110,/usr/bin/compiz,265,90416060,normal,normal,0,350871,1,171


Duration and continuity analysis was performed via PCAP files beyond this Jupyter notebook.