## Query Output Preprocessing

The graph database Dgraph returns query results in JSON fromat. The queries consist of getting all `originated` and all `responded` connections of a specified host. The `query_handler` tool converts these JSON outputs to csv files (two csv files for each host with some IP address, one for each connection direction (`originated`, `responded`)). 

This Jupyter notebook is used to:

1. Compute the neighbourhoods of these hosts. *(For each connection compute its neighbourhood from connections in a given time interval.)*
2. Assign labels. 
3. Concat DataFrames to one final `df` & write the result to a single file, ready for ML preprocessing (feature engineering).

### IDS commands:

```
capinfos -T -r -a -e Tuesday-WorkingHours.pcap
```

```
$ editcap -F libpcap -A "2017-07-04 13:45:00" -B "2017-07-04 14:15:00" Tuesday-WorkingHours.pcap tuesday_smaller.pcap

# (trying time shift:)
$ editcap -F libpcap -A "2017-07-04 17:45:00" -B "2017-07-04 18:15:00" Tuesday-WorkingHours.pcap tuesday_smaller.pcap 
# (ends up in 15, 16)

# (after the actual time shift:)
$ editcap -F libpcap -A "2017-07-04 19:45:00" -B "2017-07-04 20:15:00" Tuesday-WorkingHours.pcap tuesday_smaller.pcap
```

```
$ capinfos -T -r -a -e tuesday_smaller.pcap
```

### tool commands: 
```
$ python3 granef.py -o extraction -d /home/sramkova/diploma_thesis_data/cicids2017/granef/tuesday/ssh_and_normal4 -i tuesday_smaller.pcap -t run
```

```
$ cd diploma_thesis_data/cicids2017/granef/tuesday/ssh_and_normal4/
$ mkdir originated
$ mkdir responded
$ python3 query_handler.py -im -od /home/sramkova/diploma_thesis_data/cicids2017/granef/tuesday/ssh_and_normal4/
$ python3 query_handler.py -cm -od /home/sramkova/diploma_thesis_data/cicids2017/granef/tuesday/ssh_and_normal4/ --ips_csv /home/sramkova/diploma_thesis_data/cicids2017/granef/tuesday/ssh_and_normal4/output.csv
$ mv output-o-* originated/.
$ mv output-r-* responded/.
```



## Neighbourhood Computation

### 1. Load the data

In [1]:
import pandas as pd
import numpy as np
import os

PREFIX = '/home/sramkova//diploma_thesis_data/cicids2017/granef/tuesday/ssh_and_normal4/'
DIR_PATH_ORIG = PREFIX + 'originated'
DIR_PATH_RESP = PREFIX + 'responded'


file_list_orig = []
file_list_resp = []

def get_file_names(file_list, dir_path):
    for filename in os.listdir(dir_path):
        # only IPv4: 
        if 'f' not in filename and filename.endswith('.csv'):
            # (if there is an 'f' present in the name of the file, it means that the file contains 
            # connections of a host with IPv6 address)
            file_list.append(filename)

# load filenames to lists:
get_file_names(file_list_orig, DIR_PATH_ORIG)
get_file_names(file_list_resp, DIR_PATH_RESP)

print(len(file_list_orig))
print(len(file_list_resp))

17
1354


In [2]:
# load as dataframes to a dictionary for easier processing:

# elements of the dictionary are in a form: { host.ip -> df with connections of corresponding host }
dfs_orig = {}
dfs_resp = {}

def load_files_to_dfs(dfs_dict, file_list, dir_path, prefix):
    prefix_name = 'output-' + prefix
    for filename in file_list:
        file_ip = filename
        file_ip = file_ip.replace(prefix_name, '').replace('.csv', '')
        df_conns = pd.read_csv(dir_path + '/' + filename)

        df_conns['connection.time'] = pd.to_datetime(df_conns['connection.ts'])
        
        # missing connection.service value means that Zeek wasn't able to extract the service => nulls can 
        # be treated as a new category
        df_conns['connection.service'].fillna('none', inplace = True)

        dfs_dict[file_ip] = df_conns

load_files_to_dfs(dfs_orig, file_list_orig, DIR_PATH_ORIG, 'o-')
load_files_to_dfs(dfs_resp, file_list_resp, DIR_PATH_RESP, 'r-')

print(len(dfs_orig))
print(len(dfs_resp))

17
1354


https://www.unb.ca/cic/datasets/ids-2017.html info: 

Tuesday, July 4, 2017

*2017-07-04 13:45:00*

FTP-Patator (**9:20 – 10:20 a.m.**)

SSH-Patator (**`14:00 – 15:00 p.m.`**)

Attacker: Kali, **`205.174.165.73`** (205.174.165.73 -> 205.174.165.80 (Valid IP of the Firewall) -> 172.16.0.1 -> 192.168.10.50)

Victim: WebServer Ubuntu, **`205.174.165.68`** (Local IP: **`192.168.10.50`**) (192.168.10.50 -> 172.16.0.1 -> 205.174.165.80 -> 205.174.165.73)

In [3]:
# dfs_orig

In [4]:
o_max = dfs_orig['192.168.10.25']['connection.time'][0]
o_min = dfs_orig['192.168.10.25']['connection.time'][0]

for o_ip in dfs_orig:
    o_df = dfs_orig[o_ip]
    cur_max = o_df['connection.time'].max()
    cur_min = o_df['connection.time'].min()
    if cur_max > o_max:
        o_max = cur_max
        # print(o_ip)
    if cur_min < o_min:
        o_min = cur_min
        # print(o_ip)

print(o_min)
print(o_max)

2017-07-04 17:45:00.041825+00:00
2017-07-04 18:14:59.691410+00:00


### 2. Compute neighbourhoods for each row based on a time interval

(e.g. time interval: +- 5 minutes)

In [5]:
# various stat functions on attributes from neighbourhood:

def get_counts(df, prefix):
    # counts (overall + counts of different protocols): 
    proto_tcp_count = 0
    proto_udp_count = 0
    proto_icmp_count = 0
            
    if 'connection.proto' in df:
        proto_counts = df['connection.proto'].value_counts()
        proto_tcp_count = proto_counts['tcp'] if 'tcp' in proto_counts else 0
        proto_udp_count = proto_counts['udp'] if 'udp' in proto_counts else 0
        proto_icmp_count = proto_counts['icmp'] if 'icmp' in proto_counts else 0
    
    return {prefix + '_total': len(df.index),
            prefix + '_proto_tcp_count': proto_tcp_count,
            prefix + '_proto_udp_count': proto_udp_count,
            prefix + '_proto_icmp_count': proto_icmp_count
           }

def get_modes(df, prefix):
    # .mode()[0] return the value of a categorical variable that appeared the most times
    return {prefix + '_connection.protocol_mode': df['connection.proto'].mode()[0] if 'connection.proto' in df else '-',
            prefix + '_connection.service_mode': df['connection.service'].mode()[0] if 'connection.service' in df else '-',
            prefix + '_connection.conn_state_mode': df['connection.conn_state'].mode()[0] if 'connection.conn_state' in df else '-'
           }

def get_means(df, prefix):
    # .mean() returns mean of the corresponding numerical attribute variable values
    return {prefix + '_connection.time_mean': df['connection.time'].mean() if 'connection.time' in df else cur_time,
            prefix + '_connection.duration_mean': df['connection.duration'].mean() if 'connection.duration' in df else 0, 
            # prefix + '_connection.orig_p_mean': df['connection.orig_p'].mean() if 'connection.orig_p' in df else 0, 
            prefix + '_connection.orig_bytes_mean': df['connection.orig_bytes'].mean() if 'connection.orig_bytes' in df else 0,
            prefix + '_connection.orig_pkts_mean': df['connection.orig_pkts'].mean() if 'connection.orig_pkts' in df else 0, 
            # prefix + '_connection.resp_p_mean': df['connection.resp_p'].mean() if 'connection.resp_p' in df else 0,
            prefix + '_connection.resp_bytes_mean': df['connection.resp_bytes'].mean() if 'connection.resp_bytes' in df else 0,
            prefix + '_connection.resp_pkts_mean': df['connection.resp_pkts'].mean() if 'connection.resp_pkts' in df else 0
           }

def get_stats_means(df, prefix):
    # .mean() returns mean of the corresponding numerical attribute variable values
    return {prefix + '_dns_count_mean': df['dns_count'].mean() if 'dns_count' in df else 0,
            prefix + '_ssh_count_mean': df['ssh_count'].mean() if 'ssh_count' in df else 0, 
            prefix + '_http_count_mean': df['http_count'].mean() if 'http_count' in df else 0,
            prefix + '_ssl_count_mean': df['ssl_count'].mean() if 'ssl_count' in df else 0,
            prefix + '_files_count_mean': df['files_count'].mean() if 'files_count' in df else 0
           }

def get_medians(df, prefix):
    # .median() returns median of the corresponding numerical attribute variable values
    return {prefix + '_connection.time_median': df['connection.time'].median() if 'connection.time' in df else cur_time,
            prefix + '_connection.duration_median': df['connection.duration'].median() if 'connection.duration' in df else 0, 
            # prefix + '_connection.orig_p_median': df['connection.orig_p'].median() if 'connection.orig_p' in df else 0,
            prefix + '_connection.orig_bytes_median': df['connection.orig_bytes'].median() if 'connection.orig_bytes' in df else 0,
            prefix + '_connection.orig_pkts_median': df['connection.orig_pkts'].median() if 'connection.orig_pkts' in df else 0, 
            # prefix + '_connection.resp_p_median': df['connection.resp_p'].median() if 'connection.resp_p' in df else 0,
            prefix + '_connection.resp_bytes_median': df['connection.resp_bytes'].median() if 'connection.resp_bytes' in df else 0,
            prefix + '_connection.resp_pkts_median': df['connection.resp_pkts'].median() if 'connection.resp_pkts' in df else 0
           }

def get_orig_ports(df, prefix):
    # count orig_p categories:
    orig_well_known_count = 0
    orig_reg_or_dyn_count = 0
    unique_orig_p_list = df['connection.orig_p'].unique().tolist()
    values_orig_p = df['connection.orig_p'].value_counts()
    
    for uniq_p in unique_orig_p_list:
        if uniq_p < 1024:
            orig_well_known_count += values_orig_p[uniq_p]
        else:
            orig_reg_or_dyn_count += values_orig_p[uniq_p]
            
    return {prefix + '_orig_p_well_known_count': orig_well_known_count,
            prefix + '_orig_p_reg_or_dyn_count': orig_reg_or_dyn_count}

def get_resp_ports(df, prefix):
    # count resp_p categories:
    common_ports = {21: 0, 
                    22: 0, 
                    53: 0, 
                    80: 0, 
                    123: 0, 
                    443: 0, 
                    3389: 0}
    resp_well_known = 0
    resp_reg = 0
    resp_dyn = 0
    unique_resp_p_list = df['connection.resp_p'].unique().tolist()
    values_resp_p = df['connection.resp_p'].value_counts()
    
    for uniq_p in unique_resp_p_list:
        if uniq_p in common_ports.keys():
            common_ports[uniq_p] += values_resp_p[uniq_p]
        elif uniq_p < 1024:
            resp_well_known += values_resp_p[uniq_p]
        elif uniq_p < 49152:
            resp_reg += values_resp_p[uniq_p]
        else:
            resp_dyn += values_resp_p[uniq_p]
            
    return {prefix + '_resp_p_21_count': common_ports[21],
            prefix + '_resp_p_22_count': common_ports[22],
            prefix + '_resp_p_53_count': common_ports[53], 
            prefix + '_resp_p_80_count': common_ports[80],
            prefix + '_resp_p_123_count': common_ports[123],
            prefix + '_resp_p_443_count': common_ports[443],
            prefix + '_resp_p_3389_count': common_ports[3389],
            prefix + '_resp_p_well_known_count': resp_well_known,
            prefix + '_resp_p_reg_count': resp_reg,
            prefix + '_resp_p_dyn_count': resp_dyn}

In [6]:
def generate_duration_filter(duration_val):
    # based on constants from data_exploration.ipynb
    if duration_val <= 0.0:
        return 0.000001, None
    elif duration_val <= 0.0001:
        return 0.000001, 0.001
    elif duration_val <= 0.009:
        return 0.001, 0.05
    elif duration_val <= 0.5:
        return 0.05, 1.5
    elif duration_val <= 5:
        return 1.5, 10
    elif duration_val <= 15:
        return 10, 20
    elif duration_val <= 30:
        return 20, 40
    elif duration_val <= 50:
        return 40, 60
    elif duration_val <= 75:
        return 60, 90
    elif duration_val <= 100:
        return 75, 110
    return None, 100

def generate_bytes_filter(bytes_val):
    if bytes_val == 0:
        return 0, 0
    elif bytes_val <= 1450:
        return bytes_val - 50, bytes_val + 50
    elif bytes_val <= 35000:
        return bytes_val - 500, bytes_val + 500
    else:
        return None, bytes_val - 1000

In [7]:
def get_similar_count(df, row, prefix):
    # protocol filter
    mask = (df['connection.proto'] == row['connection.proto'])
    df_filtered = df.loc[mask]
    
    # service filter
    mask = (df_filtered['connection.service'] == row['connection.service'])
    df_filtered = df_filtered.loc[mask]
    
    # conn_state filter
    mask = (df_filtered['connection.conn_state'] == row['connection.conn_state'])
    df_filtered = df_filtered.loc[mask]
    
    # duration filter
    lower, upper = generate_duration_filter(row['connection.duration'])
    if lower:
        mask = df_filtered['connection.duration'] >= lower
        df_filtered = df_filtered.loc[mask]
    if upper:
        mask = df_filtered['connection.duration'] <= upper
        df_filtered = df_filtered.loc[mask]
        
    # _bytes filter
    lower, upper = generate_duration_filter(row['connection.orig_bytes'])
    if lower:
        mask = df_filtered['connection.orig_bytes'] >= lower
        df_filtered = df_filtered.loc[mask]
    if upper:
        mask = df_filtered['connection.orig_bytes'] <= upper
        df_filtered = df_filtered.loc[mask]
        
    lower, upper = generate_duration_filter(row['connection.resp_bytes'])
    if lower:
        mask = df_filtered['connection.resp_bytes'] >= lower
        df_filtered = df_filtered.loc[mask]
    if upper:
        mask = df_filtered['connection.resp_bytes'] <= upper
        df_filtered = df_filtered.loc[mask]
    
    # _ip_bytes filter
    mask = (df_filtered['connection.orig_ip_bytes'] >= row['connection.orig_ip_bytes'] - 50) & (df_filtered['connection.orig_ip_bytes'] <= row['connection.orig_ip_bytes'] + 50)
    df_filtered = df_filtered.loc[mask]
    mask = (df_filtered['connection.resp_ip_bytes'] >= row['connection.resp_ip_bytes'] - 50) & (df_filtered['connection.resp_ip_bytes'] <= row['connection.resp_ip_bytes'] + 50)
    df_filtered = df_filtered.loc[mask]
    
    # remove original connection from neighbourhood (empty will have size 0 instead of 1)
    mask = (df_filtered['connection.uid'] != row['connection.uid'])
    df_filtered = df_filtered.loc[mask]

    return {prefix + '_similar_conns_count': df_filtered.shape[0]}

In [8]:
def check_attr_value(x, attr_str, row_attr_vals_list):
    if isinstance(x, float) and np.isnan(x):
        return False
    
    if isinstance(x, list) and len(x) < 1:
        return False
    
    if isinstance(x, str) and x == '[]':
        return False
    
    if isinstance(row_attr_vals_list, list) and len(row_attr_vals_list) > 0:
        for attribute in x:
            if attribute in row_attr_vals_list:
                return True
    return False

def get_similar_attributes_count(df, row, prefix):
    neighbourhood_attributes_dict = {}
    attributes = ['dns_qtype', 'dns_rcode', 'ssh_auth_attempts', 'ssh_host_key', 'http_method', 'http_status_code', 
                  'http_user_agent', 'ssl_version', 'ssl_cipher', 'ssl_curve', 'ssl_validation_status', 'files_source',
                  'file_md5']
    
    for attr in attributes:
        if not row[attr]:
            # attribute value list is empty, no similarity is counted
            attr_dict = {prefix + '_similar_' + attr + '_count': 0}
            neighbourhood_attributes_dict.update(attr_dict)
        else:
            # filter
            mask = df[attr].apply(lambda x: check_attr_value(x, attr, row[attr]))
            df_filtered = df.loc[mask]

            # remove original connection from neighbourhood (empty will have size 0 instead of 1)
            mask = (df_filtered['connection.uid'] != row['connection.uid'])
            df_filtered = df_filtered.loc[mask]

            # add attribute count to dictionary that contains all counts
            attr_dict = {prefix + '_similar_' + attr + '_count': df_filtered.shape[0]}
            neighbourhood_attributes_dict.update(attr_dict)
    
    return neighbourhood_attributes_dict

In [9]:
def compute_time_neighbourhood(host_ip, dfs_list, time_col_name, cur_time, time_start, time_end, row, prefix):
    if host_ip in dfs_list:
        ip_df = dfs_list[host_ip]
        mask = (ip_df[time_col_name] > time_start) & (ip_df[time_col_name] <= time_end)
        df = ip_df.loc[mask]

        if len(df) > 0:
            neighbourhood_dict = {}

            neighbourhood_counts = get_counts(df, prefix)
            neighbourhood_modes = get_modes(df, prefix)
            neighbourhood_means = get_means(df, prefix)
            # neighbourhood_medians = get_medians(df, prefix)
            neighbourhood_orig_ports = get_orig_ports(df, prefix)
            neighbourhood_resp_ports = get_resp_ports(df, prefix)
            neighbourhood_stats_means = get_stats_means(df, prefix)
            neighbourhood_similar_count = get_similar_count(df, row, prefix)
            neighbourhood_similar_attributes_count = get_similar_attributes_count(df, row, prefix)
            
            neighbourhood_dict.update(neighbourhood_counts)
            neighbourhood_dict.update(neighbourhood_modes)
            neighbourhood_dict.update(neighbourhood_means)
            # neighbourhood_dict.update(neighbourhood_medians)
            neighbourhood_dict.update(neighbourhood_orig_ports)
            neighbourhood_dict.update(neighbourhood_resp_ports)
            neighbourhood_dict.update(neighbourhood_stats_means)
            neighbourhood_dict.update(neighbourhood_similar_count)
            neighbourhood_dict.update(neighbourhood_similar_attributes_count)
            
            return neighbourhood_dict

    return {prefix + '_total': 0,
            prefix + '_proto_tcp_count': 0,
            prefix + '_proto_udp_count': 0,
            prefix + '_proto_icmp_count': 0,
            prefix + '_connection.protocol_mode': '-',
            prefix + '_connection.service_mode': '-',
            prefix + '_connection.conn_state_mode': '-',
            prefix + '_connection.time_mean': cur_time, # time_mean: 0 could not be here => problem later with time conversion (missing year) 
                                                        # (but does it make sense as a default value?)
            prefix + '_connection.duration_mean': 0, 
            prefix + '_connection.orig_bytes_mean': 0,
            prefix + '_connection.orig_pkts_mean': 0,
            prefix + '_connection.resp_bytes_mean': 0,
            prefix + '_connection.resp_pkts_mean': 0,
            prefix + '_orig_p_well_known_count': 0,
            prefix + '_orig_p_reg_or_dyn_count': 0,
            prefix + '_resp_p_21_count': 0,
            prefix + '_resp_p_22_count': 0,
            prefix + '_resp_p_53_count': 0, 
            prefix + '_resp_p_80_count': 0,
            prefix + '_resp_p_123_count': 0,
            prefix + '_resp_p_443_count': 0,
            prefix + '_resp_p_3389_count': 0,
            prefix + '_resp_p_well_known_count': 0,
            prefix + '_resp_p_reg_count': 0,
            prefix + '_resp_p_dyn_count': 0,
            prefix + '_dns_count_mean': 0,
            prefix + '_ssh_count_mean': 0,
            prefix + '_http_count_mean': 0,
            prefix + '_ssl_count_mean': 0,
            prefix + '_files_count_mean': 0,
            prefix + '_similar_conns_count': 0,
            prefix + '_similar_dns_qtype_count': 0,
            prefix + '_similar_dns_rcode_count': 0,
            prefix + '_similar_ssh_auth_attempts_count': 0,
            prefix + '_similar_ssh_host_key_count': 0,
            prefix + '_similar_http_method_count': 0,
            prefix + '_similar_http_status_code_count': 0,
            prefix + '_similar_http_user_agent_count': 0,
            prefix + '_similar_ssl_version_count': 0,
            prefix + '_similar_ssl_cipher_count': 0,
            prefix + '_similar_ssl_curve_count': 0,
            prefix + '_similar_ssl_validation_status_count': 0,
            prefix + '_similar_files_source_count': 0,
            prefix + '_similar_file_md5_count': 0
           }

In [10]:
NEIGHBOURHOOD_TIME_WINDOW_MINUTES_ORIG_DIRECTION = 5
NEIGHBOURHOOD_TIME_WINDOW_MINUTES_RESP_DIRECTION = 2

def compute_neighbourhoods(cur_orig_ip, dfs_list_orig, dfs_list_resp):
    df_result = pd.DataFrame()
    print('[{}]: Computing neighbourhood for connections of originator {:15} ({})'.format(datetime.now().strftime("%H:%M:%S"), cur_orig_ip, str(len(dfs_list_orig[cur_orig_ip]))))
    # iterate over rows in originated connections df of host with cur_orig_ip IP address:
    for index, row in dfs_list_orig[cur_orig_ip].iterrows():
        cur_row_dict = row.to_dict()
        cur_time = row['connection.time']
        
        time_start_orig = cur_time - pd.Timedelta(minutes=NEIGHBOURHOOD_TIME_WINDOW_MINUTES_ORIG_DIRECTION)
        time_end_orig = cur_time + pd.Timedelta(minutes=NEIGHBOURHOOD_TIME_WINDOW_MINUTES_ORIG_DIRECTION)
        time_start_resp = cur_time - pd.Timedelta(minutes=NEIGHBOURHOOD_TIME_WINDOW_MINUTES_RESP_DIRECTION)
        time_end_resp = cur_time + pd.Timedelta(minutes=NEIGHBOURHOOD_TIME_WINDOW_MINUTES_RESP_DIRECTION)
        ip_responder = row['responded_ip']
        try:
            # compute neighbourhoods (from originated connections for originator, from responded connections for responder):
            originator_neighbourhood = compute_time_neighbourhood(cur_orig_ip, dfs_list_orig, 'connection.time', cur_time, time_start_orig, time_end_orig, row, 'orig_orig')
            #originator_neighbourhood2 = compute_time_neighbourhood(cur_orig_ip, dfs_list_resp, 'connection.time', cur_time, time_start_resp, time_end_resp, row, 'orig_resp')
            #responder_neighbourhood = compute_time_neighbourhood(ip_responder, dfs_list_orig, 'connection.time', cur_time, time_start_orig, time_end_orig, row, 'resp_orig')
            #responder_neighbourhood2 = compute_time_neighbourhood(ip_responder, dfs_list_resp, 'connection.time', cur_time, time_start_resp, time_end_resp, row, 'resp_resp')

            cur_row_dict.update(originator_neighbourhood)
            #cur_row_dict.update(originator_neighbourhood2)
            #cur_row_dict.update(responder_neighbourhood)
            #cur_row_dict.update(responder_neighbourhood2)
            
            # concat to one long row and to df_result:
            row_df = pd.DataFrame([cur_row_dict])
            df_result = pd.concat([df_result, row_df], axis=0, ignore_index=True)
        except: 
            print('Problem with originator {} and responder {} ({})'.format(cur_orig_ip, ip_responder, row['connection.uid']))
            pass
    return df_result

In [11]:
from datetime import datetime
import multiprocessing
from multiprocessing import Pool
from functools import partial
from contextlib import contextmanager

@contextmanager
def poolcontext(*args, **kwargs):
    pool = multiprocessing.Pool(*args, **kwargs)
    yield pool
    pool.terminate()

# compute neighbourhoods using multiple threads (time optimalization):
print('Start at ' + datetime.now().strftime("%H:%M:%S") + '.')
with poolcontext(processes=32) as pool:
    
    dfs_with_neighbourhoods = pool.map(
        partial(compute_neighbourhoods, 
                dfs_list_orig=dfs_orig, 
                dfs_list_resp=dfs_resp), 
        dfs_orig.keys())

print('Done at ' + datetime.now().strftime("%H:%M:%S") + '.')

Start at 08:17:43.
[08:17:45]: Computing neighbourhood for connections of originator 192.168.10.25   (2090)
[08:17:46]: Computing neighbourhood for connections of originator 185.49.84.72    (1)
[08:17:46]: Computing neighbourhood for connections of originator 192.168.10.8    (1287)
[08:17:47]: Computing neighbourhood for connections of originator 192.168.10.51   (1022)
[08:17:47]: Computing neighbourhood for connections of originator 192.168.10.50   (436)
[08:17:48]: Computing neighbourhood for connections of originator 192.168.10.12   (1286)
[08:17:48]: Computing neighbourhood for connections of originator 192.168.10.17   (763)
[08:17:49]: Computing neighbourhood for connections of originator 192.168.10.3    (3615)
[08:17:50]: Computing neighbourhood for connections of originator 192.168.10.16   (546)
[08:17:51]: Computing neighbourhood for connections of originator 192.168.10.19   (1949)
[08:17:51]: Computing neighbourhood for connections of originator 123.130.127.12  (1)
[08:17:52]:

In [12]:
# print(type(dfs_with_neighbourhoods))
print(type(dfs_with_neighbourhoods))

<class 'list'>


In [13]:
# print(len(dfs_with_neighbourhoods))
print(len(dfs_with_neighbourhoods))

17


In [14]:
dfs_with_neighbourhoods[0].head()

Unnamed: 0,originated_ip,uid,connection.uid,connection.conn_state,connection.duration,connection.orig_bytes,connection.orig_ip_bytes,connection.orig_p,connection.orig_pkts,connection.proto,...,orig_orig_similar_ssh_host_key_count,orig_orig_similar_http_method_count,orig_orig_similar_http_status_code_count,orig_orig_similar_http_user_agent_count,orig_orig_similar_ssl_version_count,orig_orig_similar_ssl_cipher_count,orig_orig_similar_ssl_curve_count,orig_orig_similar_ssl_validation_status_count,orig_orig_similar_files_source_count,orig_orig_similar_file_md5_count
0,192.168.10.25,0x4,CZwVtlqQKE8nc0ITg,SF,0.085332,1244,5216,61489,52,tcp,...,0,0,0,0,0,0,0,0,0,0
1,192.168.10.25,0x2a,C5aJQ828iB8p7jwU8j,SF,0.580104,887,3067,60485,38,tcp,...,0,0,0,0,0,0,0,0,0,0
2,192.168.10.25,0x5e,CWq1uR2i0A21GEy28,SF,0.630689,1710,2374,60505,13,tcp,...,0,0,0,0,0,0,0,0,0,0
3,192.168.10.25,0x6b,CQqwzL2ng4IYRArQHj,SF,0.186263,1100,1424,60663,6,tcp,...,0,0,0,0,0,0,0,0,0,0
4,192.168.10.25,0x71,CtSFld1BCLZHAVqODh,SF,0.306211,647,1023,60672,7,tcp,...,0,0,0,0,0,0,0,0,0,0


### 3. Concat

In [15]:
def concat_dfs(df_neighourhoods):
    df_result = pd.DataFrame()
    for i in range(0, len(df_neighourhoods)):
        df_i = df_neighourhoods[i]
        df_result = df_result.append(df_i)
        # print('Appending ' + str(i) + ', len = ' + str(len(df_i)) + ', df_result len = ' + str(len(df_result)))
    return df_result

df_result = concat_dfs(dfs_with_neighbourhoods)

In [16]:
df_result

Unnamed: 0,originated_ip,uid,connection.uid,connection.conn_state,connection.duration,connection.orig_bytes,connection.orig_ip_bytes,connection.orig_p,connection.orig_pkts,connection.proto,...,orig_orig_similar_ssh_host_key_count,orig_orig_similar_http_method_count,orig_orig_similar_http_status_code_count,orig_orig_similar_http_user_agent_count,orig_orig_similar_ssl_version_count,orig_orig_similar_ssl_cipher_count,orig_orig_similar_ssl_curve_count,orig_orig_similar_ssl_validation_status_count,orig_orig_similar_files_source_count,orig_orig_similar_file_md5_count
0,192.168.10.25,0x4,CZwVtlqQKE8nc0ITg,SF,0.085332,1244,5216,61489,52,tcp,...,0,0,0,0,0,0,0,0,0,0
1,192.168.10.25,0x2a,C5aJQ828iB8p7jwU8j,SF,0.580104,887,3067,60485,38,tcp,...,0,0,0,0,0,0,0,0,0,0
2,192.168.10.25,0x5e,CWq1uR2i0A21GEy28,SF,0.630689,1710,2374,60505,13,tcp,...,0,0,0,0,0,0,0,0,0,0
3,192.168.10.25,0x6b,CQqwzL2ng4IYRArQHj,SF,0.186263,1100,1424,60663,6,tcp,...,0,0,0,0,0,0,0,0,0,0
4,192.168.10.25,0x71,CtSFld1BCLZHAVqODh,SF,0.306211,647,1023,60672,7,tcp,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1255,172.16.0.1,0x2f4e0b,CF27Qq1cD7qEwMY66e,SF,12.186034,2008,3212,52212,23,tcp,...,0,0,0,0,0,0,0,0,0,0
1256,172.16.0.1,0x2f4e0d,CnzbMg3yCWOBGZl377,SF,11.466854,2008,3212,52256,23,tcp,...,0,0,0,0,0,0,0,0,0,0
1257,172.16.0.1,0x2f4e13,CdUT4l1iM98KvsDTXc,SF,12.396892,2008,3276,52302,24,tcp,...,0,0,0,0,0,0,0,0,0,0
1258,172.16.0.1,0x2f4e17,CUxRMX2llByCXs4BT2,SF,11.535562,2008,3108,52322,21,tcp,...,0,0,0,0,0,0,0,0,0,0


In [17]:
df_result.to_csv('/home/sramkova/diploma_thesis_data/neighbourhood_test_ssh_guess_1_11_2021_checkpoint.csv', index=False, header=True)

# from datetime import datetime
# import pandas as pd
# df_result = pd.read_csv('/home/sramkova/diploma_thesis_data/neighbourhood_test_ssh_guess_20_10_2021_checkpoint_r.csv')
# df_result['connection.time'] = pd.to_datetime(df_result['connection.time'])

In [18]:
df_result

Unnamed: 0,originated_ip,uid,connection.uid,connection.conn_state,connection.duration,connection.orig_bytes,connection.orig_ip_bytes,connection.orig_p,connection.orig_pkts,connection.proto,...,orig_orig_similar_ssh_host_key_count,orig_orig_similar_http_method_count,orig_orig_similar_http_status_code_count,orig_orig_similar_http_user_agent_count,orig_orig_similar_ssl_version_count,orig_orig_similar_ssl_cipher_count,orig_orig_similar_ssl_curve_count,orig_orig_similar_ssl_validation_status_count,orig_orig_similar_files_source_count,orig_orig_similar_file_md5_count
0,192.168.10.25,0x4,CZwVtlqQKE8nc0ITg,SF,0.085332,1244,5216,61489,52,tcp,...,0,0,0,0,0,0,0,0,0,0
1,192.168.10.25,0x2a,C5aJQ828iB8p7jwU8j,SF,0.580104,887,3067,60485,38,tcp,...,0,0,0,0,0,0,0,0,0,0
2,192.168.10.25,0x5e,CWq1uR2i0A21GEy28,SF,0.630689,1710,2374,60505,13,tcp,...,0,0,0,0,0,0,0,0,0,0
3,192.168.10.25,0x6b,CQqwzL2ng4IYRArQHj,SF,0.186263,1100,1424,60663,6,tcp,...,0,0,0,0,0,0,0,0,0,0
4,192.168.10.25,0x71,CtSFld1BCLZHAVqODh,SF,0.306211,647,1023,60672,7,tcp,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1255,172.16.0.1,0x2f4e0b,CF27Qq1cD7qEwMY66e,SF,12.186034,2008,3212,52212,23,tcp,...,0,0,0,0,0,0,0,0,0,0
1256,172.16.0.1,0x2f4e0d,CnzbMg3yCWOBGZl377,SF,11.466854,2008,3212,52256,23,tcp,...,0,0,0,0,0,0,0,0,0,0
1257,172.16.0.1,0x2f4e13,CdUT4l1iM98KvsDTXc,SF,12.396892,2008,3276,52302,24,tcp,...,0,0,0,0,0,0,0,0,0,0
1258,172.16.0.1,0x2f4e17,CUxRMX2llByCXs4BT2,SF,11.535562,2008,3108,52322,21,tcp,...,0,0,0,0,0,0,0,0,0,0


## Assign attacker labels

In [19]:
df_result['attacker_label'] = 'No'
df_result['victim_label'] = 'No'

In [38]:
# Red Team CIDR ranges:
ATTACKER_IPS = ["205.174.165.73/32", "205.174.165.80/32", "205.174.165.68/32", "172.16.0.1/32"]

VICTIM_IPS = ["192.168.10.50/32"]

# from netaddr import IPNetwork, IPAddress
import ipaddress

ips_cache = {} # optimalizatiion

def is_specified_ip(ip_address):
  for attacker_ip in VICTIM_IPS:
    #if IPAddress(ip_address) in IPNetwork(attacker_ip):
    if ip_address in ips_cache:
      return ips_cache[ip_address]
    try:
      if ipaddress.ip_address(ip_address) in ipaddress.ip_network(attacker_ip):
        #print("IP address " + ip_address + " is from Red team (" + attacker_ip + ").")
        ips_cache[ip_address] = True
        return True
    except:
      pass # IPv6
  ips_cache[ip_address] = False
  return False

In [35]:
# assign labels to input data ('No' not from/ to attacker, 'Yes' originated from/ responded to attacker):
df_result.loc[df_result['responded_ip'].apply(is_specified_ip),'attacker_label'] = 'Yes'
df_result.loc[df_result['originated_ip'].apply(is_specified_ip),'attacker_label'] = 'Yes'

In [36]:
df_result

Unnamed: 0,originated_ip,uid,connection.uid,connection.conn_state,connection.duration,connection.orig_bytes,connection.orig_ip_bytes,connection.orig_p,connection.orig_pkts,connection.proto,...,orig_orig_similar_http_status_code_count,orig_orig_similar_http_user_agent_count,orig_orig_similar_ssl_version_count,orig_orig_similar_ssl_cipher_count,orig_orig_similar_ssl_curve_count,orig_orig_similar_ssl_validation_status_count,orig_orig_similar_files_source_count,orig_orig_similar_file_md5_count,attacker_label,victim_label
0,192.168.10.25,0x4,CZwVtlqQKE8nc0ITg,SF,0.085332,1244,5216,61489,52,tcp,...,0,0,0,0,0,0,0,0,No,No
1,192.168.10.25,0x2a,C5aJQ828iB8p7jwU8j,SF,0.580104,887,3067,60485,38,tcp,...,0,0,0,0,0,0,0,0,No,No
2,192.168.10.25,0x5e,CWq1uR2i0A21GEy28,SF,0.630689,1710,2374,60505,13,tcp,...,0,0,0,0,0,0,0,0,No,No
3,192.168.10.25,0x6b,CQqwzL2ng4IYRArQHj,SF,0.186263,1100,1424,60663,6,tcp,...,0,0,0,0,0,0,0,0,No,No
4,192.168.10.25,0x71,CtSFld1BCLZHAVqODh,SF,0.306211,647,1023,60672,7,tcp,...,0,0,0,0,0,0,0,0,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1255,172.16.0.1,0x2f4e0b,CF27Qq1cD7qEwMY66e,SF,12.186034,2008,3212,52212,23,tcp,...,0,0,0,0,0,0,0,0,Yes,No
1256,172.16.0.1,0x2f4e0d,CnzbMg3yCWOBGZl377,SF,11.466854,2008,3212,52256,23,tcp,...,0,0,0,0,0,0,0,0,Yes,No
1257,172.16.0.1,0x2f4e13,CdUT4l1iM98KvsDTXc,SF,12.396892,2008,3276,52302,24,tcp,...,0,0,0,0,0,0,0,0,Yes,No
1258,172.16.0.1,0x2f4e17,CUxRMX2llByCXs4BT2,SF,11.535562,2008,3108,52322,21,tcp,...,0,0,0,0,0,0,0,0,Yes,No


In [37]:
df_result['attacker_label'].value_counts()

No     17228
Yes     1374
Name: attacker_label, dtype: int64

In [39]:
df_result.loc[df_result['responded_ip'].apply(is_specified_ip),'victim_label'] = 'Yes'
df_result.loc[df_result['originated_ip'].apply(is_specified_ip),'victim_label'] = 'Yes'

In [40]:
df_result['victim_label'].value_counts()

No     16677
Yes     1925
Name: victim_label, dtype: int64

## Concat to one DF and write to file

In [41]:
print(len(df_result))

18602


In [42]:
df_result.to_csv('/home/sramkova/diploma_thesis_data/neighbourhood_both_days_ssh_guess4.csv', index=False, header=True)

In [43]:
for col in df_result.columns:
    print(col)

originated_ip
uid
connection.uid
connection.conn_state
connection.duration
connection.orig_bytes
connection.orig_ip_bytes
connection.orig_p
connection.orig_pkts
connection.proto
connection.resp_bytes
connection.resp_ip_bytes
connection.resp_p
connection.resp_pkts
connection.service
connection.ts
responded_ip
dns_count
ssh_count
http_count
ssl_count
files_count
dns_qtype
dns_rcode
ssh_auth_attempts
ssh_host_key
http_method
http_status_code
http_user_agent
ssl_version
ssl_cipher
ssl_curve
ssl_validation_status
files_source
file_md5
dns_dicts
ssh_dicts
http_dicts
ssl_dicts
files_dicts
connection.time
orig_orig_total
orig_orig_proto_tcp_count
orig_orig_proto_udp_count
orig_orig_proto_icmp_count
orig_orig_connection.protocol_mode
orig_orig_connection.service_mode
orig_orig_connection.conn_state_mode
orig_orig_connection.time_mean
orig_orig_connection.duration_mean
orig_orig_connection.orig_bytes_mean
orig_orig_connection.orig_pkts_mean
orig_orig_connection.resp_bytes_mean
orig_orig_connecti