# Flash Evaluation on the DARPA OpTC Dataset

This notebook is designed for evaluating Flash on the DARPA OpTC dataset. The OpTC dataset is a node-level dataset, crucial for our analysis. Flash is configured to operate in a node-level setting to effectively assess this dataset. The OpTC dataset is enriched with node attributes, making it suitable for running Flash in a decoupled manner. This includes using offline GNN embeddings and a downstream classifier. Our approach tests Flash on this dataset, where Flash generates word2vec embeddings as feature vectors for GNN embeddings. These embeddings are stored in a datastore and used in conjunction with a downstream model for improved detection results.

## Accessing the Dataset:
- The OpTC dataset can be accessed via this link: [OpTC Dataset](https://drive.google.com/drive/u/0/folders/1n3kkS3KR31KUegn42yk3-e6JkZvf0Caa).
- Dataset files for evaluation will be downloaded automatically by the script.
- While we provide pre-trained weights, you also have the option to download benign data files for training the models from the ground up.

## Data Parsing and Execution:
- The script is adept at autonomously parsing the downloaded data files.
- For evaluation results, execute all cells in this notebook.

## Model Training and Execution Options:
- By default, the notebook utilizes pre-trained model weights.
- It also offer settings to independently train Graph Neural Networks (GNNs), word2vec, and Xgboost models.
- These independently trained models can then be deployed for an evaluation of the system.

Following these guidelines will ensure a thorough and effective analysis of the OpTC dataset using Flash.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import torch
from torch_geometric.data import Data
import os
import torch.nn.functional as F
import pickle
import json
import warnings
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
warnings.filterwarnings('ignore')
from torch_geometric.loader import NeighborLoader

# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# device = torch.device("cpu")

import subprocess
gpu_mem = {int(x.split(',')[0]): int(x.split(',')[1]) for x in subprocess.check_output(
    ["nvidia-smi", "--query-gpu=index,memory.free", "--format=csv,noheader,nounits"], 
    encoding='utf-8').strip().split('\n')}
best_gpu = max(gpu_mem.items(), key=lambda x: x[1])[0]
device = torch.device(f'cuda:{best_gpu}' if torch.cuda.is_available() else 'cpu')
print(device)

%matplotlib inline

cuda:0


In [2]:
# gnn_weights = "trained_weights/optc/gnn_temp.pth"
gnn_weights = "gnn_temp.pth"
# xgboost_weights = "trained_weights/optc/xgb.pkl"
xgboost_weights = "xgb.pkl"
word2vec_weights = 'w2v_optc.model'
create_store = True
gnnTrain = True
xgbTrain = True
w2vTrain = True

In [3]:
from pprint import pprint
import gzip
from sklearn.manifold import TSNE
import json
import copy
import os
import xgboost as xgb

import gensim
from gensim.models import Word2Vec
from multiprocessing import Pool
from itertools import compress
from tqdm import tqdm
import time

import multiprocessing

In [4]:
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from collections import Counter
from gensim.models import Word2Vec
from multiprocessing import Pool
from itertools import compress
from tqdm import tqdm
import time

In [5]:
import gzip
import io

def extract_logs(filepath, hostid):
    search_pattern = f'SysClient{hostid}'
    output_filename = f'SysClient{hostid}.systemia.com.txt'
    
    with gzip.open(filepath, 'rt', encoding='utf-8') as fin:
        with open(output_filename, 'ab') as f:
            out = io.BufferedWriter(f)
            for line in fin:
                if search_pattern in line:
                    out.write(line.encode('utf-8'))
            out.flush()

In [6]:
# import gdown
from tqdm import tqdm
    
def prepare_test_set():
    # urls = [
    #     "https://drive.google.com/file/d/1HFSyvmgH0jvdnnnTdKfWRjZYOrLWoIkv/view?usp=drive_link",
    #     "https://drive.google.com/file/d/1pJLxJsDV8sngiedbfVajMetczIgM3PQd/view?usp=drive_link",
    #     "https://drive.google.com/file/d/1fRQqc68r8-z5BL7H_eAKIDOeHp7okDuM/view?usp=drive_link",
    #     "https://drive.google.com/file/d/1VfyGr8wfSe8LBIHBWuYBlU8c2CyEgO5C/view?usp=drive_link",
    #     "https://drive.google.com/file/d/10N9ZPolq_L8HivBqzf_jFKbwjSxddsZp/view?usp=drive_link",
    #     "https://drive.google.com/file/d/1xIr8gw-4zc8ESjUpYtrFsbOwhPGUSd15/view?usp=drive_link",
    #     "https://drive.google.com/file/d/1PvlCp2oQaxEBEFGSQWfcFVj19zLOe7yH/view?usp=drive_link"
    # ]

    # for url in urls:
    #     gdown.download(url, quiet=False, use_cookies=False, fuzzy=True)

    log_files = [
        ("AIA-201-225.ecar-2019-12-08T11-05-10.046.json.gz", "0201"),
        ("AIA-201-225.ecar-last.json.gz", "0201"),
        ("AIA-501-525.ecar-2019-11-17T04-01-58.625.json.gz", "0501"),
        ("AIA-501-525.ecar-last.json.gz", "0501"),
        ("AIA-51-75.ecar-last.json.gz", "0051")
    ]
    
    os.system("rm SysClient0201.com.txt")
    os.system("rm SysClient0501.com.txt")
    os.system("rm SysClient0051.com.txt")

    for file, code in tqdm(log_files, desc="Extracting logs", unit="file"):
        extract_logs(file, code)

# prepare_test_set()

In [7]:
def is_valid_entry(entry):
    valid_objects = {'PROCESS', 'FILE', 'FLOW', 'MODULE'}
    invalid_actions = {'START', 'TERMINATE'}

    object_valid = entry['object'] in valid_objects
    action_valid = entry['action'] not in invalid_actions
    actor_object_different = entry['actorID'] != entry['objectID']

    return object_valid and action_valid and actor_object_different

def Traversal_Rules(data):
    filtered_data = {}

    for entry in data:
        if is_valid_entry(entry):
            key = (
                entry['action'], 
                entry['actorID'], 
                entry['objectID'], 
                entry['object'], 
                entry['pid'], 
                entry['ppid']
            )
            filtered_data[key] = entry

    return list(filtered_data.values())

In [8]:
def Sentence_Construction(entry):
    action = entry["action"]
    properties = entry['properties']
    object_type = entry['object']

    format_strings = {
        'PROCESS': "{parent_image_path} {action} {image_path} {command_line}",
        'FILE': "{image_path} {action} {file_path}",
        'FLOW': "{image_path} {action} {src_ip} {src_port} {dest_ip} {dest_port} {direction}",
        'MODULE': "{image_path} {action} {module_path}"
    }

    default_format = "{image_path} {action} {module_path}"

    try:
        format_str = format_strings.get(object_type, default_format)
        phrase = format_str.format(action=action, **properties)
    except KeyError:
        phrase = ''

    return phrase.split(' ')

In [9]:
import pandas as pd
import json

def Extract_Semantic_Info(event):
    object_type = event['object']
    properties = event['properties']

    label_mapping = {
        "PROCESS": ('parent_image_path', 'image_path'),
        "FILE": ('image_path', 'file_path'),
        "MODULE": ('image_path', 'module_path'),
        "FLOW": ('image_path', 'dest_ip', 'dest_port')
    }

    label_keys = label_mapping.get(object_type, None)
    if label_keys:
        labels = [properties.get(key) for key in label_keys]
        if all(labels):
            event["actorname"], event["objectname"] = labels[0], ' '.join(labels[1:])
            return event
    return None

def transform(text):
    labeled_data = [event for event in (Extract_Semantic_Info(x) for x in text) if event]
    data = Traversal_Rules(labeled_data)

    phrases = [Sentence_Construction(x) for x in data if Sentence_Construction(x)]
    for datum, phrase in zip(data, phrases):
        datum['phrase'] = phrase

    df = pd.DataFrame(data)
    df['timestamp'] = pd.to_datetime(df['timestamp'].str[:-6], infer_datetime_format=True)
    df.sort_values(by='timestamp', inplace=True)

    return df

def load_data(file_path):
    with open(file_path, 'r') as file:
        content = [json.loads(line) for line in file]
    
    return Featurize(transform(content))

In [10]:
import numpy as np

def Featurize(df):
    dummies = {'PROCESS': 0, 'FLOW': 1, 'FILE': 2, 'MODULE': 3}

    nodes = {}
    labels = {}
    lblmap = {}
    neimap = {}
    edges = []

    for index, row in df.iterrows():
        actor_id, object_id = row['actorID'], row["objectID"]
        object_type = row['object']

        nodes.setdefault(actor_id, []).extend(row['phrase'])
        nodes.setdefault(object_id, []).extend(row['phrase'])

        labels[actor_id] = dummies.get('PROCESS', -1)
        labels[object_id] = dummies.get(object_type, -1)

        lblmap[actor_id] = row['actorname']
        lblmap[object_id] = row['objectname']

        neimap.setdefault(actor_id, set()).add(row['objectname'])
        neimap.setdefault(object_id, set()).add(row['actorname'])

        edge_type = row['properties']['direction'] if object_type == 'FLOW' else row['action']
        edges.append((actor_id, object_id, edge_type))

    features, feat_labels, edge_index = [], [], [[], []]
    node_index = {}

    for node, phrases in nodes.items():
        if not (len(phrases) == 1 and phrases[0] == 'DELETE'):
            features.append(infer(phrases))
            feat_labels.append(labels[node])
            node_index[node] = len(features) - 1

    for src, dst, _ in edges:
        edge_index[0].append(node_index[src])
        edge_index[1].append(node_index[dst])

    mapp = list(node_index.keys())

    return features, np.array(feat_labels), edge_index, mapp, lblmap, neimap

In [11]:
from gensim.models.callbacks import CallbackAny2Vec

class EpochSaver(CallbackAny2Vec):

    def __init__(self):
        self.epoch = 0

    def on_epoch_end(self, model):
        model.save('w2v_optc.model')
        self.epoch += 1

In [12]:
class EpochLogger(CallbackAny2Vec):

    def __init__(self):
        self.epoch = 0

    def on_epoch_begin(self, model):
        print("Epoch #{} start".format(self.epoch))

    def on_epoch_end(self, model):
        print("Epoch #{} end".format(self.epoch))
        self.epoch += 1

In [13]:
import json
from gensim.models import Word2Vec

def prepare_sentences(df):
    nodes = {}
    for index, row in df.iterrows():
        for key in ['actorID', 'objectID']:
            node_id = row[key]
            nodes.setdefault(node_id, []).extend(row['phrase'])
    return list(nodes.values())

def train_word2vec_model(train_file_path):
    with open(train_file_path, 'r') as file:
        content = [json.loads(line) for line in file]

    events = transform(content)
    phrases = prepare_sentences(events)

    logger = EpochLogger()
    saver = EpochSaver()
    # word2vec = Word2Vec(sentences=phrases, vector_size=20, window=5, min_count=1, workers=8, epochs=300, callbacks=[saver, logger])
    word2vec = Word2Vec(sentences=phrases, vector_size=10, window=5, min_count=1, workers=8, epochs=300, callbacks=[saver, logger])

    return word2vec

In [14]:
import math
import torch
import numpy as np
from gensim.models import Word2Vec

class PositionalEncoder:

    def __init__(self, d_model, max_len=100000):
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        self.pe = torch.zeros(max_len, d_model)
        self.pe[:, 0::2] = torch.sin(position * div_term)
        self.pe[:, 1::2] = torch.cos(position * div_term)

    def embed(self, x):
        return x + self.pe[:x.size(0)]


def infer(document):
    word_embeddings = [w2vmodel.wv[word] for word in document if word in  w2vmodel.wv]
    
    if not word_embeddings:
        return np.zeros(10)

    output_embedding = torch.tensor(word_embeddings, dtype=torch.float)
    if len(document) < 100000:
        output_embedding = encoder.embed(output_embedding)

    output_embedding = output_embedding.detach().cpu().numpy()
    return np.mean(output_embedding, axis=0)

# encoder = PositionalEncoder(20)
encoder = PositionalEncoder(10)
if w2vTrain:
    file_path = '../SysClient0051.systemia.com.txt'
    # 重新训练w2v
    w2vmodel = train_word2vec_model(file_path)
else:
    w2vmodel = Word2Vec.load(word2vec_weights)


Epoch #0 start


Epoch #0 end
Epoch #1 start


Epoch #1 end
Epoch #2 start


Epoch #2 end
Epoch #3 start


Epoch #3 end
Epoch #4 start


Epoch #4 end
Epoch #5 start


Epoch #5 end
Epoch #6 start


Epoch #6 end
Epoch #7 start


Epoch #7 end
Epoch #8 start


Epoch #8 end
Epoch #9 start


Epoch #9 end
Epoch #10 start


Epoch #10 end
Epoch #11 start


Epoch #11 end
Epoch #12 start


Epoch #12 end
Epoch #13 start


Epoch #13 end
Epoch #14 start


Epoch #14 end
Epoch #15 start


Epoch #15 end
Epoch #16 start


Epoch #16 end
Epoch #17 start


Epoch #17 end
Epoch #18 start


Epoch #18 end
Epoch #19 start


Epoch #19 end
Epoch #20 start


Epoch #20 end
Epoch #21 start


Epoch #21 end
Epoch #22 start


Epoch #22 end
Epoch #23 start


Epoch #23 end
Epoch #24 start


Epoch #24 end
Epoch #25 start


Epoch #25 end
Epoch #26 start


Epoch #26 end
Epoch #27 start


Epoch #27 end
Epoch #28 start


Epoch #28 end
Epoch #29 start


Epoch #29 end
Epoch #30 start


Epoch #30 end
Epoch #31 start


Epoch #31 end
Epoch #32 start


Epoch #32 end
Epoch #33 start


Epoch #33 end
Epoch #34 start


Epoch #34 end
Epoch #35 start


Epoch #35 end
Epoch #36 start


Epoch #36 end
Epoch #37 start


Epoch #37 end
Epoch #38 start


Epoch #38 end
Epoch #39 start


Epoch #39 end
Epoch #40 start


Epoch #40 end
Epoch #41 start


Epoch #41 end
Epoch #42 start


Epoch #42 end
Epoch #43 start


Epoch #43 end
Epoch #44 start


Epoch #44 end
Epoch #45 start


Epoch #45 end
Epoch #46 start


Epoch #46 end
Epoch #47 start


Epoch #47 end
Epoch #48 start


Epoch #48 end
Epoch #49 start


Epoch #49 end
Epoch #50 start


Epoch #50 end
Epoch #51 start


Epoch #51 end
Epoch #52 start


Epoch #52 end
Epoch #53 start


Epoch #53 end
Epoch #54 start


Epoch #54 end
Epoch #55 start


Epoch #55 end
Epoch #56 start


Epoch #56 end
Epoch #57 start


Epoch #57 end
Epoch #58 start


Epoch #58 end
Epoch #59 start


Epoch #59 end
Epoch #60 start


Epoch #60 end
Epoch #61 start


Epoch #61 end
Epoch #62 start


Epoch #62 end
Epoch #63 start


Epoch #63 end
Epoch #64 start


Epoch #64 end
Epoch #65 start


Epoch #65 end
Epoch #66 start


Epoch #66 end
Epoch #67 start


Epoch #67 end
Epoch #68 start


Epoch #68 end
Epoch #69 start


Epoch #69 end
Epoch #70 start


Epoch #70 end
Epoch #71 start


Epoch #71 end
Epoch #72 start


Epoch #72 end
Epoch #73 start


Epoch #73 end
Epoch #74 start


Epoch #74 end
Epoch #75 start


Epoch #75 end
Epoch #76 start


Epoch #76 end
Epoch #77 start


Epoch #77 end
Epoch #78 start


Epoch #78 end
Epoch #79 start


Epoch #79 end
Epoch #80 start


Epoch #80 end
Epoch #81 start


Epoch #81 end
Epoch #82 start


Epoch #82 end
Epoch #83 start


Epoch #83 end
Epoch #84 start


Epoch #84 end
Epoch #85 start


Epoch #85 end
Epoch #86 start


Epoch #86 end
Epoch #87 start


Epoch #87 end
Epoch #88 start


Epoch #88 end
Epoch #89 start


Epoch #89 end
Epoch #90 start


Epoch #90 end
Epoch #91 start


Epoch #91 end
Epoch #92 start


Epoch #92 end
Epoch #93 start


Epoch #93 end
Epoch #94 start


Epoch #94 end
Epoch #95 start


Epoch #95 end
Epoch #96 start


Epoch #96 end
Epoch #97 start


Epoch #97 end
Epoch #98 start


Epoch #98 end
Epoch #99 start


Epoch #99 end
Epoch #100 start


Epoch #100 end
Epoch #101 start


Epoch #101 end
Epoch #102 start


Epoch #102 end
Epoch #103 start


Epoch #103 end
Epoch #104 start


Epoch #104 end
Epoch #105 start


Epoch #105 end
Epoch #106 start


Epoch #106 end
Epoch #107 start


Epoch #107 end
Epoch #108 start


Epoch #108 end
Epoch #109 start


Epoch #109 end
Epoch #110 start


Epoch #110 end
Epoch #111 start


Epoch #111 end
Epoch #112 start


Epoch #112 end
Epoch #113 start


Epoch #113 end
Epoch #114 start


Epoch #114 end
Epoch #115 start


Epoch #115 end
Epoch #116 start


Epoch #116 end
Epoch #117 start


Epoch #117 end
Epoch #118 start


Epoch #118 end
Epoch #119 start


Epoch #119 end
Epoch #120 start


Epoch #120 end
Epoch #121 start


Epoch #121 end
Epoch #122 start


Epoch #122 end
Epoch #123 start


Epoch #123 end
Epoch #124 start


Epoch #124 end
Epoch #125 start


Epoch #125 end
Epoch #126 start


Epoch #126 end
Epoch #127 start


Epoch #127 end
Epoch #128 start


Epoch #128 end
Epoch #129 start


Epoch #129 end
Epoch #130 start


Epoch #130 end
Epoch #131 start


Epoch #131 end
Epoch #132 start


Epoch #132 end
Epoch #133 start


Epoch #133 end
Epoch #134 start


Epoch #134 end
Epoch #135 start


Epoch #135 end
Epoch #136 start


Epoch #136 end
Epoch #137 start


Epoch #137 end
Epoch #138 start


Epoch #138 end
Epoch #139 start


Epoch #139 end
Epoch #140 start


Epoch #140 end
Epoch #141 start


Epoch #141 end
Epoch #142 start


Epoch #142 end
Epoch #143 start


Epoch #143 end
Epoch #144 start


Epoch #144 end
Epoch #145 start


Epoch #145 end
Epoch #146 start


Epoch #146 end
Epoch #147 start


Epoch #147 end
Epoch #148 start


Epoch #148 end
Epoch #149 start


Epoch #149 end
Epoch #150 start


Epoch #150 end
Epoch #151 start


Epoch #151 end
Epoch #152 start


Epoch #152 end
Epoch #153 start


Epoch #153 end
Epoch #154 start


Epoch #154 end
Epoch #155 start


Epoch #155 end
Epoch #156 start


Epoch #156 end
Epoch #157 start


Epoch #157 end
Epoch #158 start


Epoch #158 end
Epoch #159 start


Epoch #159 end
Epoch #160 start


Epoch #160 end
Epoch #161 start


Epoch #161 end
Epoch #162 start


Epoch #162 end
Epoch #163 start


Epoch #163 end
Epoch #164 start


Epoch #164 end
Epoch #165 start


Epoch #165 end
Epoch #166 start


Epoch #166 end
Epoch #167 start


Epoch #167 end
Epoch #168 start


Epoch #168 end
Epoch #169 start


Epoch #169 end
Epoch #170 start


Epoch #170 end
Epoch #171 start


Epoch #171 end
Epoch #172 start


Epoch #172 end
Epoch #173 start


Epoch #173 end
Epoch #174 start


Epoch #174 end
Epoch #175 start


Epoch #175 end
Epoch #176 start


Epoch #176 end
Epoch #177 start


Epoch #177 end
Epoch #178 start


Epoch #178 end
Epoch #179 start


Epoch #179 end
Epoch #180 start


Epoch #180 end
Epoch #181 start


Epoch #181 end
Epoch #182 start


Epoch #182 end
Epoch #183 start


Epoch #183 end
Epoch #184 start


Epoch #184 end
Epoch #185 start


Epoch #185 end
Epoch #186 start


Epoch #186 end
Epoch #187 start


Epoch #187 end
Epoch #188 start


Epoch #188 end
Epoch #189 start


Epoch #189 end
Epoch #190 start


Epoch #190 end
Epoch #191 start


Epoch #191 end
Epoch #192 start


Epoch #192 end
Epoch #193 start


Epoch #193 end
Epoch #194 start


Epoch #194 end
Epoch #195 start


Epoch #195 end
Epoch #196 start


Epoch #196 end
Epoch #197 start


Epoch #197 end
Epoch #198 start


Epoch #198 end
Epoch #199 start


Epoch #199 end
Epoch #200 start


Epoch #200 end
Epoch #201 start


Epoch #201 end
Epoch #202 start


Epoch #202 end
Epoch #203 start


Epoch #203 end
Epoch #204 start


Epoch #204 end
Epoch #205 start


Epoch #205 end
Epoch #206 start


Epoch #206 end
Epoch #207 start


Epoch #207 end
Epoch #208 start


Epoch #208 end
Epoch #209 start


Epoch #209 end
Epoch #210 start


Epoch #210 end
Epoch #211 start


Epoch #211 end
Epoch #212 start


Epoch #212 end
Epoch #213 start


Epoch #213 end
Epoch #214 start


Epoch #214 end
Epoch #215 start


Epoch #215 end
Epoch #216 start


Epoch #216 end
Epoch #217 start


Epoch #217 end
Epoch #218 start


Epoch #218 end
Epoch #219 start


Epoch #219 end
Epoch #220 start


Epoch #220 end
Epoch #221 start


Epoch #221 end
Epoch #222 start


Epoch #222 end
Epoch #223 start


Epoch #223 end
Epoch #224 start


Epoch #224 end
Epoch #225 start


Epoch #225 end
Epoch #226 start


Epoch #226 end
Epoch #227 start


Epoch #227 end
Epoch #228 start


Epoch #228 end
Epoch #229 start


Epoch #229 end
Epoch #230 start


Epoch #230 end
Epoch #231 start


Epoch #231 end
Epoch #232 start


Epoch #232 end
Epoch #233 start


Epoch #233 end
Epoch #234 start


Epoch #234 end
Epoch #235 start


Epoch #235 end
Epoch #236 start


Epoch #236 end
Epoch #237 start


Epoch #237 end
Epoch #238 start


Epoch #238 end
Epoch #239 start


Epoch #239 end
Epoch #240 start


Epoch #240 end
Epoch #241 start


Epoch #241 end
Epoch #242 start


Epoch #242 end
Epoch #243 start


Epoch #243 end
Epoch #244 start


Epoch #244 end
Epoch #245 start


Epoch #245 end
Epoch #246 start


Epoch #246 end
Epoch #247 start


Epoch #247 end
Epoch #248 start


Epoch #248 end
Epoch #249 start


Epoch #249 end
Epoch #250 start


Epoch #250 end
Epoch #251 start


Epoch #251 end
Epoch #252 start


Epoch #252 end
Epoch #253 start


Epoch #253 end
Epoch #254 start


Epoch #254 end
Epoch #255 start


Epoch #255 end
Epoch #256 start


Epoch #256 end
Epoch #257 start


Epoch #257 end
Epoch #258 start


Epoch #258 end
Epoch #259 start


Epoch #259 end
Epoch #260 start


Epoch #260 end
Epoch #261 start


Epoch #261 end
Epoch #262 start


Epoch #262 end
Epoch #263 start


Epoch #263 end
Epoch #264 start


Epoch #264 end
Epoch #265 start


Epoch #265 end
Epoch #266 start


Epoch #266 end
Epoch #267 start


Epoch #267 end
Epoch #268 start


Epoch #268 end
Epoch #269 start


Epoch #269 end
Epoch #270 start


Epoch #270 end
Epoch #271 start


Epoch #271 end
Epoch #272 start


Epoch #272 end
Epoch #273 start


Epoch #273 end
Epoch #274 start


Epoch #274 end
Epoch #275 start


Epoch #275 end
Epoch #276 start


Epoch #276 end
Epoch #277 start


Epoch #277 end
Epoch #278 start


Epoch #278 end
Epoch #279 start


Epoch #279 end
Epoch #280 start


Epoch #280 end
Epoch #281 start


Epoch #281 end
Epoch #282 start


Epoch #282 end
Epoch #283 start


Epoch #283 end
Epoch #284 start


Epoch #284 end
Epoch #285 start


Epoch #285 end
Epoch #286 start


Epoch #286 end
Epoch #287 start


Epoch #287 end
Epoch #288 start


Epoch #288 end
Epoch #289 start


Epoch #289 end
Epoch #290 start


Epoch #290 end
Epoch #291 start


Epoch #291 end
Epoch #292 start


Epoch #292 end
Epoch #293 start


Epoch #293 end
Epoch #294 start


Epoch #294 end
Epoch #295 start


Epoch #295 end
Epoch #296 start


Epoch #296 end
Epoch #297 start


Epoch #297 end
Epoch #298 start


Epoch #298 end
Epoch #299 start


Epoch #299 end


In [15]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv

class GCN(torch.nn.Module):
    def __init__(self):
        super(GCN, self).__init__()
        # self.conv1 = SAGEConv(20, 32, normalize=True)
        self.conv1 = SAGEConv(10, 32, normalize=True)
        self.conv2 = SAGEConv(32, 20, normalize=True)
        self.linear = nn.Linear(in_features=20, out_features=4)

    def forward(self, x: torch.Tensor, edge_index: torch.Tensor) -> torch.Tensor:
    
        x = self.encode(x, edge_index)
        x = self.linear(x)
        return F.softmax(x, dim=1)
    
    def encode(self, x: torch.Tensor, edge_index: torch.Tensor) -> torch.Tensor:
        
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x

In [16]:
import torch.nn.functional as F
from torch.nn import CrossEntropyLoss

model = GCN().to(device)
if not gnnTrain:
    model.load_state_dict(torch.load(gnn_weights))
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

In [17]:
from sklearn.utils import class_weight


if gnnTrain or create_store:
    file_path = '../SysClient0051.systemia.com.txt'
    nodes,labels,edges,mapp,lbl,nemap = load_data(file_path)

    l = np.array(labels)
    
    ngram_class_weights = [
        0.5000,  
        1.3549,  
        0.6373,   
        1.5000  
    ]
    class_weights = torch.tensor(ngram_class_weights,dtype=torch.float).to(device)
    
    criterion = CrossEntropyLoss(weight=class_weights,reduction='mean')

    graph = Data(x=torch.tensor(nodes,dtype=torch.float).to(device),y=torch.tensor(labels,dtype=torch.long).to(device), edge_index=torch.tensor(edges,dtype=torch.long).to(device))

In [18]:
from torch_geometric.loader import NeighborLoader

def train_model(batch):
    model.train()
    optimizer.zero_grad()
    predictions = model(batch.x, batch.edge_index)
    loss = criterion(predictions, batch.y)
    loss.backward()
    optimizer.step()
    return loss.item(), batch.x.size(0)

def evaluate_model(batch):
    model.eval()
    with torch.no_grad():
        predictions = model(batch.x, batch.edge_index)
        pred_labels = predictions.argmax(dim=1)
        correct_predictions = int((pred_labels == batch.y).sum())
    return correct_predictions

losses = []

if gnnTrain:
    loader = NeighborLoader(graph, num_neighbors=[-1, -1], batch_size=5000)
    

    for epoch in range(100):
        total_loss = total_correct = total_nodes = 0

        for batch in loader:
            loss, nodes = train_model(batch)
            total_loss += loss
            total_nodes += nodes
            total_correct += evaluate_model(batch)

        average_loss = total_loss / total_nodes
        accuracy = total_correct / total_nodes

        losses.append(average_loss)
        print(f"Epoch #{epoch}. Training Loss: {average_loss:.5f}, Accuracy: {accuracy:.5f}")
        torch.save(model.state_dict(), gnn_weights)


Epoch #0. Training Loss: 0.00023, Accuracy: 0.82485
Epoch #1. Training Loss: 0.00018, Accuracy: 0.87195
Epoch #2. Training Loss: 0.00017, Accuracy: 0.87537


Epoch #3. Training Loss: 0.00016, Accuracy: 0.88153
Epoch #4. Training Loss: 0.00016, Accuracy: 0.88316
Epoch #5. Training Loss: 0.00016, Accuracy: 0.90279


Epoch #6. Training Loss: 0.00016, Accuracy: 0.92697
Epoch #7. Training Loss: 0.00015, Accuracy: 0.95542
Epoch #8. Training Loss: 0.00015, Accuracy: 0.95895


Epoch #9. Training Loss: 0.00015, Accuracy: 0.95891
Epoch #10. Training Loss: 0.00015, Accuracy: 0.94947
Epoch #11. Training Loss: 0.00015, Accuracy: 0.95891


Epoch #12. Training Loss: 0.00015, Accuracy: 0.96759
Epoch #13. Training Loss: 0.00015, Accuracy: 0.97118
Epoch #14. Training Loss: 0.00015, Accuracy: 0.97073


Epoch #15. Training Loss: 0.00015, Accuracy: 0.97412
Epoch #16. Training Loss: 0.00015, Accuracy: 0.97541
Epoch #17. Training Loss: 0.00015, Accuracy: 0.97369


Epoch #18. Training Loss: 0.00015, Accuracy: 0.97110
Epoch #19. Training Loss: 0.00015, Accuracy: 0.97473
Epoch #20. Training Loss: 0.00015, Accuracy: 0.97814


Epoch #21. Training Loss: 0.00015, Accuracy: 0.97745
Epoch #22. Training Loss: 0.00015, Accuracy: 0.97290
Epoch #23. Training Loss: 0.00015, Accuracy: 0.97842


Epoch #24. Training Loss: 0.00015, Accuracy: 0.97454
Epoch #25. Training Loss: 0.00015, Accuracy: 0.97703
Epoch #26. Training Loss: 0.00015, Accuracy: 0.96976


Epoch #27. Training Loss: 0.00015, Accuracy: 0.96819
Epoch #28. Training Loss: 0.00015, Accuracy: 0.97546
Epoch #29. Training Loss: 0.00015, Accuracy: 0.97665


Epoch #30. Training Loss: 0.00015, Accuracy: 0.97973
Epoch #31. Training Loss: 0.00015, Accuracy: 0.97978
Epoch #32. Training Loss: 0.00015, Accuracy: 0.97973


Epoch #33. Training Loss: 0.00015, Accuracy: 0.97621
Epoch #34. Training Loss: 0.00015, Accuracy: 0.97465
Epoch #35. Training Loss: 0.00015, Accuracy: 0.97753


Epoch #36. Training Loss: 0.00015, Accuracy: 0.97731
Epoch #37. Training Loss: 0.00015, Accuracy: 0.97837
Epoch #38. Training Loss: 0.00015, Accuracy: 0.97180


Epoch #39. Training Loss: 0.00015, Accuracy: 0.97526
Epoch #40. Training Loss: 0.00015, Accuracy: 0.97212
Epoch #41. Training Loss: 0.00015, Accuracy: 0.96928


Epoch #42. Training Loss: 0.00015, Accuracy: 0.97574
Epoch #43. Training Loss: 0.00015, Accuracy: 0.96846
Epoch #44. Training Loss: 0.00015, Accuracy: 0.96820


Epoch #45. Training Loss: 0.00015, Accuracy: 0.96817
Epoch #46. Training Loss: 0.00015, Accuracy: 0.97159
Epoch #47. Training Loss: 0.00015, Accuracy: 0.97330


Epoch #48. Training Loss: 0.00015, Accuracy: 0.97307
Epoch #49. Training Loss: 0.00015, Accuracy: 0.97098
Epoch #50. Training Loss: 0.00015, Accuracy: 0.97035


Epoch #51. Training Loss: 0.00015, Accuracy: 0.97122
Epoch #52. Training Loss: 0.00015, Accuracy: 0.97487
Epoch #53. Training Loss: 0.00015, Accuracy: 0.97442


Epoch #54. Training Loss: 0.00015, Accuracy: 0.97561
Epoch #55. Training Loss: 0.00015, Accuracy: 0.97824
Epoch #56. Training Loss: 0.00015, Accuracy: 0.97970


Epoch #57. Training Loss: 0.00015, Accuracy: 0.98144
Epoch #58. Training Loss: 0.00015, Accuracy: 0.98163
Epoch #59. Training Loss: 0.00015, Accuracy: 0.98123


Epoch #60. Training Loss: 0.00015, Accuracy: 0.97940
Epoch #61. Training Loss: 0.00015, Accuracy: 0.97481
Epoch #62. Training Loss: 0.00015, Accuracy: 0.98114


Epoch #63. Training Loss: 0.00015, Accuracy: 0.97918
Epoch #64. Training Loss: 0.00015, Accuracy: 0.97957
Epoch #65. Training Loss: 0.00015, Accuracy: 0.97994


Epoch #66. Training Loss: 0.00015, Accuracy: 0.97847
Epoch #67. Training Loss: 0.00015, Accuracy: 0.97796
Epoch #68. Training Loss: 0.00015, Accuracy: 0.97587


Epoch #69. Training Loss: 0.00015, Accuracy: 0.97631
Epoch #70. Training Loss: 0.00015, Accuracy: 0.97803
Epoch #71. Training Loss: 0.00015, Accuracy: 0.97901


Epoch #72. Training Loss: 0.00015, Accuracy: 0.98114
Epoch #73. Training Loss: 0.00015, Accuracy: 0.98062
Epoch #74. Training Loss: 0.00015, Accuracy: 0.98285


Epoch #75. Training Loss: 0.00015, Accuracy: 0.98114
Epoch #76. Training Loss: 0.00015, Accuracy: 0.97868
Epoch #77. Training Loss: 0.00015, Accuracy: 0.97811


Epoch #78. Training Loss: 0.00015, Accuracy: 0.98085
Epoch #79. Training Loss: 0.00015, Accuracy: 0.97881
Epoch #80. Training Loss: 0.00015, Accuracy: 0.98144


Epoch #81. Training Loss: 0.00015, Accuracy: 0.98121
Epoch #82. Training Loss: 0.00015, Accuracy: 0.98096
Epoch #83. Training Loss: 0.00015, Accuracy: 0.97753


Epoch #84. Training Loss: 0.00015, Accuracy: 0.97397
Epoch #85. Training Loss: 0.00015, Accuracy: 0.97027
Epoch #86. Training Loss: 0.00015, Accuracy: 0.96674


Epoch #87. Training Loss: 0.00015, Accuracy: 0.97674
Epoch #88. Training Loss: 0.00015, Accuracy: 0.97821
Epoch #89. Training Loss: 0.00015, Accuracy: 0.97561


Epoch #90. Training Loss: 0.00015, Accuracy: 0.97907
Epoch #91. Training Loss: 0.00015, Accuracy: 0.97692
Epoch #92. Training Loss: 0.00015, Accuracy: 0.97708


Epoch #93. Training Loss: 0.00015, Accuracy: 0.97893
Epoch #94. Training Loss: 0.00015, Accuracy: 0.98064
Epoch #95. Training Loss: 0.00015, Accuracy: 0.97888


Epoch #96. Training Loss: 0.00015, Accuracy: 0.97983
Epoch #97. Training Loss: 0.00015, Accuracy: 0.97793
Epoch #98. Training Loss: 0.00015, Accuracy: 0.98158


Epoch #99. Training Loss: 0.00015, Accuracy: 0.97893


In [19]:
if create_store:
    model.eval()
    out = model.encode(graph.x, graph.edge_index).tolist()
    
    gnn_map = {}
    
    for i in range(len(mapp)):
        gnn_map[lbl[mapp[i]]] = (out[i],list(nemap[mapp[i]]))
    
    # with open("data_files/emb_store.json", "w") as file:
    #     json.dump(gnn_map, file)
        
    with open("emb_store.json", "w") as file:
        json.dump(gnn_map, file)

In [20]:
with open("emb_store.json", "r") as file:
    gnn_map = json.load(file)

In [21]:
import numpy as np

def load_features(filename=None, similarity=1):
    nodes, y_train, edges, mapp, lbl, nemap = load_data(filename)
    zero_vector = np.zeros(20)

    X_train = []
    for idx, map_item in enumerate(mapp):
        label = lbl[map_item]
        node_feature = nodes[idx]

        if label in gnn_map:
            emb, stored_set = gnn_map[label]
            current_set = nemap[map_item]
            jaccard_similarity = len(current_set.intersection(stored_set)) / len(current_set.union(stored_set))

            feature_vector = emb if jaccard_similarity >= similarity else zero_vector
        else:
            feature_vector = zero_vector

        X_train.append(np.hstack((node_feature, feature_vector)))

    return np.array(X_train), y_train, edges, mapp

In [22]:
from sklearn.metrics import accuracy_score
from collections import Counter
import xgboost as xgb

if xgbTrain:
    file_path = '../SysClient0051.systemia.com.txt'
    x,y,_,_ = load_features(file_path)
    

    xgb_cl = xgb.XGBClassifier()

    xgb_cl.fit(x,y)
    pickle.dump(xgb_cl, open(xgboost_weights, "wb"))

    preds = xgb_cl.predict(x)
    print(accuracy_score(y, preds))

0.9930428981570828


In [23]:
def load_pkl(fname):
    with open(fname, 'rb') as f:
        obj = pickle.load(f)
    return obj

In [24]:
def validate(file_path):
    x,y,_,_ = load_features(file_path)
    xgb_cl = load_pkl(xgboost_weights)

    pred = xgb_cl.predict(x)
    proba = xgb_cl.predict_proba(x)

    sorted = np.sort(proba, axis=1)
    conf = (sorted[:,-1] - sorted[:,-2]) / sorted[:,-1]
    conf = (conf - conf.min()) / conf.max()

    check = (pred == y)
    flag = ~torch.tensor(check)
    scores = conf[flag].tolist()
    return scores

In [25]:
from itertools import compress
from torch_geometric import utils

def Get_Adjacent(ids, mapp, edges, hops):
    if hops == 0:
        return set()
    
    neighbors = set()
    for edge in zip(edges[0], edges[1]):
        if any(mapp[node] in ids for node in edge):
            neighbors.update(mapp[node] for node in edge)

    if hops > 1:
        neighbors = neighbors.union(Get_Adjacent(neighbors, mapp, edges, hops - 1))
    
    return neighbors

def calculate_metrics(TP, FP, FN, TN):
    FPR = FP / (FP + TN) if FP + TN > 0 else 0
    TPR = TP / (TP + FN) if TP + FN > 0 else 0

    prec = TP / (TP + FP) if TP + FP > 0 else 0
    rec = TP / (TP + FN) if TP + FN > 0 else 0
    fscore = (2 * prec * rec) / (prec + rec) if prec + rec > 0 else 0

    return prec, rec, fscore, FPR, TPR

def helper(MP, all_pids, GP, edges, mapp):
    TP = MP.intersection(GP)
    FP = MP - GP
    FN = GP - MP
    TN = all_pids - (GP | MP)

    two_hop_gp = Get_Adjacent(GP, mapp, edges, 2)
    two_hop_tp = Get_Adjacent(TP, mapp, edges, 2)
    FPL = FP - two_hop_gp
    TPL = TP.union(FN.intersection(two_hop_tp))
    FN = FN - two_hop_tp

    TP, FP, FN, TN = len(TPL), len(FPL), len(FN), len(TN)

    prec, rec, fscore, FPR, TPR = calculate_metrics(TP, FP, FN, TN)
    print(f"True Positives: {TP}, True Negatives: {TN}, False Positives: {FP}, False Negatives: {FN}")
    print(f"Precision: {round(prec, 2)}, Recall: {round(rec, 2)}, Fscore: {round(fscore, 2)}")
    
    return TPL, FPL

In [26]:
import numpy as np

def load_features_test(dataframe, similarity_threshold=1):
    nodes, y_train, edges, mapping, label_map, node_entity_map = Featurize(dataframe)
    X_train = []

    for i, map_id in enumerate(mapping):
        label = label_map[map_id]
        node_embedding = np.zeros(20)  

        if label in gnn_map:
            embedding, stored_set = gnn_map[label]
            current_set = node_entity_map[map_id]
            similarity_metric = len(current_set.intersection(stored_set)) / len(current_set.union(stored_set))

            if similarity_metric >= similarity_threshold:
                node_embedding = np.array(embedding)

        X_train.append(np.hstack((nodes[i], node_embedding)))

    return np.array(X_train), y_train, edges, mapping

In [27]:
import json
import numpy as np
import torch
from torch_geometric import utils

In [28]:
def load_events_from_hosts(hosts):
    all_events = []
    for host in hosts:
        path = f'../SysClient0{host}.systemia.com.txt'
        with open(path, 'r') as file:
            raw_events = [json.loads(line) for line in file]
        all_events.extend(raw_events)
    return all_events

def load_ground_truth(gt_file):
    with open(gt_file, 'r') as file:
        gt_nodes = set(file.read().split())
    return gt_nodes

def evaluate_model(df, xgb_cl, similarity_threshold, confidence_threshold):
    x, y, edges, mapp = load_features_test(df)

    pred = xgb_cl.predict(x)
    proba = xgb_cl.predict_proba(x)

    sorted_proba = np.sort(proba, axis=1)
    conf = (sorted_proba[:, -1] - sorted_proba[:, -2]) / sorted_proba[:, -1]
    normalized_conf = (conf - conf.min()) / conf.max()

    check = (pred == y) & (normalized_conf > confidence_threshold)
    flag = ~torch.tensor(check)

    index = utils.mask_to_index(flag).tolist()
    ids = {mapp[idx] for idx in index}
    return ids,edges,mapp

In [29]:
import json
import numpy as np
import torch

def read_event_data(host):
    file_path = f'../SysClient0{host}.systemia.com.txt'
    with open(file_path, 'r') as file:
        return [json.loads(line) for line in file]
        
def stream_events(batch_size, window_size):
    event_buffer = {}
    hosts = ['051']
    positions = {host: 0 for host in hosts}
    while True:
        for host in hosts:
            if host not in event_buffer or len(event_buffer[host]) < positions[host] + batch_size:
                events = read_event_data(host)
                dframe = transform(events)
                if host in event_buffer:
                    event_buffer[host] = event_buffer[host].append(dframe, ignore_index=True)
                else:
                    event_buffer[host] = dframe
            start = positions[host]
            end = start + batch_size
            yield event_buffer[host][start:end]
            positions[host] += window_size
            if positions[host] >= len(event_buffer[host]):
                return

def analyze_events(data_frame, ground_truth_nodes):
    
    if data_frame['properties'].apply(lambda x: isinstance(x, str)).any():
        data_frame['properties'] = data_frame['properties'].apply(json.loads)
        
    actor_and_object_ids = set(data_frame['actorID']) | set(data_frame['objectID'])
    relevant_ground_truth = {x for x in ground_truth_nodes if x in actor_and_object_ids}

    features, labels, edges, mapping = load_features_test(data_frame)
    model = load_pkl(xgboost_weights)

    predictions = model.predict(features)
    probabilities = model.predict_proba(features)

    sorted_probabilities = np.sort(probabilities, axis=1)
    confidence_scores = (sorted_probabilities[:, -1] - sorted_probabilities[:, -2]) / sorted_probabilities[:, -1]
    normalized_confidence = (confidence_scores - confidence_scores.min()) / confidence_scores.max()

    misclassified = ~torch.tensor(predictions == labels)
    misclassified_indices = utils.mask_to_index(misclassified).tolist()
    misclassified_ids = {mapping[idx] for idx in misclassified_indices}

    helper(misclassified_ids, actor_and_object_ids, relevant_ground_truth, edges, mapping)

In [30]:
def traverse(ids, mapping, edges, hops, visited=None):
    if hops == 0:
        return set()

    if visited is None:
        visited = set()

    neighbors = set()
    for src, dst in zip(edges[0], edges[1]):
        src_mapped, dst_mapped = mapping[src], mapping[dst]

        if (src_mapped in ids and dst_mapped not in visited) or \
           (dst_mapped in ids and src_mapped not in visited):
            neighbors.add(src_mapped)
            neighbors.add(dst_mapped)

        visited.add(src_mapped)
        visited.add(dst_mapped)

    neighbors.difference_update(ids) 
    return ids.union(traverse(neighbors, mapping, edges, hops - 1, visited))

# def load_data(file_path):
#     with open(file_path, 'r') as file:
#         return json.load(file)

def find_connected_alerts(start_alert, mapping, edges, depth, remaining_alerts):
    connected_path = traverse({start_alert}, mapping, edges, depth)
    return connected_path.intersection(remaining_alerts)

def generate_incident_graphs(alerts, edges, mapping, depth):
    incident_graphs = []
    remaining_alerts = set(alerts)

    while remaining_alerts:
        alert = remaining_alerts.pop()
        connected_alerts = find_connected_alerts(alert, mapping, edges, depth, remaining_alerts)

        if len(connected_alerts) > 1:
            incident_graphs.append(connected_alerts)
            remaining_alerts -= connected_alerts

    return incident_graphs

### Testing Flash on OpTC Malicious Upgrade Attack

In [31]:
all_events = load_events_from_hosts(['051'])

EnActIds = [x['actorID'] for x in all_events]
EnObjIds = [x['objectID'] for x in all_events]
EntitySet = set(EnActIds).union(set(EnObjIds))

df = transform(all_events)

gt_nodes = load_ground_truth('../optc.txt')
gt_nodes = [x for x in gt_nodes if x in EntitySet]
gt_nodes = set(gt_nodes)

xgboost_model = load_pkl(xgboost_weights)
identified_ids,edges,mapp = evaluate_model(df, xgboost_model, 1, 0.6)

alerts = helper(identified_ids, EntitySet, gt_nodes, edges, mapp)

True Positives: 164, True Negatives: 179084, False Positives: 62, False Negatives: 17
Precision: 0.73, Recall: 0.91, Fscore: 0.81


### Testing Flash on OpTC Plain PowerShell Empire Attack

In [32]:
# all_events = load_events_from_hosts(['201'])

# EnActIds = [x['actorID'] for x in all_events]
# EnObjIds = [x['objectID'] for x in all_events]
# EntitySet = set(EnActIds).union(set(EnObjIds))

# df = transform(all_events)

# gt_nodes = load_ground_truth('optc.txt')
# gt_nodes = [x for x in gt_nodes if x in EntitySet]
# gt_nodes = set(gt_nodes)

# xgboost_model = load_pkl(xgboost_weights)
# identified_ids,edges,mapp = evaluate_model(df, xgboost_model, 1, 0)

# alerts = helper(identified_ids, EntitySet, gt_nodes, edges, mapp)

### Testing Flash on OpTC Custom PowerShell Empire Attack

In [33]:
# all_events = load_events_from_hosts(['501'])

# EnActIds = [x['actorID'] for x in all_events]
# EnObjIds = [x['objectID'] for x in all_events]
# EntitySet = set(EnActIds).union(set(EnObjIds))

# df = transform(all_events)

# gt_nodes = load_ground_truth('optc.txt')
# gt_nodes = [x for x in gt_nodes if x in EntitySet]
# gt_nodes = set(gt_nodes)

# xgboost_model = load_pkl(xgboost_weights)
# identified_ids,edges,mapp = evaluate_model(df, xgboost_model, 1, 0.98)

# alerts = helper(identified_ids, EntitySet, gt_nodes, edges, mapp)

### Testing Flash on Streaming Batches Generated from OpTC Attack Logs.

In [34]:
stream = False
if stream:
    for data_frame in stream_events(250000, 250):
        gt_nodes = load_ground_truth('optc.txt')
        analyze_events(data_frame, gt_nodes)