# VECTR Link Analysis with Gephi and ipysigma

[Gephi](https://gephi.org) and other link analysis tools can be used to surface trends within a VECTR dataset, allowing defenders to focus their efforts on prevalent techniques observed across a range of threat groups. This Jupyter Notebook processes a standard VECTR Environment CSV export, returning an in-line node graph, as well as `node.csv` and `edge.csv` inputs for use within Gephi. The subsequent analysis be used to discover commonalities between Threat Actors (VECTR Assessments), their intrusions (VECTR Campaigns), and corresponding MITRE ATT&CK Tactics, Tools and Techniques (VECTR Test Cases).

Note, that this methodology works best when your VECTR Environment is organised to reflect the approach documented in my original presentation, *[Intelligence-Led Adversarial Threat Modelling with VECTR](https://github.com/ssnkhan/adversarial-threat-modelling)*.

In [1]:
#!/usr/bin/env python3

# Wednesday 15 January 2025
# Sajid Nawaz Khan, HSBC CITA
# @snkhan@infosec.exchange

# Intelligence-Led Adversarial Threat Modelling with VECTR
# See https://github.com/ssnkhan/adversarial-threat-modelling
# Takes a VECTR Environment export (CSV), and generates a directed graph using ipysigma
# Additionally, generates inputs for use within Gephi


import pandas as pd
import warnings


# Hide Future Warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

In [None]:
# Import the VECTR CSV
# Generate via {Environment} ▶ Export Active Environment

vectr_df = pd.read_csv("./VECTR.csv")
vectr_df.columns

In [3]:
# Optional
# Filter Campaigns by Threat Group or Year, or both
# Expects Campaign names to be prefixed with YYYY

# List of values -- empty returns all
threat_actor_include = [""]
threat_actor_exclude = ["Red Team"]
year = [""]


# Filter by one specific list
vectr_filtered_df = vectr_df[vectr_df["AssessmentGroup"].str.contains("|".join(threat_actor_include), na=False)]
vectr_filtered_df = vectr_df[~vectr_df["AssessmentGroup"].str.contains("|".join(threat_actor_exclude), na=False)]
vectr_filtered_df = vectr_df[vectr_df["Campaign"].str.startswith(tuple(year))]


# Or all
# If `threat_actor_exclude` is an empty list, remove line 20 below
vectr_filtered_df = vectr_df[
    (vectr_df["AssessmentGroup"].str.contains("|".join(threat_actor_include), na=False)) &
    ~vectr_df["AssessmentGroup"].str.contains("|".join(threat_actor_exclude), na=False) &
    (vectr_df["Campaign"].str.startswith(tuple(year)))
]

In [None]:
# Uncomment below to use the filtered dataframe
# vectr_df = vectr_filtered_df

actor_tactic_technique_df = vectr_df[["AssessmentGroup", "Phase", "MitreID", "Method", "Attacker Tools"]].copy()
actor_tactic_technique_df.rename(columns={"AssessmentGroup":"Actor", "Phase":"Tactic", "MitreID":"Technique", "Attacker Tools":"Tool"}, inplace=True)
actor_tactic_technique_df.value_counts()[0:]

In [5]:
# Nodes
# Columns: Id, Type, Label
# Nodes tell Gephi all the possible nodes in a network

gephi_nodes_df = pd.DataFrame(columns=["Id", "Type", "Label"])
gephi_nodes_df["Id"] = actor_tactic_technique_df["Actor"].values
gephi_nodes_df["Type"] = "Threat Actor"

# last_row_df = gephi_nodes_df.index[-1] + 1
# gephi_nodes_df = pd.concat([gephi_nodes_df, pd.DataFrame({"Id": actor_tactic_technique_df["Tactic"]})], ignore_index=True)
# gephi_nodes_df["Type"][last_row_df:] = "Tactic"

last_row_df = gephi_nodes_df.index[-1] + 1
gephi_nodes_df = pd.concat([gephi_nodes_df, pd.DataFrame({"Id": actor_tactic_technique_df["Technique"], "Label": actor_tactic_technique_df["Tactic"] + ": " + actor_tactic_technique_df["Method"]})], ignore_index=True)
gephi_nodes_df["Type"][last_row_df:] = "Technique"


# Account for Tool cells containing multiple comma seperated values
last_row_df = gephi_nodes_df.index[-1] + 1
gephi_nodes_df = pd.concat([gephi_nodes_df, pd.DataFrame({"Id": actor_tactic_technique_df["Tool"]})], ignore_index=True)
empty_rows_nodes = gephi_nodes_df[gephi_nodes_df["Type"].isnull()]

expanded_rows_nodes = (
    empty_rows_nodes.assign(Id=empty_rows_nodes["Id"].str.split(","))
    .explode("Id")
    .reset_index(drop=True)
)

non_empty_rows_nodes = gephi_nodes_df[gephi_nodes_df["Type"].notnull()]
gephi_nodes_df = pd.concat([non_empty_rows_nodes, expanded_rows_nodes], ignore_index=True)
gephi_nodes_df["Type"][last_row_df:] = "Tool"


# Clean up empty rows, and duplicates
gephi_nodes_df = gephi_nodes_df.sort_values(by=["Type", "Id"])
gephi_nodes_df = gephi_nodes_df.dropna(subset=["Id"])
gephi_nodes_df = gephi_nodes_df.drop_duplicates()

In [6]:
# Edges
# Columns: Source, Target
# Edges tell Gephi how the nodes are connected

gephi_edges_df = pd.DataFrame(columns=["Source", "Target", "Type"])


# These columns can be reintroduced, but may create busy graphs
# actor_tactic = pd.DataFrame({"Source": actor_tactic_technique_df["Actor"], "Target": actor_tactic_technique_df["Tactic"]})
# gephi_edges_df = pd.concat([gephi_edges_df, actor_tactic], ignore_index=True)

# tactic_technique = pd.DataFrame({"Source": actor_tactic_technique_df["Technique"], "Target": actor_tactic_technique_df["Tactic"]})
# gephi_edges_df = pd.concat([gephi_edges_df, tactic_technique], ignore_index=True)


actor_technique = pd.DataFrame({"Source": actor_tactic_technique_df["Actor"], "Target": actor_tactic_technique_df["Technique"]})
gephi_edges_df = pd.concat([gephi_edges_df, actor_technique], ignore_index=True)


# Account for Tool cells containing multiple, comma seperated values
actor_tool = pd.DataFrame({"Source": actor_tactic_technique_df["Actor"], "Target": actor_tactic_technique_df["Tool"]})
actor_tool_expanded = actor_tool.assign(Target=actor_tool["Target"].str.split(",")).explode("Target", ignore_index=True)
actor_tool_expanded = actor_tool_expanded.dropna(subset=["Target"])
gephi_edges_df = pd.concat([gephi_edges_df, actor_tool_expanded], ignore_index=True)


# Clean up empty rows, set Type to directed
gephi_edges_df = gephi_edges_df.dropna(subset=["Source", "Target"])
gephi_edges_df["Type"] = "Directed"

In [None]:
# Draw a graph using ipysigma
# https://github.com/medialab/ipysigma
# %pip install ipysigma networkx


import networkx as nx
from ipysigma import Sigma


# Create a directed graph
graph = nx.DiGraph()


# Populate the graph
# Add nodes from our existing `gephi_nodes_df` dataframe
for _, row in gephi_nodes_df.iterrows():
    node_attributes = row.to_dict()
    node_id = node_attributes.pop("Id")
    graph.add_node(node_id, **node_attributes)


# Add edges from our existing `gephi_edges_df` dataframe
for _, row in gephi_edges_df.iterrows():
    edge_attributes = row.to_dict()
    source = edge_attributes.pop("Source")
    target = edge_attributes.pop("Target")
    graph.add_edge(source, target, **edge_attributes)


# Write the graph to a GEXF file
nx.write_gexf(graph, "./VECTR_Directed.gexf")


# Display the graph with the node size mapped on the node's in-degree, and the colour mapped to the node type
# This emphasises the MITRE techniques, as well as the tools observed
# Other `node_size` attributes to try: `graph.degree`, `graph.out_degree`,`nx.connected_components(graph)`, `nx.eigenvector_centrality(graph)`, `pagerank(graph)`

# node_size
# To emphasise tools and techniques, use `graph.in_degree`
# To emphasise threat actors, use `graph.out_degree`

Sigma(
    graph,
    name="VECTR Analysis",
    height=750, 
    node_metrics=["louvain"],
    default_edge_type="curve", 
    node_size_range=(3, 20),
    node_size=graph.in_degree, 
    default_node_color="#9370DB", 
    node_color_palette="Purples",
    node_color="Type",
    node_halo_size=graph.in_degree,
    node_halo_color="MediumBlue",
    start_layout=1
)

In [8]:
# Save an offline, self-contained copy
# Add `show_all_labels=True` to display all labels -- can make for a busy graph

Sigma.write_html(
    graph,
    "./VECTR_Directed.html",
    name="VECTR Analysis",
    fullscreen=True,
    node_metrics=["louvain"],
    default_node_color="#20B2AA", 
    node_color_palette="Purples",
    node_color="Type",
    node_size_range=(3, 20),
    default_edge_type="curve",
    node_border_color_from="node",
    default_node_label_size=14,
    node_size=graph.in_degree,
    node_halo_size=graph.in_degree,
    node_halo_color="MediumBlue",
    start_layout=1
)

In [9]:
# Alternatively, use Gephi to generate, and export the GEXF file
# This has additional benefits, such as being able to calculate specific node attributes and statistics

# Export to CSV for Gephi
gephi_nodes_df.to_csv("./gephi_nodes.csv", index=False)
gephi_edges_df.to_csv("./gephi_edges.csv", index=False)


# Then import and draw the Gephi graph as detailed below, before exporting the graph in GEXF format
# Open and replace the GEXF declaration on line 2 of the GEXF file as follows:
# <gexf version="1.2" xmlns="http://www.gexf.net/1.2draft" xmlns:viz="http://www.gexf.net/1.2/viz" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.gexf.net/1.2draft http://www.gexf.net/1.2draft/gexf.xsd">

# Import the Gephi GEXF export
# graph = nx.read_gexf("./VECTR.gexf")


# Then call Sigma as above

## Gephi Analysis


### Data Laboratory: Data Import
1. Create a New Project
2. Switch to the Data Laboratory tab
3. Select "Import Spreadsheet"
4. Select `gephi_nodes.csv`, then click "Next" and "Finish"
5. Set the "Graph Type" to "Directed"
6. Set the "Edges merge strategy" to "Don't merge", then select "Append to existing workspace", then "OK"
7. Select "Import Spreadsheet" again
8. Select `gephi_edges.csv`, then click "Next" and "Finish"
9. Set the "Graph Type" to "Directed"
10. Set the "Edges merge strategy" to "Don't merge", then select "Append to existing workspace", then "OK"


### Overview: Statistics: Calculate
With the data imported, we can now draw the nodes and edges to our graph. We can use Gephi's powerful statistical capabilities to calculate various attributes for each data point, such as its degree and modularity, and use these to organise and resize each node accordingly.

1. In the Statistics tab, run the "Average Degree" and "Modularity" plugins
2. Under the Appearance tab, select Nodes ▶ Size ▶ Ranking ▶ Degree (Minimum 5, Maximum 30), then "Apply"
3. Under the Appearance tab, select Nodes ▶ Colour ▶ Partition ▶ Modularity Class (Minimum 5, Maximum 30), then "Apply"


### Overview: Layout: Drawing the Graph
1. Select and run the Fruchterman Reingold model. This model attracts and repulses nodes based on their attributes
2. Under Layout, select "ForceAtlas 2" ▶ Prevent Overlap
3. Optionally, run the Expansion and/or the Noverlap models as necessary


### Overview: Graph: Labels
1. Below the main graph canvas, select the "More settings ..." icon
2. Enable the Nodes checkbox, and set a suitable font
3. Select the "Configure ..." button, and enable the "Id" checkbox, and disable the "Label" checkbox


### Overview: Filters (Optional)
A large VECTR dataset can result in hundreds of nodes being added to the graph. Thankfully, Gephi's filtering functionality can be used to remove statistically insignicant nodes from the graph.

1. Ensure you have already calculated the appropriate node characteristics, detailed above
2. In the Filters panel, select Library ▶ Attributes ▶ Range ▶ And select the appropriate node characteristic
3. In the Queries panel below, expand Range ▶ Parameters, and set a sensible minimum value. The range of values can be explored using the Data Laboratory
4. Click Filter


### Preview: Customising and Exporting the Graph
1. Select the Preview tab to view and export the graph
2. Optionally, select a new Font
3. Click the "Refresh" button after every adjustment to redraw the Graph


### Export to GEXF
1. File ▶ Export ▶ Graph File, then select `GEXF Files (*.gexf)` in the dropdown
2. The `.gexf` file can be used as an input for ipysigma above