# Simplicial Paths as Higher-Order Motifs

A strong inductive bias for deep learning models is processing signals while respecting the local structure of their underlying space. Many real-world systems operate on asymmetric relational structures, leading to directed graph representations. However, most graph and topological models forcibly symmetrize these relationships, thereby losing critical information. While some graph neural networks have recently started incorporating asymmetric pairwise interactions, extending the topological data analysis (TDA) framework to account for asymmetric higher-order relationships remains unexplored.

Recent studies have examined cascading dynamics on networks at the simplicial level [2]. In TDA, the use of topological tools to address questions in neuroscience has generated interest in constructing topological spaces from digraphs to better understand the phenomena they support [3].

Driven by the belief that complex high-order dynamics reveal crucial interactions among related signals, we propose integrating high-order directionality to identify simplicial paths. This approach will uncover intricate patterns and dynamics that traditional methods often overlook.

Specifically, we suggest using maximal simplicial paths as high-dimensional motifs derived from directed graphs and encoding these paths as high-order cells within a combinatorial complex. This method will facilitate the emergence of novel complex information and lead to more expressive graph representation models. We propose this approach as a first step towards defining directed topological neural network architectures.

Next, we introduce the basic background to the approach. For a more extensive introduction to the basics, we confer the reader to [1].

## Complexes

**Directed Graphs**

A *directed graph* (digraph) is a pair $G = (V,E)$ of a finite set $V$ of vertices and $E \subseteq [V]^2/\Delta_V$ is a relation, where $\Delta_V = \{(v,v)|v \in V\}$. Note that the relation is not necessarily symmetric. Quotienting by $\Delta_V$ we avoid loops on the graph, i.e., no edges $(v,v)$.

**Abstract Simplicial Complexes**

An *abstract simplicial complex* is a pair $K = (V, \Sigma)$, where $V$ is a finite set of vertices, and $\Sigma$ is a collection of subsets of $\Sigma$ such that for all element $\sigma \in \Sigma$, $\tau \subseteq \sigma$ implies $\tau \in \Sigma$. An element $\sigma$ of $\Sigma$ is an *abstract simplex* of $\mathcal{K}$. It is a *k-simplex* if $|\sigma| = k+1$. If $\tau \subseteq \sigma \in \mathcal{K}$, then $\tau$ is a face of $\sigma$. If the dimension $\tau$ is $\dim(\tau) = \dim(\sigma) - 1$, then it is a *facet* of $\sigma$. The *dimension* $\dim(\mathcal{K})$ of $\mathcal{K}$ is the maximal dimension of a simplex in $\mathcal{K}$.

There is a standard way of building an abstract simplicial complex from a graph.

**Flag Complex**

Given a graph $G$, its associated flag complex is the abstract simplicial complex whose $k$-simplices are formed by the $(k+1)$-cliques of the graph.

The following are the natural generalization of flag complexes for digraphs.

**Directed Flag Complex**

An ordered $k$-clique of a directed graph $G$ is a totally ordered $k$-tuple $(v_1, \dots, v_n)$ of vertices of $G$ with the property that $(v_i, v_{j}) \in E$ for $i < j$. Given a digraph $G$, its directed flag complex is the abstract simplicial complex whose simplices are all the directed $(k+1)$-cliques.


## Simplicial Paths

**Edge paths on digraphs** A path on a digraph is a sequence $(v_0, v_1, \dots, v_n)$ such that any consecutive pair $(v_i, v_{i+1}) \in E$, moving from a source vertex to a sink vertex. Directed graphs support various directed edge paths.

Directed cliques have an inherent directionality, which we exploit to extend the notion to higher-dimensional simplicial paths formed by sequences of simplices in the directed flag complex.

We will impose the notion of direction via face maps.

**Face maps** Face maps uniquely identify the faces of the simplex by omitting the $i$th-vertex. Let $\sigma$ be an $n$-simplex. We denote by $\hat{d}_i$ the face map

$$
\hat{d}_i(\sigma) =
\begin{cases}
(v_0, \ldots, \hat{v}_i, \ldots, v_n) & \text{if } i < n, \\
(v_0, \ldots, v_{n-1}, \hat{v}_n) & \text{if } i \geq n.
\end{cases}
$$

**Directed Q-Connectivity** For an ordered simplicial complex $K$, let $(\sigma, \tau)$ be an ordered pair of simplices $\sigma \in K_s$ and $\tau \in K_t$, where $s, t \geq q$. Let $(\hat{d}_i, \hat{d}_j)$ be an ordered pair of the $ith$ and  $jth$ face maps. Then $(\sigma, \tau)$ is $q$-*near along* $(\hat{d}_i, \hat{d}_j)$ if either of the following conditions is true:

1. $\sigma \leftrightarrow \tau$,
2. $\hat{d}_i(\sigma) \leftrightarrow \alpha \leftrightarrow \hat{d}_j(\tau)$, for some $q$-simplex $\alpha \in K$.

By closing the directed q-nearness transitively, the ordered pair $(\sigma, \tau)$ of simplices of $K$ is $q$-*connected along* $(\hat{d}_i, \hat{d}_j)$ if there is a sequence of simplices in $K$,

$$\sigma = \alpha_0, \alpha_1, \alpha_2, \ldots, \alpha_n, \alpha_{n+1} = \tau,$$

such that any two consecutive ones are $q$-*near along* $(\hat{d}_i, \hat{d}_j)$. The sequence of simplices is called a $q$-*connection* along $(\hat{d}_i, \hat{d}_j)$ between $\sigma$ and $\tau$ or $(q, \hat{d}_i, \hat{d}_j)$-*connection*, when the choices of $q$ and directions $\hat{d}_i$ and $\hat{d}_j$ are made. From now on we refer $(q, \hat{d}_i, \hat{d}_j)$ as $(q, i, j)$.

**Theorem** The relation of being $(q,i,j)$-connected is a preorder on $\Sigma_{\geq q}$.

Instead of focusing on the path structure of the digraph, we look at the path structure of the high-dimensional simplices by exploring the $q$-connectivity preorder.

Different choices of $q,i,j$ allow to enphasize different features of directionality. For instance, $(1,0,2)$-connected paths of 2-simplices exhibit directed flows aligned with the directionality of the total order of the adjacent simplices. On the other hand, the $(1,1,2)$ preorder reveals circular flows around a source vertex

<p align="center">
    <img src="./figures/sp.jpeg" alt="Alt Text" style="max-width: 50%; max-height: 50%;">
</p>

The $(q, i, j)$-connections exhibit different homotopical information compared to the original complex arising from the structure of the digraph. The following two digraphs span a $2$-dimensional directed flag complex homotopic to the $2$-sphere, making them indistinguishable by homology. However, by examining the $(1,0,2)$ and $(1,1,2)$ preorders, we can homotopically distinguish these complexes. The $(1,1,2)$ preorder, in particular, allows us to identify circular flows in both the upper and lower hemispheres. Specifically, the first complex has a circular flow only in the upper hemisphere, whereas the second complex exhibits circular flows in both the upper and lower hemispheres.

<p align="center">
    <img src="./figures/sph.jpeg" alt="Alt Text" style="max-width: 50%; max-height: 50%;">
</p>

## References

[1] Henri Riihïmaki. [Simplicial q-Connectivity of Directed Graphs with Applications to Network Analysis](https://arxiv.org/pdf/2202.07307).

[2] Bengier Ulgen, Dane Taylor. [Simplicial cascades are orchestrated by the multidimensional geometry of neuronal complexes](https://arxiv.org/pdf/2201.02071).

[3] Dane Taylor, Florian Klimm. [Topological data analysis of contagion maps for examining spreading processes on networks](https://arxiv.org/pdf/1408.1168)

[3] D. Lütgehetmann, D. Govc, J.P. Smith, and R. Levi. [Computing persistent homology of directed flag complexes](https://arxiv.org/pdf/arXiv:1906.10458).


# TO DO

1. Dataset Loading
Implements the pipeline to load a dataset from the src domain. Since the challenge repository doesn’t allow storing large files, loaders must download datasets from external sources into the datasets/ folder.
This pipeline is provided for several graph-based datasets. For any other src domain, participants are allowed to transform graph datasets into the corresponding domain through our provided lifting mappings –or just dropping their connectivity to get point-clouds.
(Bonus) Designing a loader for a new dataset (ones that are not already provided in the tutorials) will be positively taken into consideration in the final evaluation.

2. Pre-processing the Dataset
Applies the lifting transform to the dataset.
Needs to be done through the PreProcessor, which we provide in
modules/io/preprocess/preprocessor.py.

3. Running a Model over the Lifted Dataset
Creates a Neural Network model that operates over the dst domain, leveraging TopoModelX for higher order topologies or torch_geometric for graphs.
Runs the model on the lifted dataset.

In [1]:
import csv
import time
import torch
import numpy as np
import networkx as nx
import scipy.sparse as sp
import pyflagsercount as pfc
import sys

sys.path.append("../../")
from modules.transforms.liftings.graph2combinatorial.sp_lifting import (DirectedFlagComplex as dfc, )

# from datasets.data_loading import get_dataset, get_dataset_split

In [2]:
# With this cell any imported module is reloaded before each cell execution
%load_ext autoreload
%autoreload 2
from modules.data.load.loaders import GraphLoader
from modules.data.preprocess.preprocessor import PreProcessor
from modules.utils.utils import (describe_data, load_dataset_config, load_model_config, load_transform_config, )

In [3]:
CHAMELEON = "chameleon"
CORNELL = "Cornell"
WISCONSIN = "Wisconsin"
TEXAS = "Texas"
ROMAN_EMPIRE = "directed-roman-empire"
SQUIRREL = "squirrel"
OGBN_ARXIV = "ogbn-arxiv"
SNAP_PATENTS = "snap-patents"
CORA_ML = "cora_ml"
CITESEER_FULL = "citeseer_full"
ARXIV_YEAR = "arxiv-year"
SYN_DIR = "syn-dir"

In [4]:
dataset_name = "cocitation_cora"
dataset_config = load_dataset_config(dataset_name)
loader = GraphLoader(dataset_config)


Dataset configuration for cocitation_cora:

{'data_domain': 'graph',
 'data_type': 'cocitation',
 'data_name': 'Cora',
 'data_dir': 'datasets/graph/cocitation',
 'num_features': 1433,
 'num_classes': 7,
 'task': 'classification',
 'loss_type': 'cross_entropy',
 'monitor_metric': 'accuracy',
 'task_level': 'node'}


In [5]:
dataset = loader.load()
describe_data(dataset)


Dataset only contains 1 sample:
 - Graph with 2708 vertices and 10556 edges.
 - Features dimensions: [1433, 0]
 - There are 0 isolated nodes.



In [6]:
dataset.edge_index.shape

torch.Size([2, 10556])

In [11]:
# Define transformation type and id
transform_type = "liftings"
# If the transform is a topological lifting, it should include both the type of the lifting and the identifier
transform_id = "graph2combinatorial/sp_lifting"

# Read yaml file
transform_config = {"lifting": load_transform_config(transform_type, transform_id)
    # other transforms (e.g. data manipulations, feature liftings) can be added here
}


Transform configuration for graph2combinatorial/sp_lifting:

{'transform_type': 'lifting',
 'transform_name': 'Graph2CombinatorialLifting',
 'd1': 2,
 'd2': 2,
 'q': 1,
 'i': 0,
 'j': 2,
 'complex_dim': 2,
 'offset': 'torch.tensor([[0], [0]])',
 'chunk_size': 1024,
 'save_path': 'None',
 'threshold': 1}


In [None]:
lifted_dataset = PreProcessor(dataset, transform_config, loader.data_dir)
describe_data(lifted_dataset)

In [None]:
# # %%

# def create_csv_datasets(dataset_name, dataset_dir="../dataset/"):
#     dataset, evaluator = get_dataset(dataset_name, dataset_dir)
#     source = dataset.edge_index[0].tolist()  # source
#     target = dataset.edge_index[1].tolist()  # target

#     csv_file_name = "./dataset/vis/original/" + dataset_name + ".csv"

#     with open(csv_file_name, "w", newline="") as file:
#         writer = csv.writer(file)

#         # Write the list content as rows
#         for a, b in zip(source, target):
#             writer.writerow([a, b])

#     print(f'CSV file "{csv_file_name}" created successfully.')


# def create_csv_condensations(dataset_name):

#     dataset_digraph = create_digraph_from_dataset(dataset_name)
#     condensation_digraph = nx.condensation(dataset_digraph)

#     condensation_digraph_edges = list(condensation_digraph.edges)

#     if dataset_name == "cora_ml":
#         dataset_name = "cora-ml"

#     if dataset_name == "citeseer_full":
#         dataset_name = "citeseer-full"

#     csv_file_name = "./dataset/vis/condensations/" + dataset_name + "-condensation.csv"

#     with open(csv_file_name, "w", newline="") as file:
#         writer = csv.writer(file)

#         # Write the list content as rows
#         for e in condensation_digraph_edges:
#             writer.writerow([e[0], e[1]])

#     print(f'CSV file "{csv_file_name}" created successfully.')


# def create_csv_condensations_from_dataset():

#     dataset_list = [
#         CHAMELEON,
#         ROMAN_EMPIRE,
#         SQUIRREL,
#         OGBN_ARXIV,
#         CORA_ML,
#         CITESEER_FULL,
#         ARXIV_YEAR,
#     ]

#     for dataset in dataset_list:
#         create_csv_datasets(dataset)
#         create_csv_condensations(dataset)


# def flagser_count(dataset_name, complex_dim=2, num_threads=4):
#     dataset_digraph = create_digraph_from_dataset(dataset_name)

#     sparse_adjacency_matrix = nx.to_scipy_sparse_array(dataset_digraph, format="csr")

#     start_time = time.time()

#     X = pfc.flagser_count(
#         sparse_adjacency_matrix,
#         threads=num_threads,
#         return_simplices=True,
#         max_dim=complex_dim,
#     )
#     end_time = time.time()
#     print("Time elapsed: ", end_time - start_time)
#     return X


# def create_digraph_from_dataset(dataset_name, dataset_dir="../dataset/"):
#     dataset, evaluator = get_dataset(dataset_name, dataset_dir)
#     dataset_digraph = nx.DiGraph()
#     dataset_digraph.add_edges_from(
#         list(zip(dataset.edge_index[0].tolist(), dataset.edge_index[1].tolist()))
#     )
#     print("Number of nodes: ", dataset_digraph.number_of_nodes(), " Number of edges: ", dataset_digraph.number_of_edges())
#     return dataset_digraph


# def create_flag_complex_from_dataset(dataset_name, dataset_dir, complex_dim=2):
#     dataset_digraph = create_digraph_from_dataset(dataset_name, dataset_dir)
#     flag_complex = dfc.DirectedFlagComplex(dataset_digraph, complex_dim)
#     return flag_complex


# def create_condensed_digraph_from_dataset(dataset_name):
#     dataset_digraph = create_digraph_from_dataset(dataset_name)
#     condensation_digraph = nx.condensation(dataset_digraph)
#     return condensation_digraph


# def dataset_stats(dataset_name, complex_dim=2):
#     G = create_digraph_from_dataset(dataset_name)
#     FlG = dfc.DirectedFlagComplex(G, complex_dim)
#     for d in range(complex_dim + 1):
#         print(dataset_name + " number of " + str(d) + "-simplices", len(FlG.complex[d]))
#     return FlG


# def qij(
#     dataset_name, dataset_dir, q, i, j, complex_dim=2, chunk_size=1024, save_path=None
# ):
#     FlG = create_flag_complex_from_dataset(dataset_name, dataset_dir, complex_dim)
#     return FlG.qij(q, i, j, chunk_size, save_path)


# if __name__ == "__main__":
#     DATASET_DIR = "../../dataset/"
#     qij(
#         WISCONSIN,
#         DATASET_DIR,
#         1,
#         0,
#         2,
#         complex_dim=2,
#         chunk_size=100,
#         # save_path="../../dataset/cornell/102.pt",
#         save_path=None
#     )
#     pass