# Graph-to-Combinatorial Path Lifting Tutorial

***
This notebook shows how to import a dataset, with the desired lifting, and how to run a neural network using the loaded data.

The notebook is divided into sections:

- [Loading the dataset](#loading-the-dataset) loads the config files for the data and the desired tranformation, createsa a dataset object and visualizes it.
- [Loading and applying the lifting](#loading-and-applying-the-lifting) defines a simple neural network to test that the lifting creates the expected incidence matrices.
- [Create and run a simplicial nn model](#create-and-run-a-simplicial-nn-model) simply runs a forward pass of the model to check that everything is working as expected.

***
***

Note that for simplicity the notebook is setup to use a simple graph. However, there is a set of available datasets that you can play with.

To switch to one of the available datasets, simply change the *dataset_name* variable in [Dataset config](#dataset-config) to one of the following names:

* cocitation_cora
* cocitation_citeseer
* cocitation_pubmed
* MUTAG
* NCI1
* NCI109
* PROTEINS_TU
* AQSOL
* ZINC

***

## Imports and utilities

In [None]:
# With this cell any imported module is reloaded before each cell execution
from modules.data.load.loaders import GraphLoader
from modules.data.preprocess.preprocessor import PreProcessor
from modules.utils.utils import (
    describe_data,
    load_dataset_config,
    load_model_config,
    load_transform_config,
)

## Loading the dataset

Here we load the `manual_dataset`. First, the dataset config is read from the corresponding yaml file (located at `/configs/datasets/` directory), and then the data is loaded via the implemented `Loaders`.

In [None]:
dataset_name = "manual_dataset"
dataset_config = load_dataset_config(dataset_name)
loader = GraphLoader(dataset_config)

In [None]:
dataset = loader.load()
describe_data(dataset)

## Loading and Applying the Lifting - Path Lifting 

In [None]:
# Define transformation type and id
transform_type = "liftings"

# If the transform is a topological lifting, it should include both the type of the lifting and the identifier
transform_id = "graph2combinatorial/path_lifting"

# Read yaml file
transform_config = {
    "lifting": load_transform_config(transform_type, transform_id)
    # other transforms (e.g. data manipulations, feature liftings) can be added here
}

In this section we will instantiate the lifting we want to apply to the data. Since we are lifting graphs to combinatorial complexes (CC), we implemented the path lifting approach from [[1]](https://arxiv.org/abs/2406.04916). We first briefly recall some of the definitions we employed in this notebook from [[2]](https://arxiv.org/abs/2206.00606). Combinatorial complex (CC) constitute a higher-order domain that can be viewed from three perspectives: 
- as a simplicial complex whose cells and simplices are allowed to be missing;
- as a generalized cell complex with relaxed structure;
- or as a hypergraph enriched through the inclusion of a rank function.

__Definition__ (Neighbourhood function) _Let $S$ be a non-empty set. A neighbourhood function on $S$ is a function $\mathcal{N}: S \rightarrow \mathcal{P}(S)$ that assigns to each point in $x$ in $S$ a non-empty collection $\mathcal{N}(x)$ of the powerset $\mathcal{P}(S)$ of $S$. The elements of $\mathcal{N}(x)$ are called neighbourhoods of $x$ with respect to $\mathcal{N}$._


__Definition__ (Combinatorial complex) _A combinatorial complex (CC) is a triple (S, $\mathcal{X}$ rk) consisting of a set S, a subset $\mathcal{X}$ of $\mathcal{P}(S)\backslash \{\emptyset\}$, and a function $rk: \mathcal{X} \rightarrow \mathbb{N}$ with the following properties:_

 - _$\forall s \in S, \{s\} \in \mathcal{X}$_
 - the function $rk$ is order-preserving, which means that if $x, y \in \mathcal{X}$ satisfy $x \subseteq y$, then $rk(x) \leq rk(y)$.

_The elements of $S$ are called entities or vertices, the elements of $\mathcal{X}$ are called relations or cells, and $rk$ is called the rank function of the CC. The dimension of a CC is $\text{dim}(CC) = max(rk(\mathcal{X}))$ and, for all $r \in [\![ 0, \text{dim}(CC)]\!]$, we note $\mathcal{X} _{r}$ the set of all cells or rank $r(\mathcal{X} _{r} = rk^{-1}(r))$_

A lift represents a transformation from a featured domain to another featured domain as thorougly discussed in [[2]](https://arxiv.org/abs/2206.00606), [[3]](https://arxiv.org/abs/2304.10031). For instance, the incorporation of rank-2 cells onto a graph, transforming it into a combinatorial complex, represents a lifting procedure. In [[3]](https://arxiv.org/abs/2304.10031) two lifting procedures are outlined: 
- the __loop-based__ method;
- and the __path-based__ method.

We implemented the path-based approach which is defined as [[1]](https://arxiv.org/abs/2406.04916): 

__Definition__ (Path-based CC of a graph) _Let $\mathcal{G} = (S, E)$ be a graph. We associate a CC structure with $G$ that considers paths in $G$. The path-based CC of $G$ is denoted by $CC _{P}(G)$ and consists of $0$-cells, $1$-cells and $2$-cells specified as follows. First, one sets $\mathcal{X} _{0}$ and $\mathcal{X} _{1}$ in $CC _{P}(G)$ to be nodes and edges of $G$, respectively. A $2$-cell in $CC _{P}(G)$ is constructed as follows. Let $S$ be a set of source nodes and $k\geq 1$ be the path length. Both of these objects are parameters (see the function `path_based_lift_CC` in `path_lifting.py`)._

_Let $P$ be the set of all paths in $G$ starting from a node that belongs to $\mathcal{S}$ and that has exactly $k$ different nodes. A $2$-cell in $CC _{P}(G)$ is a set $$ C = \{x _{0}^{1},  \cdots, x _{0}^{k}\} \subset \mathcal{X} _{0}$$ such that for all $x \in (\{x _{0}^{1},  \cdots, x _{0}^{k}\})$ there exists a permuation $\pi _{k}$ such that $\pi _{k}(x) \in P$ and such that for all $i \in [\![1, k]\!], (\pi _{k}(x) _{i}, \pi _{k}(x) _{i+1 \, mod \,k}) \in \mathcal{X}$._

- We start from the graph representation from the graph in `mutual_dataset`but reproducible for the aforementioned dataset. 
- We start with one or many source node(s) and a path length $k = 3$.
- We identify the nodes belonging to the same paths of length k in the graphs and that start with a node that belongs to the set of source nodes.
- We group them together to form a rank-2 cell that is added to create a combinatorial complex (_i.e_: the lifted topology).
  
***
[[1]](https://arxiv.org/abs/2406.04916) Carrel, A. (2024). Combinatorial Complex Score-based Diffusion Modelling through Stochastic Differential Equations (PhD thesis).

[[2]](https://arxiv.org/abs/2206.00606) Hajij, M., Zamzmi, G., Papamarkou, T., Miolane, N., Guzmán-Sáenz, A., Ramamurthy, K. N., et al. (2022). Topological deep learning: Going beyond graph data.

[[3]](https://arxiv.org/abs/2304.10031) Papillon, M., Sanborn, S., Hajij, M., & Miolane, N. (2023). Architectures of Topological Deep Learning: A Survey on Topological Neural Networks.
***

We than apply the transform via the `PreProcessor` class:

In [None]:
lifted_dataset = PreProcessor(dataset, transform_config, loader.data_dir)
describe_data(lifted_dataset)

## Create and Run a Combinatorial NN Model

In this section a simple model is created to test that the used lifting works as intended. In this case the model uses the `x_0`, `x_1`, `x_2` which are the features of the nodes, edges and cells respectively. It also uses the `adjacency_1`, `incidence_1` and `incidence_2` matrices so the lifting should make sure to add them to the data.

In [None]:
from modules.models.combinatorial.hmc import HMCModel

model_type = "combinatorial"
model_id = "hmc"
model_config = load_model_config(model_type, model_id)
model = HMCModel(model_config, dataset_config)

y_hat = model(lifted_dataset)
print(y_hat)

If everything is correct the cell above should execute without errors.