# 2 
> GDL

- toc: true 
- badges: true
- comments: false
- categories: [jupyter]


## __1.2 $\dots$ equivariance__ 

We henceforth consider groups $G$ acting on some space of signals $\mathcal{X}(\Omega)$ over a (fixed) signal domain $\Omega$. The group action is encoded in a linear representation $\rho$, assumed to be described in a given _input coordinate system_, just as we would need to specify to a computer. Thus, if $\dim ( \mathcal{X}(\Omega) ) = n$, for each $g \in G$, we have that $\rho(g)$ is an $n \times n$ matrix with real entries.  

The learning dynamics occur in the _hypothesis space_ $\textsf{H}$, a collection of functions 
$$
\textsf{H} \subset \{\, F : \mathcal{X}(\Omega) \to \mathcal{Y}\, \} ,
$$
where $\mathcal{Y}$ is some unspecified (context-specific) space of output signals. We describe $\textsf{H}$ explicitly in the "learning blueprint", up to hyperparameters such as depth and layer widths. 

A key aspect of this blueprint is that $F \in \textsf{H}$ should be expressed as a composition of functions, most of which are $G$-equivariant. The requirement of $G$-equivariance constitutes a _geometric prior_ from which one can derive the architecture of a generic CNN when $G$ corresponds to translations, and a family of generalizations for other domains and group actions. 

___

__Def__ $\quad$ Let $\rho$ be a representation of group $G$ over $\mathcal{X}(\Omega)$, and let $\rho'$ be a representation of the same group over the $\mathbb{R}$-vector space $\mathcal{Y}$. A function $F: \mathcal{X}(\Omega) \to \mathcal{Y}$ is _$G$-equivariant_ if for all $g \in G$ and all $x \in \mathcal{X}(\Omega)$, we have $\rho'(g) F(x) = F ( \rho(g) x )$. We say $F$ is _$G$-invariant_ if this holds when $\rho'$ is the trivial representation, which is to say $F ( \rho(g) x) = F(x)$ for all $g \in G$ and $x \in \mathcal{X}(\Omega)$.

___
___

__Example__ $\quad$ Suppose we are given either a set $\mathcal{V}$, or more generally a graph $\mathcal{G} = ( \mathcal{V}, \mathcal{E} )$, with $\# \mathcal{V} = n$ in either case. As discussed, a signal over $\mathcal{V} = \{ v_1, \dots, v_n \}$ can be thought of as a collection of node features $\{ \, x(v_1), \dots, x(v_n) \, \}$, with $x(v_j) \in \mathbb{R}^s$. We stack the node features as rows of the $n \times s$ _design matrix_

$$
\mathbf{X} = 
\left[ 
\begin{matrix}
x_1\\ 
\vdots\\
x_n
\end{matrix}
\right] ,
$$

which is effectively the same object as signal $x$, provided the vertices are labeled as described. The action of $g \in \mathfrak{S}_n$ on this input data is naturally represented as an $n \times n$ permutation matrix, $\mathbf{P} \equiv \rho(g)$.

One standard way to construct a permutation invariant function $F$ in this setting is through the following recipe: a function $\psi$ is independently applied to every node's features, and $\varphi$ is applied on its sum-aggregated outputs.
$$
    F( \mathbf{X}) = \varphi \left( \, \sum_{j \, = \, 1}^n \psi(x_j) \, \right) .
$$ 
Such a function can be thought of as reporting some "global statistic" of signal $x$.


Equivariance manifests even more naturally. Suppose we want to apply a function $F$ to the signal to "update" the node features, producing a set of _latent_ (node) features.  

This is the case in which the NN outputs an image segmentation mask; the underlying domain does not change, but the features at each node are updated to the extent that they may not even agree on the number of channels. 

We can stack these latent features into another design matrix, $\mathbf{H} = F(\mathbf{X})$. The order of the rows of $\mathbf{H}$ should clearly be tied to the order of the rows of $\mathbf{X}$, i.e. permutation equivariant: for any permutation matrix $\mathbf{P}$, it holds that $F(\mathbf{P} \mathbf{X} ) = \mathbf{P} F(\mathbf{X})$. 

As a concrete example of a permutation equivariant function, consider a weight matrix $\theta \in \mathbb{R}^{s \, \times \, s'}$. This matrix can be used to map a length-$s$ feature vector at a given node to some new, updated feature vector with $s'$-channels. 

Applying this matrix to each row of the design matrix is an example of a _shared node-wise linear transform_, and constitutes a linear, $\mathfrak{S}_n$-equivariant map. Note that, if one wishes to express this map in coordinates, it seems the correct object to consider is a $3$-tensor, "$\Theta,$" constructed as a stack of $n$ copies of $\theta$.

___
___

The above example considered signals over the nodes of a graph only, thus label permutation symmetry applies equally well, regardless of the graph structure ( or lack of structure ) underlying such functions. 

In the case that signals $x$ have a domain with graph structure, encoded by adjacency matrix $\mathbf{A}$, it feels right to work with a hypothesis space recognizing this structure.

This is to say that we wish to consider functions $F \in \textsf{H}$ with $F \equiv F( \mathbf{X}, \mathbf{A} )$. Such a function is (label) _permutation invariant_ if $F( \mathbf{P} \mathbf{X},\, \mathbf{P} \mathbf{A} \mathbf{P}^{\textrm{T}} ) = F ( \mathbf{X}, \mathbf{A})$, and is _permutation equivariant_ if
$$
F( \mathbf{P} \mathbf{X}, \mathbf{P} \mathbf{A} \mathbf{P}^T ) = \mathbf{P}F( \mathbf{X}, \mathbf{A})
$$
for any permutation matrix $\mathbf{P}$. 

___

__Rmk__ $\quad$ One can characterize linear $\mathfrak{S}_n$-equivariant functions on nodes. From [BBCV21]:

> "One can verify any such map can be written as a linear combination of two generators, the identity and the average. As observed by Maron et al. 2018, any linear permutation equivariant $\mathbf{F}$ can be expressed as a linear combination of fifteen linear generators; remarkably, this family is independent of $n \equiv \# \mathcal{V}$."

___

__Rmk__ $\quad$ Amongst the generators just discussed, the geometric learning blueprint "specifically advocates" for those that are also local, in the sense that the output on node $u$ directly depends on its neighboring nodes in the graph. 

___

We can formalize this constraint explicitly, by defining the _$1$-hop neighborhood_ of node $u$ as
$$
\mathcal{N}(u) := \{ v : \{ u,v \} \in \mathcal{E} \} ,
$$
as well as the corresponding \emph{\B{neighborhood features}}, 
$$
\mathbf{X}_{\mathcal{N}(u)} := \{ \{ \, x_v : v \in \mathcal{N}(u) \, \} \} ,
$$
which is a multiset ( indicated by double-brackets ) for the simple reason that distinct nodes may be decorated with the same feature vector. 


___
___

__Example__ $\quad$ The node-wise linear transformation described in the previous example can be thought of as operating on `$0$-hop neighborhoods' of nodes. We generalize this with an example of a function operating on $1$-hop neighborhoods. Instead of a fixed map between feature spaces $\theta : \mathbb{R}^s \to \mathbb{R}^{s'}$, cloned to a pointwise map $\Theta$, we instead specify a \emph{local} function 
$$
\varphi \equiv \varphi( \, x_u, \, \mathbf{X}_{\mathcal{N}(u)} \, )
$$ 

operating on the features of a node as well as those of its $1$-hop neighborhood. 


We may construct a permutation equivariant function $\Phi$ ( the analogue of $\Theta$ here, just as $\varphi$ corresponds to $\theta$ ) by applying $\varphi$ to each $1$-hop neighborhood in isolation, and then collecting these into a new feature matrix. As before, for $\mathcal{V} = \{ v_1, \dots, v_n \}$, we write $x_j$ in place of $x_{v(j)}$, and we similarly write $\mathcal{N}_j$ instead of $\mathcal{N}( v(j) )$.


$$
\Phi ( \mathbf{X}, \mathbf{A} ) = 
\left[
\begin{matrix}
\varphi( \, x_1 , \, \mathbf{X}_{\mathcal{N}_1} \, ) \\
\varphi( \, x_2 , \, \mathbf{X}_{\mathcal{N}_2} \, ) \\
\vdots \\
\varphi( \, x_n , \, \mathbf{X}_{\mathcal{N}_n} \, )
\end{matrix}
\right]
$$

The permutation equivariance of $\Phi$ rests on the output of $\varphi$ being independent of the ordering of the nodes in $\mathcal{N}_u$. Thus, if $\varphi$ is permutation invariant ( a local averaging ) this property is satisfied. 

___
___

> [BBCV21] : "The choice of $\varphi$ plays a crucial role in the expressive power of the learning scheme."

When considering signals $x \equiv \mathbf{X}$ over graphs, it is natural to consider a hypothesis space whose functions operates on the pair $( \mathbf{X}, \mathbf{A})$, where $\mathbf{A}$ is the adjacency matrix of the signal domain. Thus, for such signals the domain naturally becomes part of the input. 

The GDL blueprint distinguishes between two contexts: one in which the input domain is fixed, and another in which the input domain varies from signal to signal. The preceding example demonstrates that, even in the former context, it can be essential that elements of $\textsf{H}$ treat the (fixed) domain as an argument. 


When the signal domain has geometric structure, it can often be leveraged to construct a _coarsening operator_, one of the components of a GDL block in the learning blueprint. Such an operator passes a signal $x \in \mathcal{X}(\Omega)$ to a signal $y \in \mathcal{X}( \Omega')$, where $\Omega'$ is a `coarsened version' of $\Omega$. 

As the blueprint calls for a sequence of such blocks, we often wish to act on the latent signal $y$ by a pointwise non-linearity, followed by the action of an equivariant function on signals in $\mathcal{X}( \Omega')$. 

The domain may be fixed for each input, but this domain changes as the signal passes through the layers of the NN, and it is natural that the functions the NN is built out of should pass this data forward. 

> [BBCV21] "Due to their additional structure, graphs and grids, unlike sets, can be coarsened in a non-trivial way, giving rise to a variety of pooling operations... more precisely, we cannot define a non-trivial coarsening assuming set structure alone. There exist established approaches that infer topological structure from unordered sets, and those can admit non-trivial coarsening"