# notebook 2
> Notes on the geometric deep learning [blueprint](https://arxiv.org/abs/2104.13478). 

- toc: true 
- badges: true
- comments: false
- categories: [jupyter]


___

[homepage](https://the-ninth-wave.github.io/)

[GDL notes](https://the-ninth-wave.github.io/geometric-deep-learning/)

___

$\quad$ Modern neural network (NN) design is built on two algorithmic principles: hierarchical feature learning (concerning the NN architecture), and learning by local gradient-descent driven by backpropagation (concerning the dynamics undergone by the NN during its training). An instance of training data is modeled as an element of some high-dimensional vector space, making a generic learning problem subject to the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality). Most tasks of interest are not generic; the data comes from the physical world, and so it inherits low-dimensional structure and symmetries. Requiring an NN to respect the symmetries of the data upon which it acts amounts to a prior. The notes [BBCV21] construct a blueprint for neural network architecture incorporating these priors, termed _geometric priors_ throughout the notes. Importantly, this blueprint provides a unified perspective of the most successful neural network architectures, at least at the level of the building blocks chosen for the network.

$\quad$ I am choosing to start the notes with the definition of a category. This is in part because it leads to a definition of a group of symmetries acting on some general 'object', and the goal of GDL is to incorporate into the NN arhitecture the symmetries of (and which act on) input data. I am also trusting that it will come in handy, at some point, following [Bartosz Milewski](https://bartoszmilewski.com/2014/10/28/category-theory-for-programmers-the-preface/)'s book on category theory for programmers: _"...category theory is a treasure trove of extremely useful programming ideas... Composition is at the very root of category theory — it’s part of the definition of the category itself. And I will argue strongly that composition is the essence of programming."_  

$\quad$ Moreover, the 'geometric' constraints in GDL are summed up in a property called _equivariance_. Functors are mathematical objects central to category theory, in a sense are characterized by equivariance. They could become a useful perspective for describing some things in the notes, eventually. 

# algebra

_References_

* [Bronstein-Bruna-Cohen-Velicovic 2021](https://arxiv.org/abs/2104.13478), _Geometric deep learning_

* [A first impression of group representations](https://wlou.blog/2018/06/22/a-first-impression-of-group-representations/)

## graphs, categories, groups


___

###__Definition.__ ( graph ) 

$\quad$ A __graph__ is a pair $G = (\textrm{V}, \textrm{E})$, where $\textrm{V}$ is a set whose elements are called __vertices__. The set $\textrm{E}$ consists of __edges__, defined to be a multi-set of exactly two vertices in $\textrm{V}$, not necessarily distinct. 

___


___

### __Definition.__ ( directed graph ) 

$\quad$ A __directed graph__ is a pair of sets  $G = (\textrm{V}, \textrm{A})$ of vertices and arrows (aka, directed edges). An __arrow__ is an ordered pair of vertices. Self-edges have no distinct orientations.

___

___

### __Definition.__ ( domain and codomain operations )

$\quad$ Consider an arrow $f$ of a directed graph $G = ( \textrm{V}, \textrm{A})$, specifically $f \equiv (a,b) \in \textrm{A}$, with $a,b \in \textrm{V}$. The operations $\textsf{dom}$ and $\textsf{cod}$ act on the arrows $f \in \textrm{A}$ via 
$$
\textsf{dom}f = a,\, \textsf{cod} f = b\,,
$$ 
and are called the __domain__ operation and __codomain__ operation,  respectively. 

___

$\quad$ Given arrows, $f$ and $g$ of some directed graph, say that the ordered pair $(g,f)$ is a __composable pair__ if $\textsf{dom} g = \textsf{cod} f$. 
Going forward, we express the relations $a = \textsf{dom} f$ and $b = \textsf{cod} f$ equivalently as
$$
f : a \to b \, \quad \text { or  } \, \quad a \xrightarrow[\,]{ f} b  
$$

The next definition formalizes the behavior of a collection of structure-respecting maps between mathematical objects. 

___

### __Definition.__ ( category ) 

$\quad$ A __category__ is a directed graph $\textsf{C} = (\textrm{O},\textrm{A})$, whose vertices $\textrm{O}$ we call __objects__, such that 

1. For each object $a \in \textrm{O}$, there is a unique __identity__ arrow $\textrm{A} \ni \textrm{id}_a : a \to a$, defined by the following property: for all arrows $f : b \to a$ and $g : a \to c$, composition with the identity arrow $\textrm{id}_a $ gives
$$
\textrm{id}_a \circ f = f \quad \text{ and } \quad g \circ \textrm{id}_a = g
$$

2. For each composable pair $(g, f)$ of arrows, there is a unique arrow $g \circ f$ called their __composite__, with $g \circ f : \textsf{dom} f \to \textsf{cod} g$, such that the composition operation is associative. Namely, for given objects and arrows in the configuration 
$$a \xrightarrow[\,]{ f} b \xrightarrow[\,]{ g} c \xrightarrow[\,]{ k} d \,,
$$
one always has the equality $k \circ (g \circ f) = (k \circ g ) \circ f$.

___

$\quad$ Given a category $\mathcal{C} = (\textrm{O},\textrm{A})$, let
$$
\textsf{hom} (b,c) := \{ \, f : f \in \textrm{A},  \, \textsf{dom} f = b \in \textrm{O}, \, \textsf{cod} f = c \in \textrm{O} \, \}
$$
denote the set of arrows from $b$ to $c$. We use the terms __morphism__ and arrow interchangeably. 


$\quad$ Groups are collections of symmetries. A __group__ $\textsf{G}$ is a category $\textsf{C} = (\textrm{O}, \textrm{A})$ with $\textrm{O} = \{ o \}$ ( so that we may identify $\textsf{G}$ with the collection of arrows $\textrm{A}$ ) such that each arrow has a unique inverse: for $g \in \textrm{A}$, there is an arrow $h$ such that $g \circ h = h \circ g = \textrm{id}_o$. Each arrow $g \in \textrm{A}$ thus has $\textsf{dom} g = \textsf{cod} g = o$. As remarked, the arrows $g \in \textrm{A}$ correspond to group elements $g \in \textsf{G}$. The categorical interpretation suggests that the group acts on some abstract object $o \in \textrm{O}$. In the present context, we care how groups act on data, and how this action is represented to a computer. 

## group representations

$\quad$ Linear representation theory uses linear algebra to study groups, by 'representing' group elements as self-maps on a vector space $V$. When $V$ is finite dimensional, and is given a coordinate basis, group elements correspond to square matrices. Consider a function $\varphi : \textsf{G} \times V \to V$, where $\textsf{G}$ is a group, and where $V$ is a vector space over $\mathbb{R}$. This function will encode how group elements $g$ are identified with self-maps of $V$. Filling in the first coordinate with $g$ produces the corresponding self-map $\varphi(g, \cdot) : V \to V$. When the map $\varphi$ is understood, or left general, we write $g.v$ in place of $\varphi(g,v)$, and we write $(g.)$ in place of $\varphi(g, \cdot)$. 

$\quad$ The 'representatives' $(g.)$ of these group elements $g$ can always be composed. If this compositional structure is compatible with the original group operation, we say $\varphi$ is a __group action__ on $V$. Specifically, $\varphi$ should satisfy $e.v = v$ for all $v \in V$, where $e$ denotes the identity element of $\textsf{G}$, and in general one has $(gh).v  = g.(h.v)$. The group action $\varphi$ is __$\mathbb{R}$-linear__ if it is compatible with the $\mathbb{R}$-vector space structure on $V$, i.e. additive and homogeneous. Specifically, if for all $v,w \in V$ and all scalars $\lambda \in \mathbb{R}$, one has $g.(v+w) = g.v + g.w$ and $g.(\lambda v) = \lambda \, g.v$. 

___

### __Definition.__ ( linear representation )

 $\quad$ 
An __$\mathbb{R}$-linear representation__ of group $\textsf{G}$ over $\mathbb{R}$-vector space $V$ is an $\mathbb{R}$-linear group action on $V$.

___

$\quad$ The next example illustrates how linear group representations arise naturally when considering group actions on data. Input datum are modeled as members of some vector space $\mathcal{X}$, defined with respect to a domain $\Omega$ and a smaller __channel vector space__ $\mathcal{C}$. We may assume all vector spaces discussed to be finite dimensional. Specifically, we consider the vector space of $\mathcal{C}$-valued signals over some finite, discrete domain $\Omega$, which perhaps has some graph structure. This vector space is denoted $\mathcal{X}(\Omega, \mathcal{C})$. 

___
___

### __Example.__ ( shifts of RGB images )



 $\quad$  Consider, for some $n \in \mathbb{N}$, a signal domain $\Omega = \mathbb{T}_n^2$, where $\mathbb{T}_n^2$ denotes the two-dimensional discrete torus of side-length $n$, namely $( \mathbb{Z} / n\mathbb{Z} )^2$. This domain has natural graph as well as group structures. If we imagine each vertex of $\Omega$ to be a pixel, we can express an $n \times n$-pixel color (RGB) image as a signal $x : \Omega \to \mathbb{R}^3$, with the first, second and third coordinates of $\mathbb{R}^3$ reporting Red, Green and Blue values of a given pixel. We make two observations: 

1. As a vector space, $\mathcal{X}(\Omega)$ is isomorphic to $\mathbb{R}^d$, with $d$ typically very large. In the above example, $d = 3n^2$, which is thirty-thousand for a $n \times n \equiv 100 \times 100$ pixel image. 

2. Any group action on $\Omega$ induces a group action on $\mathcal{X}(\Omega)$. 


$\quad$ Expanding on the latter, consider a group action of $\textsf{G}$ on domain $\Omega$. As the torus $\Omega$ already has group structure, it is natural to think of it acting on itself through translations, i.e. we now additionally consider $\textsf{G}\equiv \Omega$. The action of $\textsf{G}$ on itself $\Omega \equiv \mathbb{T}_n^2$ induces a $\textsf{G}$-action on $\mathcal{X}(\Omega)$ as follows: for $g \in G$ signal $x \in \mathcal{X}(\Omega)$, the action $(g, x) \mapsto \mathbf{g}.x \in \mathcal{X}(\Omega)$  is defined pointwise at each $u \in \Omega$:
$$
(\mathbf{g}.x)(u) := x(g.u),
$$
where the bold $(\mathbf{g}.)$ is used to distinguish the action on signals from the action on the domain. 

$\vdots$

> To summarize: any $\textsf{G}$-action on the domain $\Omega$ induces an $\mathbb{R}$-linear representation of $\textsf{G}$ over the vector space of signals on $\Omega$.

___
___

___
___

### __Example.__ ( one-hot encoding )

$\quad$  It seems like standard practice to encode the collection of classes associated to some ML classification problem as an orthonormal basis. These are given ( to the computer ) in the usual coordinate basis 
$$
e_1 \equiv (1, 0, \dots, 0),\, e_2 \equiv (0,1,\dots, 0),\, \dots, \, e_n \equiv (0,\dots, 0,1) \,,
$$
hence the nomenclature __one-hot__. In the preceding example, if one considers a one-hot encoding of the vertices of $\mathbb{T}_n^2$, we see that each signal is expressed with respect to this coordinate system, in the sense that $x = \sum_{j=1}^n x_j e_j$. This kind of encoding is useful for considering general symmetries of the domain. For instance, if permuting node labels is a relevant symmetry, the action of the symmetric group $\frak{S}_n$ is naturally represented by $n \times n$ permutation matrices.


___
___

$\quad$ The following definition reformulates the notion of a signal.

___

### __Definition.__ ( features )

$\quad$ A graph $G = ( \textrm{V}, \textrm{E} )$ has __features__ if for each $v  \in \textrm{V}$, one has the additional data of an $s$-dimensional vector $x(v) \in \mathbb{R}^s$, called the __features of node $v$__. Here, $\mathbb{R}^s$ is understood as the channel vector space $\mathcal{C}$ equipped with a coordinate basis.  

___


$\quad$ Above useage of the term _features_ is compatible with its usage in ML. We can think of a neural network as a sequence of node-layers built on top of the signal domain graph $\Omega$, which forms the input layer of the network. An input signal endows the first node layer of a neural network with features, and the weights of the neural network propagate these through to node features on the nodes of successive layers of the network. The features on the last layer of the network can be read off as the output of the neural network. 

# equivariance

_references_

* [Bronstein--Bruna--Cohen--Velicovic 2021](https://arxiv.org/abs/2104.13478), _Geometric deep learning_

* [Maron--Ben-Hamu--Shamir--Lipman 2018](https://arxiv.org/abs/1812.09902), _Invariant and equivariant graph networks_

## definitions

$\quad$ We henceforth consider groups $G$ acting on some space of signals $\mathcal{X}(\Omega)$ through a given action on some fixed signal domain $\Omega$. The group action is encoded in a linear representation $\rho$, assumed to be described in a given input coordinate system, just as we would need to specify to a computer. Thus, if $\dim ( \mathcal{X} ) = n$, for each $g \in G$, the object $\rho(g)$ is an $n \times n$ matrix with real entries.  

$\quad$ The training dynamics of the neural network take place in the __hypothesis space__ $\mathcal{H}$, a collection of functions 
$$
\mathcal{H} \subset \{\, F : \mathcal{X}(\Omega) \to \mathcal{Y}\, \} ,
$$
where $\mathcal{Y}$ is a vector space of output channels. The point of the learning blueprint is to be able to choose $\textsf{H}$ compatible with the symmetries of input data. The blueprint describes $\textsf{H}$ explicitly, up to hyperparameters such as depth and layer widths. A key aspect of this blueprint is that $F \in \textsf{H}$ should be expressed as a composition of functions, most of which are $G$-equivariant. The requirement of $G$-equivariance constitutes a geometric prior. From this prior, one can derive the architecture of a generic CNN when $G$ corresponds to translations, and a family of generalizations for other domains and group actions. 

___

###__Definition.__ ( group invariance, equivariance )

$\quad$ Let $\rho$ be a representation of group $\textsf{G}$ over $\mathcal{X}(\Omega)$, and let $\rho'$ be a representation of the same group over the $\mathbb{R}$-vector space $\mathcal{Y}$. A function $F: \mathcal{X}(\Omega) \to \mathcal{Y}$ is __$\textsf{G}$-equivariant__ if for all $g \in \textsf{G}$ and all $x \in \mathcal{X}(\Omega)$, we have $\rho'(g) F(x) = F ( \rho(g) x )$. Say $F$ is __$\textsf{G}$-invariant__ if this holds when $\rho'$ is the trivial representation, i.e. $F ( \rho(g) x) = F(x)$ for all $g \in \textsf{G}$ and $x \in \mathcal{X}(\Omega)$.

___

## examples

___
___

### __Example.__ ( permutation invariance and equivariance in data )

$\quad$ Suppose we are given either a set $\textrm{V}$, or more generally a graph $G = ( \textrm{V}, \textrm{E} )$, with $\# \textrm{V} = n$ in either case. As discussed, a signal $x$ over $\textrm{V} = \{ v_1, \dots, v_n \}$ can be thought of as a collection of node features $\{ \, x(v_1), \dots, x(v_n) \, \}$, with $x(v_j) \in \mathbb{R}^s$. Let us stack the node features as rows of an $n \times s$ matrix called the __design matrix__, which we also refer to as $x$:

$$
x = 
\left[ 
\begin{matrix}
x(v_1)\\ 
\vdots\\
x(v_n)
\end{matrix}
\right] ,
$$

which is effectively the same object as signal $x$, provided the vertices are labeled as described. The action of $g \in \mathfrak{S}_n$ on this input data is naturally represented as an $n \times n$ permutation matrix, $P \equiv \rho(g)$.

$\quad$ One standard way to construct a permutation invariant function $F$ in this setting is through the following recipe: a function $\psi$ is independently applied to every node's features, and $\varphi$ is applied on its sum-aggregated outputs.
$$
    F( X) = \varphi \left( \, \sum_{j \, = \, 1}^n \psi(x(v_j)) \, \right) .
$$ 
Such a function can be thought of as reporting some 'global statistic' of signal $x$.


$\quad$ Equivariance manifests even more naturally. Suppose we want to apply a function $F$ to the signal to 'update' the node features to a set of __latent__ node features. This is the case when a neural network outputs an image segmentation mask; the underlying domain does not change, but the features at each node are updated to the extent that they may not even agree on the number of channels. We can stack these latent features into another design matrix, $F(x)$. The order of the rows of $F(x)$ should clearly be tied to the order of the rows of $x$, i.e. permutation equivariant: for any permutation matrix $P$, it holds that $F(P x ) = P F(x)$. 

$\quad$ As a concrete example of a permutation equivariant function, consider a weight matrix $\theta \in \mathbb{R}^{s \, \times \, s'}$. This matrix can be used to map a length-$s$ feature vector at a given node to some new, updated feature vector with $s'$ channels. Applying this matrix to each row of the design matrix is an example of a __shared node-wise linear transform__, and constitutes a linear, $\mathfrak{S}_n$-equivariant map.

___

$\quad$ The above example considered signals over the nodes of a graph only, thus label permutation symmetry applies equally well, regardless of the graph structure ( or lack of structure ) underlying such functions. In the case that signals $x$ have a domain with graph structure, encoded by adjacency matrix $\textrm{A}$, it feels right to work with a hypothesis space recognizing this structure.

This is to say that we wish to consider functions $F \in \textsf{H}$ with $F \equiv F( x, \textrm{A} )$. Such a function is (label) __permutation invariant__ if $F( Px,\, P \textrm{A} P^{\textrm{T}} ) = F ( x, \textrm{A})$, and is __permutation equivariant__ if
$$
F( P x, P \textrm{A} P^T ) = PF( x, \textrm{A} )
$$
for any permutation matrix $P$. 

___

__Rmk__ $\quad$ On a characterization of linear $\mathfrak{S}_n$-equivariant functions on nodes.
[BBCV 21](https://arxiv.org/abs/2104.13478): _"One can verify any such map can be written as a linear combination of two generators, the identity and the average. As observed by [Maron et al. 2018](https://arxiv.org/abs/1812.09902), any linear permutation equivariant $F$ can be expressed as a linear combination of fifteen linear generators; remarkably, this family is independent of $n \equiv \# \mathcal{V}$."_

___


$\quad$ Among the generators described in the above remark, the geometric learning blueprint 'specifically advocates' for those that are also local, in the sense that the output on node $u$ directly depends on its neighboring nodes in the graph. We can formalize this constraint explicitly, by defining the __$1$-hop neighborhood__ of node $u$ as
$$
\textsf{nbhd}(u) \equiv \textsf{nbhd}_1(u)  := \{ v : \{ u,v \} \in \mathcal{E} \} ,
$$
as well as the corresponding __neighborhood features__, 
$$
x({\textsf{nbhd}(u)}) := \{ \!\{ \, x(v) : v \in \textsf{nbhd}(u) \, \} \!\} ,
$$
which is a multiset, as indicated by double-brackets, as distinct nodes may be decorated with the same feature vector. 



___
___

### __Example.__ ( permutations and local averaging )

$\quad$ The node-wise linear transformation described in the previous example can be thought of as operating on $0$-hop neighborhoods of nodes. We generalize this with an example of a function operating on $1$-hop neighborhoods. Instead of a fixed map between feature spaces $\theta : \mathbb{R}^s \to \mathbb{R}^{s'}$, cloned to a pointwise map, we instead specify a _local_ function 
$$
\varphi \equiv \varphi( \, x(u), \, x(\textsf{nbhd}(u)) \, )
$$
operating on the features of a node as well as those of its $1$-hop neighborhood. 


$\quad$ We may construct a permutation equivariant function $\mathbf{\phi}$ by applying $\varphi$ to each $1$-hop neighborhood in isolation, and then collecting these into a new feature matrix.

As above, the vertices of the signal domain graph $\textrm{V} = \{ v_1, \dots, v_n \}$.


$$
\phi ( x, \textrm{A} ) = 
\left[
\begin{matrix}
\varphi( \, x(v_1) , \, x(\,{\textsf{nbhd}(v_1)} \, ) \\
\varphi( \, x(v_2) , \, x(\,\textsf{nbhd}(v_2) \, ) \\
\vdots \\
\varphi( \, x(v_n) , \, x(\,\textsf{nbhd}(v_n) \, )
\end{matrix}
\right]
$$

The permutation equivariance of $\phi$ rests on the output of $\varphi$ being independent of the ordering of the nodes in each $\textsf{nbhd}v)$. Thus, if $\varphi$ is permutation invariant, as in a local averaging, this property is satisfied by $\phi$ as well. [BBCV 21](https://arxiv.org/abs/2104.13478): _"The choice of $\varphi$ plays a crucial role in the expressive power of the learning scheme."_

___
___

$\quad$ For signals $x$ over graphs, it is natural to consider a hypothesis space whose functions operate on the pair $( x, \textrm{A})$, where $\textrm{A}$ is the adjacency matrix of the signal domain. Thus, for such signals the domain naturally becomes part of the input. The GDL blueprint distinguishes between two contexts: 

1. one in which the input domain is fixed, and 

2. another in which the input domain varies from signal to signal. 

The preceding example demonstrates that, even in the former context, it can be essential that elements of $\textsf{H}$ treat the fixed domain as an argument. 
 

$\quad$ When the signal domain has geometric structure, it can often be leveraged to construct a __coarsening operator__, one of the components of a GDL block in the learning blueprint. Such an operator passes a signal $x \in \mathcal{X}(\Omega)$ to a signal $y \in \mathcal{X}( \Omega')$, where $\Omega'$ is a coarse-grained version of $\Omega$. The domain may be fixed for each input, but this domain changes as the signal passes through the layers of the NN, and it is natural that the functions the NN is built out of should pass this data forward. [BBCV 21](https://arxiv.org/abs/2104.13478): _"Due to their additional structure, graphs and grids, unlike sets, can be coarsened in a non-trivial way, giving rise to a variety of pooling operations... more precisely, we cannot define a non-trivial coarsening assuming set structure alone. There exist established approaches that infer topological structure from unordered sets, and those can admit non-trivial coarsening."_

# equivariance in neural networks

_references_

* [Kondor-Trivedi 2018](https://arxiv.org/abs/1802.03690)

## feed-forward neural networks

$\quad$ The building blocks of neural networks are tensor objects. It is convenient for now to present a definition of a neural network without introducing the notion of a tensor. 

___

### __Definition.__ ( feed-forward neural network ) 

$\quad$ A __feed-forward neural network__ 
$$
(\textsf{N}, \textsf{S}, b, w, a )  
$$ 
is a directed graph $(\textsf{N},\textsf{S})$, called the __computation graph__, whose vertices $\textsf{N}$ and edges $\textsf{S}$ are equipped with weights, respectively $b$ and $w$, along with an __activation function__ $a$. For simplicity, we assume a single activation function $a$ is used across the network.  The objects should satisfy the following:

* The vertices $\textsf{N}$ of the computation graph, also called __neurons__, are partitioned into $L+1$ distinct __layers__, for some $L \in \mathbb{N}$. The collection of vertices in layer $\ell$ is denoted $\textsf{N}_\ell$. Layer $\textsf{N}_0$ is called the __input layer__, while $\textsf{N}_L$ is the __output layer__.

* We write $\mathfrak{n}_i^\ell$ for the $i$th neuron in layer $\ell$. Each neuron is equipped with a __bias__ parameter, $b_i^\ell$. The vector of biases for neurons in a given layer $\ell$ is denoted $b^\ell$, and the collection of all bias parameters across all layers is denoted $b$. 

* The edges $\textsf{S}$ of the computation graph, called __synapses__, only join neurons in consecutive layers, and importantly, the orientation of each edge always points towards the larger layer. This is why the network is "feed-forward." 

* This structure effectively partitions the synapses into layers, indexed by the largest layer among the two endpoints of the edges. We write $\textsf{S}_\ell$ for the $\ell$th layer of synapses, though we could also write $\textsf{S}_{\ell-1, \ell}$ to emphasize that the edges in $\textsf{S}_\ell$ join neurons in $\textsf{N}_{\ell-1}$ to those in $\textsf{N}_\ell$. The synapse in $\textsf{S}_\ell$ joining neuron $\mathfrak{n}_i^{\ell-1}$ to $\mathfrak{n}_j^\ell$ is denoted $\mathfrak{s}_{ij}^{\ell}$.

* Each synapse $\mathfrak{s}_{ij}^{\ell}$ is equipped with a __weight__ parameter, $w_{ij}^{\ell}$. The collection of weights associated to neurons in layer $\ell$ is denoted $w^\ell$, and the collection of all weights across all layers, $w$. 

* For all layer indices $\ell >0$, and given an input vector $x$, let $a^\ell$ denote the __activation at layer__ $\ell$. The activation at layer $\ell$ is a vector $a^\ell = ( a_1^\ell, \dots, a_{\# \textsf{N}_\ell }^\ell)$, the $j$th coordinate of which is called the __activation at neuron $\mathfrak{n}_j^\ell$__. These activations are defined inductively by
$$
a_j^\ell = a \left( \left( \sum_{i} w_{ij}^\ell a_i^{\ell-1} \right) + b_j^{\ell}\right) , 
$$
with the activation at the $0$th layer, $a^0$ is defined as some input signal $x$, without any transformation applied to it.

___

$\quad$ The parameter $L$, deteremining the number of layers, is the __depth__ of the network, while the __width of layer $\ell$__ is the number of neurons in the layer. We will often abuse notation slightly, and write $\textsf{N}$ in place of the quintuple $(\textsf{N}, \textsf{S}, b, w, a )$. Each feed-forward neural network $\textsf{N}$ of depth $L$ is associated to a function 
$$x \mapsto F_{\textsf{N}}(x) \equiv a^L(x,\textsf{N})$$

So, $F_{\textsf{N}}$ can be expressed as the the alternating composition of nonlinearities and transformations between layers

$$
\alpha \circ B_L \circ \alpha \circ B_{L-1} \circ \alpha \dots \circ  \alpha \circ B_1,
$$

where $\alpha$ denotes the nonlinearity $a$ applied entrywise to its vector argument. In the present subsection, it is convenient to consider each layer of neurons as embedded in $\mathbb{N}$ through some enumeration. Signals in the input layer are thus compactly supported functions $x : \mathbb{N} \to \mathcal{C}$, where $\mathcal{C}$ is the input channel vector space. 

$\quad$ As signal $x$ is propagated through the network, its codomain may change to a different channel vector space, so that in general,

$$
B_\ell : \mathcal{X}_c(\mathbb{N}, \mathcal{C}_{\ell-1} ) \to \mathcal{X}_c (\mathbb{N}, \mathcal{C}_\ell ), \quad z \mapsto w^\ell z + b^\ell
$$

so that $F_{\textsf{N}}$ may also be expressed as 

$$ \mathcal{X}_c( \mathbb{N}, \mathcal{C} ) \xrightarrow[\,]{ B_1} \mathcal{X}_c( \mathbb{N}, \mathcal{C}_1 )  \xrightarrow[\,]{ \alpha } \mathcal{X}_c( \mathbb{N}, \mathcal{C}_1 )  \xrightarrow[\,]{ B_2} \mathcal{X}_c(\mathbb{N}, \mathcal{C}_2) \xrightarrow[\,]{ \alpha }  \, \dots \, \xrightarrow[\,]{ B_L} \mathcal{X}_c(\mathbb{N}, \mathcal{C}_L) \xrightarrow[\,]{ \alpha } \mathcal{X}_c(\mathbb{N}, \mathcal{C}_L)
$$

This view obscures the width of each layer, as well as the shape of the signal as it propagates through the layers. Nonetheless, it suffices to discuss equivariance in the context of such functions. Henceforth, we use the term neural network to indicate a feed forward neural network. 

## equivariant neural networks

$\quad$ In the present context, we suppose there is a group $\textsf{G}$ acting on each of the signal domains. As discussed, the signal domains have been obscured by embedding each domain into $\mathbb{N}$. We should keep in mind that these signal domains are changing, a priori, with each new layer. As such, the group $\textsf{G}$ may act differently on the signal domain at each layer, leading to different representations of $\textsf{G}$, which we wish to distinguish in our notation. For $\ell = 0, \dots, L$, let $\rho^\ell$ denote the representation of $\textsf{G}$ at the $\ell$th layer, so that for each $g \in \textsf{G}$, $\rho_g^\ell$ is an invertible linear map,

$$
\rho_g^\ell : \mathcal{X}_c( \mathbb{N} , \mathcal{C}_\ell) \to  \mathcal{X}_c( \mathbb{N} , \mathcal{C}_\ell).
$$

___

### __Definition.__ ( group equivariant neural network )

$\quad$ Let $\textsf{N}$ be a feed-forward neural network of depth $L$. Let $\textsf{G}$ be a group acting on the signal domain of each layer of $\mathbb{N}$. The associated representation of this action on signals, at layer $\ell$, is denoted $\rho^\ell$ as above. The network $\textsf{N}$ is said to be __$\textsf{G}$-equivariant__ if each $B_\ell$ is $\textsf{G}$-equivariant, in the sense that for all $g \in \textsf{G}$, and all appropriately supported $z \in \mathcal{X}_c(\mathbb{N}, \mathcal{C}_{\ell-1})$, one has

$$
\rho_g^{\ell} B_\ell (z) = B_\ell (\rho_g^{\ell-1}z ) 
$$


___

## convolution

$\quad$ The main result of [KT 18](https://arxiv.org/abs/1802.03690) is a classification of equivariant neural networks: they show each $B_\ell$ corresponds to a generalized convolutional layer, which we now introduce. According to the usual analytic definition, given $\varphi, \theta : \mathbb{R} \to \mathbb{R}$ sufficiently nice, their __convolution__ $\varphi \star \theta$ is defined by
$$
( \varphi \star \theta ) (u) = \int_{\mathbb{R}} \varphi(u-v) \theta (v) \, \textrm{d} v
$$
This operation is commutative, and we can interchangeably think of $\varphi$ and $\theta$ as signal and filter. 

This definition can be generalized to (say, $\mathbb{R}$-valued) functions/signals over other domains, in particular compact groups, which admit unique Haar probability measures. 

___

### __Definition.__ ( convolution )  

$\quad$ Letting $\mu_{\textsf{G}}$ denote the unique Haar probability measure of compact group $\textsf{G}$, we define the __convolution__ of two functions $\varphi, \theta : \textsf{G} \to \mathbb{R}$ by
$$
(\varphi \star \theta) (g) = \int_{\textsf{G}} \varphi(gh^{-1}) \theta(h) \, \textrm{d}\mu_G(h)
$$
___


$\quad$ We have already seen an instance in which the signal domain is a compact group $-$ when modeling images, RGB or grayscale, we took the domain to be the discrete two-dimensional torus $\mathbb{T}_n^2$. The domains of the signals that propagate through the network aren't necessarily groups. In general, we require these domains to be __homogeneous spaces__, which means they are equipped with a __transitive__ $\textsf{G}$-action. Going forward, we also assume that $\textsf{G}$ is some finite group, which is reasonable if $\textsf{G}$ is to be represented to a computer.

$\quad$ Let $\textsf{G}$ be a compact group acting on homogeneous space $\Omega$. Let us fix a distinguished 'basepoint' $v_e \in \Omega$. The basepoint allows us to index the space $\Omega$ using $\textsf{G}$, through the transitivity of its action. Any $v \in \Omega$ can be expressed as $g.v_e$ for some $g \in \textsf{G}$. Let us write
$$
v_g := g.v_e\,,
$$
and observe that group elements of $\textsf{G}$ which fix $v_e$ form a subgroup, the __stabilizer__ of $v_e$, denoted $\textsf{H}$. The __(left) quotient space__ $\textsf{G} / \textsf{H}$ is the collection of __left cosets__  
$$
g\textsf{H} := \{ gh : h \in \textsf{H} \},
$$
and through the map $g \mapsto v_g$, this quotient space can be identified with the domain $\Omega$ itself. 

$\quad$ The quotient map $g \mapsto g\textsf{H}$ is natural to consider, given a subgroup $\textsf{H} \subset \textsf{G}$. It is convenient to introduce another $\textsf{G} / \textsf{H} \to \textsf{G}$ which is effectively a one-sided inverse of the quotient map. This 'inverse' of the quotient map can be rephrased as a selection of a __coset representative__ $[g] \in g \textsf{H}$. For $v \in \Omega$, let $[v]$ denote a coset representative of the coset of group elements which map $v_e$ to $v$. This leads to an identification of $v \in \Omega$ with $[v]\textsf{H} \in \textsf{G} / \textsf{H}$.

___

### __Definition.__ ( projection, lift )

$\quad$ Consider a function $\varphi : \textsf{G} \to \mathbb{R}$, with $\textsf{G}$ a group acting (transitively) on some homogeneous space $\Omega$ with stabilizer $\textsf{H}$, so that $\Omega \cong \textsf{G}/\textsf{H}$. The __projection__ of $\varphi$ to $\Omega$ is the function $\varphi_\downarrow : \Omega \to \mathbb{R}$ given by averaging $\varphi$ over the coset corresponding to a given $v \in \Omega$:
$$
\varphi_\downarrow (v) := \frac{1}{\# \textsf{H}} \sum_{g \in [v]\textsf{H}} f(g).
$$
Conversely, given $\varphi : \Omega \to \mathbb{R}$, the __lift__ of $\varphi$ to $\textsf{G}$ is the function $\varphi^\uparrow: \textsf{G} \to \mathbb{R}$ given by
$$
\varphi^\uparrow (g) := \varphi([v_g]\textsf{H}).
$$
___

$\quad$ The notion of a lift allows us to generalize the convolution operation to pairs of functions whose domains are perhaps distinct quotient spaces of group $\textsf{G}$. Let $\Omega$ and $\Omega'$ be such quotient spaces, and consider $\varphi: \Omega \to \mathbb{R}$ and $\theta : \Omega' \to \mathbb{R}$; their __convolution__ $\varphi \star \theta : \textsf{G} \to \mathbb{R}$ is defined by
$$
(\varphi \star \theta) (g) = \sum_{h \in \textsf{G}} \varphi^\uparrow(gh^{-1}) \, \theta^\uparrow(h) \, .
$$

$\quad$ The definition of a lift clearly generalizes to functions whose domain is some **_right_ quotient** space $\textsf{H} \backslash \textsf{G}$, consisting of right cosets 
$$
\textsf{H}g := \{ hg : h \in \textsf{H} \},
$$
as well as to functions defined on the **double quotient space** $\textsf{H} \backslash \textsf{G} / \textsf{K}$, for some subgroups $\textsf{H}, \textsf{K} \subset \textsf{G}$, whose elements are the double cosets
$$
\textsf{H}g\textsf{K} := \{ hgk : h \in \textsf{H}, k \in \textsf{K} \}.
$$

Thus, the definition of convolution can be extended to pairs of functions whose domains are any type of quotient space associated to $\textsf{G}$.

___

### __Definition.__ ( generalized convolution )

$\quad$ Let $\textsf{G}$ be a finite group, and consider two, perhaps distinct, associated homogeneous space $\Omega$ and $\Omega'$. Consider functions $\varphi : \Omega \to \mathbb{R}$ and $\theta : \Omega' \to \mathbb{R}$. The __generalized convolution__ $\varphi \star \theta : \textsf{G} \to \mathbb{R}$ is 
$$
(\varphi \star \theta) (g) = \sum_{h \in \textsf{G}} \varphi^\uparrow(gh^{-1}) \, \theta^\uparrow(h). 
$$
___

$\quad$ Now that we have some understanding of the structural requirements we make of signal domains, we are more specific about the output $F_{\textsf{N}}$ of some neural network $\textsf{N}$; it can be expressed as 

$$ \mathcal{X}( \Omega, \mathcal{C} ) \xrightarrow[\,]{ B_1} \mathcal{X}( \Omega_1, \mathcal{C}_1 )  \xrightarrow[\,]{ \alpha } \mathcal{X}( \Omega_1, \mathcal{C}_1 )  \xrightarrow[\,]{ B_2} \mathcal{X}(\Omega_2, \mathcal{C}_2) \xrightarrow[\,]{ \alpha }  \, \dots \, \xrightarrow[\,]{ B_L} \mathcal{X}(\Omega_L, \mathcal{C}_L) \xrightarrow[\,]{ \alpha } \mathcal{X}(\Omega_L, \mathcal{C}_L) \, ,
$$

where $\Omega \equiv \Omega_0, \Omega_1, \dots, \Omega_L$ is a sequence of homogeneous spaces associated to $\textsf{G}$. In particular, we may identify each $\Omega_\ell$ with its stabilizer subgroup $\textsf{H}_\ell$ in $G$, so that $\Omega_\ell \cong \textsf{G} / \textsf{H}_\ell$. Each affine block $B_\ell$ consists of a linear transformation $T_\ell : \mathcal{X}(\Omega_{\ell-1}, \mathcal{C}_{\ell-1}) \to \mathcal{X}(\Omega_\ell, \mathcal{C}_\ell)$, followed by addition of the bias vector $b^\ell$. The object $T_\ell$ is very similar to the weight matrix $w^\ell \equiv [ w_{ij}^\ell ]_{i,j}$. The linear transformation $T_\ell$ associated to this weight matrix may involve first reshaping it into a tensor, depending on the signal domain. 

___

### __Definition.__ ( group-convolutional neural network )

$\quad$ Let $\textsf{N}$ be a feed-forward neural network of depth $L$, whose layers $\Omega \equiv \Omega_0, \Omega_1, \dots, \Omega_L$ are homogeneous spaces associated to finite group $\textsf{G}$. Each $\Omega_\ell$ corresponds to its stabilizer in $\textsf{G}$, denoted $\textsf{H}_\ell$. Let $(\mathcal{C}_j)_{j=0}^L$ denote the sequence of channel vector spaces associated to each layer. The affine blocks of $\textsf{N}$ are, as usual, denoted $B_\ell$ for $\ell \in \{1, \dots, L\}$, and the linear transformation within each block is denoted $T_\ell$. We call $\textsf{N}$ a __$\textsf{G}$-convolutional neural network ($\textsf{G}$-CNN)__ if each $T_\ell$ may be expressed as
$$
T_\ell (z) = z \star \theta_\ell,
$$
for some __filter__ 
$$
\theta_\ell \in \mathcal{X} ( \textsf{H}_{\ell-1} \backslash \textsf{G} / \textsf{H}_\ell , \,\mathcal{C}_{\ell-1} \otimes \mathcal{C}_{\ell} ),
$$
where the symbol $\otimes$ denotes the tensor product of vector spaces. 
___

$\quad$ The tensor product will be defined in the next section. The relevance of $\textsf{G}$-CNN's to equivariance in neural networks is encapsulated in the following theorem: if we feel $\textsf{G}$-equivariance constitutes an important prior on the network structure, the notion of a $\textsf{G}$-CNN is fundamental. 

___

### __Theorem__ ( [Kondor-Trivedi 2018](https://arxiv.org/abs/1802.03690), Theorem 1 ) 

$\quad$ Let $\textsf{G}$ be a compact group with transitive action on each layer of the feed-forward neural network $\textsf{N}$. Then $\textsf{N}$ is $\textsf{G}$-equivariant, in the sense described [above](https://colab.research.google.com/drive/13S7RUk6kP5SX8xUycbvJRm3rufkcTisN#scrollTo=ME1hkruixTRy&line=1&uniqifier=1), if and only if it is a $\textsf{G}$-CNN. 

___

# tensors

_references_

* [Schwarz](https://www.math.ucdavis.edu/~schwarz/), _Topology for Physicists_

* Wald, _General Relativity_

## vectors and covectors, through the lens of rep. theory

$\quad$ Consider the action of some finite group $\textsf{G}$ on discrete domain $\Omega$. Let $\mathcal{X}$ denote the space of $\mathcal{C}$-valued signals, where $\mathcal{C}$ is itself a vector space over $\mathbb{R}$. In this case, $\mathcal{X}$ also has the structure of a vector space over $\mathbb{R}$. The action of $\textsf{G}$ on $\Omega$ induces an action on $\mathcal{X}$, putting elements $g$ of $\textsf{G}$ in correspondence with elements of $\textrm{GL}(\mathcal{X})$, which denotes the vector space of invertible linear endomorphisms of $\mathcal{X}$. 

$\quad$ Let $\rho$ denote the $\mathcal{X}$-representation of $\textsf{G}$ just described. A subspace $\mathcal{X}' \subset \mathcal{X}$ is called an __invariant subspace__ of representation $\rho$ if all operators $\rho(g)$ map $\mathcal{X}'$ to itself. Clearly, restricting $\rho$ to $\mathcal{X}'$ is itself a representation of $\textsf{G}$ over vector space $\mathcal{X}'$. The representation $\rho$ is called __irreducible__ if there are no invariant subspaces of $\mathcal{X}$ other than $\mathcal{X}$ itself and the trivial subspace $\{0\}$. 

$\quad$ If $\rho_1$ and $\rho_2$ are representations of $\textsf{G}$ over $\mathbb{R}$-vector spaces $\mathcal{X}_1$ and $\mathcal{X}_2$, their __direct sum__ is defined as the representation $\rho$ of $\textsf{G}$ over $V_1 \oplus V_2$ given by
$$
[\rho(g)]\,( v_1, v_2) = ( \, \rho_1(g) v_1, \,\rho_2(g) v_2 \, ),
$$
for any $v_1 \in \mathcal{X}_1$ and $v_2 \in \mathcal{X}_2$. 

Two representations $\rho_1$ and $\rho_2$ over $\mathcal{X}_1$ and $\mathcal{X}_2$ are __equivalent__ if there is an isomorphism $\varphi : \mathcal{X}_1 \to \mathcal{X}_2$ so that 

$$
\varphi \rho_1(g) = \rho_2(g) \varphi \,.
$$

$\quad$ We say a representation $\rho$ is __orthogonal__ if each $\rho(g)$ is an orthogonal linear transformation (it preserves the length of all unit vectors). If $\mathcal{Y}$ is an invariant subspace of an orthogonal representation $\rho$, so is its orthogonal complement $\mathcal{Y}^\perp$. Furthermore, $\mathcal{Y}$ and $\mathcal{Y}^\perp$ inherit representations of $\textsf{G}$ through restriction, and the original representation $\textsf{G}$ is equivalent to the direct sum of these restrictions. 

$\quad$ For any representation $\rho$ of compact group $\textsf{G}$ over $\mathcal{X}$, one can find a scalar product on $\mathcal{X}$ that is invariant under $\rho$. This is equivalent to saying that for an appropriate choice of scalar product on $\mathcal{X}$, in terms of $\textsf{G}$, every representation of $\textsf{G}$ over $\mathcal{X}$ is orthogonal. One constructs this by taking any scalar product on $\mathcal{X}$, and then by averaging its images under the $\textsf{G}$-action. The averaging done is with respect to Haar measure. Crucially, the existence of this invariant scalar product implies that every finite-dimensional representation of a compact $\textsf{G}$ is equal to the direct sum of irreducible representations.  

$\quad$ Let $\textrm{GL}(n, \mathbb{R})$ denote the space of $n \times n$ matrices with real entries. As thee group $\textrm{GL}(\mathbb{R}^n)$ of linear transformations of $\mathbb{R}^n$ is isomorphic to $\textrm{GL}(n, \mathbb{R})$, one can regard any group homomorphism $\textsf{G} \to \textrm{GL}(n, \mathbb{R})$ as an $n$-dimensional representation of $\textsf{G}$. In the case that $\textsf{G}$ is itself $\textrm{GL}(n, \mathbb{R})$, we have the natural representation $\rho(g) v \equiv g v$, called the __vector representation__. The __covector representation__ is defined by $\rho(g) v$ = $(g^T)^{-1}v$. The latter matrix can be written unambiguously as $g^{-T}$ as inverse and transpose commute with one another. The elements of a space on which a matrix group acts via the vector representation are _vectors_, and _covectors_  can be defined analogously. We expand on the discussion of vectors and covectors after introducing a basis in the next section.

$\quad$ By the above discussion, we should feel no loss in generality working with orthogonal representations of groups over some signal $\mathbb{R}$-vector space $\mathcal{X}$. Moreover, as feeding any signal to a computer requires choosing a basis for $\mathcal{X}$, it is natural to expect this basis to be "maximally compatible" with the orthogonal representation, in the sense that the basis is orthonormal. 

## vectors and covectors, in coordinates

$\quad$ Thus it is natural to assume that an orthonormal basis $(e_j)_{j=1}^n$ for some $n$-dimensional vector space $\mathcal{X}$ is always given alongside $\mathcal{X}$, and it is through the lens of this basis that we explore the covector representation described above. The basis furnishes a canonical isomorphism between $\mathcal{X}$ and $\mathbb{R}^n$, and hence between $\textrm{GL}(\mathcal{X})$ and $\textrm{GL}(n, \mathbb{R})$. On the other hand, the inner product $\langle \cdot, \cdot \rangle$ furnishes a canonical isomorphism between $\mathcal{X}$ and its dual $\mathcal{X}^*$. Elements of $\mathcal{X}$ are __vectors__ and elements of $\mathcal{X}^*$ are called __covectors__.

> The above is technically correct. But I can't help but wonder: if $\mathcal{X}$, as a vector space, has a further structure as a space of functions taking values in channel vector space $\mathcal{C}$, how should this be reflected in the basis for $\mathcal{X}$? Perhaps this structure means that it's proper to think of elements of the signal vector space as already having some tensor structure. 

$\quad$ Through the assumed basis, any representation $\rho$ of $\textsf{G}$ corresponds uniquely to a representation of $\textsf{G}$ over $\mathbb{R}^n$, as well as to a representation of $\textsf{G}$ over the dual space $( \mathbb{R}^n)^*$. In either case, group elements can effectively be identified with $n \times n$ matrices. We will write coordinates of a vector $x \in \mathcal{X}$ using raised indices, i.e. for

$$
x =  x^1e_1 + \dots + x^ne_n\, \equiv \,x^je_j
$$

the $x^j$ are the __$(e_j)$-coordinates__ of $x$. We are adopting the usual summation convention for tensors. 

$\quad$ The basis $(e_j)$ identifies the vector $x$ with the $n$-tuple $(x^1, \dots, x^n) \in \mathbb{R}^n$. It also gives rise to a canonical identification with a dual basis, through the inner product; this identification induces the isomorphism of $\mathcal{X}$ and $\mathcal{X}^*$ mentioned above. Given $e_j$, the associated dual basis vector $e^j$ is given by

$$
e^j = \langle e_j, \cdot \rangle \,.
$$

$\quad$ Having assumed an orthogonal representation, the action of any group element, through the representation, takes the $(e^j)$ to some other orthonormal basis, say $(\tilde{e}_j)$. This leads to the questions of 
* how the dual basis $(e^j)$ transforms under the mapping $e^j \mapsto \langle \tilde{e}_j, \cdot \rangle $, 
* and of how coordinates $(x^j)_{j=1}^n$ describing vector $x$ transform under this change of basis, which is effectively the same question.   

$\quad$ We can express the basis $( \tilde{e}_j )$ in terms of the $( e_j)$ via
$$
\tilde{e}_j = S^i_je_i  \, ,
$$
where $S$ is called the __direct transformation matrix__ from original basis to the new. Let $T$ denote $S^{-1}$, the direct transition matrix from the new basis to the old:
$$
e_i = T_i^j \tilde{e}_j 
$$
This inverse determines how the coordinates $x^j$ transform. This is to say, coordinate objects themselves transform __contravariantly__:
$$
\tilde{x}^j =  T_i^j x^i\,.
$$

To see this, one starts with the two equivalent expressions for $x$, namely $x^i e_i$ and $\tilde{x}^j \tilde{e}_j$, and then uses the identity $e_i = T_i^j \tilde{e}_j$ on the first expression.  

$\quad$ Let us discuss the form of the covector representation described above. We think of the action of group element $g$ as describing a change of coordinates. The vector representation describes this coordinate change directly, as an action on vectors. The covector representation encodes how covectors transform under the same change in coordinates. From the above discussion, we know that this transformation on covectors is contravariant, namely the induced action on covectors is given by
$$
\langle v , \cdot \rangle \mapsto \langle g^{-1}v, \cdot \rangle ,
$$
but the inner product allows us to transfer this to yet another action on vectors, through the relation

$$
\langle g^{-1} v, w \rangle = \langle v, g^{-T} w \rangle ,
$$

thus the covector representation, defined in the previous section, describes how vectors _see_ the induced contravariant transformation. Conveniently, in the orthogonal setting, the vector and covector representations coincide. 

## tensors

$\quad$ Recall that the tensor product $\mathcal{X}_1 \otimes \mathcal{X}_2$ of two vector spaces $\mathcal{X}_1$ and $\mathcal{X}_2$, with respective bases $e_{(1), \,1}, \dots, e_{(1), \,m}$ and $e_{(2), \,1}, \dots, e_{(2), \,n}$, is the space of formal linear combinations of the symbols
$$
 e_{(1), \,a} \otimes  e_{(2), \,b},
$$
meaning that each element of $\mathcal{X}_1 \otimes \mathcal{X}_2$ can be uniquely expressed in the form 

$$
C^{ab} e_{(1), \,a} \otimes  e_{(2), \,b} 
$$

where $C^{ab}$ are the coordinates of an object we denote $c$, and which is an example of a _$2$-tensor_. 



___
___

### __Example.__ ( changing coordinates with a $2$-tensor )

$\quad$ Suppose $\rho_1$ is a representation of $\textsf{G}$ over $\mathcal{X}_1$, and that $\rho_2$ is a representation of $\textsf{G}$ over $\mathcal{X}_2$. Suppose that $x = (x^1, \dots, x^m) \in \mathcal{X}_1$ and that $y = (y^1, \dots, y^m) \in \mathcal{X}_2$, so that $x$ and $y$ are each transformed under the $\textsf{G}$ action on each space. By definition, the quantity with components $x^iy^j$ transforms according to the __tensor product representation__ $\rho_1 \otimes \rho_2$ over $\mathcal{X}_1 \otimes \mathcal{X}_2$. The representation $\rho_1 \otimes \rho_2$, acting in $\mathcal{X}_1 \otimes \mathcal{X}_2$, changes the coordinates $C^{ab}$ according to the following rule; we denote the result of this transformation by $D^{ab}$
$$
D^{ab} = (\rho_1(g))_k^a \, (\rho_2(g))_\ell^b \, C^{k\ell},
$$
where $(\rho_1(g))_k^a$ is the matrix of $\rho_1(g)$ in the basis $e_{(1), \,1}, \dots, e_{(1), \,m}$, and analogously for the other matrix.



$\quad$ Let us consider this situation, specialized to the case $\mathcal{X}_1 = \mathcal{X}$, and $\mathcal{X}_2 = \mathcal{X}^*$. Using the above display, which relates the different coordinates representations of a tensor under a change of basis, along with the work of the previous section, one finds in this case
$$
D^{ab} = S_k^a \, T_\ell^b \,C^{k\ell},
$$
where $S$ and $T$ are the change of basis matrices previously discussed. The point is that, because the representation $\rho$ is orthogonal, and because all bases are orthonormal, the action of any $\rho(g)$ for $g \in \textsf{G}$ functions as a change of basis. We are imagining $\rho(g)$ as $S$. Importantly, the above display can be used to 'lift' the representation $\rho$ to the larger vector space $\mathcal{X} \otimes \mathcal{X}^*$. We index this induced representation by the _valency_ of the tensors under consideration (defined just below). Here, the induced representation is denoted $\rho_{(1,1)}$, and for $g$ as above, $\rho_{(1,1)}(g)$ corresponds to the $4$-tensor $ST$ with entries $S_k^a \, T_\ell^b$. 

___
___

___
### __Definition.__ ( tensors )

$\quad$  A __tensor of type__, or __valency $(k,\ell)$__ over vector space $\mathcal{X}$ is a multilinear map 
$$
A : \underbrace{\mathcal{X}^* \times \, \dots \, \times \mathcal{X}^*}_{ k \text{ copies }}  \,\times \, \underbrace{ \mathcal{X} \times \, \dots \, \times \mathcal{X} }_{ \ell \text{ copies }} \to \mathbb{R} .
$$
In instances where the precise valancy is not required to identify $A$, we refer to $A$ as a __$(k+\ell)$-tensor__. The space of all such objects is denoted $\mathcal{X}[k,\ell]$. When expressed in coordinates, the integer $k$ is the number of __upper indices__ and $\ell$ the number of __lower indices__. 

___


$\quad$ An orthonormal basis on $\mathcal{X}$ produces an orthonormal basis on any $\mathcal{X}[k, \ell]$. Through this induced basis, a tensor $A \in \mathcal{X}[k,\ell]$ has entries $A_{ \quad j_1, \, \dots\, , \, j_\ell} ^ { i_1, \, \dots \, , \, i_k}$, where each index ranges from $1$ to $n$, and where each _entry_ or _component_ takes values in $\mathbb{R}$. These entries are the __$(e_j)$-coordinates__ of tensor $A$. 

$\quad$ For coordinate representations of such objects, the transformation rules for coordinates above generalize. Indeed, Schwarz defines _tensors, with $k$ upper indices and $\ell$ lower indices_, as quantities transforming like a product of $k$ vectors and $\ell$ covectors. This is effectively the definition above. The space $\mathcal{X}[k,\ell]$ is isomorphic to $\mathbb{R}^N$, where $N = n^{k + \ell}$, and $\textrm{dim}(\mathcal{X}) = n$. 


$\quad$ The relevance of transforming coordinates is, again, that $\rho(g)$ can be interpreted as a change of coordinates. For a change of coordinates matrix $S \equiv \rho(g)$ (and its 'dual' $T$), the generalized transformation rules are thus a recipe for obtaining the representation $\rho_{(k,\ell)}$ over $\mathcal{X}[k,\ell]$ from the original representation $\rho$ over $\mathcal{X}$. Here we use $S$ to denote the (orthogonal) direct transformation matrix from one orthonormal basis $(e_j)$ to another $(\tilde{e}_j)$, and we use $T$ to denote $S^{-1}$ as before. 




$\quad$ Given a tensor $A \in \mathcal{X}[k,\ell]$, we us $A$ and $\tilde{A}$ with accompanying indices to denote the $(e_j)$- and $(\tilde{e}_j)$-coordinate representations of $A$. The relation between these coordinate representations is as follows:

$$
\tilde{A}_{ \quad j_1, \, \dots\, , \, j_k} ^ { i_1, \, \dots \, , \, i_\ell} = S_{i_1}^{m_1} \dots S_{i_\ell}^{m_\ell} T_{n_1}^{j_1} \dots T_{n_k}^{j_k} A_{ \quad m_1, \, \dots\, , \, m_\ell} ^ { n_1, \, \dots \, , \, n_k} \,.
$$




> Does anything need to be said about the endpoint case of $k = \ell = 0$?

## tensor operations

$\quad$ Aside from the operations which give $\mathcal{X}[k, \ell]$ the structure of a vector space, there are two operations betwen such spaces to discuss, for now. The first is called __contraction__ with respect to the $i$th (dual vector) and $j$th (vector) slots. For $k, \ell \geq 1$, contraction is a map 
$$
\textsf{c} : \mathcal{X}[k, \ell] \to \mathcal{X}[k-1, \ell-1]
$$
given by
$$
\textsf{c} A = A( \dots, e^a,\dots ; \dots, e_a, \dots ), 
$$
where the repeated indexing denotes a sum indexed by $a$, which runs over the dimensions of $\mathcal{X}$, and where $(e_a)$ is an orthonormal basis for $\mathcal{X}$, with $(e^a)$ the corresponding orthonormal basis for $\mathcal{X}^*$. The $e^a$ are in the $i$th slot of $T$, while the $e_a$ are in the $j$th slot of $T$. The contraction of a tensor of type $(1,1)$, when viewed as a linear map from $V$ to itself, is just the trace of this map. Contraction thus generalizes trace, and both objects are independent of the choice of $e_a$. 


$\quad$ The second operation we discuss is the __outer product__ of tensors $A \in \mathcal{X}[k, \ell]$ and $B \in \mathcal{X}[k', \ell']$, denoted $A \otimes B$. The tensor $A \otimes B$ is an element of $\mathcal{X}[ k + k', \ell + \ell ']$, defined as follows. Given $(k+k')$ dual vectors $v^{(1)}, \dots, v^{(k+k')}$ and $(\ell + \ell')$ vectors $w_{(1)}, \dots, w_{(\ell + \ell')}$, we define the action of $A \otimes B$ on these objects to be the product 
$$
A( v^{(1)}, \dots, v^{(k)} ; w_{(1)}, \dots, w_{(\ell)}) \cdot B ( v^{(k+1)}, \dots, v^{(k + k')} ; w_{(\ell+1)}, \dots, w_{(\ell + \ell')} )\,.
$$

# GDL blueprint

_( signal domain fixed )_

## setup

Our formal treatment of a classification problem requires: 

* A finite group $\textsf{G}$, the __data symmetry group__. 

* A sequence of discrete domains, $( \Omega_j )_{j=0}^L$, with $\Omega \equiv \Omega_0$ the domain of input signals. 

    * Each $\Omega_j$ is a homogeneous space with respect to $\textsf{G}$, meaning each domain admits a transitive $\textsf{G}$ action, denoted $\xi_j$. In particular, we can write $\Omega_j \cong \textsf{G} / \textsf{H}_j$ for some subgroup $\textsf{H}_j \subset \textsf{G}$. 

* A sequence of finite dimensional channel vector spaces $( \mathcal{C}_j )_{j=0}^L$, with $\mathcal{C}_0 \equiv \mathcal{C}$ the space of input channels. 

    * The $\xi_j$ induce representions $\rho_j$ over respective signal vector spaces $\mathcal{X}(\Omega_j, \mathcal{C}_j)$

* A collection of orthonormal bases for each signal vector space. Unless stated otherwise, for each $j$, these are  assumed to have the form 
$$
e_{(j),\, \textrm{space} , \,i} \otimes e_{(j),\, \textrm{channel}, \, q}
$$
where the index $i$ runs between $1$ and $\# \Omega_j$, while index $q$ runs between $1$ and $\textrm{dim}(\mathcal{C}_j)$.


## three flavors of equivariant map

$\quad$ The essential components of the GDL blueprint are $\textsf{G}$-equivariant maps. These are separated into three 'types', discussed below. Each map below is imagined as associated to some layer $j$ of a neural network, though we omit the subscripts for the time being. We consider at most two domains, $\Omega, \tilde{\Omega}$ and at most two channel spaces $\mathcal{C}, \tilde{\mathcal{C}}$. The signal domains are as described in the setup, and thus can be expressed as $\Omega \equiv \textsf{G} / \textsf{H}$ and $\tilde{\Omega} \equiv \textsf{G} / \tilde{\textsf{H}}$ for subgroups $\textsf{H}, \tilde{\textsf{H}} \subset \textsf{G}$. The cases are

1. Affine, $\textsf{G}$-equivariant maps, 
$$
\tilde{B} : \mathcal{X}( \Omega, \mathcal{C}) \to \mathcal{X} ( \tilde{\Omega}, \tilde{\mathcal{C}})
$$
of the following form: first a generalized convolution with filter 
$$
\tilde{\theta} \in \mathcal{X}( \textsf{H} \backslash \textsf{G} / \tilde{\textsf{H}} , \, \mathcal{C} \otimes \tilde{\mathcal{C}}) \,,
$$ followed by the addition of a bias vector $\tilde{b} \in \mathcal{X}( \tilde{\Omega}, \tilde{\mathcal{C}} )$. The convolution itself can be expressed as a matrix, with respect to basis
$$
e_{\textrm{space},\,\ell} \otimes \tilde{e}_{\textrm{space}, k} \otimes e_{\textrm{channel},\, q} \otimes \tilde{e}_{\textrm{channel},\,r},
$$
where $\ell$ and $k$ range from $1$ to $\# \Omega$ and $\# \tilde{\Omega}$ respectively. Likewise, $q$ and $r$ range from $1$ to $\textrm{dim}(\mathcal{C})$ and $\textrm{dim}(\tilde{\mathcal{C}})$ respectively. The entries of this matrix, as well as those of the bias vector $\tilde{b}$ are all _learned_. 

2. Non-linear activation function 
$$
\alpha : \mathcal{X}( \Omega, \mathcal{C}) \to \mathcal{X}( \Omega, \mathcal{C})
$$ 
applied 'entrywise' to a given input. This implicitly assumes that the input is expressed with respect to the basis
$$
e_{\textrm{space}, \ell}  \otimes e_{\textrm{channel},\,q},
$$
There are no learned parameters associated to this map. Despite its basis dependence, the entrywise application of the nonlinearity is the reason this map is equivariant. 

3. Equivariant local pooling operators, 
$$
P : \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{X}(\tilde{\Omega}, \tilde{\mathcal{C}} )
$$
which contain no learnable parameters. These perform a kind of renormalization (or coarse graining) of $\Omega$. These are not in general convolutions, because in particular they are not necessarily linear nor affine. In addition to the hope that such renormalization works as in its physics useage, washing away non-essential information, a smaller output layer usually means fewer learned parameters in the network overall. It is also a way to ensure that the successive map examines the preceding signal with effectively a larger field of view. 


Before discussing some examples of equivariant pooling in the sense of (3.) above, we remark that convolutional layers can effectively function as renormalization operations as well. 

We give three examples of equivariant pooling.




___
___

### __Example.__ ( strideless group-pooling )


$\quad$ Let $\textsf{G}$ be a finite group; an example we will bear in mind is the usual one for images, with $\textsf{G} \equiv \Omega \equiv \mathbb{T}_{2,\,n}$. In addition to the domain being the group acting on itself, we additionally identify $\mathbb{T}_{2,\,n}$ with its Cayley graph. Thus the graph distance in the Cayley graph endows $\mathbb{T}_{2, \,n}$ with a natural notion of distance. In general, we write $\textsf{d}_\Omega$ to denote this domain graph-distance, and we let $\textsf{B}_\Omega(v_g,r)$ for $r \in \mathbb{N}$ denote the ball around $v_g \in \Omega$ of radius $r$, with respect to $\textsf{d}_\Omega$.  

$\quad$ Consider $\textsf{B}_\Omega(v_e,r)$, the ball of radius $r$ centered at the 'basepoint' $v_e$ of $\Omega$ corresponding to the group identity $e$. Let us denote this neighborhood as $\textsf{B}_r$ for brevity. The following defines a pooling operation $P$ on signals $x \in \mathcal{X}(\Omega, \mathcal{C})$:
$$
Px(v_g) = x \left( \textrm{argmax}_{h \, \in \, g.\textsf{B}_r } \| x(h) \|_{\mathcal{C}} \right),
$$
where $\| \cdot \|_{\mathcal{C} }$ denotes some norm on the channel vector space $\mathcal{C}$. 

___
___

___
___
### __Example.__ ( subsampling )
___
___

$\quad$

___
___


___
___
### __Example.__ ( coset pooling )
___
___

$\quad$

___
___

## hypothesis space


$\quad$

## discussion 


Shift-invariance arises naturally in vision and pattern recognition. In this case, the desired function $f \in \textsf{H}$, typically implemented as a CNN, inputs an image and outputs the probability of the image to contain an object from a certain class. It is often reasonably assumed that the classification result should not be affected by the position of the object in the image, i.e., the function $f$ must be shift-invariant.

Multi-layer perceptrons lack this property, a reason why early (1970s) attempts to apply these architectures to pattern recognition problems failed. The development of NN architectures with local weight sharing, as epitomized by CNNs, was among other reasons motivated by the need for shift-invariant object classification. 



A prototypical application requiring shift-equivariance is image segmentation, where the output of $f$ is a pixel-wise image mask. This segmentation mask must follow shifts in the input image. In this example, the domains of the input and output are the same, but since the input has three color channels while the output has \emph{one channel per class}, the representations $(\rho, \mathcal{X}(\Omega, \mathcal{C}) )$ and $(\rho', \mathcal{Y} \equiv \mathcal{X}(\Omega, \mathcal{C}'))$ are somewhat different. 

When $f$ is implemented as a CNN, it may be written as a composition of $L$ functions, where $L$ is determined by the depth and other hyperparameters:
$$
f = f_L \circ f_{L-1} \circ \dots \circ f_2 \circ f_1 .
$$

Examining the individual layer functions making up CNN, one finds they are not shift-invariant in general but rather shift-equivariant. The last function applied, namely $f_L$, is typically a ``global-pooling" function that is genuinely shift-invariant, causing $f$ to be shift-invariant, but to focus on this ignores the structure we will leverage for purposes of expressivity and regularity. 