# 5
> GDL

- toc: true 
- badges: true
- comments: false
- categories: [jupyter]


# __2 $\quad$ Geometric deep learning models__  

## __2.1 $\quad$ Learning with scalar signals on a cyclic group__

In the simplest setting we consider, the underlying domain is a one-dimensional grid, and the signals only have a single channel. We can identify this grid $\Omega$ with the Cayley graph of cyclic group
$$
C_n = \langle \, a : a^n = 1 \, \rangle \equiv \{ \, 1, a, a^2, \dots, a^{n-1} \, \}.
$$
It is convenient to parametrize the group, and hence the grid, through the exponent of the generator 
$$
C_n \equiv \{ 0, 1, \dots, n -1 \}
$$
as this indexing is consistent with the way most programming languages index vectors, reinterpreting the group operation as addition modulo $n$.

The vector space of single-channeled (i.e. real-valued) signals
$$
\mathcal{X}(C_n,\mathbb{R}) = \{ x : C_n \to \mathbb{R} \} ,
$$
is finite dimensional, and each $x \in \mathcal{X}(C_n, \mathbb{R})$ may be expressed as 
$$
x = 
\left[ 
\begin{matrix}
x_0\\ 
\vdots\\
\,x_{n-1}\,
\end{matrix}
\right] 
$$
with respect to some implicit coordinate system used by the computer, the _input coordinate system_. This is the same coordinate system used to express the representation $\rho$ of translation group $G \equiv C_n$, which we now describe.

Given a vector $\theta = (\theta_0 , \dots, \theta_{n-1})$, recall the associated _circulant matrix_ is the $n \times n$ matrix with entries 
$$
\mathcal{C}(\theta) := \left( \, \theta_{ (u - v) \mod n} \right)_{ 0 \, \leq \,u,\,v \, \leq n-1 } 
$$

Consider the case of $\theta_S := (0,1,0,\dots, 0)^T$, the associated circulant matrix, $\mathbf{S} := \mathbf{C}(\theta_S)$ acts on vectors by shifting the entries of vectors to the right by one position, modulo $n$. This is a shift or translation operator, which we denote $\mathbf{S}$. 

___
__Lemma__ $\quad$
A matrix is circulant if and only if it commutes with $\mathbf{S}$. Moreover, given any two vectors $\theta, \eta \in \mathbb{R}^n$, one has $\mathbf{C}(\theta) \mathbf{C}(\eta) = \mathbf{C}(\eta) \mathbf{C}(\theta)$. 

___

The importance of $\mathbf{S}$ to the present discussion is that it generates a group isomorphic to the one-dimensional translation group $C_n$. This is to say, a natural representation of $C_n = \langle \, a : a^n = 1 \, \rangle$ to consider is the group isomorphism induced by mapping the generator $a$ of $C_n$ to $\mathbf{S}$. Specifically, the representation $\rho$ of $G$ over $\mathcal{X}( C_n, \mathbb{R})$ is given by
$$
\rho ( a^j ) := \mathbf{S}^j 
$$

___

__Corollary__ $\quad$ Any $f : \mathcal{X}(C_n, \mathbb{R}) \to \mathcal{X}(C_n,\mathbb{R})$ which is linear and $C_n$-equivariant can be expressed ( in the input coordinate system ) as an $n \times n$ circulant matrix $\mathbf{C}(\theta)$ for some vector $\theta$.

___


___
___

__Example__ $\quad$ 
Our previous recipe for designing an equivariant function $F= \Phi( \mathbf{X}, \mathbf{A})$ using a local aggregation function $\varphi$. In this case, we can express
$$
\varphi ( \mathbf{x}_u, \mathbf{X}_{\mathcal{N}(u)} ) = \varphi( \mathbf{x}_{u-1}, \, \mathbf{x}_u, \, \mathbf{x}_{u+1} ),
$$
where the addition and subtraction in the indices above is understood to be modulo $n$. 

 If in addition, we insist that $\varphi$ is linear, then it has the form 
$$
 \varphi( \mathbf{x}_{u-1}, \, \mathbf{x}_u, \, \mathbf{x}_{u+1} ) = \theta_{-1} \mathbf{x}_{u-1} + \theta_0 \mathbf{x}_u + \theta_1 \mathbf{x}_{u+1},
$$
and in this case we can express $\mathbf{F} = \Phi (\mathbf{X}, \mathbf{A} )$ through the following matrix multiplication:
$$
\left[
\begin{matrix}
\theta_0 & \theta_1 & \text{ } & \text{ } & \theta_{-1} \\
\theta_{-1} & \theta_0 & \theta_1 & \text{ } &   \text{ } \\
\text{} & \ddots & \ddots & \ddots & \text{ } \\
\text{ } & \text{ } & \theta_{-1} & \theta_0 & \theta_1 \\
\theta_1 & \text{ } & \text{ } & \theta_{-1} & \theta_0 
\end{matrix} 
\right]
\left[
\begin{matrix}
\mathbf{x}_0 \\
\mathbf{x}_1 \\
\vdots \\
\,\mathbf{x}_{n-2} \, \\
\mathbf{x}_{n-1}  
\end{matrix}
\right]
$$
This special multi-diagonal structure is sometimes referred to as ``weight sharing" in the machine learning literature. 

___
___

Circulant matrices are synonymous with discrete convolutions; for $x \in \mathcal{X}(\Omega,\mathbb{R})$ and $\theta \in \mathbb{R}^n$, their _convolution_ $x \star \theta$ is defined by 
$$
( x \star \theta )_u := \sum_{v = 0}^{n-1} x_{v \mod n}\, \theta_{ (u-v) \mod n}  \, ,
$$
$$
\equiv \mathbf{C}(\theta) x 
$$


___

__Rmk__ $\quad$ This leads to an alternate, equivalent definition of convolution as a translation equivariant linear operation. Moreover by replacing translations by a more general group $G$, one can generalize convolution to settings whose domain has symmetry other than translational. 
___ 

## __2.2 $\quad$ A simple CNN__

We consider possibly the simplest neural network that we can construct through the above blueprint. Suppose we have a binary classification problem, with the following hypothesis space. Let $\textsf{H}_1$ denote the hypothesis space of functions $f : \mathcal{X}( C_n, \mathbb{R}) \to \{0,1\}$ of the form 

$$
f = A \circ P \circ \mathbf{a} \circ B \,,
$$
where the components of $f$ are 

where the components of $f$ are 


* $B$  : $\quad$ A $C_n$-equivariant function, to be learned. It is represented as a circulant matrix $\mathbf{C}(\theta)$, where $\theta$ is a vector $\theta \equiv (\theta_0, \dots, \theta_{n-1})$ whose entries $\theta_j$ are parameters to be learned. 

* $ \mathbf{a} $ : $\quad$ We consider the ReLU activation function, $a : \mathbb{R} \to \mathbb{R}_{\geq\, 0}$ defined by $a(w) = \max(0,w)$, for $w \in \mathbb{R}$. The bold-face $\mathbf{a}$ denotes the entry-wise action of this function on a given vector;for $y \equiv (\,y_1, \,\dots, \, y_n \, ) \in \mathcal{X}(C_n, \mathbb{R})$, which we imagine as the output of $B(x)$ for some input signal $x$, we have $\mathbf{a} (y ) = ( \,  \max(0,y_1), \,  \dots, \, \max(0,y_n) )$. There are no learned parameters in this layer. 

* $P$ : $\quad$ A coarsening operator. In this case, let us say it is a _zero-padded group homomorphism_. 

 $P : C_n \to C_{n / d }$ for some divisor $d \mid n$ \footnote{zero-padding} , and let us say that it operates through max-pooling on the signal, over the pre-images of each element of $C_{n / d}$. 

* $A$ : $\quad$ A global-pooling layer. We assume this has the form of a fully-connected layer, followed by a softmax. Specifically,