# 1
> GDL1

- toc: true 
- badges: true
- comments: false
- categories: [jupyter]


Modern neural network (NN) design is built on two algorithmic principles: hierarchical feature learning ( concerning the architecture of the NN ), and learning by local gradient-descent driven by backpropagation ( concerning the learning dynamics undergone by the NN ). 

An instance of training data is modeled as an element of some high-dimensional vector space, making a generic learning problem subject to the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality). Fortunately, most tasks of interest are not generic, inheriting regularities from the underlying low-dimensionality and structure of the physical world.

Exploiting known symmetries of a large system is a useful, classical remedy against the curse of dimensionality, and forms the basis of most physical theories. The notes [BBCV21] construct a blueprint for neural network architecture which incorporates these ``physical" priors, termed _geometric priors_ throughout the notes. Importantly, this blueprint provides a unified perspective of the most successful neural network architectures.

## __1.1 categories and groups__


___

__Def__ $\quad$ A _graph_ is a pair $\mathcal{G} = (\mathcal{V}, \mathcal{E})$, where $\mathcal{V}$ is a set whose elements are called _vertices_. The set $\mathcal{E}$ consists of _edges_, defined to be a multi-set of exactly two vertices in $\mathcal{V}$, not necessarily distinct. 


___

__Def__ $\quad$ A _directed graph_ is a pair of sets  $\mathcal{G} = (\mathcal{V}, \mathcal{A})$ of vertices and _arrows_ (or _directed edges_). An _arrow_ is an ordered pair of vertices.

___

__Def__ $\quad$ Consider an arrow $f$ of a directed graph $\mathcal{G} = ( \mathcal{V}, \mathcal{A})$, specifically $f \equiv (a,b) \in \mathcal{A}$, with $a,b \in \mathcal{V}$. The operations $\mathbb{dom}$ and $\mathbb{cod}$ act on the arrows $f \in \mathcal{A}$ via $\mathbb{dom}f = a,\, \mathbb{cod} f = b$, and are called the _domain_ operation and _codomain_ operation,  respectively. 

___

$\vdots$


Given two arrows, $f$ and $g$ in some directed graph, we say that the ordered pair of arrows $(g,f)$ is a _composable pair_ if $\mathbb{dom} g = \mathbb{cod} f$. 
Going forward, let us express the relations $a = \mathbb{dom} f$ and $b = \mathbb{cod} f$ more concisely via
$$
f : a \to b \, \quad \text { or equivalently, } \, \quad a \xrightarrow[\,]{ f} b  
$$

The next definition formalizes the behavior of a collection of structure-respecting maps between mathematical objects. 

$\vdots$



___

__Def__ $\quad$ A _category_ is a directed graph $\mathcal{C} = (\mathcal{O},\mathcal{A})$, whose vertices $\mathcal{O}$ we call _objects_, such that 

1. For each object $a \in \mathcal{O}$, there is a unique _identity_ arrow $\textrm{id}_a \equiv \mathbf{1}_a : a \to a$, defined by the following property: for all arrows $f : b \to a$ and $g : a \to c$, composition with the identity arrow $\mathbf{1}_a $ gives
$$
\mathbf{1}_a \circ f = f \quad \text{ and } \quad g \circ \mathbf{1}_a = g
$$

2. For each composable pair $(g, f)$ of arrows, there is a unique arrow $g \circ f$ called their _composite_, with $g \circ f : \mathbb{dom} f \to \mathbb{cod} g$, such that the composition operation is associative. Namely, for given objects and arrows in the configuration $a \xrightarrow[\,]{ f} b \xrightarrow[\,]{ g} c \xrightarrow[\,]{ k} d$,
one always has the equality $k \circ (g \circ f) = (k \circ g ) \circ f$.

___

$\vdots$

Given a category $\mathcal{C} = (\mathcal{O},\mathcal{A})$, let
$$
\mathbb{hom} (b,c) := \{ \, f : f \in \mathcal{A},  \, \mathbb{dom} f = b \in \mathcal{O}, \, \mathbb{cod} f = c \in \mathcal{O} \, \}
$$
denote the set of arrows from $b$ to $c$. Henceforth, we use the terms _morphism_ and arrow interchangeably. 


Groups are collections of symmetries. A _group_ $G$ is a category $\mathcal{C} = (\mathcal{O}, \mathcal{A})$ with $\mathcal{O} = \{ o \}$ ( so that we may identify $G$ with the collection of arrows $\mathcal{A}$ ) such that each arrow has a unique inverse: for $g \in \mathcal{A}$, there is an arrow $h$ such that $g \circ h = h \circ g = \mathbf{1}_o$. 
          
Each arrow $g \in \mathcal{A}$ thus has $\mathbb{dom} g = \mathbb{cod} g = o$. As remarked, the arrows $g \in \mathcal{A}$ correspond to group elements $g \in G$. The categorical interpretation suggests that the group \emph{acts} on some abstract object $o \in \mathcal{O}$. In the present context, we care how groups act on data, and how this action is represented to a computer. 

### $\quad$ *__group representations__*

Linear representation theory allows us to study groups using linear algebra ([a source](https://wlou.blog/2018/06/22/a-first-impression-of-group-representations/) ). We start by considering a function $\varphi : G \times V \to V$, where $G$ is a group, and where $V$ is a vector space over $\mathbb{R}$. This allows us to identify group elements $g$ with functions $\varphi(g, \cdot) : V \to V$ from the vector space to itself. When the map $\varphi$ is understood, or general ( as now ), we write $g.v$ in place of $\varphi(g,v)$, and we write $(g.)$ in place of $\varphi(g, \cdot)$. 

The "representatives" $(g.)$ of these group elements $g$ can be composed, and if this compositional structure is compatible with the original group operation, we say $\varphi$ is a _group action_ on $V$. Specifically, $\varphi$ should satisfy $e.v = v$ for all $v \in V$, where $e$ denotes the identity element of $G$, and in general one has $(gh).v  = g.(h.v)$. 

The map $\varphi$ is _$\mathbb{R}$-linear_ if it is compatible with the $\mathbb{R}$-vector space structure on $V$, i.e. additive and homogeneous. Specifically, if for all $v,w \in V$ and all scalars $\lambda \in \mathbb{R}$, one has $g.(v+w) = g.v + g.w$ and $g.(\lambda v) = \lambda g.v$. 

$\vdots$

___

__Def__ $\quad$ 
An _$\mathbb{R}$-linear representation_ of group $G$ over $\mathbb{R}$-vector space $V$ is an $\mathbb{R}$-linear group action on $V$.

___

$\vdots$

The next example illustrates how linear group representations arise naturally when considering group actions on data. As mentioned, we consider input data as members of some vector space $V$, which we may assume to be finite dimensional for any practical discussion. Specifically, we consider some finite, discrete domain $\Omega$, which may also have the structure of an undirected graph. 

A \emph{\B{signal}} over $\Omega$ is a function $x : \Omega \to \mathbb{R}^s$, where $s$ is the number of _channels_. The vector space $\mathcal{X}(\Omega,\mathbb{R}^s)$ is defined to be the collection of all such signals, for given $\Omega$ and $s$.

$\vdots$

___
___

*__Example__* $\quad$  Consider, for some $n \in \mathbb{N}$, a signal domain $\Omega = \mathbb{T}_n^2$, where $\mathbb{T}_n^2$ denotes the two-dimensional discrete torus of side-length $n$, namely $( \mathbb{Z} / n\mathbb{Z} )^2$. This domain has natural graph as well as group structures. 

If we imagine each vertex of $\Omega$ to be a pixel, we can express an $n \times n$-pixel color (RGB) image as a signal $x : \Omega \to \mathbb{R}^3$, with the first, second and third coordinates of $\mathbb{R}^3$ reporting R, G and B values of a given pixel. 

We make two observations: 

1. As a vector space, $\mathcal{X}(\Omega)$ is isomorphic to $\mathbb{R}^d$, with $d$ typically very large. In the above example, $d = 3n^2$, which is thirty-thousand for a $n \times n \equiv 100 \times 100$ pixel image. 

2. Any group action on $\Omega$ induces a group action on $\mathcal{X}(\Omega)$. 


Expanding on the latter, consider a group action of $G$ on domain $\Omega$. As the torus $\Omega$ already has group structure, it is natural to think of it acting on itself through translations, i.e. we now additionally consider $G = \mathbb{T}_n^2$. 

The action of $G \equiv \mathbb{T}_n^2$ on itself $\Omega \equiv \mathbb{T}_n^2$ induces a $G$-action on $\mathcal{X}(\Omega)$ as follows: for $g \in G$ signal $x \in \mathcal{X}(\Omega)$, the action $(g, x) \mapsto \mathbf{g}.x \in \mathcal{X}(\Omega)$  is defined pointwise at each $u \in \Omega$:
$$
(\mathbf{g}.x)(u) := x(g.\omega),
$$
where the bold $(\mathbf{g}.)$ is used to distinguish the action on signals from the action on the domain. 
___
___

$\vdots$

To summarize: any $G$-action on the domain $\Omega$ induces an $\mathbb{R}$-linear representation of $G$ over the vector space of signals on $\Omega$.

$\vdots$

___
___

*__Example__* $\quad$  It seems like standard practice to encode the collection of classes associated to some ML classification problem as an orthonormal basis. These are given ( to the computer ) in the usual coordinate basis 
$$
e_1 \equiv (1, 0, \dots, 0),\, e_2 \equiv (0,1,\dots, 0),\, \dots, \, e_n \equiv (0,\dots, 0,1) \,,
$$
hence the nomenclature _one-hot_. In the preceding example, if one considers a one-hot encoding of the vertices of $\mathbb{T}_n^2$, we see that each signal is expressed with respect to this coordinate system, in the sense that $x = \sum_{j=1}^n x_j e_j$. 

This kind of encoding is useful for considering general symmetries of the domain. For instance, if permuting node labels is a relevant symmetry, the action of the symmetric group $\frak{S}_n$ is naturally represented by $n \times n$ permutation matrices.

___
___

$\vdots$

The following definition reformulates the notion of a signal over the nodes of some graph as _node features_. 

$\vdots$

___

__Def__ $\quad$ We say a graph $\mathcal{G} = ( \mathcal{V}, \mathcal{E} )$ is equipped with \emph{\B{node features}} if for each $v  \in \mathcal{V}$, one has the additional data of an $s$-dimensional vector $x(v) \in \mathbb{R}^s$, called the _features_ of node $v$. 

___


$\vdots$

The term 'features' is compatible with the usage in ML, supposing that our input signal has domain some graph $\mathcal{G}$. In this case, we can think of a neural network as a sequence of node-layers built ``on top of" the graph $\mathcal{G}$. An input signal endows the first node layer of a NN with features, and the weights of the neural network propagate these through to node features on the nodes of the rest of the network. The features on the last layer of the network can be read off as the output of the NN function. 