# Shallow neural networks

In Chapter 1 we introduced supervised Learning using Linear Regression.
The main Limiting factor is that this model can only describe linear relationships i.e. relationships as a line. 
What if we wanted more complex relationships? 

We'll introduce shallow neural networks, which describe piecewise linear functions.
This method is simple yet a sophisticated method to approximate arbitrarily complex relationships between multi-dimensional inputs and outputs

## Neural network example

We first consider an example to see the main concepts that are involved in a neural network:

$\text{input: } x \in \mathbb{R}$ <br> 
$\text{output: } y \in \mathbb{R}$ <br>
$\text{parameteres: } \{\phi_0, \phi_1, \phi_2, \phi_3, \theta_{10}, \theta_{11}, \theta_{20}, \theta_{21}, \theta_{30}, \theta_{31} \} = \phi \in \mathbb{R}^{10}$

$$\begin{aligned} y &= f[x, \phi] \\ &= \phi_0 + \phi_1 a[\textcolor{green}{\theta_{10} + \theta_{11}x}] + \phi_2a[\textcolor{gold}{\theta_{20} + \theta_{21}x}] + \phi_3a[\textcolor{pink}{\theta_{30} + \theta_{31}x}] \end{aligned}$$

### The breakdown 

1. We computed 3 linear functions from the input data.<br>
   i. $\textcolor{green}{\theta_{10} + \theta_{11}x}$ <br>
   ii. $\textcolor{gold}{\theta_{20} + \theta_{21}x}$ <br>
   iii. $\textcolor{pink}{\theta_{30} + \theta_{31}x}$
2. We then pass the three linear results through an $\textcolor{lightblue}{activation \ function}$ this function is known as the $\textcolor{lightblue}{rectified \ linear \ unit}$
   1. $a[i] = \text{ReLU}[i] = \begin{cases} 0, &, i \lt 0 \\ i, &, i \ge 0 \end{cases}$
   2. $a[ii] = \text{ReLU}[ii] = \begin{cases} 0, &, ii \lt 0 \\ ii, &, ii \ge 0 \end{cases}$
   3. $a[iii] = \text{ReLU}[iii] = \begin{cases} 0, &, iii \lt 0 \\ iii, &, iii \ge 0 \end{cases}$
3. We finally apply a weighted sum the results of these functions using $\phi_0, \ \phi_1, \ \phi_2 \ \phi_3$

We can see that $y = f[x, \phi]$ represents a family of functions, where a particular member depends on the parameters values chosen $\phi$.<br>
As seen in chapter 1 given our full training dataset $\{x_i, y_i\}_{i=1}^N$ we can define a loss function $L[\phi]$<br>
With this we can train our model to find the best $\hat\phi$ parameters, describing the relationship between the inputs and the outputs.

### Small insight on ReLU

Below we can see the general ReLU function over any given scalar input:

<img src="images/chap2/ReLU_func.png" alt="ReLU Function" width="400" />

## Understanding The Neural Network

The function $y = f[x, \phi]$ more specifically represents a family of continous piecewise linear functions: 

The following are describes/refered to as $\textcolor{lightblue}{hidden \ units}$

$$h_1 = a[\textcolor{brown}{\theta_{10} + \theta_{11}x}]$$

$$h_2 = a[\textcolor{cyan}{\theta_{20} + \theta_{21}x}]$$

$$h_3 = a[\textcolor{grey}{\theta_{30} + \theta_{31}x}]$$


Using the hidden units we compute the output: 

$$y = \phi_0 + \phi_1\textcolor{brown}{h_1} + \phi_2\textcolor{cyan}{h_2} + \phi_3\textcolor{grey}{h_3}$$

| Description | Visualization |
|---|---|
| Each hidden unit contains a linear function $\theta_{•0} + \theta_{•1x}$ this is in essence producing a line <br> These are also referred to the $\textcolor{lightblue}{pre-activations}$ | <div align="center"><img src="images/chap2/relu_lin1.png" alt="ReLU Function" width="500" /></div> |
| For each linear function produced we then clip it with the activation function (ReLU) <br> These are also referred to the $\textcolor{lightblue}{activations}$ | <div align="center"><img src="images/chap2/relu_act1.png" alt="ReLU Function" width="500" /></div> |
| We then apply a linear function on the sum of the three results. <br>Individually it's seen as a linear transformation on the resulting functions. <br> These are the contributions to the output.| <div align="center"><img src="images/chap2/relu_lin2.png" alt="ReLU Function" width="500" /></div> |
| The positions where the three lines cross zero become the three "joints" in the final output. <br> We add the offset $\phi_0$ to the weighted sum which controls the overall height of the final function. | <div align="center"><img src="images/chap2/relu_res.png" alt="ReLU Function" width="300" /></div> |

Strongly advice to visit the same site as before, under the line "3.3a - 1D shallow network (ReLU)" provides great visualisations

https://udlbook.github.io/udlfigures/

## First Look at a Computational Graph

Another visual way to look at the this function is via a computational graph.

This is an asyclic directed graph where:<br>
- The vertices reprsent the inputs or functions applied on inputs.<br>
- The edges represent linear summation to the next vertex, which are usually the parameters.

For example we have: 


| $$h_1 = a[\textcolor{brown}{\theta_{10} + \theta_{11}x}]$$ | $$h_2 = a[\textcolor{cyan}{\theta_{20} + \theta_{21}x}]$$| $$y =\phi_0 + \phi_1 h_1 + \phi_2 h_2 + \phi_3 h_3$$| 
|---|---|---|
|So it would be interpreted as $ 1 • \theta_{10} + x • \theta_{11}$, which is going to be the input to the activation function.| So it would be interpreted as $ 1 • \theta_{20} + x • \theta_{21}$ <br> Which is going to be the input to the activation function. | This is presented from the middle vertices each being multiplied with their respective parameters.<br> Where all the values are flowing to the output which each edge to the same vertex $y$ implies the summation of these edges.|

The third graph is the general presented format in many litratures.
<br>
<div align="center">

<img src="images/chap2/explicit_cg.png" alt="ReLU Function" width="350" />

<img src="images/chap2/fcannotate.png" alt="ReLU Function" width="355" />

<img src="images/chap2/implicit_cg.png" alt="ReLU Function" width="368" />

</div>




## Universal Approximation Theorem

Suppose we have $D$ hidden units where the $d^{th}$ hidden unit is: 
$$ h_d = a[\theta_{d0} + \theta_{d1}x]$$

As such our inference is linearly combined: 

$$ y = \phi_0 +\sum_{d=1}^D \phi_d h_d$$

In a shallow network the number of hidden units is also usually refered as the $\textcolor{lightblue}{network \ capacity}$.<br> A shallow netowrk with $D$ hidden units will have a most $D$ peicewise functions (by the activation) which implies $D$ joints and thus at most $D+1$ regions.<br> The more hidden units added to the network the more we can approximate more complex functions.


<div align="center">
<img  src="images/chap2/UnivApprxThm.png" alt="ReLU Function" width="700" />
</div>

$\text{The universal  approximation theorem := } \forall \text{ continous functions, } \exists \text{ a shallow network that can approximate this function to any specified prescision}$


The proof is ommiting from this discussion but can be found in the proof directory.

## Multivariate inputs and outputs

The universal approximation theorem applies over higher dimension than just a scalar inputs and outputs.

For now we'll just discuss how to:
1.  Obtain any number of outputs **from** the (final) hidden layer.
2.  Provide multiple inputs **to** the (first) hidden layer.

Suppose we wanted 2 outputs $y_1, y_2$. <br>
Assume we have $d$ hidden units $h_1, h_2, \dots, h_d$ in the layer before the output then:

$$y_1 = \sum_{i=1}^d \phi_{i1}h_i$$
$$y_2 = \sum_{i=1}^d \phi_{i2}h_i$$

We see that we just double the number of paramters allocated a disjoint subset of them for each output we desire. 

<div align="center">

<img src="images/chap2/multiout.png" alt="ReLU Function" width="800" />

</div>

Suppose we wanted to provide 2 inputs $x_1, x_2$ to the first hidden layer. <br>
Assume we have $d$ hidden units $h_1, h_2, \dots, h_d$ in the first hidden layer then:

$$h_j = \sum_{i=1}^2 \theta_{ij}\,x_i, \quad j = 1,2,\dots,d$$

$$h_j = a\!\left(\theta_{0j} + \sum_{i=1}^2 \theta_{ij}\,x_i\right), \quad j = 1,2,\dots,d$$

<div align="center">

<img src="images/chap2/multiIn.png" alt="ReLU Function" width="600" />

</div>

**NOTE: <br> In both cases we haven't explicitely presented the bias term, <br> but for each pre-activation function and activation function exists an additional bias paramter in the computation.**


$_\text{multivariate := two or more quantities}$

## Shallow networks: General Case
$ \text{Given } x \in \mathbb{R}^{D_i} \text{ maps to a multi-dimensional output } y \in \mathbb{R}^{D_o} \text{ using } h \in \mathbb{R}^D \text{ hidden units, meaning:}$
$$x = [x_1,x_2,...,x_{D_i} ]^T \quad y = [y_1, y_2, \dots, y_{D_o}]^T \quad  h = [h_1, h_d, \dots, h_D]$$

$\text{A hidden unit is computed as follows:}$ $$h_d = a\left[\theta_{d0} + \sum_{i=1}^{D_i} \theta_{di}x_i\right] \quad \forall d \in \{1, \dots, D\}$$

$\text{An output is computed as follows:}$ $$y_j = \phi_{j0} + \sum_{d=1}^D \phi_{jd}h_d \quad \forall j \in {1, \dots , D_o}$$

**NOTE: For future refference a shallow network will consist of:<br> $$\text{Input layer } \rightarrow \text{ hidden layer } \rightarrow \text{ output layer}$$**


## Counting Number of paramters
We're starting to gain an understanding that these parameters are crucial to the results produced so let's calculate the number parameters in a shallow network.

**My approach:**
1. Draw the computational graph
2. Go layer by layer (left to right)
3. For each connection to the next layer:
   - Write the linear function feeding into one unit
   - Count parameters (weights + bias)
   - Multiply by the number of units receiving inputs <br> (number vertices on the otherwise of the connection)
4. Sum parameters across all layers
   
$ \text{Given } x \in \mathbb{R}^{D_i \ge 2} \text{ maps to a multi-dimensional output } y \in \mathbb{R}^{D_o \ge 2} \text{ using } h \in \mathbb{R}^{D \ge 2} \text{ hidden units, meaning:}$

### Input Layer → Hidden Layer

Each of the $D$ hidden units computes:
$$h_d = a\left[\theta_{d0} + \sum_{i=1}^{D_i} \theta_{di}x_i\right]$$

**Parameters per hidden unit:** $D_i + 1$ (weights $\theta_{d1}, \ldots, \theta_{dD_i}$ plus bias $\theta_{d0}$)

**Total parameters:** $D \times (D_i + 1) = D \cdot D_i + D$

### Hidden Layer → Output Layer

Each of the $D_o$ outputs computes:
$$y_j = \phi_{j0} + \sum_{d=1}^{D} \phi_{jd}h_d$$

**Parameters per output:** $D + 1$ (weights $\phi_{j1}, \ldots, \phi_{jD}$ plus bias $\phi_{j0}$)

**Total parameters:** $D_o \times (D + 1) = D_o \cdot D + D_o$


### Total Network Parameters

$$\boxed{\text{Total} = D(D_i + 1) + D_o(D + 1) = D \cdot D_i + D + D_o \cdot D + D_o}$$

Simplifying:
$$\boxed{\text{Total} = D(D_i + D_o) + D + D_o}$$

## Mathematical Technicality

Note that we mentioned throughout the shallow network a sequence of $\textcolor{lightblue}{linear \ function}$.

**This isn't mathematically true:**

$ \text{Let } \mathbb{F} \text{ be a field and let } \mathbb{V}, \mathbb{W} \text{ be two vector spaces over } \mathbb{F} \text{ let } f : \mathbb{V} \rightarrow \mathbb{W} \\ \text{then } f \text{ is a linear function iff } \forall a, b \in \mathbb{V} \text{ and } c \in \mathbb{F}$

1. $f(a + b) =f(a) + f(b)$
2. $f(ca) = cf(a)$

**Our Problem: bias term**

$$f(h_1, h_2, h_3) = \phi_0 + \phi_1 h_1, + \phi_2 h_2 + \phi_3 h_3$$

Counter example: (our constraint is the **for all** quantifier)

$$f(2 \vec{h}) = \phi_0 + 2\phi_1 h_1, + 2\phi_2 h_2 + 2\phi_3 h_3 \ne 2\phi_0 + 2\phi_1 h_1, + 2\phi_2 h_2 + 2\phi_3 h_3 = 2f(\vec{h})$$

The more correct notion of this is an $\textcolor{lightblue}{Affine \ function}$ however, Linear function is accepted in the machine learning world
