# Deep Neural Network

In the previous chapter we saw that as the number of hidden units increase, the greater the complexity of a function can be described using a shallow network. <br>
The transition from a shallow network to a Deep Network came to existence from functions that would have required so many hidden units that it would be impractical. <br>
Deep Networks provide the advantage which with the same number of parameters we're able to produce many more linear regions than a shallow network, which means, they are more powerful.

## Composing neural network

Suppose we have two shallow networks, we want to know how can we "concatonate" these networks. 

$$\boxed{\begin{aligned}
h_1 &= a[\theta_{10}+ \theta_{11}x]\\
h_2 &= a[\theta_{20}+ \theta_{21}x]\\
h_3 &= a[\theta_{30}+ \theta_{31}x]\\
\textcolor{brown}{y}   &= \textcolor{brown}{\phi_0 + \phi_1 h_1 + \phi_2 h_2 + \phi_3 h_3}
\end{aligned}}$$

$$\boxed{\begin{aligned}
h_1' &= a[\theta_{10}'+ \theta_{11}'y]\\
h_2' &= a[\theta_{20}'+ \theta_{21}'y]\\
h_3' &= a[\theta_{30}'+ \theta_{31}'y]\\
y'    &= \phi_0' + \phi_1' h_1' + \phi_2' h_2' + \phi_3' h_3'
\end{aligned}}$$

Notice that the output from the first shallow network is directly fed through as the input of the next shallow network.

<div align="center">
<img  src="images/chap3/concatnet.png" alt="2-Layer Net" width="700" />
</div>

The effect of this is that the input function we at most 3 regions is now passed through another function that can produce at most 3 regions thus <br> we obtain a total of 9 regions which is already greater than a function with 6 regions.

## Connecting to Deep Networks

The concatonation technique is actually a special case of a deep network with two hidden layer

Consider the above example we can open out the final activation functions:

$$h_1' = a[\theta_{10}+ \theta_{11}\textcolor{brown}{y}] = a[\theta_{10} + \theta_{11}\textcolor{brown}{(\phi_1 h_1 + \phi_2 h_2 + \phi_3 h_3)}] = a[\textcolor{cyan}{\theta_{10} + \theta_{11}\phi_0} + \textcolor{royalblue}{\theta_{11}\phi_1 }h_1 + \textcolor{gold}{\theta_{11}\phi_2 }h_2 + \textcolor{pink}{\theta_{11}\phi_3} h_3]$$

$$h_2' = a[\theta_{20}+ \theta_{21}\textcolor{brown}{y}] = a[\theta_{20} + \theta_{21}\textcolor{brown}{(\phi_1 h_1 + \phi_2 h_2 + \phi_3 h_3)}] = a[\textcolor{cyan}{\theta_{20} + \theta_{21}\phi_0 } + \textcolor{royalblue}{\theta_{21}\phi_1} h_1 + \textcolor{gold}{\theta_{21}\phi_2 }h_2 + \textcolor{pink}{\theta_{21}'\phi_3} h_3]$$

$$
h_3' = \underbrace{a[\theta_{30}+ \theta_{31}\textcolor{brown}{y}]}_{\text{Feedback Form}} 
= \underbrace{a[\theta_{30} + \theta_{31}\textcolor{brown}{(\phi_1 h_1 + \phi_2 h_2 + \phi_3 h_3)}]}_{\text{Substitute } y} 
= \underbrace{a[\textcolor{cyan}{\theta_{30} + \theta_{31}\phi_0} + \textcolor{royalblue}{ \theta_{31}\phi_1 }h_1 + \textcolor{gold}{\theta_{31}\phi_2} h_2 + \textcolor{pink}{\theta_{31}\phi_3} h_3]}_{\text{Group by Terms}}
$$

Then we can simplify the terms:
$$h_1' = a[\textcolor{cyan}{\psi_{10}} + \textcolor{royalblue}{\psi_{11}} h_1 + \textcolor{gold}{\psi_{12}} h_2 + \textcolor{pink}{\psi_{13}} h_3]$$

$$h_2' = a[\textcolor{cyan}{\psi_{20}} + \textcolor{royalblue}{\psi_{21}} h_1 + \textcolor{gold}{\psi_{22}} h_2 + \textcolor{pink}{\psi_{23}} h_3]$$

$$h_3' = a[\textcolor{cyan}{\psi_{30} }+ \textcolor{royalblue}{\psi_{31}} h_1 + \textcolor{gold}{\psi_{32}} h_2 + \textcolor{pink}{\psi_{33}} h_3]$$

Finally as before: 

$$y' = \phi_0' + \phi_1'h_1 + \phi_2' h_2 + \phi_3' h_3$$

This represents a broader family because latter versions of $h_1', h_2' \text{ and } h_3'$ the nine slope parameters $\psi_{11} \ \psi_{12} \ \psi_{13} \dots \psi_{33}$ can be set any values whereas the first version that was shown are constrained by the outer product $\left[\theta_{11}', \ \theta_{21}', \ \theta_{31}'\right]^T \left[\phi_1, \ \phi_2, \ \phi_3 \right] = \begin{bmatrix} \theta_{11}'\phi_1 & \theta_{11}'\phi_2 & \theta_{11}'\phi_3 \\ \theta_{21}'\phi_1 & \theta_{21}'\phi_2 & \theta_{21}'\phi_3 \\ \theta_{31}'\phi_1 & \theta_{31}'\phi_2 & \theta_{31}'\phi_3 \end{bmatrix}$. (This produces a matrix (3x3))

Note that each row is a scalar multiple of every other row. 
Which means there's only a single linearly independent row. In contrast with the unconstrained $\psi$ matrix allow all 9 parameters to vary independently, so it can have rank up to 3, thus making the network more expressive and thus more powerful.

#### Another interpretation on deep network construction

1. The three hidden units $h_1, h_2, h_3$ are constructed by forming linear functions (as we know) from the input and then passed to the activation functions (individually).
2. We then use the results (view this as preactivation from **second layer**) to form another three linear functions, and now we're back to our shallow network structure.
3. We take these (second layer) pre-activation functions and apply ReLU to each of the function.
4. Apply a weighted sum on the results on the activated functions from the third layer.

If one really wanted it could be written in one shot...

$$\begin{aligned}
y' &= \phi_0' + \phi_1' a\left[\psi_{10} + \psi_{11}a[\theta_{10} + \theta_{11}x] + \psi_{12}a[\theta_{20} + \theta_{21}x] + \psi_{13}a[\theta_{30} + \theta_{31}x]\right]\\
&\quad + \phi_2' a\left[\psi_{20} + \psi_{21}a[\theta_{10} + \theta_{11}x] + \psi_{22}a[\theta_{20} + \theta_{21}x] + \psi_{23}a[\theta_{30} + \theta_{31}x]\right]\\
&\quad + \phi_3' a\left[\psi_{30} + \psi_{31}a[\theta_{10} + \theta_{11}x] + \psi_{32}a[\theta_{20} + \theta_{21}x] + \psi_{33}a[\theta_{30} + \theta_{31}x]\right]
\end{aligned}$$

However, it can cloud a lot of the aforementioned insights.

### Hyperparameters and Termonolgy

Indeed this can be extended to even larger deep networks, thus in general we have a few terms to keep in mind:

$\textcolor{lightblue}{Width \ of \ a \ network :=} \text{The number of hidden units in each layer}$

$\textcolor{lightblue}{Depth \ of \ a \ network :=} \text{The number of hidden layers}$

$\textcolor{lightblue}{Capacity \ of \ a \ network :=} \text{The total number of hidden units}$

$\text{Let K denote the depth of the network} \\ D_1, D_2, \dots, D_K \text{ the width of each layer}$

The above quantities are refered to as $\textcolor{lightblue}{hyperparameters}$, that are chosen before we learn the model parameters,<br> once these quantities are fixed does the model describe a family of functions, <br> where the values of these parameters will describe a perticular function within this family.


## Matrix Notation and General Formulation

Deep neural network are going be handling complicated relations between provided inputs and desired outputs.<br>
So we're going to see large Multi-layered networks, up until now we've been writing everything to gain good intuition behind a shallow network and a deep network etc. <br> However, writing all the linear functions and the activations is very tedious and cumbersome.<br>
We convert from scalar multiplications and summation to vectors and matrix operations.

To begin lets rewrite the above formulation in Matrix/Vector notation: 


$$ \begin{bmatrix} h_1 \\ h_2 \\ h_3\end{bmatrix} = a \left[ \begin{bmatrix} \theta_{10} \\ \theta_{20} \\ \theta_{30} \end{bmatrix} + \begin{bmatrix} \theta_{11} \\ \theta_{21} \\ \theta_{31} \end{bmatrix}x \right]$$

$$ \Downarrow $$

$$ 
\begin{bmatrix} h_1' \\ h_2' \\ h_3' \end{bmatrix} = a \left[ 
\begin{bmatrix} \textcolor{cyan}{\psi_{10}} \\ \textcolor{cyan}{\psi_{20}} \\ \textcolor{cyan}{\psi_{30}} \end{bmatrix} + 
\begin{bmatrix} 
\textcolor{royalblue}{\psi_{11}} & \textcolor{gold}{\psi_{12}} & \textcolor{pink}{\psi_{13}} \\ 
\textcolor{royalblue}{\psi_{21}} & \textcolor{gold}{\psi_{22}} & \textcolor{pink}{\psi_{23}} \\ 
\textcolor{royalblue}{\psi_{31}} & \textcolor{gold}{\psi_{32}} & \textcolor{pink}{\psi_{33}} 
\end{bmatrix} 
\begin{bmatrix} h_1 \\ h_2 \\ h_3\end{bmatrix} 
\right]
$$

$$ \Downarrow $$
$$y' = \phi_0' + \left[ \phi_1' \quad \phi_2' \quad \phi_3' \right] \begin{bmatrix} h_1' \\ h_2' \\ h_3' \end{bmatrix}$$


$\text{Let } h_k \text{ be a vector describing the hidden units of hidden layer k (do not confuse with the above } h_k \text{ which is a single activation unit) }$
$\text{Let } \beta_k \text{ be a vector describing the bias vector that's being applied on the results of the Kth layer.} \\ \text{That is being summed for the (K+1) hidden layer} \\ \text{Let } \Omega_K \text{ be the matrix describing the slopes being applied to the Kth layer}$

With these notations we'll describe the general proccess of the deep neural network. 

$$\begin{align}
h_1 &= a\left[\beta_0 + \Omega_0 x\right] \\ 
h_2 &= a\left[\beta_1 + \Omega_1 h_1\right] \\
h_3 &= a\left[\beta_2 + \Omega_2 h_2\right] \\
\dots &= \dots \\
h_K &= a\left[\beta_{K-1} + \Omega_{K-1} h_{K-1}\right] \\
y &= \beta_K + \Omega_K h_K
\end{align}$$

<div align="center">
<img  src="images/chap3/fcNet.png" alt="ReLU Function" width="700" />
</div>

#### Dimsionality matching 

Because we now work with vectors and matrices, tracking input/output dimensions across layers is crucial for training, fitting the model, and implementing these networks in code.

Our parameters is summurised concicely as follows: $\phi = \left\{\beta_{k}, \Omega_{k}\right\}_{k=0}^K$

If the Kth layer has $D_k$ hidden units then the bias vector $\beta_{K-1} \in \mathbb{R}^{D_k}$ and so $\Omega_{K-1} \in \mathbb{R}^{D_k \times D_{k-1}}$
The last bias vector $\beta_K \in \mathbb{R}^{D_o}$

The first weight matrix $\Omega_{0} \in \mathbb{R}^{D_1 \times D_i}$

The last weight matrix $\Omega_{K} \in \mathbb{R}^{D_o \times D_K}$

This can be written recursively in a single function:

$$
y = \beta_K + \Omega_K\, a\Big[\,\beta_{K-1} + \Omega_{K-1} a\Big[\, \dots + \beta_2 + \Omega_2 a\big[\, \beta_1 + \Omega_1 a[\, \beta_0 + \Omega_0 x \,] \big] \Big] \Big]
$$


## Number of linear regions per parameter

#### Shallow Network

- In a shallow network with one input and output where D > 2 hidden units can create up to D+1 linear regions: 
$$ \text{D hidden units means you can bend the line in at most D places thereby forming D+1 Linear regions}$$
- In this shallow network we'd have 3D + 1 parameters

#### Deep Network

- In a deep network with one input and output, K layers of D > 2 hidden units can create a function with up to $(D+1)^k$ linear regions:
    - $\text{The shallow network is base case for a deep network}$
  $$\begin{align}N_k &= (D+1)N_{K-1} \\ &= (D+1)(D+1)N_{K-2}  \\ &= \quad \vdots  \\N_K &= (D+1)^K \end{align} $$
- In the deep network we'd have 3D + 1 (K-1)D(D+1) parameters

### Discrete Neural Lookup

Below presents an example of constructing a neural network that can **perfectly memorize and retrieve** a set of $N$ unique binary input vectors $\mathbf{x}_i \in \{0,1\}^d$ and their associated targets $y_i \in \mathbb{R}$.<br>This construction demonstrates how a simple two-layer network can act as a lookup table, outputting the correct $y_k$ for any given input $\mathbf{x}_k$ from the training set.

**Algorithm: Discrete Neural Lookup (DNL)**

Given $N$ unique inputs $\mathbf{x}_i \in \{0, 1\}^d$ and targets $y_i \in \mathbb{R}$:

1. **Mapping Function ($\phi$):**  
   $\phi(\mathbf{x}) = 2\mathbf{x} - \mathbf{1}$, mapping $\{0,1\}^d \to \{-1,1\}^d$

2. **Layer 1 (Detection):**  
   Define weight matrix $\mathbf{W} \in \mathbb{R}^{N \times d}$ and bias $\mathbf{b} \in \mathbb{R}^N$: <br>
   $
   \mathbf{W} = \begin{bmatrix} \phi(\mathbf{x}_1)^T \\ \vdots \\ \phi(\mathbf{x}_N)^T \end{bmatrix}, \quad
   \mathbf{b} = \begin{bmatrix} 1-d \\ \vdots \\ 1-d \end{bmatrix}
   $
   <br>
   <br>
   $\mathbf{h} = \mathrm{ReLU}(\mathbf{W}\phi(\mathbf{x}) + \mathbf{b})$

3. **Layer 2 (Selection):**  
   $f(\mathbf{x}) = \sum_{i=1}^N h_i y_i$

4. **Correctness:**  
   For input $\mathbf{x}_k$, only $h_k = 1$ (others are $0$), so $f(\mathbf{x}_k) = y_k$.