## Deep Neural Networks

In our [last notebook](../03_ArtificialNeurons/Introduction_ActivationFunctions.ipynb), we learned how multiple inputs to a single activation function can influence the activation function value. Given a particular set of inputs, we interpreted the function's output to be the degree of confidence in a positive or negative decision; for instance, a output of `0.80` corresponds to an 80% confidence that a positive decision is the correct one for the provided data (and conversely, a 20% confidence that a negative decision is the correct one).

In this notebook, we create a new mathematical model by composing these activation functions into a network of activation functions wherein the output of several activation functions serves as weighted input into another activation function. Such models are known as _neural networks_, and are extraordinarily well-equipped to learn an extremely wide variety of classification and decision functions.

### Software Prerequisites

The following Python libraries are prerequisites to run this notebook; simply run the following code block to install them. They are also listed in the `requirements.txt` file in the root of this notebook's [GitHub repository](https://github.com/uccs-math-clinic/mc-workshops).

In [None]:
%pip install matplotlib==3.5.1 \
             numpy==1.21.5

The Python kernel must be restarted after running the above code block for the first time within a particular virtual environment. This may be accomplished by navigating to `Kernel -> Restart` in the menu bar.

With our package dependencies installed, we can run the following [boilerplate code](https://en.wikipedia.org/wiki/Boilerplate_code) in order to import the packages needed for this notebook:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

%matplotlib notebook
plt.ion()

## Introducing our Model

Earlier, we claimed that neural networks are well-suited to learn a wide variety of classification and decision functions. Theoretically, neural networks act as [universal function approximators](https://en.wikipedia.org/wiki/Universal_approximation_theorem); that is, given sufficient training data and neurons within the middle (hidden) layers, neural networks are capable of approximating arbitrary continuous functions defined on compact subsets of $\mathbb{R}^n$.

In practice, there exist pragmatic computational constraints upon the structure and size of neural networks which are feasible to work with. A neural network's efficacy in classification and training efficiency depend highly on both the quality of training data (i.e., how well the data actually represents the phenomenon we wish to model) and the structure of the underlying neural network. Oftentimes, it is only the latter of these over which we have any control; in fact, choosing an appropriate neural network structure for a particular data set often comprises much of the work behind learning a useful data model.

Recall that neural networks consist of weighted compositions of activation functions. In particular, a set of activation functions (or neurons) whose output serves as weighted input into another activation function is said to comprise a _layer_ of the neural network. In this notebook, we use the now-familiar logistic function as our activation function of choice, though of course [many](https://en.wikipedia.org/wiki/ReLU) [others](https://en.wikipedia.org/wiki/Gaussian_function) [exist](https://en.wikipedia.org/wiki/Heaviside_step_function).

### Notation
As we delve into the mathematics behind training multi-layer neural networks, we will observe see that managing notation among layers, weights, biases, and activation function values (or more simply, _activations_) gets tedious rather quickly and is in fact quite unintuitive when encountered for the first time. To alleviate this, we will deviate slightly from the standard notation by denoting neural network layers by their corresponding uppercase letter in the alphabet (e.g., the input/first layer is denoted by $A$; for a three-layer network, the final layer is denoted $C$, etc.). In practice, these layers are denoted by natural numbers, and so this deviation becomes the standard (and hence extensible to arbitrary network depths) simply by exchanging alphabetical characters for their corresponding numerical position in the alphabet.

We denote weights, biases, and activation function values thusly:

- The weight applied to the $k^{th}$ output neuron from layer $A$ which contributes to the input for the $j^{th}$ neuron in layer $B$ is denoted $w_{j \leftarrow k}^B$. Furthermore, we denote the set of all weights which contribute to the input for the $j^{th}$ neuron in layer $B$ by the vector $\textbf{w}_j^B$.

- The bias added to the input for the $j^{th}$ neuron in layer $B$ is denoted $b_{j}^B$.

- The activation function value of the $k^{th}$ output neuron from layer $A$ which contributes to the input for the $j^{th}$ neuron in layer $B$ is denoted $x_{j \leftarrow k}^B$. Furthermore, we denote the set of all activation function values which contribute to the input for the $j^{th}$ neuron in layer $B$ by the vector $\textbf{x}_j^B$.

It is worth spending time becoming comfortable with this notation, as we will use it extensively.

### One Hidden Layer
We begin by training a very simple neural network to approximate $f(x) = sin(x)$ over the interval $[0, 2\pi]$. Our neural network will consist of a single input layer (or activation function) connected to a two-neuron hidden layer which is in turn connected to a single-neuron output layer. We interpret the output of this last layer to be the predicted $y$-value for the model function. Qualitatively, our network looks like this:

![](./nn.svg)

With the notation introduced earlier, our network $N(x)$ becomes the following composition of functions:

$$
\begin{aligned}
    N(x) &= \sigma(\textbf{w}_{1}^C \cdot \textbf{x}_{1}^C + b_{1}^C) \\
         &= TODO
\end{aligned}
$$