# L12a: Feed Forward Neural Networks (FNNs)

___

In this lecture, we will explore the concept of Feed Forward Neural Networks (FNNs), which are a type of artificial neural network where connections between layers of nodes do not form cycles. This is in contrast to recurrent neural networks (RNNs), where data can flow in cycles (e.g., in time series data). The key concepts disscussed in this lecture are:

* __FNN Architecture__: Feedforward neural networks (FNNs) are foundational artificial neural network architectures where information flows _unidirectionally_ from input nodes through (potentially many) hidden layers of arbitrary dimension to output nodes, without cycles or feedback loops. Each node in the network is a simple processing unit that applies a linear transformation followed by a (potentially) non-linear activation function to its inputs. 
* __FNN Applications__: FNNs are widely employed for pattern recognition, classification tasks (including for non-linearly seperable data), and predictive modeling, such as identifying objects in images, sentiment analysis in text, or forecasting process trends. They also work well in structured data applications like medical diagnostics (classification of patient records) and marketing (personalized recommendations). FFNs also are components of more advanced architectures in fields like computer vision and natural language processing.
* __FNN Training__: FNNs are trained using supervised learning, where the model learns to map inputs to outputs by minimizing a loss function. The most common training algorithm [is backpropagation](https://en.wikipedia.org/wiki/Backpropagation), which uses [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) to update the weights (and bias valies) of the network based on the error between predicted and actual outputs.

The source(s) for this lecture can be found here:
* [John Hertz, Anders Krogh, and Richard G. Palmer. 1991. Introduction to the theory of neural computation. Addison-Wesley Longman Publishing Co., Inc., USA.](https://dl.acm.org/doi/10.5555/104000)
* [Mehlig, B. (2021). Machine Learning with Neural Networks. Chapter 5: Perceptrons and Chapter 6: Stochastic Gradient Descent](https://arxiv.org/abs/1901.05639v4)

___

## Origin story: McCulloch-Pitts Neurons
In [their paper, McCulloch and Pitts (1943)](https://link.springer.com/article/10.1007/BF02478259) explored how the brain could produce highly complex patterns by using many [interconnected _basic cells (neurons)_](https://en.wikipedia.org/wiki/Biological_neuron_model). McCulloch and Pitts suggested a _highly simplified model_ of a neuron. Nevertheless, they made a foundational contribution to the development of artificial neural networks that we find in wide use today. Let's take a look at the model of a neuron proposed by McCulloch and Pitts.

### McCulloch-Pitts Neuron
Suppose we have a neuron that takes an input vector $\mathbf{n}(t) = (n^{(t)}_1, n^{(t)}_2, \ldots, n^{(t)}_{m})$, where each component $n_k\in\mathbf{n}$ is a binary value (`0` or `1`) which represents the state of other prediscessor neurons $n_1,n_2,\ldots,n_m$ at time $t$. Then, the state of our neuron (say neuron $k$) at time $t+1$ is given by:
$$
\begin{align*}
n_{k}(t+1) &= \sigma\left(\sum_{j=1}^{m} w_{kj} n_j(t) - \theta_k\right) \\
\end{align*}
$$
where $\sigma:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is an _activation function_ that maps the weighted sum of a vector of inputs to a scalar (binary) output. In the original paper, the state of neuron $k$ at time $t+1$ denoted as $n_k(t+1)\in\{0,1\}$, where $w_{kj}$ is the weight of the connection between neuron $k$ to the output of (predecessor) neuron $k$, and $\theta_k$ is the threshold for neuron $k$. 
* _Activation function_: In this original McCulloch and Pitts model, the activation function $\sigma$ is a step function, which means that the output of the neuron is `1` if the weighted sum of inputs exceeds the threshold $\theta_k$, and `0` otherwise. In other words, the neuron "fires" (produces an output of `1`) if the total input to the neuron is greater than or equal to the threshold $\theta_k$. This is a binary output, which is a simplification of real biological neurons that can produce continuous outputs.
* _Parameters_: The weights $w_{kj}\in\mathbb{R}$ and the threshold $\theta_k\in\mathbb{R}$ are parameters of the neuron that determine its behavior. The weights can be positive or negative, and they represent the strength and direction of the influence of the input neurons on the output neuron. The threshold determines how much input is needed for the neuron to "fire" (i.e., produce an output of `1`).

While the McCulloch-Pitts neuron model is a simplification of real biological neurons, it laid the groundwork for the development of more complex artificial neural networks. The key idea is that by combining many simple neurons in a network, we can create complex functions and learn to approximate any continuous function. This idea is at the heart of modern deep learning and neural networks. 

__Hmmmm__. These ideas _really_ seem familar, have we seen this before? Yes! the McCulloch-Pitts Neuron underpins [The Perceptron (Rosenblatt, 1957)](https://en.wikipedia.org/wiki/Perceptron), [Hopfield networks](https://en.wikipedia.org/wiki/Hopfield_network) and [Boltzmann machines](https://en.wikipedia.org/wiki/Boltzmann_machine). Wow!!

___

<img
  src="figs/nn-4.svg"
  alt="triangle with all three sides equal"
  height="400"
  width="800" />

## Activation functions
The activation function $\sigma:\mathbb{R}^{n}\rightarrow\mathbb{R}$ of a neuron is a mathematical function that determines the output of the neuron based on its input. 

The activation function takes the weighted sum of the inputs to the neuron and applies a non-linear transformation to produce the output. The choice of activation function is important because it affects the learning process and the performance of the neural network. There is a [wide variety of activation functions](https://en.wikipedia.org/wiki/Activation_function#Table_of_activation_functions), each with its own characteristics and applications. 

Some common activation functions are:

* __Sigmoid function__: The sigmoid function is a smooth, S-shaped curve that maps input values to the range (0, 1). It is defined as:
$ \sigma(x) = \frac{1}{1 + e^{-x}}$. The sigmoid function is often used in the output layer of binary classification problems, as it can be interpreted as a probability. 
* __Hyperbolic tangent function (tanh)__: The $\texttt{tanh}$ function is similar to the sigmoid function but maps input values to the range (-1, 1). It is defined as: $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$. It is often used in hidden layers of neural networks.
* __Rectified Linear Unit (ReLU)__: The ReLU function is a piecewise linear function that outputs the input value if it is positive and zero otherwise. It is defined as: $\text{ReLU}(x) = \max(0, x)$. The $\texttt{ReLU}$ function helps mitigate the vanishing gradient problem, a complication in training, making it a popular choice for hidden layers in deep networks. However, it can suffer from [the dying ReLU problem](https://arxiv.org/abs/1903.06733), where neurons can become inactive and stop learning if they output zero for all inputs.
* __Softmax function__: The $\texttt{softmax}$ function is often used in the output layer of multi-class classification problems. It converts a vector of raw scores (logits) into a probability distribution over multiple classes. It is defined as: $\texttt{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$ where $z_i$ is the raw score for class $i$, and $K$ is the total number of classes. The softmax function ensures that the output probabilities sum to 1, making it suitable for multi-class classification tasks.

## Feedforward Neural Networks
Let's consider the simple network shown above. The network has three layers: an input layer (five nodes), a hidden layer (12 nodes), and an output layer (three nodes). 

* __Input Layer__: Denote the input to the network as the vector $\mathbf{x}$; in the example network $m = 5$. In general, the input to the network is a vector of binary or continous valued features that represent the data we want to process. Each component of the input vector corresponds to a feature of the data. For example, in an image classification task, the input vector could represent pixel values of an image. 
* __Hidden Layer__: The hidden layer perform computations on the input data. Each node in the hidden layer takes the input vector $\mathbf{x}$ and applies a linear transformation followed by a non-linear activation function. The output of each node in the hidden layer is then passed to the next layer (in this case, the output layer). The number of nodes in the hidden layer can vary depending on the complexity of the task. Furthermore, the hidden layer can have multiple layers (i.e., deep neural networks) to learn more complex representations of the data. In the example network, we have a single hidden layer with 12 nodes.
* __Output Layer__: The output layer is the final layer of the network that produces the output of the network $\mathbf{y} = \left\{y_{1},y_{2},\dots,y_{k}\right\}$ where $\mathbf{y}\in\mathbb{R}^{k}$. Each node in the output layer takes the output from the hidden layer and applies a linear transformation followed by a non-linear activation function. In this example, we have three output nodes. For example, this could mean that we are predicting three different classes in a multiclass classification task. In this case, the output of the network can be interpreted as probabilities for each class, and we can use techniques like softmax to convert the output into probabilities.

Let's generalize this example to a more formal definition of a feedforward neural network.

### Function Composition
Suppose have a feedforward neural network with $L$ layers. The network has $n$ input nodes (we'll call this layer 0), and $i=1,2,\dots,L-1$ hidden layers where each hidden layer has $m_{i}$ nodes, and the output (layer $L$) has $d_{out}$ output nodes.  Each hidden layer is fully connected to the previous and next layers (but there are no connections between the nodes inside a layer, and no self connections). Information flows from the input to the output layer, forming a feedforward structure.

Let's dig into the states and parameters of the network.

* __Inputs and outputs__: Let $\mathbf{x} = \left\{x_{1},x_{2},\dots,x_{d_{in}},1\right\}$ be the _augmented_ input vector, where $x_{i}\in\mathbb{R}$ is the $i$-th feature of the input vector. The dimension of the _augmented_ input vector is $d_{in} + 1$, where $d_{in}$ is the number of features in the input vector. The extra `1` is added to the input vector to allow us to include the bias term in the weight vector. This is a common technique used in machine learning to simplify the representation of the model. Further, let $\mathbf{z}_{i} = \left\{z^{(i)}_{1},z^{(i)}_{2},\dots,z^{(i)}_{m_{i}}\right\}$ be the output vector of the $i$-th hidden layer, where $z^{(i)}_{j}\in\mathbb{R}$ is the $j$-th component of the output of layer $i$. Finally, let $\mathbf{y}_{k} = \left\{y_{1},y_{2},\dots,y_{d_{out}}\right\}$ be the output vector, where $y_{k}\in\mathbb{R}$ is the $k$-th component of the output of the network.

* __Parameters__: Each node $j=1,2,\dots,m_{i}$ in layer $i\geq{1}$ has a parameter vector $\mathbf{w}^{(i)}_{j} = \left(w^{(i)}_{j,1},w^{(i)}_{j,2},\dots,w^{(i)}_{j,m_{i-1}}, b^{(i)}_{j}\right)$, where $w^{(i)}_{j,k}\in\mathbb{R}$ is the weight of the $k$-th input to node $j$ in layer $i$, and $b^{(i)}_{j}\in\mathbb{R}$ is the bias term for node $j$ in layer $i$. The weight vector $\mathbf{w}^{(i)}_{j}$ represents the strength of the connection between node $j$ in layer $i$ and all nodes in layer $i-1$. The bias term $b_{i}$ allows the model to shift the activation function to the left or right.

This may seem a bit confusing, so let's think about this in a different way. A feedforward neural network can be thought of as a series of function compositions. For example, consider layer $1$ with $m_{1}$ nodes. The output of layer $1$ (given the input vector $\mathbf{z}_{\circ}$) is given by:

$$
\begin{align*}
\mathbf{z}^{(1)} &= \begin{bmatrix}
\sigma_{1}\left(\mathbf{z}_{\circ}^{\top}\cdot\mathbf{w}^{(1)}_{1}\right) \\
\sigma_{1}\left(\mathbf{z}_{\circ}^{\top}\cdot\mathbf{w}^{(1)}_{2}\right) \\
\vdots \\
\sigma_{1}\left(\mathbf{z}_{\circ}^{\top}\cdot\mathbf{w}^{(1)}_{m_{1}}\right)
\end{bmatrix}
\end{align*}
$$
where $\mathbf{w}^{(1)}_{j}$ is the weight vector for node $j$ in layer $1$, and $\sigma_{1}$ is the activation function for layer $1$ (assumed to be the same for nodes in layer $1$). The output of layer $1$ is a vector $\mathbf{z}^{(1)}\in\mathbb{R}^{m_{1}}$ which is then passed to layer $2$, which has $m_{2}$ nodes:
$$
\begin{align*}
\mathbf{z}^{(2)} &= \begin{bmatrix}
\sigma_{2}\left(\mathbf{z}_{1}^{\top}\cdot\mathbf{w}^{(2)}_{1}\right) \\
\sigma_{2}\left(\mathbf{z}_{1}^{\top}\cdot\mathbf{w}^{(2)}_{2}\right) \\
\vdots \\
\sigma_{2}\left(\mathbf{z}_{1}^{\top}\cdot\mathbf{w}^{(2)}_{m_{2}}\right)
\end{bmatrix}
\end{align*}
$$
where $\mathbf{w}^{(2)}_{j}$ is the weight vector for node $j$ in layer $2$, and $\sigma_{2}$ is the activation function for layer $2$ (assumed to be the same for nodes in layer $2$). However, we can also think of the output of layer $2$ as: $\mathbf{z}^{(2)} = \sigma_{2}\circ\sigma_{1}\left(\mathbf{z}_{\circ}\right)$, where $\sigma_{2}\circ\sigma_{1}$ is the composition of the two activation functions, i.e., $\mathbf{z}^{(2)} = \sigma_{2}\left(\sigma_{1}\left(\mathbf{z}_{\circ}\right)\right)$. Putting this ideas together, gives a really nice way to think about a feedforward neural network: a series of function compositions:
$$
\begin{align*}
\hat{\mathbf{y}} &= f_{\theta}(\mathbf{x}) = \sigma_{L}\circ\sigma_{L-1}\circ\dots\circ\sigma_{1}\left(\mathbf{x}\right)
\end{align*}
$$
where $\sigma_{L}$ is the activation function for the output layer, $\mathbf{x}$ is the _augmented_ input vector, and $\hat{\mathbf{y}}\in\mathbb{R}^{d_{out}}$ is the output of the network. The function $f_{\theta}(\mathbf{x})$ represents the mapping from the input vector $\mathbf{x}=\mathbf{z}_{\circ}$ to the output vector $\hat{\mathbf{y}}$, and $\theta$ represents the parameters of the network (i.e., the weights and biases). 

__Wow!__ A feedforward neural network $f_{\theta}:\mathbb{R}^{d_{in}}\rightarrow\mathbb{R}^{d_{out}}$ is just _some complicated function_ that takes an input vector $\mathbf{x}$ and produces an output vector $\hat{\mathbf{y}}$. Thus, we can do all the things we do with functions: we can compose them, take their derivatives, and so on. This is a really powerful idea because it allows us to use all the tools of calculus and linear algebra to analyze and optimize neural networks.

### Parameterization
Before we move on, let's take a moment to think about the parameters of the network. 

The parameters of the network are the weights and biases of each node in the network. Each layer $i$ has $m_{i}$ nodes, and each node in layer $i$ has a _weight vector_ $\mathbf{w}^{(i)}_{j} = \left(w^{(i)}_{j,1},w^{(i)}_{j,2},\dots,w^{(i)}_{j,m_{i-1}}, b^{(i)}_{j}\right)$, where $w^{(i)}_{j,k}\in\mathbb{R}$ is the weight of the $k$-th input to node $j$ in layer $i$, and $b^{(i)}_{j}\in\mathbb{R}$ is the bias term for node $j$ in layer $i$. 

We represent the parameters of layer $i$ in the matrix $\mathbf{W}_{i}\in\mathbb{R}^{m_{i}\times(m_{i-1}+1)}$. The weight matrix $\mathbf{W}_{i}$ contains the weights and biases for all nodes in layer $i$. The first column of the weight matrix $\mathbf{W}_{i}$ contains the bias terms for each node in layer $i$, and the remaining columns contain the weights for each node in layer $i$. We can pack all the parameters into the $\theta$ vector:
$$
\begin{align*}
\theta &= \left(\mathbf{W}_{1},\mathbf{W}_{2},\dots,\mathbf{W}_{L}\right) \\
&= \left(w^{(1)}_{1,1},w^{(1)}_{1,2},\dots,w^{(1)}_{1,m_{0}}, b^{(1)}_{1}, w^{(1)}_{2,1},w^{(1)}_{2,2},\dots,w^{(1)}_{2,m_{0}}, b^{(1)}_{2}, \ldots, w^{(L)}_{k,1},w^{(L)}_{k,2},\dots,w^{(L)}_{k,m_{L-1}}, b^{(L)}_{k}\right)
\end{align*}
$$
where $k=1,2,\dots,d_{out}$ is the index of the output node. The parameter vector $\theta$ contains all the weights and biases for all nodes in the network.
The dimension of the parameter vector $\theta$ is $d_{in} + d_{out} + \sum_{i=1}^{L-1} m_{i}\cdot(m_{i-1}+1)$, where $d_{in}$ is the number of features in the input vector, $d_{out}$ is the number of output nodes, and $m_{i}$ is the number of nodes in layer $i$. 

___

## Training
Classically, the training of feedforward neural networks is done [using the _backpropagation_ algorithm](https://en.wikipedia.org/wiki/Backpropagation), which is a common _supervised learning_ algorithm based on gradient descent. Let's dig into this algorithm a bit more, consider some variations, and then discuss alternatives to this approach.

_What is Backpropagation_? Backpropagation is an algorithm for efficiently computing the gradient of multi-layer neural network in order to train using gradient descent. It is a supervised learning algorithm that uses the chain rule of calculus to compute the gradient of the loss function with respect to the weights and biases of the network. The algorithm consists of two main steps: 
  1. **Forward Pass**: Compute the output of the network for a given input by passing the input through each layer of the network, applying the activation function at each node. This produces an output vector $\hat{\mathbf{y}}$ for the input $\mathbf{x}$.
  2. **Backward Pass**: Compute the gradient of the loss function (measures the difference between the truu output $\mathbf{y}$ and the model predicted output $\hat{\mathbf{y}}$) with respect to each weight and bias in the network by propagating the error backward through the network using the chain rule.

Suppose have a traning dataset $\mathcal{D} = \left\{(\mathbf{x}_{1},y_{1}),\dotsc,(\mathbf{x}_{n},y_{n})\right\}$ with $n$ examples, where 
$\mathbf{x}_{i}\in\mathbb{R}^{m}$ is the $i$-th feature vector, and $y_{i}\in\mathbb{R}$ is the corresponding output. The output can be a discrete label (e.g., in classification tasks) where each example has been labeled by an expert, i.e., a human to be in a category $y_{i}\in\{-1,1\}$, or is some continuous value $y_{i}\in\mathbb{R}$ e.g., a real-valued measurement such as temperature, pressure, etc, for regression tasks. 

#### Forward Pass
The forward pass computes the output of the network $\hat{\mathbf{y}}$ for a given input $\mathbf{x}$ by passing it through each layer of the network. 

__Initialization__: Initialize the weights and biases of the network randomly or using some heuristic method. For example, the weights are typically initialized to small random values, and the biases can be initialized to zero or small random values. Let $\mathbf{z}_{\circ}^{\top} = \left(x_{1},x_{2},\dots,x_{n}, 1\right)$ be the _augmented input vector_, where the last component is a constant `1` that allows us to include the bias term in the weight vector. 

For each layer $i=1,2,\dots,L$ of the network, we compute the output of the layer as follows:
1. Compute input to the activation function for each node in the layer: $a_{i} = \mathbf{w}_{i}^{\top}\cdot\mathbf{z}_{i-1}$, where $\mathbf{w}_{i}$ is the weight vector for node $i$ in layer $j$, and $\mathbf{z}_{i-1}$ is the output vector from the previous layer. 
2. Compute the output of the activation function for each node in the layer: $\mathbf{z}_{i} = \sigma_{i}(a_{i})$, where $\sigma:\mathbb{R}$ denotes the activation function for layer $i$.
3. The output of the last layer is the final model predicted output from the network: $\hat{\mathbf{y}} = \mathbf{z}_{L}$.

## Pros and Cons of FNNs
There are several advantages and disadvantages to using feedforward neural networks (FNNs) for machine learning tasks. Here are some of the key points to consider:

### Advantages
* __Flexibility__: FNNs can be used for a wide range of tasks, including classification, regression, and generative modeling. They can also be adapted to work with different types of data, such as images, text, and time series. Lastly, there are theoretical results that show that FNNs can approximate any continuous function to arbitrary precision, given enough hidden units and training data.
* __Non-linearity__: FNNs can learn complex non-linear relationships between inputs and outputs by using non-linear activation functions. This allows them to model complex patterns in the data that linear models cannot capture. They can be also be used with a variety of data sources, including images, text, and time series data.

### Disadvantages
* __Overfitting__: FNNs can easily overfit the training data, especially when the model is too complex or the training data is limited. This can lead to poor generalization to new data. Regularization techniques such as dropout, L1/L2 regularization, and early stopping can help mitigate this issue.
* __Computationally expensive__: Training FNNs can be computationally expensive, especially for large datasets or deep networks. This can require significant computational resources and time. However, this is less of a concern with modern hardware and software frameworks, but these techniques require (for the most part) expert level knowledge to efficiently implement.
* __Interpretability__: FNNs are often considered "black box" models, meaning that it can be difficult to interpret how they make predictions. This can be a disadvantage in applications where interpretability is important, such as in healthcare or finance. However, there are techniques such as LIME and SHAP that can help improve the interpretability of FNNs.

## Lab
In Lab `L12b`, we will implement (and train) a feedforawrd model for a simple computer vision task. 
* _Cool, what is this task?_ We'll give handwritten digits to the model and ask it to classify them. The model will be a simple feedforward neural network with one hidden layer. We'll use the MNIST dataset, which contains images of handwritten digits (0-9). The goal is to train the model to recognize these digits based on the pixel values of the images.

# Today?
That's a wrap! What are some of the interesting things we discussed today?