# L12a: Feed Forward Neural Networks (FNNs)
In this lecture, we will explore the concept of Feed Forward Neural Networks (FNNs), which are a type of artificial neural network where connections between layers of nodes do not form cycles. This is in contrast to recurrent neural networks (RNNs), where data can flow in cycles (e.g., in time series data). The key concepts disscussed in this lecture are:

* __FNN Architecture__: Feedforward neural networks (FNNs) are foundational artificial neural network architectures where information flows _unidirectionally_ from input nodes through (potentially many) hidden layers of arbitrary dimension to output nodes, without cycles or feedback loops. Each node in the network is a simple processing unit that applies a linear transformation followed by a (potentially) non-linear activation function to its inputs. 
* __FNN Applications__: FNNs are widely employed for pattern recognition, classification tasks (including for non-linearly seperable data), and predictive modeling, such as identifying objects in images, sentiment analysis in text, or forecasting process trends. They also work well in structured data applications like medical diagnostics (classification of patient records) and marketing (personalized recommendations). FFNs also are components of more advanced architectures in fields like computer vision and natural language processing.
* __FNN Training__: FNNs are trained using supervised learning, where the model learns to map inputs to outputs by minimizing a loss function. The most common training algorithm [is backpropagation](https://en.wikipedia.org/wiki/Backpropagation), which uses [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) to update the weights (and bias valies) of the network based on the error between predicted and actual outputs.

The source(s) for this lecture can be found here:
* [John Hertz, Anders Krogh, and Richard G. Palmer. 1991. Introduction to the theory of neural computation. Addison-Wesley Longman Publishing Co., Inc., USA.](https://dl.acm.org/doi/10.5555/104000)
* [Mehlig, B. (2021). Machine Learning with Neural Networks. Chapter 5: Perceptrons and Chapter 6: Stochastic Gradient Descent](https://arxiv.org/abs/1901.05639v4)

## The Beginning: McCulloch-Pitts Neurons
The [McCulloch-Pitts Neuron (1943)](https://link.springer.com/article/10.1007/BF02478259) is a simplified model of an [actual biological neuron](https://en.wikipedia.org/wiki/Biological_neuron_model) that serves as the building block for the artificial neural networks that we find in wide use today. 
*  _Origin story_: In [their paper McCulloch and Pitts](https://link.springer.com/article/10.1007/BF02478259) explored how the brain could produce highly complex patterns by using many interconnected _basic cells (neurons)_. McCulloch and Pitts gave a highly simplified model of a _neuron_ in their paper. Nevertheless, they made a foundational contribution to the development of artificial neural networks -- which model key features of biological neurons.

Suppose neuron $k$ takes an input vector $\mathbf{n}(t) = (n^{(t)}_1, n^{(t)}_2, \ldots, n^{(t)}_n)$, at time $t$, where each component $n_k\in\mathbf{n}$ is a binary value (`0` or `1`) which represents the state of the prediscessor neurons $n_1,n_2,\ldots,n_n$ at time $t$. Then, the state of neuron $k$ at time $t+1$ is given by:
$$
\begin{align*}
n_{k}(t+1) &= \sigma\left(\sum_{j=1}^{n} w_{kj} n_j(t) - \theta_k\right) \\
\end{align*}
$$
where $\sigma$ is an _activation function_, $w_{kj}$ is the weight of the connection from neuron $j$ to neuron $k$, and $\theta_k$ is the threshold for neuron $k$. In the original publication, the activation function $\sigma$ was a step function that outputs `1` if its argument is non-negative, and `0` otherwise.

The weights $w_{kj}\in\mathbb{R}$ and the threshold $\theta_k\in\mathbb{R}$ are parameters of the neuron that determine its behavior. The weights can be positive or negative, and they represent the strength and direction of the influence of the input neurons on the output neuron. The threshold determines how much input is needed for the neuron to "fire" (i.e., produce an output of `1`).

## Is the Perceptron a Neural Network?
[The Perceptron (Rosenblatt, 1957)](https://en.wikipedia.org/wiki/Perceptron) takes a scalar input $y_{i}\in\mathbb{R}$ and transforms it using an _activation_ function $\sigma(\star) = \text{sign}(\star)$ function to a discrete set of values representing categories, e.g., $\sigma:\mathbb{R}\rightarrow\{-1,1\}$ in the binary classification case. 
* Suppose there exists a data set
$\mathcal{D} = \left\{(\mathbf{x}_{1},y_{1}),\dotsc,(\mathbf{x}_{n},y_{n})\right\}$ with $n$ _labeled_ examples, where each example has been labeled by an expert, i.e., a human to be in a category $\hat{y}_{i}\in\{-1,1\}$, given the $m$-dimensional feature vector $\mathbf{x}_{i}\in\mathbb{R}^{m}$. 
* [The Perceptron](https://en.wikipedia.org/wiki/Perceptron) _incrementally_ learns a linear decision boundary between _two_ classes of possible objects (binary classification) in $\mathcal{D}$ by repeatedly processing the data. During each pass, a regression parameter vector $\mathbf{\beta}$ is updated until it makes no more than a specified number of mistakes. 

[The Perceptron](https://en.wikipedia.org/wiki/Perceptron) computes the estimated label $\hat{y}_{i}$ for feature vector $\hat{\mathbf{x}}_{i}$ using the $\texttt{sign}:\mathbb{R}\to\{-1,1\}$ function:
$$
\begin{equation*}
    \hat{y}_{i} = \sigma\left(\hat{\mathbf{x}}_{i}^{\top}\cdot\beta\right)
\end{equation*}
$$
where $\beta=\left(w_{1},\dots,w_{n}, b\right)$ is a column vector of (unknown) classifier parameters, $w_{j}\in\mathbb{R}$ corresponding to the importance of feature $j$ and $b\in\mathbb{R}$ is a bias parameter, the features $\hat{\mathbf{x}}^{\top}_{i}=\left(x^{(i)}_{1},\dots,x^{(i)}_{m}, 1\right)$ are $p = m+1$-dimensional (row) vectors (features augmented with bias term), and $\sigma(z)$ is an _activation_ function. In the case of the Perceptron, the activation function is the _sign_ function, which maps a real-valued input to one of two possible outputs, typically -1 or 1. The sign function is defined as follows:
$$
\begin{equation*}
    \texttt{sign}(z) = 
    \begin{cases}
        1 & \text{if}~z\geq{0}\\
        -1 & \text{if}~z<0
    \end{cases}
\end{equation*}
$$
Let's imagine that we have _many_ perceptrons arranged in layers, where each perceptron in a layer is connected to all perceptrons in the next layer. This is the basic idea behind a feedforward neural network (FNN).

___

<img
  src="figs/nn-4.svg"
  alt="triangle with all three sides equal"
  height="400"
  width="800" />

## Feedforward Neural Networks with a Single Hidden Layer
Let's consider the simple network shown above. The network has three layers: an input layer (five nodes), a hidden layer (12 nodes), and an output layer (three nodes). 

* __Input__: Denote the input to the network as the vector $\mathbf{x} = \left\{x_{1},x_{2},\dots,x_{n}\right\}$ where $\mathbf{x}\in\mathbb{R}^{n}$; in the example network $n = 5$. In general, the input to the network is a vector of features that represent the data we want to process. Each component of the input vector corresponds to a feature of the data. For example, in an image classification task, the input vector could represent pixel values of an image.
* __Hidden Layer__: The hidden layer perform computations on the input data. Each node in the hidden layer takes the input vector $\mathbf{x}$ and applies a linear transformation followed by a non-linear activation function. The output of each node in the hidden layer is then passed to the next layer (in this case, the output layer). The number of nodes in the hidden layer can vary depending on the complexity of the task. Furthermore, the hidden layer can have multiple layers (i.e., deep neural networks) to learn more complex representations of the data. In the example network, we have a single hidden layer with 12 nodes.
* __Output Layer__: The output layer is the final layer of the network that produces the output of the network $\mathbf{y} = \left\{y_{1},y_{2},\dots,y_{m}\right\}$ where $\mathbf{y}\in\mathbb{R}^{m}$. Each node in the output layer takes the output from the hidden layer and applies a linear transformation followed by a non-linear activation function. In this example, we have three output nodes. For example, this could mean that we are predicting three different classes in a multiclass classification task. In this case, the output of the network can be interpreted as probabilities for each class, and we can use techniques like softmax to convert the output into probabilities.