# L12a: Feed Forward Neural Networks (FNNs)
In this lecture, we will explore the concept of Feed Forward Neural Networks (FNNs), which are a type of artificial neural network where connections between layers of nodes do not form cycles. This is in contrast to recurrent neural networks (RNNs), where data can flow in cycles (e.g., in time series data). The key concepts disscussed in this lecture are:

* __FNN Architecture__: Feedforward neural networks (FNNs) are foundational artificial neural network architectures where information flows _unidirectionally_ from input nodes through (potentially many) hidden layers of arbitrary dimension to output nodes, without cycles or feedback loops. Each node in the network is a simple processing unit that applies a linear transformation followed by a (potentially) non-linear activation function to its inputs. 
* __FNN Applications__: FNNs are widely employed for pattern recognition, classification tasks (including for non-linearly seperable data), and predictive modeling, such as identifying objects in images, sentiment analysis in text, or forecasting process trends. They also work well in structured data applications like medical diagnostics (classification of patient records) and marketing (personalized recommendations). FFNs also are components of more advanced architectures in fields like computer vision and natural language processing.
* __FNN Training__: FNNs are trained using supervised learning, where the model learns to map inputs to outputs by minimizing a loss function. The most common training algorithm [is backpropagation](https://en.wikipedia.org/wiki/Backpropagation), which uses [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) to update the weights (and bias valies) of the network based on the error between predicted and actual outputs.

The source(s) for this lecture can be found here:
* [John Hertz, Anders Krogh, and Richard G. Palmer. 1991. Introduction to the theory of neural computation. Addison-Wesley Longman Publishing Co., Inc., USA.](https://dl.acm.org/doi/10.5555/104000)
* [Mehlig, B. (2021). Machine Learning with Neural Networks. Chapter 5: Perceptrons and Chapter 6: Stochastic Gradient Descent](https://arxiv.org/abs/1901.05639v4)

## Start at the start: McCulloch-Pitts Neurons
The [McCulloch-Pitts Neuron (1943)](https://link.springer.com/article/10.1007/BF02478259) is a simplified model of an actual biological neuron that serves as the building block for the artificial neural networks that we find in wide use today. 
*  _Origin story_: In [their paper McCulloch and Pitts](https://link.springer.com/article/10.1007/BF02478259) explored how the brain could produce highly complex patterns by using many interconnected _basic cells (neurons)_. McCulloch and Pitts gave a highly simplified model of a _neuron_ in their paper. Nevertheless, they made a foundational contribution to the development of artificial neural networks -- which model key features of biological neurons.

## Is the Perceptron a Neural Network?
[The Perceptron (Rosenblatt, 1957)](https://en.wikipedia.org/wiki/Perceptron) takes a scalar input $y_{i}\in\mathbb{R}$ and transforms it using an _activation_ function $\sigma(\star) = \text{sign}(\star)$ function to a discrete set of values representing categories, e.g., $\sigma:\mathbb{R}\rightarrow\{-1,1\}$ in the binary classification case. 
* Suppose there exists a data set
$\mathcal{D} = \left\{(\mathbf{x}_{1},y_{1}),\dotsc,(\mathbf{x}_{n},y_{n})\right\}$ with $n$ _labeled_ examples, where each example has been labeled by an expert, i.e., a human to be in a category $\hat{y}_{i}\in\{-1,1\}$, given the $m$-dimensional feature vector $\mathbf{x}_{i}\in\mathbb{R}^{m}$. 
* [The Perceptron](https://en.wikipedia.org/wiki/Perceptron) _incrementally_ learns a linear decision boundary between _two_ classes of possible objects (binary classification) in $\mathcal{D}$ by repeatedly processing the data. During each pass, a regression parameter vector $\mathbf{\beta}$ is updated until it makes no more than a specified number of mistakes. 

[The Perceptron](https://en.wikipedia.org/wiki/Perceptron) computes the estimated label $\hat{y}_{i}$ for feature vector $\hat{\mathbf{x}}_{i}$ using the $\texttt{sign}:\mathbb{R}\to\{-1,1\}$ function:
$$
\begin{equation*}
    \hat{y}_{i} = \sigma\left(\hat{\mathbf{x}}_{i}^{\top}\cdot\beta\right)
\end{equation*}
$$
where $\beta=\left(w_{1},\dots,w_{n}, b\right)$ is a column vector of (unknown) classifier parameters, $w_{j}\in\mathbb{R}$ corresponding to the importance of feature $j$ and $b\in\mathbb{R}$ is a bias parameter, the features $\hat{\mathbf{x}}^{\top}_{i}=\left(x^{(i)}_{1},\dots,x^{(i)}_{m}, 1\right)$ are $p = m+1$-dimensional (row) vectors (features augmented with bias term), and $\sigma(z)$ is an _activation_ function. In the case of the Perceptron, the activation function is the _sign_ function, which maps a real-valued input to one of two possible outputs, typically -1 or 1. The sign function is defined as follows:
$$
\begin{equation*}
    \texttt{sign}(z) = 
    \begin{cases}
        1 & \text{if}~z\geq{0}\\
        -1 & \text{if}~z<0
    \end{cases}
\end{equation*}
$$
Let's imagine that we have _many_ perceptrons arranged in layers, where each perceptron in a layer is connected to all perceptrons in the next layer. This is the basic idea behind a feedforward neural network (FNN).