# 5. Neural Networks

### *Table of Contents*
5.1 [Feed-forward Network Functions](#5.1-Feed-forward-Network-Functions)

In [1]:
import math
import numpy as np
import matplotlib.pyplot as plt

# Set random seed to make deterministic
np.random.seed(0)

# Ignore zero divisions and computation involving NaN values.
np.seterr(divide = 'ignore', invalid='ignore')

# Enable higher resolution plots
%config InlineBackend.figure_format = 'retina'

# Enable autoreload all modules before executing code
%load_ext autoreload
%autoreload 2

## 5.1 Feed-forward Network Functions

The linear models discussed in previous chapters are based on linear combinations of fixed (non)linear basis functions $\phi_j(\mathbf{x})$ and take the form

$$
y(\mathbf{x},\mathbf{w}) = f\Bigg(\sum_{j=1}^M w_j\phi_j(\mathbf{x})\Bigg)
$$

where $f(\cdot)$ is a nonlinear activation function in the case of classification and the identity in the case of regression. Although such models have useful analytical properties, they are limited by the curse of dimentionality, and they need to adapt the basis functions to the data for large-scale problems. An alternative is to use a predefined number of basis functions but allow them to be adaptive during training. Thus, our goal is to extend the model above by making the basis functions $\phi_j(\mathbf{x})$ depend on parameters and then adjust them along the coefficients $\{w_j\}$, during training.

Neural networks use basis functions that follow the same form, that is, each basis function is itself a nonlinear function of a linear combination of the inputs, where the coefficients are adapative parameters. Thus, the basic neural network model is described as a series of functional transformations.

1. Given the input variables $x_1,\dots,x_D$, we construct $M$ linear combinations in the form:

$$
a_j = \sum_{i=1}^D w_{ji}^{(1)}x_i + w_{j0}^{(1)}
$$

where $j=1,\dots,M$, and the superscript $(1)$ indicates the corresponding parameters of the *first* layer of the network. The quantities $a_j$ are known as activations and each of them is transformed using a *differentiable* nonlinear activation function $h(\cdot)$ to give

$$
z_j = h(a_j)
$$

these correspond to the outputs of the basis functions, and in the context of neural networks are called *hidden units*.

2. Following the same procedure, these output values from the *first* layer, are linearly combined again to give,

$$
a_k = \sum_{j=1}^M w_{ki}^{(2)}z_j + w_{k0}^{(2)}
$$

where $k=1,\dots,K$. This transformation corresponds to the *second* layer of the network. These output activations are transformed again using an appropriate activation function $h$ to give a set of outputs $y_k$.