# Neural Networks

### Zhentao Shi


* The workhorse of AI

## Types of Neural Networks


* Feedforward (FNN)
* Convolution (CNN)
* Recurrent (RNN)
* Long short-term memory (LSTM)
* etc.

## Layers

* The transition from layer $k-1$ to layer $k$ can be written as

$$
\begin{align*}
z_l^{(k)} & = b_{l0}^{(k-1)} + \sum_{j=1}^{p_{k-1} } w_{lj}^{(k-1)} a_j^{(k-1)} \\ 
a_l^{(k)} & = \sigma ( z_l^{(k)})
\end{align*}
$$

where $a_j^{(0)} = x_j$ is the input.

* The latent variable $z_l^{(k)}$ usually takes a linear form
* *Activation function* $\sigma(\cdot)$ is usually a simple nonlinear function
* Popular choices
  * Sigmoid: $1/(1+\exp(-x))$
  * Rectified linear unit (ReLu) $z\cdot 1\{x\geq 0\}$

## Why Does It Work?

* Animated video by [3Blue1Brown](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)

* Feedforward: criterion evaluation
* Back propagation: parameter adjustment

## Optimization

* One-layer feedforward NN for demonstration
* Input: $p$
* Hidden nodes: $K$
  
* Criterion: 
$$
\min_{\theta}   \frac{1}{2}\sum_{i=1}^n  Q_i \textrm{ where } Q_i = [y_i - \hat{y}_i ]^2
$$


## Components in Optimization

* Input -> hidden layer is indexed as (1), 
* The hidden -> output layer as (2):
$$
\begin{align*}
\hat{y}_i & =  b^{(2)} + \sum_{j=1}^K w_{j}^{(2)} a_j^{(1)} \\ 
a_j^{(1)} & = \sigma \left( z^{(1)}_j\right) \\
z_j^{(1)} & =b_j^{(1)} + \sum_{\ell=1}^p w_{j\ell}^{(1)} x_{i} 
\end{align*}
$$

* Intercept is called **bias**
* Slope coefficient is called **weight**



## Gradient method

Taylor expansion

$$
Q(\theta_{t+1}) = Q(\theta_t) +  \nabla^{\top} Q(\theta_t) (\theta_{t+1}-\theta_{t}) + h.o.t.
$$
where
* $\nabla Q(\theta_t)$ is **Gradient**
* $(\theta_{t+1}-\theta_{t})$ is unknown, use $p_t$ to replace it as
$$
Q(\theta_{t+1}) = Q(\theta_t) +  \nabla^{\top} Q(\theta_t) p_t
$$
which direction reduces the value of function?

* Choose $ \theta_{t+1}-\theta_{t} = p_t = - \alpha \cdot \nabla Q(\theta_t)$ ensures reduction in $Q$, where $\alpha$ is the **learning rate**, and we obtain the iterative formula
$$
 \theta_{t+1} = \theta_{t} - \alpha \cdot \nabla Q(\theta_t)
$$



## Backpropagation

* Output layer -> hidden layer
\begin{align*}
\frac{\partial Q_{i}}{\partial b^{(2)}} & =-\left[y_{i}-f^{(2)}\left(X_{i}\right)\right]\\
\frac{\partial Q_{i}}{\partial w_{j}^{(2)}} & =-\left[y_{i}-f^{(2)}\left(X_{i}\right)\right]\sigma\left(z_{j}\right)
\end{align*}

* Hidden layer -> input layer: 
  * NN is a composite function. By the chain rule 
\begin{align*}
\frac{\partial Q_{i}}{\partial b^{(1)}} & =\frac{\partial Q_{i}}{\partial b^{(2)}}\cdot\sigma'\left(z_{j}\right)\\
\frac{\partial Q_{i}}{\partial w_{j}^{(1)}} & =\frac{\partial Q_{i}}{\partial w_{j}^{(2)}}\cdot\sigma'\left(z_{j}\right)x_{i}
\end{align*}

**Example**: Use NN to fit a linear model.
* Notice $x = \mathrm{ReLu}(x) - \mathrm{ReLu}(-x)$. A linear function can be exactly represented by NN with ReLu.

In [1]:
import numpy as np

# simulate data
np.random.seed(1)  # For reproducibility
n = 100
x = np.random.rand(n, 2)
y = 1 + 2 * x[:, 0] + 1 * x[:, 1] + np.random.randn(n)  # Linear relationship with noise
y = y.reshape(-1, 1)  # Reshape to column vector (n, 1)


In [2]:
# Define the Neural Network class
class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        """Initialize weights and biases."""
        self.W1 = np.random.randn(input_size, hidden_size) * 0.01  # Input to hidden weights
        self.b1 = np.zeros((1, hidden_size))  # bias
        
        self.W2 = np.random.randn(hidden_size, output_size) * 0.01  # Hidden to output weights
        self.b2 = np.zeros((1, output_size))  # bias

    def forward(self, X):
        """Compute the forward pass."""
        self.Z1 = np.dot(X, self.W1) + self.b1  # Input to hidden layer
        self.A1 = np.maximum(0, self.Z1)  # ReLU activation

        self.Z2 = np.dot(self.A1, self.W2) + self.b2  # Hidden to output layer
        self.A2 = self.Z2  # Linear activation (identity)
        return self.A2

    def compute_loss(self, y_true, y_pred):
        """Calculate Mean Squared Error loss."""
        return np.mean((y_true - y_pred) ** 2)

    def backward(self, X, y):
        """Compute gradients using backpropagation."""
        n = X.shape[0]  # sample size

        # Output layer gradients. 2nd layer is computed first.
        dZ2 = 2 * (self.A2 - y) / n  # Gradient of loss w.r.t. Z2
        
        dW2 = np.dot(self.A1.T, dZ2)  # Gradient of loss w.r.t. W2
        db2 = np.sum(dZ2, axis=0, keepdims=True)  # Gradient of loss w.r.t. b2

        # Hidden layer gradients. 1st layer is computed last.
        dA1 = np.dot(dZ2, self.W2.T)  # Gradient of loss w.r.t. A1
        dZ1 = dA1 * (self.Z1 > 0)  # Gradient of loss w.r.t. Z1 (ReLU derivative)
        dW1 = np.dot(X.T, dZ1)  # Gradient of loss w.r.t. W1
        db1 = np.sum(dZ1, axis=0, keepdims=True)  # Gradient of loss w.r.t. b1
        return dW1, db1, dW2, db2

    def update(self, dW1, db1, dW2, db2, learning_rate):
        """Update weights and biases using gradients."""
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2

    def train(self, X, y, epochs, learning_rate):
        """Train the neural network."""
        for epoch in range(epochs):
            y_pred = self.forward(X)  # Forward pass
            loss = self.compute_loss(y, y_pred)  # Compute loss
            if epoch % 100 == 0:  # Print loss every 100 epochs
                print(f"Epoch {epoch}, Loss: {loss:.6f}")
            dW1, db1, dW2, db2 = self.backward(X, y)  # Backward pass
            self.update(dW1, db1, dW2, db2, learning_rate)  # Update parameters


In [None]:
# Main execution
if __name__ == "__main__":
    # Initialize the neural network
    nn = NeuralNetwork(input_size=2, hidden_size=8, output_size=1)
    
    # Train the network
    nn.train(x, y, epochs=1000, learning_rate=0.01)
    
    # Evaluate the final performance
    final_pred = nn.forward(x)
    final_loss = nn.compute_loss(y, final_pred)
    print(f"Final Loss: {final_loss:.6f}")

Epoch 0, Loss: 7.843827
Epoch 100, Loss: 1.589406
Epoch 200, Loss: 1.461798
Epoch 300, Loss: 1.415684
Epoch 400, Loss: 1.347697
Epoch 500, Loss: 1.262448
Epoch 600, Loss: 1.181384
Epoch 700, Loss: 1.124319
Epoch 800, Loss: 1.090392
Epoch 900, Loss: 1.071762
Final Loss: 1.061774


A `pytorch` implementation of the same neural network

In [4]:
import torch

import torch.nn as nn
import torch.optim as optim

# Define the Neural Network class
class NeuralNetwork_torch(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(NeuralNetwork_torch, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out


In [5]:
 # Convert numpy arrays to torch tensors
x_tensor = torch.tensor(x, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32)

# Initialize the neural network
input_size = x.shape[1]
hidden_size = 8
output_size = 1
model = NeuralNetwork_torch(input_size, hidden_size, output_size)

# Define the loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Train the neural network
epochs = 1000
for epoch in range(epochs):
    # Forward pass
    outputs = model(x_tensor)
    loss = criterion(outputs, y_tensor)
    
    # Backward pass and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.6f}")

# Evaluate the final performance
final_pred = model(x_tensor).detach().numpy()
final_loss = criterion(torch.tensor(final_pred), y_tensor).item()
print(f"Final Loss: {final_loss:.6f}")

Epoch 0, Loss: 8.963584
Epoch 100, Loss: 1.196773
Epoch 200, Loss: 1.137686
Epoch 300, Loss: 1.098162
Epoch 400, Loss: 1.073694
Epoch 500, Loss: 1.059587
Epoch 600, Loss: 1.051506
Epoch 700, Loss: 1.046908
Epoch 800, Loss: 1.044397
Epoch 900, Loss: 1.043045
Final Loss: 1.042303


# Stochastic Gradient Descent

* Large $n$
* Sample a *minibatch*
  * Unbiased gradient, but large variance
* Learning rate
* Many epochs



## Regularization

* $L_1$-norm (Lasso)
* $L_2$-norm (ridge)
* Learning rate
* Number of epochs and minibatches


#  Theory is Incomplete

* Theoretical understanding is an ongoing endeavor.
* Hornik, Stinchcombe, and White (1989):
  * A single hidden layer neural network, given enough many nodes, is a *universal approximator* for any measurable function.
* Deep learning: engineering breakthrough
* Big data available