# Neural Networks

### Zhentao Shi

![NN](graph/Colored_neural_network.png)

* The workhorse of AI

## Types of Neural Networks


* Feedforward (FNN)
* Convolution (CNN)
* Recurrent (RNN)
* Long short-term memory (LSTM)
* etc.

## Layers

* The transition from layer $k-1$ to layer $k$ can be written as

$$
\begin{align*}
z_l^{(k)} & = b_{l0}^{(k-1)} + \sum_{j=1}^{p_{k-1} } w_{lj}^{(k-1)} a_j^{(k-1)} \\ 
a_l^{(k)} & = \sigma ( z_l^{(k)})
\end{align*}
$$

where $a_j^{(0)} = x_j$ is the input.

* The latent variable $z_l^{(k)}$ usually takes a linear form
* *Activation function* $\sigma(\cdot)$ is usually a simple nonlinear function
* Popular choices
  * Sigmoid: $1/(1+\exp(-x))$
  * Rectified linear unit (ReLu) $z\cdot 1\{x\geq 0\}$

## Why Does It Work?

* Animated video by [3Blue1Brown](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)

* Feedforward: criterion evaluation
* Back propagation: parameter adjustment

## Optimization

* One-layer feedforward NN for demonstration
* Input: $p$
* Hidden nodes: $K$
  
* Criterion: 
$$
\min_{\theta}   \frac{1}{2}\sum_{i=1}^n  Q_i \textrm{ where } Q_i = [y_i - \hat{y}_i ]^2
$$


## Components in Optimization

* Input -> hidden layer is indexed as (1), 
* The hidden -> output layer as (2):
$$
\begin{align*}
\hat{y}_i & =  \beta^{(2)} + \sum_{j=1}^K w_{j}^{(2)} z_j^{(2)} \\ 
z_j^{(2)} & = \sigma \left( z^{(1)}_j\right) \\
z_j^{(1)} & =\beta_j^{(1)} + \sum_{\ell=1}^p w_{j\ell}^{(1)} x_{i} 
\end{align*}
$$

* Intercept is called **bias**
* Slope coefficient is called **weight**



## Gradient method

Taylor expansion

$$
Q(\theta_{t+1}) = Q(\theta_t) +  \nabla^{\top} Q(\theta_t) (\theta_{t+1}-\theta_{t}) + h.o.t.
$$
where
* $\nabla Q(\theta_t)$ is **Gradient**
* $(\theta_{t+1}-\theta_{t})$ is unknown, use $p_t$ (**length of step**) to replace it as
$$
Q(\theta_{t+1}) = Q(\theta_t) +  \nabla^{\top} Q(\theta_t) p_t
$$
which direction reduces the value of function?

* Choose $p_t = - \alpha \cdot \nabla Q(\theta_t)$ ensures reduction in $Q$, where $\alpha$ is the **learning rate**.



## Backpropagation

* Output layer -> hidden layer
\begin{align*}
\frac{\partial Q_{i}}{\partial\beta^{(2)}} & =-\left[y_{i}-f^{(2)}\left(X_{i}\right)\right]\\
\frac{\partial Q_{i}}{\partial w_{j}^{(2)}} & =-\left[y_{i}-f^{(2)}\left(X_{i}\right)\right]\sigma\left(z_{j}\right)
\end{align*}

* Hidden layer -> input layer: 
  * NN is a composite function. By the chain rule 
\begin{align*}
\frac{\partial Q_{i}}{\partial\beta^{(1)}} & =\frac{\partial Q_{i}}{\partial\beta^{(2)}}\cdot\sigma'\left(z_{j}\right)\\
\frac{\partial Q_{i}}{\partial w_{j}^{(1)}} & =\frac{\partial Q_{i}}{\partial w_{j}^{(2)}}\cdot\sigma'\left(z_{j}\right)x_{i}
\end{align*}

**Example**: Use NN to fit a linear model.
* Notice $x = \mathrm{ReLu}(x) - \mathrm{ReLu}(-x)$. A linear function can be exactly represented by NN with ReLu.

In [1]:
import numpy as np

# simulate data
np.random.seed(1)  # For reproducibility
n = 100
x = np.random.rand(n, 2)
y = 1 + 2 * x[:, 0] + 1 * x[:, 1] + np.random.randn(n)  # Linear relationship with noise
y = y.reshape(-1, 1)  # Reshape to column vector (n, 1)


In [2]:
# Define the Neural Network class
class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        """Initialize weights and biases."""
        self.W1 = np.random.randn(input_size, hidden_size) * 0.01  # Input to hidden weights
        self.b1 = np.zeros((1, hidden_size))  # bias
        
        self.W2 = np.random.randn(hidden_size, output_size) * 0.01  # Hidden to output weights
        self.b2 = np.zeros((1, output_size))  # bias

    def forward(self, X):
        """Compute the forward pass."""
        self.Z1 = np.dot(X, self.W1) + self.b1  # Input to hidden layer
        self.A1 = np.maximum(0, self.Z1)  # ReLU activation

        self.Z2 = np.dot(self.A1, self.W2) + self.b2  # Hidden to output layer
        self.A2 = self.Z2  # Linear activation (identity)
        return self.A2

    def compute_loss(self, y_true, y_pred):
        """Calculate Mean Squared Error loss."""
        return np.mean((y_true - y_pred) ** 2)

    def backward(self, X, y):
        """Compute gradients using backpropagation."""
        n = X.shape[0]  # sample size

        # Output layer gradients. 2nd layer is computed first.
        dZ2 = 2 * (self.A2 - y) / n  # Gradient of loss w.r.t. Z2
        
        dW2 = np.dot(self.A1.T, dZ2)  # Gradient of loss w.r.t. W2
        db2 = np.sum(dZ2, axis=0, keepdims=True)  # Gradient of loss w.r.t. b2

        # Hidden layer gradients. 1st layer is computed last.
        dA1 = np.dot(dZ2, self.W2.T)  # Gradient of loss w.r.t. A1
        dZ1 = dA1 * (self.Z1 > 0)  # Gradient of loss w.r.t. Z1 (ReLU derivative)
        dW1 = np.dot(X.T, dZ1)  # Gradient of loss w.r.t. W1
        db1 = np.sum(dZ1, axis=0, keepdims=True)  # Gradient of loss w.r.t. b1
        return dW1, db1, dW2, db2

    def update(self, dW1, db1, dW2, db2, learning_rate):
        """Update weights and biases using gradients."""
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2

    def train(self, X, y, epochs, learning_rate):
        """Train the neural network."""
        for epoch in range(epochs):
            y_pred = self.forward(X)  # Forward pass
            loss = self.compute_loss(y, y_pred)  # Compute loss
            if epoch % 100 == 0:  # Print loss every 100 epochs
                print(f"Epoch {epoch}, Loss: {loss:.6f}")
            dW1, db1, dW2, db2 = self.backward(X, y)  # Backward pass
            self.update(dW1, db1, dW2, db2, learning_rate)  # Update parameters


In [None]:
# Main execution
if __name__ == "__main__":
    # Initialize the neural network
    nn = NeuralNetwork(input_size=2, hidden_size=8, output_size=1)
    
    # Train the network
    nn.train(x, y, epochs=1000, learning_rate=0.01)
    
    # Evaluate the final performance
    final_pred = nn.forward(x)
    final_loss = nn.compute_loss(y, final_pred)
    print(f"Final Loss: {final_loss:.6f}")

Epoch 0, Loss: 7.843827
Epoch 100, Loss: 1.589406
Epoch 200, Loss: 1.461798
Epoch 300, Loss: 1.415684
Epoch 400, Loss: 1.347697
Epoch 500, Loss: 1.262448
Epoch 600, Loss: 1.181384
Epoch 700, Loss: 1.124319
Epoch 800, Loss: 1.090392
Epoch 900, Loss: 1.071762
Final Loss: 1.061774


A `pytorch` implementation of the same neural network

In [4]:
import torch

import torch.nn as nn
import torch.optim as optim

# Define the Neural Network class
class NeuralNetwork_torch(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(NeuralNetwork_torch, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out


In [5]:
 # Convert numpy arrays to torch tensors
x_tensor = torch.tensor(x, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32)

# Initialize the neural network
input_size = x.shape[1]
hidden_size = 8
output_size = 1
model = NeuralNetwork_torch(input_size, hidden_size, output_size)

# Define the loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Train the neural network
epochs = 1000
for epoch in range(epochs):
    # Forward pass
    outputs = model(x_tensor)
    loss = criterion(outputs, y_tensor)
    
    # Backward pass and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.6f}")

# Evaluate the final performance
final_pred = model(x_tensor).detach().numpy()
final_loss = criterion(torch.tensor(final_pred), y_tensor).item()
print(f"Final Loss: {final_loss:.6f}")

Epoch 0, Loss: 8.963584
Epoch 100, Loss: 1.196773
Epoch 200, Loss: 1.137686
Epoch 300, Loss: 1.098162
Epoch 400, Loss: 1.073694
Epoch 500, Loss: 1.059587
Epoch 600, Loss: 1.051506
Epoch 700, Loss: 1.046908
Epoch 800, Loss: 1.044397
Epoch 900, Loss: 1.043045
Final Loss: 1.042303


# Stochastic Gradient Descent

* Large $n$
* Sample a *minibatch*
  * Unbiased gradient, but large variance
* Learning rate
* Many epochs



## Regularization

* $L_1$-norm (Lasso)
* $L_2$-norm (ridge)
* Learning rate
* Number of epochs and minibatches


## Simulation Example

* Use NN to solve Poisson regression
  * A trivial example for demonstration
  * No hidden layer
  * Keep the essence
  
* See `data_example/nn_torch.ipynb`

## Network Structures

* Time series
  * Recurrent NN (RNN)
  * Long short-term memory (LSTM) (See `data_example/nn_LSTM.ipynb`)
* Graphics
  * Convolutional NN (CNN)


In [6]:
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic time series data (sine wave with noise)
def generate_data(n_samples, seq_length):
    t = np.linspace(0, 10, n_samples)
    data = np.sin(t) + np.random.normal(0, 0.1, n_samples)  # Sine wave + noise
    X, y = [], []
    for i in range(n_samples - seq_length):
        X.append(data[i:i + seq_length])  # Input sequence
        y.append(data[i + seq_length])    # Next value to predict
    return np.array(X), np.array(y).reshape(-1, 1)

# Define RNN class
class SimpleRNN:
    def __init__(self, input_size, hidden_size, output_size):
        """Initialize weights and biases."""
        # Weight matrices
        self.Wxh = np.random.randn(input_size, hidden_size) * 0.01  # Input to hidden
        self.Whh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden to hidden
        self.Why = np.random.randn(hidden_size, output_size) * 0.01  # Hidden to output
        # Biases
        self.bh = np.zeros((1, hidden_size))  # Hidden bias
        self.by = np.zeros((1, output_size))  # Output bias

    def forward(self, X):
        """Forward pass through the RNN."""
        seq_length = X.shape[1]
        batch_size = X.shape[0]
        hidden_size = self.Whh.shape[0]
        
        # Initialize hidden state and storage for activations
        h = np.zeros((batch_size, hidden_size))
        self.hs = [h]  # Store hidden states for backprop
        self.xs = [X[:, t, :] for t in range(seq_length)]  # Store inputs
        
        # Forward pass over time steps
        for t in range(seq_length):
            h = np.tanh(np.dot(X[:, t, :], self.Wxh) + np.dot(h, self.Whh) + self.bh)
            self.hs.append(h)
        
        # Output layer (at the last time step)
        y = np.dot(h, self.Why) + self.by
        return y

    def backward(self, X, y_true, y_pred, learning_rate):
        """Backpropagation through time (BPTT)."""
        seq_length = X.shape[1]
        batch_size = X.shape[0]
        
        # Initialize gradients
        dWxh, dWhh, dWhy = np.zeros_like(self.Wxh), np.zeros_like(self.Whh), np.zeros_like(self.Why)
        dbh, dby = np.zeros_like(self.bh), np.zeros_like(self.by)
        
        # Output layer gradient
        dy = y_pred - y_true  # Gradient of loss (MSE) w.r.t. output
        dWhy = np.dot(self.hs[-1].T, dy) / batch_size  # Gradient w.r.t. Why
        dby = np.sum(dy, axis=0, keepdims=True) / batch_size  # Gradient w.r.t. by
        
        # Initialize hidden state gradient
        dh_next = np.dot(dy, self.Why.T)  # Gradient from output to last hidden state
        
        # Backprop through time
        for t in range(seq_length - 1, -1, -1):
            dh_raw = dh_next * (1 - self.hs[t + 1] ** 2)  # Gradient w.r.t. pre-activation (tanh derivative)
            dbh += np.sum(dh_raw, axis=0, keepdims=True) / batch_size
            dWxh += np.dot(self.xs[t].T, dh_raw) / batch_size
            dWhh += np.dot(self.hs[t].T, dh_raw) / batch_size
            dh_next = np.dot(dh_raw, self.Whh.T)  # Propagate gradient to previous time step
        
        # Clip gradients to prevent exploding gradients
        for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
            np.clip(dparam, -5, 5, out=dparam)
        
        # Update weights and biases
        self.Wxh -= learning_rate * dWxh
        self.Whh -= learning_rate * dWhh
        self.Why -= learning_rate * dWhy
        self.bh -= learning_rate * dbh
        self.by -= learning_rate * dby

    def train(self, X, y, epochs, learning_rate):
        """Train the RNN."""
        for epoch in range(epochs):
            y_pred = self.forward(X)
            loss = np.mean((y_pred - y) ** 2)  # Mean Squared Error
            if epoch % 100 == 0:
                print(f"Epoch {epoch}, Loss: {loss:.6f}")
            self.backward(X, y, y_pred, learning_rate)

# Main execution
if __name__ == "__main__":
    # Generate data
    n_samples = 1000
    seq_length = 10
    X, y = generate_data(n_samples, seq_length)
    X = X.reshape(-1, seq_length, 1)  # Shape: (samples, timesteps, features)
    
    # Initialize and train RNN
    rnn = SimpleRNN(input_size=1, hidden_size=5, output_size=1)
    rnn.train(X, y, epochs=1000, learning_rate=0.01)
    
    # Test the model
    y_pred = rnn.forward(X)
    final_loss = np.mean((y_pred - y) ** 2)
    print(f"Final Loss: {final_loss:.6f}")
    
    # Optional: Print some predictions
    print("\nSample Predictions vs Targets:")
    for i in range(5):
        print(f"Predicted: {y_pred[i][0]:.4f}, Target: {y[i][0]:.4f}")

Epoch 0, Loss: 0.497974
Epoch 100, Loss: 0.467624
Epoch 200, Loss: 0.463342
Epoch 300, Loss: 0.462317
Epoch 400, Loss: 0.461109
Epoch 500, Loss: 0.458347
Epoch 600, Loss: 0.451663
Epoch 700, Loss: 0.435664
Epoch 800, Loss: 0.398754
Epoch 900, Loss: 0.321377
Final Loss: 0.194452

Sample Predictions vs Targets:
Predicted: 0.1906, Target: 0.0536
Predicted: 0.1661, Target: 0.0633
Predicted: 0.1653, Target: 0.1440
Predicted: 0.1913, Target: -0.0616
Predicted: 0.1289, Target: -0.0328


In [7]:
import torch

import torch.nn as nn
import torch.optim as optim

# Define the RNN class
class RNN_torch(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN_torch, self).__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        h0 = torch.zeros(1, x.size(0), self.hidden_size).to(x.device)  # Initial hidden state
        out, _ = self.rnn(x, h0)  # RNN forward pass
        out = self.fc(out[:, -1, :])  # Fully connected layer on the last time step
        return out

# Convert numpy arrays to torch tensors
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32)

# Initialize the RNN
input_size = X.shape[2]
hidden_size = 5
output_size = 1
model = RNN_torch(input_size, hidden_size, output_size)

# Define the loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Train the RNN
epochs = 1000
for epoch in range(epochs):
    model.train()
    optimizer.zero_grad()
    outputs = model(X_tensor)
    loss = criterion(outputs, y_tensor)
    loss.backward()
    optimizer.step()
    
    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.6f}")

# Evaluate the final performance
model.eval()
with torch.no_grad():
    final_pred = model(X_tensor).numpy()
    final_loss = criterion(torch.tensor(final_pred), y_tensor).item()
    print(f"Final Loss: {final_loss:.6f}")

# Optional: Print some predictions
print("\nSample Predictions vs Targets:")
for i in range(5):
    print(f"Predicted: {final_pred[i][0]:.4f}, Target: {y[i][0]:.4f}")

Epoch 0, Loss: 0.708658
Epoch 100, Loss: 0.013469
Epoch 200, Loss: 0.012991
Epoch 300, Loss: 0.012638
Epoch 400, Loss: 0.012350
Epoch 500, Loss: 0.012142
Epoch 600, Loss: 0.011987
Epoch 700, Loss: 0.011860
Epoch 800, Loss: 0.011754
Epoch 900, Loss: 0.011671
Final Loss: 0.011614

Sample Predictions vs Targets:
Predicted: 0.0969, Target: 0.0536
Predicted: 0.0838, Target: 0.0633
Predicted: 0.0803, Target: 0.1440
Predicted: 0.0931, Target: -0.0616
Predicted: 0.0501, Target: -0.0328


#  Theory is Incomplete

* Theoretical understanding is an ongoing endeavor.
* Hornik, Stinchcombe, and White (1989):
  * A single hidden layer neural network, given enough many nodes, is a *universal approximator* for any measurable function.
* Deep learning: engineering breakthrough
* Big data available