# Multi-Layer Perceptron from Scratch

In this notebook, we will construct a multi-layer perceptron (MLP) using `numpy` only. We consider the data sampled from the following model
$$ y = \sin x + 0.1 \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1) $$

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Data generation
np.random.seed(87)
N = 200                                             # number of data
X = np.linspace(-np.pi, np.pi, N).reshape(-1, 1)    # For the matrix operation. X: N x 1
y = np.sin(X) + 0.1 * np.random.randn(N, 1)         # For the matrix operation. Y: N x 1

In [None]:
fig, ax = plt.subplots(figsize=(8,6))

ax.scatter(X, y, s=5, label="data", color='black')

ax.set_title("Data")
ax.legend()

plt.show()

## 1. Linear regression
Obviously, fitting the given data with linear regression will give poor results.

In [None]:
lr = 0.01
epochs = 1000         # number of iterations

w = np.zeros((1, 1))  # For the matrix operation
b = np.zeros(1)

lin_loss_history = np.zeros(epochs)

for epoch in range(epochs):
    y_pred_lin = X @ w + b
    error_lin = y_pred_lin - y

    lin_loss_history[epoch] = np.mean(error_lin**2)

    grad_w = 2 * X.T @ error_lin / len(X)
    grad_b = 2 * np.mean(error_lin)

    w -= lr * grad_w.T
    b -= lr * grad_b

In [None]:
# Plot
fig, ax = plt.subplots(figsize=(8,6))

ax.scatter(X, y, s=5, label="data", color='black')
ax.plot(X, y_pred_lin, label="predicted", color='red')

ax.set_title("Linear Regression")
ax.legend()

plt.show()

In [None]:
plt.plot(lin_loss_history)
plt.ylim([0, 0.6])
plt.xlabel("epochs")
plt.ylabel("MSE")
plt.title("Loss history (Linear regression)")
plt.show()

## 2. 1 Hidden layer with ReLU
Recall that 
$$ \text{ReLU}(x) = \max (0, x) = \begin{cases}
x, & x > 0, \\
0, & x \le 0.
\end{cases} $$
Even though this function is not differentiable at $x = 0$, we define
$$ \text{ReLU}'(x) = \begin{cases}
1, & x > 0, \\
0, & x \le 0.
\end{cases} $$

In [None]:
# ReLU
def relu(x):
    return np.maximum(0, x)
def relu_deriv(x):
    return (x > 0).astype(float)

We want to include 1 hidden layer of dimension 10 between input layer and output layer.

In [None]:
# Hyperparmeters
input_dim = 1
hidden_dim = 5
output_dim = 1
lr = 0.01
epochs = 1000


# Initialization
np.random.seed(100)
W1 = np.random.randn(input_dim, hidden_dim) * 0.1
b1 = np.zeros((1, hidden_dim))
W2 = np.random.randn(hidden_dim, output_dim) * 0.1
b2 = np.zeros((1, output_dim))

In [None]:
loss_history_1 = []

for epoch in range(epochs):
    # Forward
    h1 = X @ W1 + b1
    z1 = relu(h1) 
    y_pred_1 = z1 @ W2 + b2

    # Loss (MSE)
    error_1 = y_pred_1 - y
    loss_1 = np.mean(error_1**2)
    loss_history_1.append(loss_1)

    # Backward
    dy = 2 * error_1 / len(X)
    
    dW2 = z1.T @ dy
    db2 = np.sum(dy, axis=0, keepdims=True)

    dz1 = dy @ W2.T
    dh1 = dz1 * relu_deriv(h1)
    dW1 = X.T @ dh1
    db1 = np.sum(dh1, axis=0, keepdims=True)

    # Update
    W1 -= lr * dW1
    b1 -= lr * db1
    W2 -= lr * dW2
    b2 -= lr * db2

    if epoch % 100 == 0:
        print(f"Epoch {epoch}: MSE = {loss_1:.4f}")

In [None]:
fig, ax = plt.subplots(figsize=(8,6))

ax.scatter(X, y, s=5, label="data", color='black')
ax.plot(X, y_pred_1, label="predicted", color='red')

ax.set_title(f"1 Hidden Layer (after {epochs} epochs)")
ax.legend()

plt.show()

In [None]:
plt.plot(loss_history_1)
plt.ylim([0, 0.6])
plt.xlabel("epochs")
plt.ylabel("MSE")
plt.title("Loss history (1 Hidden Layer)")
plt.show()

## 3. 2 Hidden layers with tanh

tanh is another popular actvation function. It's defined as
$$ \tanh (x) = \frac{e^x -e^{-x}}{e^x + e^{-x}}. $$
It's derivative is given by
$$ \tanh'(x) = 1 - \tanh^2(x). $$

In [None]:
def tanh(z):
    return np.tanh(z)

def tanh_grad(z):
    return 1 - np.tanh(z) ** 2

In [None]:
# Hyperparmeters
input_dim = 1
h1_dim = 50
h2_dim = 50
output_dim = 1
lr = 0.01
epochs = 1000

# Initialization
W1 = np.random.randn(input_dim, h1_dim) * 0.1
b1 = np.zeros((1, h1_dim))

W2 = np.random.randn(h1_dim, h2_dim) * 0.1
b2 = np.zeros((1, h2_dim))

W3 = np.random.randn(h2_dim, output_dim) * 0.1
b3 = np.zeros((1, output_dim))

loss_history_2 = []

for epoch in range(epochs):
    # Forward
    h1 = X @ W1 + b1
    z1 = tanh(h1)

    h2 = z1 @ W2 + b2
    z2 = tanh(h2)

    y_pred_2 = z2 @ W3 + b3

    # Loss
    error_2 = y_pred_2 - y
    loss_2 = np.mean(error_2 ** 2)
    loss_history_2.append(loss_2)

    # Backward
    dy = 2 * error_2 / len(X)

    dW3 = z2.T @ dy
    db3 = np.sum(dy, axis=0)

    dz2 = dy @ W3.T
    dh2 = dz2 * tanh_grad(h2)

    dW2 = z1.T @ dh2
    db2 = np.sum(dh2, axis=0)

    dz1 = dh2 @ W2.T
    dh1 = dz1 * tanh_grad(h1)

    dW1 = X.T @ dh1
    db1 = np.sum(dh1, axis=0)

    # Update
    W3 -= lr * dW3
    b3 -= lr * db3
    W2 -= lr * dW2
    b2 -= lr * db2
    W1 -= lr * dW1
    b1 -= lr * db1

    if epoch % 100 == 0:
        print(f"Epoch {epoch}: MSE = {loss_2:.4f}")

In [None]:
fig, ax = plt.subplots(figsize=(8,6))

ax.scatter(X, y, s=5, label="data", color='black')
ax.plot(X, y_pred_2, label="predicted", color='red')

ax.set_title(f"2 Hidden Layer (after {epochs} epochs)")
ax.legend()

plt.show()

In [None]:
plt.plot(loss_history_2)
plt.ylim([0, 0.6])
plt.xlabel("epochs")
plt.ylabel("MSE")
plt.title("Loss history (3 Hidden layers)")
plt.show()

#### Discussion
In this example, the MLP was trained using the entire dataset. As a result, the loss continues to decrease with more training epochs—but this does not guarantee better performance on new, unseen data. In fact, it may lead to **overfitting**.

To achieve better generalization, how could you redesign the training process?

- How would you split the dataset?
- What would be an appropriate stopping condition to prevent overfitting?