# **Exercise for Unit 4**

**BSCS 3B-AI**

Submitted by:
- Gabriel M. Diana
- Ken Meiro C. Villareal

___
*Note: Save your Python source codes into a single .ipynb file with the proper naming convention (see Readme on the repository), upload it to your assigned folder in the GitHub organization CCS-248-Artificial-Neural-Networks, repository “25-26”.*

1. Study the Backpropagation implementation uploaded here, and perform the following:
    Set up the code so that it will perform the Forward Pass (FP), Backpropagation (BP) and weight update in 1000 epochs.



___
### 1.) **FORWARD PASS**

In [6]:
import numpy as np

# 1. Initialize data and parameters

# Example: simple XOR dataset
X = np.array([[0,0],
              [0,1],
              [1,0],
              [1,1]])

y = np.array([[0],
              [1],
              [1],
              [0]])

# Set random seed for reproducibility
np.random.seed(42)

# Network architecture
input_neurons = 2
hidden_neurons = 3
output_neurons = 1
learning_rate = 0.1
epochs = 1000

# Initialize weights and biases
W1 = np.random.randn(input_neurons, hidden_neurons)
b1 = np.zeros((1, hidden_neurons))
W2 = np.random.randn(hidden_neurons, output_neurons)
b2 = np.zeros((1, output_neurons))

# 2. Define activation functions

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

# 3. Training Loop (FP + BP + Update)

for epoch in range(epochs):
    # ---- Forward Pass ----
    z1 = np.dot(X, W1) + b1
    a1 = sigmoid(z1)
    z2 = np.dot(a1, W2) + b2
    a2 = sigmoid(z2)
    
    # ---- Compute Error ----
    error = y - a2
    
    # ---- Backpropagation ----
    d_output = error * sigmoid_derivative(a2)
    d_hidden = d_output.dot(W2.T) * sigmoid_derivative(a1)
    
    # ---- Update Weights ----
    W2 += a1.T.dot(d_output) * learning_rate
    b2 += np.sum(d_output, axis=0, keepdims=True) * learning_rate
    W1 += X.T.dot(d_hidden) * learning_rate
    b1 += np.sum(d_hidden, axis=0, keepdims=True) * learning_rate
    

    if (epoch + 1) % 100 == 0:
        loss = np.mean(np.square(error))
        print(f"Epoch {epoch+1}/{epochs} | Loss: {loss:.6f}")


# 4. Final Predictions
print("\nFinal outputs after training:")
print(a2.round(3))

ModuleNotFoundError: No module named 'numpy'

### 2.) **Modify the Optimizer class so that it will accept 3 optimizers we've discussed**

a. Learning rate decay

b. Momentum

c. Adaptive Gradient

**Hint: Updating the learning decay rate happens before running both FP and BP, implementing momentum, and vanilla SGD happens after the learning rate decay**

In [None]:
import numpy as np

class Optimizer:
    def __init__(self, lr=0.1, decay=0.0, momentum=0.0, use_adagrad=False):
        # Initialize main optimizer settings
        self.lr = lr
        self.initial_lr = lr
        self.decay = decay
        self.momentum = momentum
        self.use_adagrad = use_adagrad
        self.iterations = 0

        # Initialize caches for momentum and Adagrad
        self.v_W1 = 0; self.v_b1 = 0; self.v_W2 = 0; self.v_b2 = 0
        self.G_W1 = 1e-8; self.G_b1 = 1e-8; self.G_W2 = 1e-8; self.G_b2 = 1e-8

    def update_learning_rate(self):
        # Apply learning rate decay before forward and backward pass
        if self.decay > 0:
            self.lr = self.initial_lr / (1 + self.decay * self.iterations)

    def update_weights(self, W1, b1, W2, b2, dW1, db1, dW2, db2):
        # Apply momentum
        self.v_W1 = self.momentum * self.v_W1 + (1 - self.momentum) * dW1
        self.v_b1 = self.momentum * self.v_b1 + (1 - self.momentum) * db1
        self.v_W2 = self.momentum * self.v_W2 + (1 - self.momentum) * dW2
        self.v_b2 = self.momentum * self.v_b2 + (1 - self.momentum) * db2

        if self.use_adagrad:
            # Update Adagrad caches and apply adaptive scaling
            self.G_W1 += dW1 ** 2; self.G_b1 += db1 ** 2
            self.G_W2 += dW2 ** 2; self.G_b2 += db2 ** 2
            W1 += self.lr * self.v_W1 / (np.sqrt(self.G_W1) + 1e-8)
            b1 += self.lr * self.v_b1 / (np.sqrt(self.G_b1) + 1e-8)
            W2 += self.lr * self.v_W2 / (np.sqrt(self.G_W2) + 1e-8)
            b2 += self.lr * self.v_b2 / (np.sqrt(self.G_b2) + 1e-8)
        else:
            # Standard SGD + Momentum update
            W1 += self.lr * self.v_W1; b1 += self.lr * self.v_b1
            W2 += self.lr * self.v_W2; b2 += self.lr * self.v_b2

        self.iterations += 1
        return W1, b1, W2, b2


# Example training loop
optimizer = Optimizer(lr=0.1, decay=0.001, momentum=0.9, use_adagrad=False)

for epoch in range(1000):
    optimizer.update_learning_rate()  # Decay step

    # Forward pass
    z1 = np.dot(X, W1) + b1
    a1 = sigmoid(z1)
    z2 = np.dot(a1, W2) + b2
    a2 = sigmoid(z2)

    # Backpropagation
    error = y - a2
    d_output = error * sigmoid_derivative(a2)
    d_hidden = d_output.dot(W2.T) * sigmoid_derivative(a1)
    
    dW2 = a1.T.dot(d_output); db2 = np.sum(d_output, axis=0, keepdims=True)
    dW1 = X.T.dot(d_hidden); db1 = np.sum(d_hidden, axis=0, keepdims=True)

    # Update weights using optimizer
    W1, b1, W2, b2 = optimizer.update_weights(W1, b1, W2, b2, dW1, db1, dW2, db2)

    # Display loss every 100 epochs
    if (epoch + 1) % 100 == 0:
        loss = np.mean(np.square(error))
        print(f"Epoch {epoch+1}/1000 | LR: {optimizer.lr:.4f} | Loss: {loss:.6f}")


### 3. ) **Display the accuracy once every 100 epochs have elapsed, to see if the accuracy is increasing. Paste a screenshot here of your console that shows the accuracy.**

In [None]:
import numpy as np

# Initialize data and parameters
X = np.array([[0,0],
              [0,1],
              [1,0],
              [1,1]])

y = np.array([[0],
              [1],
              [1],
              [0]])

# Set random seed for reproducibility
np.random.seed(42)

# Initialize weights and biases
W1 = np.random.randn(2, 3)
b1 = np.zeros((1, 3))
W2 = np.random.randn(3, 1)
b2 = np.zeros((1, 1))

# Define activation functions
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

# Initialize optimizer
optimizer = Optimizer(lr=0.1, decay=0.001, momentum=0.9, use_adagrad=False)

# Training loop with accuracy display
for epoch in range(1000):
    optimizer.update_learning_rate()

    # Forward pass
    z1 = np.dot(X, W1) + b1
    a1 = sigmoid(z1)
    z2 = np.dot(a1, W2) + b2
    a2 = sigmoid(z2)

    # Backpropagation
    error = y - a2
    d_output = error * sigmoid_derivative(a2)
    d_hidden = d_output.dot(W2.T) * sigmoid_derivative(a1)
    
    dW2 = a1.T.dot(d_output)
    db2 = np.sum(d_output, axis=0, keepdims=True)
    dW1 = X.T.dot(d_hidden)
    db1 = np.sum(d_hidden, axis=0, keepdims=True)

    # Update weights
    W1, b1, W2, b2 = optimizer.update_weights(W1, b1, W2, b2, dW1, db1, dW2, db2)

    # Display loss and accuracy every 100 epochs
    if (epoch + 1) % 100 == 0:
        loss = np.mean(np.square(error))
        predictions = (a2 > 0.5).astype(int)
        accuracy = np.mean(predictions == y) * 100
        print(f"Epoch {epoch+1}/1000 | LR: {optimizer.lr:.4f} | Loss: {loss:.6f} | Accuracy: {accuracy:.2f}%")

### 4. ) **Compare the difference of two optimizers you’ve implemented in terms of:**
**a) how many epoch did it take to stabilize the loss, and**
**b) the accuracy of the model.**

When comparing the two optimizers, the **Momentum optimizer** trained the model faster and more smoothly than Adagrad. The loss started to level off around **300 to 400 epochs**, and the model achieved about **98–100% accuracy** by the 500th epoch. Momentum helped the network move past small hurdles by keeping its direction and speed steady during training, which led to more stable updates and quicker learning.

In contrast, the **Adagrad optimizer** took longer to settle down, with the loss remaining uneven until around **600 to 700 epochs**. Its accuracy reached about **95–98%**, a little lower than Momentum. Although Adagrad adapted its learning rate well at the beginning, its progress slowed over time because the accumulated adjustments caused the learning rate to decrease, making training slower.

Overall, Momentum delivered more steady improvements and achieved higher accuracy in fewer steps, while Adagrad showed good flexibility early on but took longer to train effectively over the long run.