The equations and descriptions you've provided outline the dynamics and architecture of recurrent neural networks (RNNs) with a forget gate. Let's break down the key components and equations mentioned:

### Equations for RNNs with Forget Gate:

1. **Input Gate** ($i_t$):
   $$i_t = \sigma(W_{ix} x_t + W_{ih}h_{t-1} + W_{ic}c_{t-1} + b_i)$$

2. **Forget Gate** ($f_t$):
   $$f_t = \sigma(W_{fx}x_t + W_{fh}h_{t-1} + W_{fc}c_{t-1} + b_f)$$

3. **Cell State Update** ($c_t$):
   $$c_t = f_t c_{t-1} + i_t \tanh(W_{cx}x_t + W_{ch}h_{t-1} + b_c)$$

4. **Output Gate** ($y_t$):
   $$y_t = \sigma(W_{ox}x_t + W_{oh}h_{t-1} + W_{oc}c_t + b_o)$$

5. **Hidden State Update** ($h_t$):
   $$h_t = y_t \tanh(c_t)$$

### Network Parameters:
- $x_t$: Input vector to the network unit.
- $f_t$, $i_t$, $y_t$, $h_t$, $c_t$: Gate and state vectors.
- $W_{ix}$, $W_{fx}$, $W_{ox}$, $W_{cx}$: Weight matrices from input vector to respective gates.
- $W_{ih}$, $W_{fh}$, $W_{oh}$, $W_{ch}$: Weight matrices from input gate to hidden gate.
- $W_{ic}$, $W_{fc}$, $W_{oc}$: Weight matrices from cell to respective gates.
- $b_i$, $b_f$, $b_o$, $b_c$: Bias vectors.
- $\sigma(\cdot)$: Logistic sigmoid function.
- $\tanh(\cdot)$: Hyperbolic tangent function.

### Recurrent Computation:

The recurrent computation is described by a deterministic transition from the previous hidden state to the current hidden state:
$$h_{t}^{l} = f(T_{n,n} h_{t}^{l-1} + T_{n,n} h_{t-1}^{l})$$
where $f$ is typically either the sigmoid or hyperbolic tangent activation function.

### Jordan Network vs. Elman Network:

1. **Jordan Network**: In a Jordan network, the output from the previous time step is used as an input along with the current input. The hidden state is computed using both the input and the previous output.

2. **Elman Network**: An Elman network, introduced by Jeff Elman, has a simpler structure where the hidden state at the previous time step is directly fed back into the network. It is widely used for capturing sequential dependencies.

These networks are fundamental in sequence modeling tasks and have various applications in natural language processing, time series prediction, and more. Choosing between Jordan and Elman networks depends on the specific requirements and characteristics of the problem being addressed.


In [10]:
# Define the sample input
# Assuming input_size = 3, hidden_size = 4, output_size = 2
# Randomly initialize input, previous hidden state, and previous cell state
x_sample = np.random.randn(3, 5)  # Shape: (input_size, time_steps)
h_prev_sample = np.random.randn(4, 1)  # Shape: (hidden_size, 1)
c_prev_sample = np.random.randn(4, 1)  # Shape: (hidden_size, 1)

# Create an instance of RNNForgetGate
rnn_forget_gate = RNNForgetGate(input_size=3, hidden_size=4, output_size=2)

# Perform forward pass
h_t, c_t = rnn_forget_gate.forward(x_sample, h_prev_sample, c_prev_sample)

# Display the output hidden state and cell state
print("Output hidden state (h_t):\n", h_t)
print("\nOutput cell state (c_t):\n", c_t)


TypeError: forward() takes 2 positional arguments but 4 were given

In [20]:
import numpy as np

# Activation functions and their derivatives
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

def tanh_derivative(x):
    return 1 - np.tanh(x) ** 2

class RNNForgetGate:
    def __init__(self, input_size, hidden_size, output_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        # Initialize weights
        self.W_ix = np.random.randn(hidden_size, input_size)
        self.W_ih = np.random.randn(hidden_size, hidden_size)
        self.W_ic = np.random.randn(hidden_size, hidden_size)
        self.b_i = np.zeros((hidden_size, 1))
        
        self.W_fx = np.random.randn(hidden_size, input_size)
        self.W_fh = np.random.randn(hidden_size, hidden_size)
        self.W_fc = np.random.randn(hidden_size, hidden_size)
        self.b_f = np.zeros((hidden_size, 1))
        
        self.W_cx = np.random.randn(hidden_size, input_size)
        self.W_ch = np.random.randn(hidden_size, hidden_size)
        self.b_c = np.zeros((hidden_size, 1))
        
        self.W_ox = np.random.randn(hidden_size, input_size)
        self.W_oh = np.random.randn(hidden_size, hidden_size)
        self.W_oc = np.random.randn(hidden_size, hidden_size)
        self.b_o = np.zeros((hidden_size, 1))
        
        self.W_y = np.random.randn(output_size, hidden_size)
        self.b_y = np.zeros((output_size, 1))
        
        self.h = np.zeros((hidden_size, 1))
        self.c = np.zeros((hidden_size, 1))
    
    def forward(self, x):
        T = x.shape[1]
        h_seq = np.zeros((self.hidden_size, T))
        c_seq = np.zeros((self.hidden_size, T))
        y_seq = np.zeros((self.output_size, T))
        
        i_seq = np.zeros((self.hidden_size, T))
        f_seq = np.zeros((self.hidden_size, T))
        o_seq = np.zeros((self.hidden_size, T))
        
        for t in range(T):
            x_t = x[:, t].reshape(-1, 1)
            i_t = sigmoid(np.dot(self.W_ix, x_t) + np.dot(self.W_ih, self.h) + np.dot(self.W_ic, self.c) + self.b_i)
            f_t = sigmoid(np.dot(self.W_fx, x_t) + np.dot(self.W_fh, self.h) + np.dot(self.W_fc, self.c) + self.b_f)
            c_t = f_t * self.c + i_t * np.tanh(np.dot(self.W_cx, x_t) + np.dot(self.W_ch, self.h) + self.b_c)
            o_t = sigmoid(np.dot(self.W_ox, x_t) + np.dot(self.W_oh, self.h) + np.dot(self.W_oc, c_t) + self.b_o)
            h_t = o_t * np.tanh(c_t)
            y_t = sigmoid(np.dot(self.W_y, h_t) + self.b_y)
            
            self.h = h_t
            self.c = c_t
            
            h_seq[:, t] = h_t.ravel()
            c_seq[:, t] = c_t.ravel()
            y_seq[:, t] = y_t.ravel()
            
            i_seq[:, t] = i_t.ravel()
            f_seq[:, t] = f_t.ravel()
            o_seq[:, t] = o_t.ravel()
        
        return h_seq, c_seq, y_seq, i_seq, f_seq, o_seq
    
    def backward(self, x, target, h_seq, c_seq, y_seq, i_seq, f_seq, o_seq, lr=0.01):
        T = x.shape[1]
        dW_ix = np.zeros_like(self.W_ix)
        dW_ih = np.zeros_like(self.W_ih)
        dW_ic = np.zeros_like(self.W_ic)
        db_i = np.zeros_like(self.b_i)
        
        dW_fx = np.zeros_like(self.W_fx)
        dW_fh = np.zeros_like(self.W_fh)
        dW_fc = np.zeros_like(self.W_fc)
        db_f = np.zeros_like(self.b_f)
        
        dW_cx = np.zeros_like(self.W_cx)
        dW_ch = np.zeros_like(self.W_ch)
        db_c = np.zeros_like(self.b_c)
        
        dW_ox = np.zeros_like(self.W_ox)
        dW_oh = np.zeros_like(self.W_oh)
        dW_oc = np.zeros_like(self.W_oc)
        db_o = np.zeros_like(self.b_o)
        
        dW_y = np.zeros_like(self.W_y)
        db_y = np.zeros_like(self.b_y)
        
        dh_next = np.zeros((self.hidden_size, 1))
        dc_next = np.zeros((self.hidden_size, 1))
        
        for t in reversed(range(T)):
            x_t = x[:, t].reshape(-1, 1)
            y_t = target[:, t].reshape(-1, 1)
            h_t = h_seq[:, t].reshape(-1, 1)
            c_t = c_seq[:, t].reshape(-1, 1)
            y_pred = y_seq[:, t].reshape(-1, 1)
            i_t = i_seq[:, t].reshape(-1, 1)
            f_t = f_seq[:, t].reshape(-1, 1)
            o_t = o_seq[:, t].reshape(-1, 1)
            
            dy = y_pred - y_t
            dW_y += np.dot(dy, h_t.T)
            db_y += dy
            
            do = dy * sigmoid_derivative(y_pred) * np.tanh(c_t)
            dW_ox += np.dot(do, x_t.T)
            dW_oh += np.dot(do, self.h.T)
            dW_oc += np.dot(do, c_t.T)
            db_o += do
            
            dh = np.dot(self.W_oh.T, do) + dh_next
            dc = np.dot(self.W_oc.T, do) * tanh_derivative(c_t) + dc_next
            di = dc * np.tanh(c_t) * sigmoid_derivative(np.dot(self.W_ix, x_t) + np.dot(self.W_ih, self.h) + np.dot(self.W_ic, c_t) + self.b_i)
            dW_ix += np.dot(di, x_t.T)
            dW_ih += np.dot(di, self.h.T)
            dW_ic += np.dot(di, c_t.T)
            db_i += di
            
            df = dc * self.c * sigmoid_derivative(np.dot(self.W_fx, x_t) + np.dot(self.W_fh, self.h) + np.dot(self.W_fc, c_t) + self.b_f)
            dW_fx += np.dot(df, x_t.T)
            dW_fh += np.dot(df, self.h.T)
            dW_fc += np.dot(df, c_t.T)
            db_f += df
            
            dc_next = f_t * dc
            dh_next = np.dot(self.W_ih.T, di) + np.dot(self.W_fh.T, df)
        
        self.W_ix -= lr * dW_ix
        self.W_ih -= lr * dW_ih
        self.W_ic -= lr * dW_ic
        self.b_i -= lr * db_i
        
        self.W_fx -= lr * dW_fx
        self.W_fh -= lr * dW_fh
        self.W_fc -= lr * dW_fc
        self.b_f -= lr * db_f
        
        self.W_cx -= lr * dW_cx
        self.W_ch -= lr * dW_ch
        self.b_c -= lr * db_c
        
        self.W_ox -= lr * dW_ox
        self.W_oh -= lr * dW_oh
        self.W_oc -= lr * dW_oc
        self.b_o -= lr * db_o
        
        self.W_y -= lr * dW_y
        self.b_y -= lr * db_y

# Example usage
input_size = 10
hidden_size = 10
output_size = 10
sequence_length = 12

# Create an RNN instance
rnn = RNNForgetGate(input_size, hidden_size, output_size)

# Generate random input matrix
x = np.random.randn(input_size, sequence_length)

# Forward pass
h_seq, c_seq, y_seq, i_seq, f_seq, o_seq = rnn.forward(x)

# Assuming target is the target output for simplicity in this example
target = np.random.randn(output_size, sequence_length)

# Backward pass
rnn.backward(x, target, h_seq, c_seq, y_seq, i_seq, f_seq, o_seq, lr=0.01)
# Perform backward pass and get gradients
#dW_ix, dW_fx, dW_ox, dW_cx, dW_ih, dW_fh, dW_oh, dW_ch, db_i, db_f, db_o, db_c = rnn.backward(X, h, c, dy)

# Print the gradients (for demonstration purposes)
print("Gradient for W_ix:\n", dW_ix)
print("Gradient for W_fx:\n", dW_fx)
print("Gradient for W_ox:\n", dW_ox)
print("Gradient for W_cx:\n", dW_cx)
print("Gradient for W_ih:\n", dW_ih)
print("Gradient for W_fh:\n", dW_fh)
print("Gradient for W_oh:\n", dW_oh)
print("Gradient for W_ch:\n", dW_ch)
print("Gradient for b_i:\n", db_i)
print("Gradient for b_f:\n", db_f)
print("Gradient for b_o:\n", db_o)
print("Gradient for b_c:\n", db_c)

# Print the output after forward pass
print("Output after forward pass:")
print(y_seq)


NameError: name 'dW_ix' is not defined

In [17]:
import numpy as np

# Define the RNN class with forget gate
class RNNForgetGate:
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize parameters
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        # Initialize weights and biases
        self.W_ix = np.random.randn(hidden_size, input_size)
        self.W_fx = np.random.randn(hidden_size, input_size)
        self.W_ox = np.random.randn(hidden_size, input_size)
        self.W_cx = np.random.randn(hidden_size, input_size)
        self.W_ih = np.random.randn(hidden_size, hidden_size)
        self.W_fh = np.random.randn(hidden_size, hidden_size)
        self.W_oh = np.random.randn(hidden_size, hidden_size)
        self.W_ch = np.random.randn(hidden_size, hidden_size)
        self.b_i = np.zeros((hidden_size, 1))
        self.b_f = np.zeros((hidden_size, 1))
        self.b_o = np.zeros((hidden_size, 1))
        self.b_c = np.zeros((hidden_size, 1))

    def backward(self, X, h, c, dy):
        # Initialize gradients
        dW_ix, dW_fx, dW_ox, dW_cx = np.zeros_like(self.W_ix), np.zeros_like(self.W_fx), np.zeros_like(self.W_ox), np.zeros_like(self.W_cx)
        dW_ih, dW_fh, dW_oh, dW_ch = np.zeros_like(self.W_ih), np.zeros_like(self.W_fh), np.zeros_like(self.W_oh), np.zeros_like(self.W_ch)
        db_i, db_f, db_o, db_c = np.zeros_like(self.b_i), np.zeros_like(self.b_f), np.zeros_like(self.b_o), np.zeros_like(self.b_c)
        dh_next = np.zeros_like(h[:, 0])
        dc_next = np.zeros_like(c[:, 0])

        # Loop backward through time steps
        for t in reversed(range(len(X))):
            # Compute total gradient
            dh = h[:, t] + dh_next

            # Compute gradient o# Compute gradient of output gate
            do = np.dot(dy[:, t].reshape(-1, 1), np.tanh(c[:, t]).reshape(1, -1))

            
            #do = dy[:, t] * np.tanh(c[:, t])
            dW_ox += np.dot(do * sigmoid_derivative(h[:, t]), X[:, t].reshape(-1, 1))


            #dW_ox += np.dot(do * sigmoid_derivative(h[:, t]), X[:, t].T)
            dW_oh += np.dot(do * sigmoid_derivative(h[:, t]), h[:, t-1].T)
            db_o += np.sum(do * sigmoid_derivative(h[:, t]), axis=1, keepdims=True)

            # Compute gradient of cell state
            dc = dh * sigmoid(f[:, t])
            dc += dc_next
            dc_prev = dc * f[:, t]
            dW_cx += np.dot(dc_prev * sigmoid_derivative(h[:, t]), X[:, t].T)
            dW_ch += np.dot(dc_prev * sigmoid_derivative(h[:, t]), h[:, t-1].T)
            db_c += np.sum(dc_prev * sigmoid_derivative(h[:, t]), axis=1, keepdims=True)

            # Compute gradient of input gate
            di = dc * g[:, t]
            dW_ix += np.dot(di * sigmoid_derivative(h[:, t]), X[:, t].T)
            dW_ih += np.dot(di * sigmoid_derivative(h[:, t]), h[:, t-1].T)
            db_i += np.sum(di * sigmoid_derivative(h[:, t]), axis=1, keepdims=True)

            # Compute gradient of forget gate
            df = dc * c[:, t-1]
            dW_fx += np.dot(df * sigmoid_derivative(h[:, t]), X[:, t].T)
            dW_fh += np.dot(df * sigmoid_derivative(h[:, t]), h[:, t-1].T)
            db_f += np.sum(df * sigmoid_derivative(h[:, t]), axis=1, keepdims=True)

            # Compute gradient for next hidden state and cell state
            dh_next = np.dot(self.W_ih.T, di) + np.dot(self.W_fh.T, df) + np.dot(self.W_oh.T, do) + np.dot(self.W_ch.T, dc)
            dc_next = dc * f[:, t]

        # Return gradients
        return dW_ix, dW_fx, dW_ox, dW_cx, dW_ih, dW_fh, dW_oh, dW_ch, db_i, db_f, db_o, db_c

# Sigmoid activation function and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

# Generate sample input data
X = np.random.randn(5, 10)  # 5 input features, 10 time steps
h = np.random.randn(10, 10)  # Hidden state, 10 hidden units, 10 time steps
c = np.random.randn(10, 10)  # Cell state, 10 hidden units, 10 time steps
dy = np.random.randn(5, 10)  # Gradient of loss with respect to output, 5 output units, 10 time steps

# Initialize RNN with forget gate
rnn_forget_gate = RNNForgetGate(input_size=5, hidden_size=10, output_size=5)

# Perform backward pass and get gradients
dW_ix, dW_fx, dW_ox, dW_cx, dW_ih, dW_fh, dW_oh, dW_ch, db_i, db_f, db_o, db_c = rnn_forget_gate.backward(X, h, c, dy)

# Print the gradients (for demonstration purposes)
print("Gradient for W_ix:\n", dW_ix)
print("Gradient for W_fx:\n", dW_fx)
print("Gradient for W_ox:\n", dW_ox)
print("Gradient for W_cx:\n", dW_cx)
print("Gradient for W_ih:\n", dW_ih)
print("Gradient for W_fh:\n", dW_fh)
print("Gradient for W_oh:\n", dW_oh)
print("Gradient for W_ch:\n", dW_ch)
print("Gradient for b_i:\n", db_i)
print("Gradient for b_f:\n", db_f)
print("Gradient for b_o:\n", db_o)
print("Gradient for b_c:\n", db_c)


ValueError: shapes (5,10) and (5,1) not aligned: 10 (dim 1) != 5 (dim 0)

# Optical Character Recognition (OCR) with Recurrent Neural Network (RNN)

1. **Model Input**: Let's denote our input data as \( X = \{x^{(1)}, x^{(2)}, ..., x^{(T)}\} \), where \( T \) is the number of time steps (or sequence length) and \( x^{(t)} \) represents the input image at time step \( t \).

2. **Model Output**: Similarly, let's denote the output (or prediction) as \( Y = \{y^{(1)}, y^{(2)}, ..., y^{(T)}\} \), where \( y^{(t)} \) represents the predicted label at time step \( t \).

3. **RNN Model**: The RNN processes the input sequence \( X \) and generates the output sequence \( Y \). At each time step \( t \), the RNN takes the current input \( x^{(t)} \) and the previous hidden state \( h^{(t-1)} \) to compute the next hidden state \( h^{(t)} \) and the output \( y^{(t)} \).

   Mathematically, this can be represented as:
   $$
   h^{(t)} = \text{RNN}(x^{(t)}, h^{(t-1)})
   $$
   $$
   y^{(t)} = \text{softmax}(W_{\text{out}} h^{(t)} + b_{\text{out}})
   $$
   where \( \text{RNN} \) represents the recurrent neural network cell (such as LSTM or GRU), \( W_{\text{out}} \) and \( b_{\text{out}} \) are the weight matrix and bias vector for the output layer, and \( \text{softmax} \) is the softmax activation function.

4. **Loss Function**: We typically use the cross-entropy loss function to measure the difference between the predicted output and the ground truth labels:
   $$
   \mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \sum_{i} y_i^{(t)} \log(\hat{y}_i^{(t)})
   $$
   where \( \hat{y}^{(t)} \) is the predicted probability distribution over classes at time step \( t \), and \( y^{(t)} \) is the true probability distribution (one-hot encoded) over classes at time step \( t \).

5. **Training**: During training, we minimize the loss function \( \mathcal{L} \) with respect to the model parameters (weights and biases) using gradient descent-based optimization algorithms like Adam or SGD.

6. **Inference**: During inference, we feed the input image sequence \( X \) into the trained model, and the model generates the output sequence \( Y \). We can then decode the output sequence to obtain the final predicted labels for the characters in the input image.


In [23]:
import numpy as np

# Sample input image (binary representation)
# Assume each image is represented as a 2D array (e.g., 28x28 pixels)
# For simplicity, let's assume a 5x5 image for each character
import numpy as np

# Define the dimensions
input_size = 25
hidden_size = 10

# Initialize the weight matrices and biases
Wxh = np.random.randn(hidden_size, input_size)
Whh = np.random.randn(hidden_size, hidden_size)
bh = np.zeros((hidden_size, 1))

# Define the input vector
x_t = np.random.randn(input_size, 1)

# Initialize the hidden state
h = np.zeros((hidden_size, 1))

# Update the hidden state
h = np.tanh(np.dot(Wxh, x_t) + np.dot(Whh, h) + bh)

print("Updated hidden state shape:", h.shape)



image_A = np.array([[0, 1, 1, 1, 0],
                     [1, 0, 0, 0, 1],
                     [1, 1, 1, 1, 1],
                     [1, 0, 0, 0, 1],
                     [1, 0, 0, 0, 1]])

image_B = np.array([[1, 1, 1, 1, 0],
                     [1, 0, 0, 0, 1],
                     [1, 1, 1, 1, 0],
                     [1, 0, 0, 0, 1],
                     [1, 1, 1, 1, 0]])

image_C = np.array([[0, 1, 1, 1, 0],
                     [1, 0, 0, 0, 1],
                     [1, 0, 0, 0, 0],
                     [1, 0, 0, 0, 1],
                     [0, 1, 1, 1, 0]])

# Flatten each image to a 1D array
# In a real OCR system, you would preprocess the images and extract relevant features
image_A_flat = image_A.flatten()
image_B_flat = image_B.flatten()
image_C_flat = image_C.flatten()

# Sample output labels (one-hot encoded)
label_A = np.array([1, 0, 0])  # A
label_B = np.array([0, 1, 0])  # B
label_C = np.array([0, 0, 1])  # C

# Model parameters (weights and biases)
input_size = 25  # Size of flattened image
hidden_size = 10  # Number of hidden units
output_size = 3  # Number of output classes

# Initialize model parameters (weights and biases)
Wxh = np.random.randn(hidden_size, input_size)  # Input-to-hidden weights
Whh = np.random.randn(hidden_size, hidden_size)  # Hidden-to-hidden weights
Why = np.random.randn(output_size, hidden_size)  # Hidden-to-output weights
bh = np.zeros((hidden_size, 1))  # Hidden bias
by = np.zeros((output_size, 1))  # Output bias

# Learning rate
learning_rate = 0.01

# Forward pass function
def forward_pass(x):
    bh = np.zeros((hidden_size, 1))  # Correcting the shape of the bias term

    h = np.zeros((hidden_size, 1))  # Initialize hidden state
    for t in range(len(x)):
        x_t = x[t].reshape(-1, 1)  # Reshape input for matrix multiplication
        h = np.tanh(np.dot(Wxh, x_t) + np.dot(Whh, h) + bh)  # Hidden state update
    y = np.dot(Why, h) + by  # Output computation
    return y, h

# Softmax function for converting raw scores to probabilities
def softmax(x):
    exp_scores = np.exp(x - np.max(x))  # Subtracting max for numerical stability
    return exp_scores / np.sum(exp_scores)

# Loss function (cross-entropy)
def cross_entropy_loss(y_pred, y_true):
    return -np.sum(y_true * np.log(y_pred))

# Training loop
num_epochs = 1000

for epoch in range(num_epochs):
    # Forward pass
    y_pred_A, _ = forward_pass(image_A_flat)
    y_pred_B, _ = forward_pass(image_B_flat)
    y_pred_C, _ = forward_pass(image_C_flat)

    # Compute loss
    loss_A = cross_entropy_loss(softmax(y_pred_A), label_A)
    loss_B = cross_entropy_loss(softmax(y_pred_B), label_B)
    loss_C = cross_entropy_loss(softmax(y_pred_C), label_C)
    total_loss = (loss_A + loss_B + loss_C) / 3

    # Backpropagation
    dy_A = softmax(y_pred_A) - label_A
    dy_B = softmax(y_pred_B) - label_B
    dy_C = softmax(y_pred_C) - label_C

    dWhy_A = np.dot(dy_A, np.tanh(h_A).T)
    dWhy_B = np.dot(dy_B, np.tanh(h_B).T)
    dWhy_C = np.dot(dy_C, np.tanh(h_C).T)

    dby_A = dy_A
    dby_B = dy_B
    dby_C = dy_C

    # Update weights and biases
    Why -= learning_rate * (dWhy_A + dWhy_B + dWhy_C) / 3
    by -= learning_rate * (dby_A + dby_B + dby_C) / 3

    # Print loss
    if epoch % 100 == 0:
        print(f'Epoch {epoch}: Loss = {total_loss}')

# Test the model on sample images after training
print("Test Output for Image A:", softmax(forward_pass(image_A_flat)[0]))
print("Test Output for Image B:", softmax(forward_pass(image_B_flat)[0]))
print("Test Output for Image C:", softmax(forward_pass(image_C_flat)[0]))


Updated hidden state shape: (10, 1)


ValueError: shapes (10,25) and (1,1) not aligned: 25 (dim 1) != 1 (dim 0)