<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/nlp/blob/main/05.nn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/nlp/blob/main/05.nn.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>

# Neural Networks

📝 SALP chapter 7

## Neural Networks
- A neural network is a `machine learning model` inspired by biological neurons. 
- It consists of `layers of connected nodes (neurons)` that process and transform input data.
  - **Input Layer:** Receives input features.
  - **Hidden Layers:** Transforms the input through weights and activation functions.
  - **Output Layer:** Produces final output or predictions.
- **Applications:** Image recognition, natural language processing (NLP), speech recognition.

---

### **Neuron (Unit)**
- A single processing unit in a neural network that computes the `weighted sum of its inputs` and applies a `non-linear activation function`.
  - $z = \sigma(w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b) = \sigma(\mathbf{w}^T \mathbf{x} + b)$
    - $\mathbf{x} = [x_1, x_2, \dots, x_n]$ is the input vector.
    - $\mathbf{w} = [w_1, w_2, \dots, w_n]$ is the weight vector.
    - $b$ is the bias.
    - $\sigma(\cdot)$ is the activation function (e.g., sigmoid, ReLU, tanh).
- **Common Activation Functions:**
  - **Sigmoid:** $\sigma(z) = \dfrac{1}{1 + e^{-z}}$
  - **ReLU:** $\sigma(z) = \max(0, z)$
  - **Tanh:** $\sigma(z) = \dfrac{e^z - e^{-z}}{e^z + e^{-z}}$

---

### **The XOR Problem**
- A binary classification problem where two inputs are classified as `1` if both inputs are different.
  - $\text{XOR} (x_1, x_2) = \begin{cases}
  1, & \text{if } x_1 \neq x_2 \\
  0, & \text{otherwise}
  \end{cases}$
- It is **non-linearly separable**:
  - i.e. it cannot be solved by a single-layer of  neurons.
  - can be solved with **multi-layer neural network** by learning complex non-linear relationships.
- It shows that the need for `multi-layer networks with hidden layers` to solve non-linear problems.

---

### **Feedforward Neural Networks (FNNs)**
- Where information flows in one direction — from input to output, without cycles.
- **Architecture:**
  - **Input Layer → Hidden Layers → Output Layer.**
  - Each neuron in a layer is connected to every neuron in the next layer.
- **Mathematics of Feedforward Pass:**
  - $h^{(l)} = \sigma(\mathbf{W}^{(l)} h^{(l-1)} + \mathbf{b}^{(l)})$
    - $h^{(l)}$ is the output of layer $l$ or input data if it's the first layer.
    - $\mathbf{W}^{(l)}$ and $\mathbf{b}^{(l)}$ are weights and biases for layer $l$.
    - $\sigma$ is the activation function.
  - The final output layer computes the predicted value $\hat{y}$:
    - $\hat{y} = \sigma(\mathbf{W}^{(L)} h^{(L-1)} + \mathbf{b}^{(L)})$

---

### **Training Neural Networks**
- **Goal:** Minimize the difference between predicted output ($\hat{y}$) and actual label ($y$) using a loss function.
  
- **Loss Function (Binary Classification):**
  - $\mathcal{L}(\hat{y}, y) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]$

- **Training via Backpropagation:**
  - **Forward Pass:** Calculate predictions.
  - **Backward Pass:** Compute gradients of the loss function with respect to weights.
  - **Update Weights:** Using gradient descent.
    - $w := w - \eta \dfrac{\partial \mathcal{L}}{\partial w}$
  Where $\eta$ is the learning rate.


---

### **Optimization Algorithms**
- **Stochastic Gradient Descent (SGD):**
  - Updates weights using one training example at a time.
    - $w := w - \eta \cdot \dfrac{\partial \mathcal{L}}{\partial w}$

- **Mini-Batch Gradient Descent:**
  - Instead of one sample or the whole dataset, updates are done using a `small batch` of data.
  - Combines the efficiency of SGD with more stable updates.

- **Adam Optimizer (Adaptive Moment Estimation):**
  - Uses both the first moment (mean) and second moment (variance) of gradients for better convergence.
    - $m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
  \quad \text{and} \quad
  v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$
    - $w := w - \eta \cdot \dfrac{m_t}{\sqrt{v_t} + \epsilon}$
      - $g_t$ is the gradient at time $t$,
      - $m_t$ and $v_t$ are estimates of the first and second moments of the gradient.

- **Comparison of Optimization Algorithms:**
  - SGD: Faster updates, may oscillate.
  - Adam: Smoother convergence, especially for noisy or sparse gradients.

### 🍎 **Backward Differentiation of Sigmoid**
- Consider a neural network with:
  - Input: $x = [x_1, x_2]$
  - Weights: $w = [w_1, w_2]$
  - Activation function: Sigmoid.

  **Forward Pass:**
  - $z = w_1 x_1 + w_2 x_2 + b \quad \text{and} \quad \hat{y} = \sigma(z)$

  **Backward Pass (Error Term $\delta$):**
  1. Compute the gradient of the loss with respect to the output $\hat{y}$:
  - $\delta = \dfrac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \dfrac{\partial \hat{y}}{\partial z} = (\hat{y} - y) \cdot \hat{y}(1 - \hat{y})$

  2. Compute the gradients for weights $w_1$ and $w_2$:
  - $\dfrac{\partial \mathcal{L}}{\partial w_1} = \delta \cdot x_1
  \quad \text{and} \quad \dfrac{\partial \mathcal{L}}{\partial w_2} = \delta \cdot x_2$

- **Weight Update:**
  - $w_1 := w_1 - \eta \cdot \dfrac{\partial \mathcal{L}}{\partial w_1}
  \quad \text{and} \quad w_2 := w_2 - \eta \cdot \dfrac{\partial \mathcal{L}}{\partial w_2}$

### 🍎 **Feedforward Neural Network**

In [None]:
# Import necessary libraries
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Generate a dataset for binary classification
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a simple feedforward neural network model
model = Sequential()
model.add(Dense(10, activation='relu', input_shape=(20,)))  # Input layer + hidden layer
model.add(Dense(8, activation='relu'))  # Hidden layer
model.add(Dense(1, activation='sigmoid'))  # Output layer

# Compile and train the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=1)

# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy * 100:.2f}%")

### 🍎 **Backward Differentiation of Softmax**
#### **Network Structure:**
1. **Input layer (x):** $x_1, x_2$ (2 input neurons)
2. **Hidden layer (h):** $h_1, h_2$ (2 hidden neurons, ReLU activation)
3. **Output layer (y):** $y_1, y_2$ (2 output neurons, softmax activation)

#### **Given Data:**
- Input: $x = [x_1, x_2] = [0.5, 0.2]$
- Initial weights:
  - Input to hidden weights: 
    $W_1 = \begin{bmatrix}
    0.1 & 0.3 \\
    0.4 & 0.7
    \end{bmatrix}$
  - Hidden to output weights:
    $W_2 = \begin{bmatrix}
    0.2 & 0.6 \\
    0.5 & 0.9
    \end{bmatrix}$
- Biases:
  - Hidden layer biases: $b_1 = [0.1, 0.2]$
  - Output layer biases: $b_2 = [0.1, 0.2]$

### **1. Forward Pass:**

#### **Step 1: Input to Hidden Layer Calculation**
Each hidden neuron $h_j$ (before applying activation) is calculated as:

- $h_j = x_1 W_1[1, j] + x_2 W_1[2, j] + b_1[j]$

For the hidden neurons:

- $h_1 = 0.5 \times 0.1 + 0.2 \times 0.4 + 0.1 = 0.05 + 0.08 + 0.1 = 0.23$
- $h_2 = 0.5 \times 0.3 + 0.2 \times 0.7 + 0.2 = 0.15 + 0.14 + 0.2 = 0.49$

#### **Step 2: Apply ReLU Activation Function to Hidden Layer**
The ReLU activation function is:

- $\text{ReLU}(z) = \max(0, z)$

Thus, for the hidden neurons:

- $h_1 = \text{ReLU}(0.23) = 0.23$
- $h_2 = \text{ReLU}(0.49) = 0.49$

#### **Step 3: Hidden to Output Layer Calculation**
Each output neuron $y_k$ (before applying softmax) is calculated as:

- $y_k = h_1 W_2[1, k] + h_2 W_2[2, k] + b_2[k]$

For the output neurons:
- $y_1 = 0.23 \times 0.2 + 0.49 \times 0.5 + 0.1 = 0.046 + 0.245 + 0.1 = 0.391$
- $y_2 = 0.23 \times 0.6 + 0.49 \times 0.9 + 0.2 = 0.138 + 0.441 + 0.2 = 0.779$

#### **Step 4: Apply Softmax Activation Function to Output Layer**
The softmax function converts the output into probabilities:

- $\text{softmax}(y_i) = \dfrac{e^{y_i}}{\sum_{j} e^{y_j}}$

Thus:

- $\text{softmax}(y_1) = \dfrac{e^{0.391}}{e^{0.391} + e^{0.779}} = \dfrac{1.478}{1.478 + 2.18} = \dfrac{1.478}{3.658} = 0.404$
- $\text{softmax}(y_2) = \dfrac{e^{0.779}}{e^{0.391} + e^{0.779}} = \dfrac{2.18}{3.658} = 0.596$

Thus, the output probabilities are:

- $\hat{y} = [0.404, 0.596]$

### **2. Backward Pass:**

#### **Step 1: Compute Loss**
For a multi-class classification, the cross-entropy loss is used:

- $\mathcal{L} = -\sum_{i} y_i \log(\hat{y}_i)$

Assuming the true labels are $y = [1, 0]$ (i.e., class 1 is the true class), the loss becomes:

- $\mathcal{L} = -(1 \times \log(0.404) + 0 \times \log(0.596)) = -\log(0.404) = 0.906$

#### **Step 2: Backpropagation (Compute Gradients)**
We compute gradients with respect to each weight and update them using gradient descent.

The derivative of the loss with respect to output $y_1$:

- $\dfrac{\partial \mathcal{L}}{\partial y_1} = \hat{y}_1 - y_1 = 0.404 - 1 = -0.596$

For $y_2$:

- $\dfrac{\partial \mathcal{L}}{\partial y_2} = \hat{y}_2 - y_2 = 0.596 - 0 = 0.596$

---

### 🍎 Implementation of this Simple Neural Network in Python

In [None]:

import numpy as np

# Input data
x = np.array([0.5, 0.2])

# Weights and biases (initial)
W1 = np.array([[0.1, 0.3], [0.4, 0.7]])  # Input to hidden weights
b1 = np.array([0.1, 0.2])               # Hidden layer bias

W2 = np.array([[0.2, 0.6], [0.5, 0.9]])  # Hidden to output weights
b2 = np.array([0.1, 0.2])               # Output layer bias

# Forward pass

# Step 1: Input to hidden
h = np.dot(x, W1) + b1
h = np.maximum(0, h)  # ReLU activation

# Step 2: Hidden to output
y = np.dot(h, W2) + b2

# Apply softmax
def softmax(z):
    exp_z = np.exp(z)
    return exp_z / np.sum(exp_z)

output = softmax(y)
print("Output probabilities:", output)

# Assume the true label is [1, 0] for the first class
y_true = np.array([1, 0])

# Compute cross-entropy loss
loss = -np.sum(y_true * np.log(output))
print("Loss:", loss)

# Backward pass (compute gradients)
grad_y = output - y_true  # Gradient of loss w.r.t. output

# Gradient w.r.t. W2 and b2
grad_W2 = np.outer(h, grad_y)
grad_b2 = grad_y

# Backpropagate through ReLU (only pass gradients where h > 0)
grad_h = np.dot(W2, grad_y)
grad_h[h <= 0] = 0  # ReLU derivative

# Gradient w.r.t. W1 and b1
grad_W1 = np.outer(x, grad_h)
grad_b1 = grad_h

# Print gradients
print("Gradients for W2:", grad_W2)
print("Gradients for W1:", grad_W1)

# Update weights using gradient descent (learning rate = 0.01)
learning_rate = 0.01
W1 -= learning_rate * grad_W1
b1 -= learning_rate * grad_b1
W2 -= learning_rate * grad_W2
b2 -= learning_rate * grad_b2

# Updated weights
print("Updated W1:", W1)
print("Updated W2:", W2)

### 🍎 In PyTorch

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# Define the 3-layer neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        # Define layers
        self.fc1 = nn.Linear(2, 2)  # Input layer to hidden layer (2 inputs, 2 hidden units)
        self.fc1.weight.data = torch.tensor([[0.1, 0.3], [0.4, 0.7]], dtype=torch.float32)
        self.fc1.bias.data = torch.tensor([0.1, 0.2], dtype=torch.float32)

        self.fc2 = nn.Linear(2, 2)  # Hidden layer to output layer (2 hidden units, 2 outputs)
        self.fc2.weight.data = torch.tensor([[0.2, 0.6], [0.5, 0.9]], dtype=torch.float32)
        self.fc2.bias.data = torch.tensor([0.1, 0.2], dtype=torch.float32)

    def forward(self, x):
        # Hidden layer with ReLU activation
        h = F.relu(self.fc1(x))
        # Output layer with Softmax activation
        y = F.softmax(self.fc2(h), dim=1)
        return y

# Create a sample input (batch size of 1, 2 input features)
x = torch.tensor([[0.5, 0.2]], dtype=torch.float32)

# Initialize the model
model = SimpleNN()

# Forward pass
output = model(x)

# Print the output
print("Output after forward pass:", output)

# Let's also check the parameters (weights and biases)
print("\nModel Parameters (Before Training):")
for name, param in model.named_parameters():
    print(f"{name}: {param.data}")

# Example target (true output, one-hot encoded as class index)
target = torch.tensor([0])  # Let's assume the true class is the first one

# Define loss function (Cross-Entropy Loss)
loss_fn = nn.CrossEntropyLoss(reduction='mean')

# Compute loss
loss = loss_fn(output, target)
print("\nInitial Loss:", loss.item())

# Backward pass (compute gradients)
loss.backward()

# Optimizer: SGD (Stochastic Gradient Descent)
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Update parameters based on gradients
optimizer.step()

# After one step of training, let's print the updated weights
print("\nModel Parameters (After One Training Step):")
for name, param in model.named_parameters():
    print(f"{name}: {param.data}")

# Optionally: Run multiple iterations (epochs) for training
epochs = 5
for epoch in range(epochs):
    # Forward pass
    output = model(x)
    
    # Compute loss
    loss = loss_fn(output, target)
    
    # Zero the gradients
    optimizer.zero_grad()
    
    # Backward pass (compute gradients)
    loss.backward()
    
    # Update weights
    optimizer.step()
    
    # Print loss for each epoch
    print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}")

# Final output after training
output = model(x)
print("\nFinal Output after Training:", output)

### **Feedforward Neural Language Model (FFNLM)**
- Predicts the next word in a sequence, given previous words.
- **Model Structure:**
  - **Input:** Vector representations of words (usually via embeddings like Word2Vec or GloVe).
  - **Hidden Layers:** Process input through non-linear transformations.
  - **Output Layer:** Produces a probability distribution over possible next words using softmax.
- **Mathematics:**
  - $P(w_t | w_{t-1}, w_{t-2}, \dots, w_{t-n}) = \text{softmax}(\mathbf{W}^{(L)} h^{(L-1)} + \mathbf{b}^{(L)})$
    - $P(w_t)$ is the predicted probability for the next word $w_t$.
- **Loss Function (Cross-Entropy):**
  - $\displaystyle\mathcal{L}(w_t) = - \sum_{i=1}^{N} y_i \log(\hat{y}_i)$
    - $y_i$ is the true label and $\hat{y}_i$ is the predicted probability for word $i$.

### 🍎 Implement a FFNLM in PyTorch
- The `nn.Embedding` layer takes word indices as input and maps them to dense vectors.
   - Input: `[word index 1, word index 2]` (2 previous words in context).
   - Output: A concatenated embedding vector of the 2 words.
- The model predicts the next word based on the previous two words $w_{t-1}$ and $w_{t-2}$:
  - $P(w_t | w_{t-1}, w_{t-2}) = \text{softmax}(\mathbf{W}^{(2)} h^{(1)} + \mathbf{b}^{(2)})$
    - $h^{(1)} = \text{ReLU}(\mathbf{W}^{(1)} [\text{Embed}(w_{t-1}), \text{Embed}(w_{t-2})] + \mathbf{b}^{(1)})$
    - $\mathbf{W}^{(1)}$, $\mathbf{W}^{(2)}$, $\mathbf{b}^{(1)}$, and $\mathbf{b}^{(2)}$ are the learned weights and biases.
- For a more realistic language model, try larger datasets and different embeddings like Word2Vec or GloVe 

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# Define the Feedforward Neural Language Model
class FeedforwardNeuralLM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, context_size):
        super(FeedforwardNeuralLM, self).__init__()
        
        # Embedding layer to convert word indices into vectors
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # Feedforward network layers
        self.fc1 = nn.Linear(context_size * embedding_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, inputs):
        # Convert input word indices into embeddings
        embeds = self.embeddings(inputs).view(1, -1)
        
        # Pass embeddings through the first fully connected layer
        out = F.relu(self.fc1(embeds))
        
        # Final output layer to predict next word probabilities
        out = self.fc2(out)
        
        # Apply softmax to get probabilities
        log_probs = F.log_softmax(out, dim=1)
        
        return log_probs

# Sample data: let's say we have a simple vocabulary
vocab = ['i', 'like', 'to', 'play', 'football', 'and', 'tennis']
word_to_idx = {word: i for i, word in enumerate(vocab)}
idx_to_word = {i: word for i, word in enumerate(vocab)}

# Model hyperparameters
vocab_size = len(vocab)
embedding_dim = 10   # Size of word embeddings
hidden_dim = 20      # Size of hidden layer
context_size = 2     # How many previous words to consider (context)

# Create a language model instance
model = FeedforwardNeuralLM(vocab_size, embedding_dim, hidden_dim, context_size)

# Define loss function (negative log likelihood) and optimizer
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training data: context-target pairs
data = [
    (['i', 'like'], 'to'),
    (['like', 'to'], 'play'),
    (['to', 'play'], 'football'),
    (['play', 'football'], 'and'),
    (['football', 'and'], 'tennis')
]

# Prepare the inputs (context words) and targets (next word)
def make_context_vector(context, word_to_idx):
    return torch.tensor([word_to_idx[w] for w in context], dtype=torch.long)

# Training loop
n_epochs = 100
for epoch in range(n_epochs):
    total_loss = 0
    
    for context, target in data:
        # Convert context and target to tensor format
        context_vector = make_context_vector(context, word_to_idx)
        target_idx = torch.tensor([word_to_idx[target]], dtype=torch.long)

        # Zero the gradients from the previous iteration
        model.zero_grad()

        # Forward pass: get predicted log probabilities
        log_probs = model(context_vector)

        # Compute loss: how much did we deviate from the true label?
        loss = loss_function(log_probs, target_idx)
        
        # Backpropagate and update the weights
        loss.backward()
        optimizer.step()

        # Accumulate loss for reporting
        total_loss += loss.item()
    
    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {total_loss:.4f}')

# Test the model: predict the next word after the context ['like', 'to']
with torch.no_grad():
    context = ['like', 'to']
    context_vector = make_context_vector(context, word_to_idx)
    predicted_log_probs = model(context_vector)
    predicted_word_idx = torch.argmax(predicted_log_probs, dim=1).item()
    predicted_word = idx_to_word[predicted_word_idx]
    print(f"Given context: {context}, Predicted next word: {predicted_word}")