| **x₁**  | **x₂**  | **y**  |
|--------|--------|-------|
| 0.1    | 0.7    | 1     |
| 0.4    | 0.9    | 1     |
| 0.3    | 0.2    | 0     |
| 0.8    | 0.4    | 0     |
| 0.9    | 0.8    | 1     |
| 0.2    | 0.1    | 0     |
| 0.5    | 0.5    | 1     |
| 0.6    | 0.3    | 0     |
| 0.7    | 0.9    | 1     |
| 0.2    | 0.8    | 1     |

Training a simple neural network manually is a great exercise to understand how neural networks work. Here's a step-by-step process to train a simple neural network on the given dataset.

---

### **1. Define the Neural Network Structure**
- **Input layer**: 2 inputs (`x1` and `x2`).
- **Hidden layer**: 2 neurons (for simplicity) with sigmoid activation.
- **Output layer**: 1 neuron (since `y` is binary) with sigmoid activation.


### **Example Calculation (1 Epoch, 1 Sample)**
1. **Given** $ x_1 = 1, x_2 = 2, y = 0 $
2. **Feedforward** (compute $ z_1, z_2, h_1, h_2, z_3, \hat{y} $)
3. **Loss**: Calculate cross-entropy loss.
4. **Backpropagate**: Calculate $ \delta_3, \delta_1, \delta_2 $ and update weights and biases.

<a href="https://lucid.app/lucidchart/f1747941-87bb-470d-9e7d-99eb4992e12b/edit?beaconFlowId=F6CBE2C75F2F7444&invitationId=inv_38519930-d244-4e91-b32d-d154cfd90e9f&page=0_0#">Neural Network Chart</a>

In [21]:
import random
import math

data = [
    (0.1, 0.7, 1), # first row
    (0.4, 0.9, 1),
    (0.3, 0.2, 0),
    (0.8, 0.4, 0),
    (0.9, 0.8, 1),
    (0.2, 0.1, 0),
    (0.5, 0.5, 1),
    (0.6, 0.3, 0),
    (0.7, 0.9, 1),
    (0.2, 0.8, 1)
]

### **2. Initialize Weights and Biases**
- **Weights**: Randomly initialize weights for each connection. 
  - Between input and hidden layer: $ w_1, w_2, w_3, w_4 $ (for 2 neurons).
  - Between hidden layer and output layer: $ w_5, w_6 $ (for 1 neuron).

- **Biases**: Randomly initialize biases for each layer.
  - Hidden layer biases: $ b_1, b_2 $ for 2 neurons.
  - Output layer bias: $ b_3 $.

In [22]:
w1 = round(5*random.random() - 10, 2)
w2 = round(5*random.random() - 10, 2)
w3 = round(5*random.random() - 10, 2)
w4 = round(5*random.random() - 10, 2)

b1 = round(5*random.random() - 10, 2)
b2 = round(5*random.random() - 10, 2)

w5 = round(5*random.random() - 10, 2)
w6 = round(5*random.random() - 10, 2)

b3 = round(5*random.random() - 10, 2)

print(f"{w1=}, {w2=}, {w3=}, {w4=}")
print(f"{b1=}, {b2=}")
print(f"{w5=}, {w6=}")
print(f"{b3=}")

w1=-7.16, w2=-9.21, w3=-7.46, w4=-9.45
b1=-9.78, b2=-6.35
w5=-5.07, w6=-6.58
b3=-6.76


### **3. Forward Pass (Feedforward)**
1. **Input to Hidden Layer**
   - Calculate the weighted sum $ z_1 $ and $ z_2 $ for each hidden neuron:
     $$
     z_1 = w_1 \cdot x_1 + w_2 \cdot x_2 + b_1
     $$

     $$
     z_2 = w_3 \cdot x_1 + w_4 \cdot x_2 + b_2
     $$
   - Apply the sigmoid activation function to each neuron in the hidden layer:
     $$
     h_1 = \frac{1}{1 + e^{-z_1}}, \quad h_2 = \frac{1}{1 + e^{-z_2}}
     $$

In [23]:
x1, x2, y = data[0]
print(f"{x1=}, {x2=}, {y=}")

x1=0.1, x2=0.7, y=1


In [24]:
z1 = w1*x1 + w2*x2 + b1
z2 = w3*x1 + w4*x2 + b2
print(f"{z1=}, {z2=}")

h1 = 1/(1 + math.exp(-z1))
h2 = 1/(1 + math.exp(-z2))
print(f"{h1=}, {h2=}")

z1=-16.942999999999998, z2=-13.710999999999999
h1=4.382768928580775e-08, h2=1.1101658824394257e-06


2. **Hidden Layer to Output Layer**
   - Calculate the weighted sum for the output neuron:
     $$
     z_3 = w_5 \cdot h_1 + w_6 \cdot h_2 + b_3
     $$
   - Apply the sigmoid activation function to get the final output $ \hat{y} $:
     $$
     \hat{y} = \frac{1}{1 + e^{-z_3}}
     $$

In [25]:
z3 = w5 * h1 + w6 * h2 + b3
print(f"{z3=}")

y_pred = 1/(1 + math.exp(-z3))
print(f"{y_pred=}")

z3=-6.7600075270978905
y_pred=0.001157878212205724


### **4. Compute the Loss**
- Use the binary cross-entropy loss (since $ y $ is binary):
  $$
  L = - \left( y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y}) \right)
  $$
  This tells us how far the prediction $ \hat{y} $ is from the actual $ y $.


In [26]:
L = -(y * math.log(y_pred) + (1-y)*math.log(1-y_pred))
print(f"{L=}")

L=6.761166076168972


### **5. Backward Pass (Backpropagation)**
1. **Error for Output Neuron**
   - Calculate the derivative of the loss $ L $ with respect to $ z_3 $ (output neuron's pre-activation):
     $$
     \delta_3 = \hat{y} - y
     $$

In [27]:
delta3 = y_pred - y
print(f"{delta3=}")

delta3=-0.9988421217877943


2. **Update Weights from Hidden to Output**
   - Calculate gradients for $ w_5 $, $ w_6 $, and $ b_3 $:
     $$
     \frac{\partial L}{\partial w_5} = \delta_3 \cdot h_1
     $$

     $$
     \frac{\partial L}{\partial w_6} = \delta_3 \cdot h_2
     $$

     $$
     \frac{\partial L}{\partial b_3} = \delta_3
     $$

In [28]:
grad_w5 = delta3 * h1
grad_w6 = delta3 * h2
grad_b3 = delta3

print(f"{grad_w5=}")
print(f"{grad_w6=}")
print(f"{grad_b3=}")

grad_w5=-4.377694215929239e-08
grad_w6=-1.108880445552215e-06
grad_b3=-0.9988421217877943


3. **Error for Hidden Layer Neurons**
   - Backpropagate the error from the output layer to the hidden layer using the chain rule:
     $$
     \delta_1 = \delta_3 \cdot w_5 \cdot h_1 \cdot (1 - h_1)
     $$

     $$
     \delta_2 = \delta_3 \cdot w_6 \cdot h_2 \cdot (1 - h_2)
     $$

In [29]:
delta1 = delta3 * w5 * h1 * (1 - h1)
delta2 = delta3 * w6 * h2 * (1 - h2)
print(f"{delta1=}")
print(f"{delta2=}")

delta1=2.2194908702009636e-07
delta2=7.296425231482227e-06


4. **Update Weights from Input to Hidden**
   - Calculate gradients for $ w_1, w_2, w_3, w_4 $, and biases $ b_1, b_2 $:
     $$
     \frac{\partial L}{\partial w_1} = \delta_1 \cdot x_1
     $$
     $$
     \frac{\partial L}{\partial w_2} = \delta_1 \cdot x_2
     $$
     $$
     \frac{\partial L}{\partial w_3} = \delta_2 \cdot x_1
     $$
     $$
     \frac{\partial L}{\partial w_4} = \delta_2 \cdot x_2
     $$
     $$
     \frac{\partial L}{\partial b_1} = \delta_1
     $$
     $$
     \frac{\partial L}{\partial b_2} = \delta_2
     $$

In [30]:
grad_w1 = delta1 * x1
grad_w2 = delta1 * x2
grad_w3 = delta2 * x1
grad_w4 = delta2 * x2
grad_b1 = delta1
grad_b2 = delta2

print(f"{grad_w1=}")
print(f"{grad_w2=}")
print(f"{grad_w3=}")
print(f"{grad_w4=}")
print(f"{grad_b1=}")
print(f"{grad_b2=}")

grad_w1=2.2194908702009636e-08
grad_w2=1.5536436091406745e-07
grad_w3=7.296425231482227e-07
grad_w4=5.107497662037558e-06
grad_b1=2.2194908702009636e-07
grad_b2=7.296425231482227e-06


### **6. Update Weights and Biases**
- Update each parameter using gradient descent:
  $$
  w = w - \eta \cdot \frac{\partial L}{\partial w}
  $$
  $$
  b = b - \eta \cdot \frac{\partial L}{\partial b}
  $$
  Here, $ \eta $ is the learning rate (like 0.1 or 0.01). Apply this to all weights and biases.


In [31]:
lr = 0.01 # learning rate

w1 -= lr*grad_w1
w2 -= lr*grad_w2
w3 -= lr*grad_w3
w4 -= lr*grad_w4
w5 -= lr*grad_w5
w6 -= lr*grad_w6
b1 -= lr*grad_b1
b2 -= lr*grad_b2
b3 -= lr*grad_b3

print(f"{w1=}")
print(f"{w2=}")
print(f"{w3=}")
print(f"{w4=}")
print(f"{w5=}")
print(f"{w6=}")
print(f"{b1=}")
print(f"{b2=}")
print(f"{b3=}")

w1=-7.160000000221949
w2=-9.210000001553645
w3=-7.460000007296425
w4=-9.450000051074976
w5=-5.0699999995622305
w6=-6.5799999889111955
b1=-9.78000000221949
b2=-6.350000072964252
b3=-6.7500115787821215


### **7. Repeat for Each Training Sample**
- For each sample, do a forward pass, calculate the loss, perform backpropagation, and update the weights.
- Repeat for multiple epochs (full passes over all the samples) until the loss is small or the weights stop changing.

### **Summary of Steps**
1. **Initialize** weights and biases.
2. **Feedforward** to calculate $ \hat{y} $.
3. **Calculate loss** between $ \hat{y} $ and true $ y $.
4. **Backpropagate** errors to update weights and biases.
5. **Repeat** for all 10 samples for several epochs.

### **Example Calculation (1 Epoch, All Samples)**

In [35]:
lr = 0.1 # learning rate
i = 1
for x1, x2, y in data:
    print(f"SAMPLE {i}")
    z1 = w1*x1 + w2*x2 + b1
    z2 = w3*x1 + w4*x2 + b2

    h1 = 1/(1 + math.exp(-z1))
    h2 = 1/(1 + math.exp(-z2))

    z3 = w5 * h1 + w6 * h2 + b3

    y_pred = 1/(1 + math.exp(-z3))
    print(f"y: {y}")
    print(f"y_pred: {y_pred}")

    L = -(y * math.log(y_pred) + (1-y)*math.log(1-y_pred))
    print(f"Loss: {L}")

    # Backpropagation
    delta3 = y_pred - y

    grad_w5 = delta3 * h1
    grad_w6 = delta3 * h2
    grad_b3 = delta3


    delta1 = delta3 * w5 * h1 * (1 - h1)
    delta2 = delta3 * w6 * h2 * (1 - h2)

    grad_w1 = delta1 * x1
    grad_w2 = delta1 * x2
    grad_w3 = delta2 * x1
    grad_w4 = delta2 * x2
    grad_b1 = delta1
    grad_b2 = delta2

    w1 -= lr*grad_w1
    w2 -= lr*grad_w2
    w3 -= lr*grad_w3
    w4 -= lr*grad_w4
    w5 -= lr*grad_w5
    w6 -= lr*grad_w6
    b1 -= lr*grad_b1
    b2 -= lr*grad_b2
    b3 -= lr*grad_b3

    print(f"{w1=}")
    print(f"{w2=}")
    print(f"{w3=}")
    print(f"{w4=}")
    print(f"{w5=}")
    print(f"{w6=}")
    print(f"{b1=}")
    print(f"{b2=}")
    print(f"{b3=}")

    i+=1
    print("="*100)

print()
print()
print(f"{w1=}")
print(f"{w2=}")
print(f"{w3=}")
print(f"{w4=}")
print(f"{w5=}")
print(f"{w6=}")
print(f"{b1=}")
print(f"{b2=}")
print(f"{b3=}")

SAMPLE 1
y: 1
y_pred: 0.0013992735716420975
Loss: 6.5718020544228875
w1=-7.160000004354449
w2=-9.21000002396348
w3=-7.460000138691741
w4=-9.45000078215713
w5=-5.069999993363855
w6=-6.579999833870503
b1=-9.780000033645265
b2=-6.350001093137008
b3=-6.470534201217211
SAMPLE 2
y: 1
y_pred: 0.001546004336745643
Loss: 6.472081523681357
w1=-7.160000004518626
w2=-9.210000024332874
w3=-7.460000143393213
w4=-9.450000792735441
w5=-5.0699999932829005
w6=-6.57999983208423
b1=-9.780000034055705
b2=-6.350001104890688
b3=-6.370688801650886
SAMPLE 3
y: 0
y_pred: 0.0017077330063818932
Loss: 0.0017091928446385373
w1=-7.160000004246785
w2=-9.210000024151647
w3=-7.460000133904561
w4=-9.450000786409673
w5=-5.069999993461625
w6=-6.579999836891179
b1=-9.780000033149571
b2=-6.350001073261849
b3=-6.370859574951524
SAMPLE 4
y: 0
y_pred: 0.00170776548802987
Loss: 0.0017092253818518656
w1=-7.1600000042435825
w2=-9.210000024150045
w3=-7.46000013381284
w4=-9.450000786363812
w5=-5.069999993462415
w6=-6.57999983690860