**Source**: [Neural Networks and Deep Learning - Chapter 1](http://neuralnetworksanddeeplearning.com/chap1.html)

### 🧠 Sigmoid Neuron

A **sigmoid neuron** takes inputs in the range [0, 1] and outputs a value between 0 and 1, using the sigmoid activation function.

The sigmoid function is defined as:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

---

### 🧱 Neural Network Layers

- **Input Layer**: Contains *input neurons*
- **Hidden (Middle) Layer**: Contains *hidden neurons*
- **Output Layer**: Contains *output neurons*

---

### 🔻 Gradient Descent

Let:
- $x$ be the training input  
- $y = y(x)$ be the corresponding desired output  
- $a$ be the actual output of the network  

The **cost function** is defined as:

$$
C(w, b) = \frac{1}{2n} \sum_x \| y(x) - a \|^2
$$

Suppose we simplify notation by writing $(w, b) \rightarrow (v_1, v_2)$.  
Then the change in cost is approximately:

$$
\Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2
$$

We want to choose $\Delta v_1$ and $\Delta v_2$ such that $\Delta C < 0$ (i.e., the cost decreases).

Define:
- $\Delta v \equiv \begin{pmatrix} \Delta v_1 \\ \Delta v_2 \end{pmatrix}$
- $\nabla C \equiv \begin{pmatrix} \frac{\partial C}{\partial v_1} \\ \frac{\partial C}{\partial v_2} \end{pmatrix}$

Then:

$$
\Delta C \approx \nabla C \cdot \Delta v
$$

To ensure the cost decreases, we choose:

$$
\Delta v = -\eta \nabla C
$$

where $\eta$ is a small positive number called the *learning rate*. Substituting, we get:

$$
\Delta C \approx -\eta \nabla C \cdot \nabla C = -\eta \| \nabla C \|^2
$$

Since $\| \nabla C \|^2 \geq 0$, this guarantees:

$$
\Delta C \leq 0
$$

So, the cost function $C$ always decreases (or remains the same), and the update rule becomes:

$$
v \rightarrow v' = v - \eta \nabla C
$$

This can be extended to functions with any number of variables.


Excercise 1: The goal is to minimize the cost $C$ in such a way that the cost $C$ goes down as much as possible. Let's the limit the size of the change to a fixed value, $ \|\Delta v\| = \epsilon$. As a first order of approximation $\Delta C \approx  \nabla C \cdot \Delta v$. Now the objective is to choose a vector $\Delta v$ of fixed length $\epsilon$ that minimizes $\nabla C \cdot \Delta v$. The dot product of $\nabla C \cdot \Delta v$ is $\|\nabla C\|\cdot \|\Delta v\| \cos(\theta)$. To minimize this, $\cos(\theta) = -1$ since $ \|\Delta v\| = \epsilon$, the smallest value of the dot product is $-\|\nabla C\|\cdot \epsilon$, hence $\eta = \epsilon / \|\nabla C\|$.

Excecise 2: For $1D$, the "gradient" is just the slope of the line.

### 🔁 Gradient Descent (Component-wise Update)

When using gradient descent to minimize a cost function $C$, we update each **weight** $w_k$ and **bias** $b_l$ individually using the gradients of the cost function:

$$
w_k \rightarrow w_k' = w_k - \eta \frac{\partial C}{\partial w_k}
$$

$$
b_l \rightarrow b_l' = b_l - \eta \frac{\partial C}{\partial b_l}
$$

where:
- $\eta$ is the **learning rate**,
- $\frac{\partial C}{\partial w_k}$ and $\frac{\partial C}{\partial b_l}$ are the **partial derivatives** of the cost function with respect to each parameter.

---

### 🎲 Stochastic Gradient Descent (SGD)

Instead of computing the full gradient $\nabla C$ over the entire dataset (which can be computationally expensive), **Stochastic Gradient Descent** estimates it using a small, randomly selected subset (mini-batch) of the training data.

We approximate the true gradient:

$$
\nabla C \approx \frac{1}{m} \sum_{j=1}^{m} \nabla C_{x_j}
$$

where:
- $m$ is the **mini-batch size**,
- $x_j$ is the $j$-th example in the mini-batch,
- $\nabla C_{x_j}$ is the gradient computed for the individual training example $x_j$.

This is an approximation to the full batch gradient:

$$
\nabla C = \frac{1}{n} \sum_{x} \nabla C_x
$$

---

### 🧮 SGD Component-wise Update Rules

Using the estimated gradient from the mini-batch, we update the weights and biases as:

$$
w_k \rightarrow w_k' = w_k - \frac{\eta}{m} \sum_{j=1}^{m} \frac{\partial C_{x_j}}{\partial w_k}
$$

$$
b_l \rightarrow b_l' = b_l - \frac{\eta}{m} \sum_{j=1}^{m} \frac{\partial C_{x_j}}{\partial b_l}
$$

This speeds up training and helps the model generalize better due to the noise introduced by random sampling.


Backpropagation Algorithm: 

- Input $x$: Set the corresponding activation $a^1$ for the input layer 
- Feedforward: For each layer $l \in [2, L]$ (layer 1 is for input) compute $z^l = w^la^{l - 1} + b^l$ and $a^l = \sigma(z^l)$
- Ouput Error $\delta^L$: Compute, $\delta^L = \nabla_aC \odot \sigma'(z^L)$
- Backpropagate the error: For each $l$ in $\{L-1, L-2, \cdots 2\}$ compute $\delta^l = ((w^{l+1})^T\delta^{l + 1}) \odot \sigma'(z^l)$
-  The gradient of the cost function is given by 
$\frac{\partial C}{\partial w_{jk}^l} = a_k^{l-1} \delta_j^l \quad \text{and} \quad \frac{\partial C}{\partial b_j^l} = \delta_j^l$
- Gradient Descent: Update the $w^l$ and $b^l$

Cross Entropy Cost Function: Four regularization methods-
- L1 & L2 regularization 
- Dropout 
- Artificial expansion of    


L2 Regularization: The idea of L2 regularization is to add an extra term to the cost function, a term called the regularization term. Here is the regularized cross-entropy:

$C = \frac{1}{n} \sum_{xj}[y_j \ln a_j^L + (1 - y_j) \ln(a_j^L)] + \frac{\lambda}{2n}\sum_w w^2$

- First term is the usual expression for the cross entropy
- $C = C_0 + \frac{\lambda}{2n}\sum_w w^2$
- Regularization does not chose biases
- $W \leftarrow (1- \frac{\eta\lambda}{n})w - \eta\frac{\partial C_0}{\partial w}$


In [1]:
import mnist_loader
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()

In [None]:
import network2
net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
net.large_weight_initializer()
net.SGD(training_data, 30, 10, 0.5, evaluation_data=test_data, monitor_evaluation_accuracy=True)