Sigmoid Neuron: It can takes input from 0 to 1. write the equation 

Input Layer: Input Neurons

Hidden (Middle) Layer: Hidden Neurons

Output Layer: Output Neurons

Gradient Descent: x -> Traning Input, y = y(x) corresponding desired output
    Cost function $$ C(w, b) = \frac{1}{2n} \sum_x \| y(x) - a \|^2 $$
    Lets think $(w,b) -> (v1, v2)$
    Change in cost function: $$ \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2 $$
    Need to find a way of choosing $\Delta v_1$ & $\Delta v_2$ so that $\Delta C$ is negative. Let, $\Delta v \equiv (\Delta v_1, \Delta v_2)^T $. Let also denote the gradient vector $$\nabla C \equiv (\frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2})$$. So, $\Delta C \approx  \nabla C \cdot \Delta v$. As we need to get, $\Delta C \lt 0$, we pick $\Delta v=-\eta \nabla C$, where $\eta$ is learning rate. Hence, $\Delta C \approx (-\eta \nabla C)\cdot \nabla C = -\eta \|\nabla C\|^2 $ i.e., $C$ will decrease. So, $v \rightarrow v' = v - \eta \nabla C$. This can be extended for any number of variables.


Excercise 1: The goal is to minimize the cost $C$ in such a way that the cost $C$ goes down as much as possible. Let's the limit the size of the change to a fixed value, $ \|\Delta v\| = \epsilon$. As a first order of approximation $\Delta C \approx  \nabla C \cdot \Delta v$. Now the objective is to choose a vector $\Delta v$ of fixed length $\epsilon$ that minimizes $\nabla C \cdot \Delta v$. The dot product of $\nabla C \cdot \Delta v$ is $\|\nabla C\|\cdot \|\Delta v\| \cos(\theta)$. To minimize this, $\cos(\theta) = -1$ since $ \|\Delta v\| = \epsilon$, the smallest value of the dot product is $-\|\nabla C\|\cdot \epsilon$, hence $\eta = \epsilon / \|\nabla C\|$.

Excecise 2: For $1D$, the "gradient" is just the slope of the line.

Writing out the gradient descent update rule in terms of components, we have:

$$
w_k \rightarrow w_k' = w_k - \eta \frac{\partial C}{\partial w_k} 
$$

$$
b_l \rightarrow b_l' = b_l - \eta \frac{\partial C}{\partial b_l} 
$$

**Stochastic Gradient Descent:** Estimate the gradient $\nabla C$ by computing  $\nabla C_x$ for a small sample of randomly chosen training examples.

$$
\frac{1}{m} \sum_{j=1}^{m} \nabla C_{x_j} \approx \frac{1}{n} \sum_{x} \nabla C_{x} = \nabla C
$$

$$
w_k \rightarrow w_k' = w_k - \frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k}
$$

$$
b_l \rightarrow b_l' = b_l - \frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l}
$$



In [1]:
import numpy as np

In [None]:
class Network(object):
    def __init__(self, sizes):
        """Initialize the network with a list of sizes."""
        self.num_layers = len(sizes) # number of layers in the network
        self.sizes = sizes # number of neurons in each layer
        self.biases = [np.random.randn(cur, 1) for cur in sizes[1:]] # biases for each layer (except the input layer)
        self.weights = [np.random.randn(prev, cur) for prev, cur in zip(sizes[:-1], sizes[1:])] # weights for each layer (except the input layer)
    def feedforward(self, a):
        """Return the output of the network given input a."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a) + b)
        return a
    
    def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
        """Train the network using mini batch stochastic gradient descent (SGD). The 
        training_data is a list of tuples (x, y) where x is the input and y is the expected output.
        The test_data is a list of tuples (x, y) for evaluating the performance of the network. Other parameters
        are epochs (number of iterations), mini_batch_size (size of each mini batch), and eta (learning rate)."""
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in range(epochs):
            np.random.shuffle(training_data)
            mini_batches = [training_data[k:k+mini_batch_size] for k in range(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print(f"Epoch {j}: {self.evaluate(test_data)} / {n_test}")
            else:
                print(f"Epoch {j} complete")

    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying gradient descent using a single mini batch."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w - (eta/len(mini_batch)) * nw for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b - (eta/len(mini_batch)) * nb for b, nb in zip(self.biases, nabla_b)]
    
    
def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

In [5]:
net = Network([2, 3, 1]) # Example: a network with 2 input neurons, 3 hidden neurons, and 1 output neuron
net.weights

[array([[-1.54174811,  0.44301886],
        [-1.17677569, -2.97927487],
        [-0.33045123, -0.28177142]]),
 array([[ 0.24636485, -0.45324362, -0.58637459]])]