# 💫 Hands on learning of Deep Learning and Neural Networks 💫

<div align="center">
    <video width="1280" height="720" src="./assets/Introduction.mp4" controls></video>
</div>

<div align="center">
    <video width="1280" height="720" src="./assets/Applications.mp4" controls></video>
</div>

This tutorial will help you understand the basics of neural networks from the perspective of optimization. 
At the end of the lecture, you will have an understanding on, 
1. Gradient Descent Algorithm for Optimization
2. Modelling Simple Multi-Layer Perceptrons
3. Backpropogation Algorithm
4. Training your own simple neural network for classification

## ⚒️ Let's setup by importing relevant libraries and relevant utility functions


First we import the relevant libraries. 

**Note**: Graphviz may require manual installation, see this webpage for more information (https://graphviz.org/download/)

In [None]:
%pip install numpy
%pip install matplotlib
%pip install graphviz
%pip install scikit-learn
%pip install adjustText

import math
import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets # import make_moons, make_blobs
from graphviz import Digraph
from adjustText import adjust_text

seed = 1337

np.random.seed(seed)

%matplotlib inline

## ⏮️ Recap on Optimization: Playing with optimization

Let's recap on our optimization class. 

Suppose we have an optimization function that looks like this, 

<div align="center">
<img src="./assets/Optimization.png" style="height: 500px"/>
</div>
We want to create an **algorithm** that can find the maxima (or minima) of the given function, i.e., the reddest (or bluest) point in the function. 

Let's start by creating this toy function. 

In [None]:
# Defining a grid of Xs and Ys
resolution = 100
X, Y = np.meshgrid( np.linspace(-1,1,resolution), np.linspace(-1,1,resolution) )

# Defining 4 different 2D functions
mux, muy, sigma = 0.3, -0.3, 4
G1 = np.exp(-((X-mux)**2+(Y-muy)**2)/2.0*sigma**2)

mux, muy, sigma = -0.3, 0.3, 2
G2 = np.exp(-((X-mux)**2+(Y-muy)**2)/2.0*sigma**2)

mux, muy, sigma = 0.6, 0.6, 2
G3 = np.exp(-((X-mux)**2+(Y-muy)**2)/2.0*sigma**2)

mux ,muy, sigma = -0.4, -0.2, 3
G4 = np.exp(-((X-mux)**2+(Y-muy)**2)/2.0*sigma**2)

# Composing the final function
G = G1 + G2 - G3 - G4

Let's visualise the function

In [None]:
fig = plt.figure(figsize=(6*4,6)) # Defining the figure space
axes = fig.subplots(1, 4)         # Defining the subplots in the figure

for ax, g, t in zip(axes.flat, [G1, G2, G3, G4], ['G1', 'G2', 'G3', 'G4']): # Iterating over axes and functions
    ax.imshow(g, vmin=-1, vmax=1, cmap='jet')                               # Ploting the function on the subplot
    ax.set(title=t)                                                          # Setting the title of the subplot

fig.tight_layout() # Removes extra spacing from the figure

fig = plt.figure(figsize=(6,6))
ax = fig.subplots()

cax = ax.imshow(G, vmin=-1, vmax=1, cmap='jet')
ax.set(title="G")

cbar = fig.colorbar(cax) # Attaching the colorbar to the figure

fig.tight_layout() 
plt.show()               # Instruct Matplotlib to show the figures created

# fig.savefig("./assets/Optimization.png", dpi=300)

Now we have the function, so we can start optimization on the function. 

Let's **start** at a point, (70.0, 60.0) on the grid.
We will sample points around this region and the direction of movement with gradient.  


In [None]:
n_iter = 5     # Number of Steps to take for optimisation
alpha  = 0.03  # Learning rate of the optimisation

w = np.array([70.0, 60.0]) # Starting Parameter (Point)
sigma  = 3                 # Standard deviation of the samples around current parameter vector

fig  = plt.figure( figsize=(5*n_iter, 5) )
axes = fig.subplots(1, n_iter) 

prevx, prevy = [], []
for q, ax in zip(range(n_iter), axes):
    
    # Draw the Optimization Landscape
    ax.imshow(G, vmin=-1, vmax=1, cmap='jet')

    # Sample Random Population
    noise = np.random.randn(200, 2)
    wp = np.expand_dims(w, 0) + sigma * noise
    x,y = zip(*wp)
    
    # Estimate Gradient (Direction)
    R  = np.array([G[int(wi[1]), int(wi[0])] for wi in wp])
    R -= R.mean()
    R /= R.std() 
    g  = np.dot(R, noise)
    u  = alpha * g
    
    prevx.append(w[0])
    prevy.append(w[1])
    
    # Draw Population on Landscape (Black Points)
    ax.scatter(x, y, 4, 'k', edgecolors='face')
    
    # Draw estimated gradient (direction) as arrow (White Arrow)
    ax.arrow(w[0], w[1], u[0], u[1], head_width=3, head_length=5, fc='w', ec='w')
    
    # Draw Parameter History (White Points)
    ax.plot(prevx, prevy, 'wo-')
    
    # Update Parameter According to the gradient
    w += u
    
    ax.set(title=f"Iteration: {q+1} | Reward: {G[int(w[0]), int(w[1])]:.2f}")

fig.tight_layout()

#### 🚀 **Try-it-out**: Try finding minima bluest part of the figure 

## 📈 Playing with gradients


Let's start by creating our own **datatype** to save a single scalar value and its gradient. 

**Note**: We are creating our own datatype just to understand the behind the scenes working of well established libraries like, Numpy, Tensorflow, PyTorch, etc. 

In [None]:
class Value:
    def __init__(self, data, label='', _children=(), _op=''):
        
        # Information about value, gradient and its name
        self.data  = data
        self.grad  = 0.0
        self.label = label
        
        # Utility attributes for the calculating and passing gradients (Backprop)
        self._backward = lambda: None
        self._prev     = set(_children)
        self._op       = _op 
    
    # Simple arithemtic operations on value and computing corresponding gradients   
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, label='+', _children=(self, other), _op='+')

        def _backward():
            self.grad  += out.grad
            other.grad += out.grad
        out._backward = _backward

        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, label='*', _children=(self, other), _op='*')

        def _backward():
            self.grad  += other.data * out.grad
            other.grad += self.data  * out.grad
        out._backward = _backward

        return out

    def __pow__(self, other):
        assert isinstance(other, (int, float)), "only supporting int/float powers for now"
        out = Value(self.data**other, label=f'**{other}', _children=(self,), _op='**')

        def _backward():
            self.grad += (other * self.data**(other-1)) * out.grad
        out._backward = _backward

        return out

    # Other arithmetic operations
    ### Don't need to define backward functions since, they use __mul__ or __add__ for which backward is already defined. 
    def __neg__(self): # -self
        return self * -1

    def __radd__(self, other): # other + self
        return self + other

    def __sub__(self, other): # self - other
        return self + (-other)

    def __rsub__(self, other): # other - self
        return other + (-self)

    def __rmul__(self, other): # other * self
        return self * other

    def __truediv__(self, other): # self / other
        return self * other**-1

    def __rtruediv__(self, other): # other / self
        return other * self**-1
    
    # Simple transformations on Value and computing corresponding gradients
    def relu(self):
        out = Value(0 if self.data < 0 else self.data, label='ReLU', _children=(self,), _op='ReLU')

        def _backward():
            self.grad += (out.data > 0) * out.grad
        out._backward = _backward

        return out

    def tanh(self):
        x = self.data
        t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
        out = Value(t, label='Tanh', _children=(self, ), _op='Tanh')
        
        def _backward():
            self.grad += (1 - t**2) * out.grad
        out._backward = _backward
        
        return out
  
    def exp(self):
        x = self.data
        out = Value(math.exp(x), label='Exp',  _children=(self, ), _op='Exp')
        
        def _backward():
            self.grad += out.data * out.grad 
        out._backward = _backward
        
        return out
    
    # Information when printing instance
    def __repr__(self):
        if self.label:
            return f"Value(node={self.label}, data={self.data}, grad={self.grad})"
        else:
            return f"Value(data={self.data}, grad={self.grad})"
    
    # Recurisvely call backward -> Backprop
    def backward(self):

        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1
        for v in reversed(topo):
            v._backward()

Let's also create some visualisation utilities which will help us understand the flow of gradients and data in a complicated fucntion. 

In [None]:
# Builds the graph from a root node
def trace(root):
    nodes, edges = set(), set()
    def build(v):
        if v not in nodes:
            nodes.add(v)
            for child in v._prev:
                edges.add((child, v))
                build(child)
    build(root)
    return nodes, edges

# Visualisizes the graph built from root node
def draw_dot(root):
    dot = Digraph(format='svg', graph_attr={'rankdir': 'LR'})
    nodes, edges = trace(root)
    for n in nodes:
        uid = str(id(n))
        dot.node(name = uid, label = "{ %s | data %.4f | grad %.4f }" % (n.label, n.data, n.grad), shape='record')
        if n._op:
            dot.node(name = uid + n._op, label = n._op)
            dot.edge(uid + n._op, uid)

    for n1, n2 in edges:
        dot.edge(str(id(n1)), str(id(n2)) + n2._op)

    return dot

Let's start by defining our own simple one-variable function and plot it. 

We will use a simple function, 
$$ f(x) = y = x^2 - 4x + 3 $$

In [None]:
def cost_function(x):
    return x**2 - 4*x + 3 

In [None]:
X = np.linspace(-7, 15, 100)
Y = cost_function(X)

In [None]:
plt.plot(X, Y)
plt.show()

In [None]:
x = Value(15.0, label='X')
y = cost_function(x)

draw_dot(y)

Fortunately, we know what is the gradient for the function:
$$ \frac{d(f(x))}{dx} = \frac{dy}{dx} = 2x -4 $$ 

However, let's use a magic function of backprop where we do not explicitly calculate the gradient and let's see how it calculates the gradient. 

We know, that the gradient of the function should be: $$ 26 \quad \text{when} \quad  x = 15.0 $$ 

In [None]:
y.backward()
draw_dot(y)

Now let's use the information from the gradient to update the x, can we find the minima of the function by just **iteratively** doing this updation?

In [None]:
alpha = 0.3 # 0.1 # 0.3 # 0.75 #
num_iterations = 10

x = Value(15.0, label='X')

xy_list = []
for i in range(num_iterations):
    
    # Calculate f(x)
    y = cost_function(x)
    
    # Calculate dy/dx
    y.backward()
    
    xy_list.append((x.data, y.data))
    print(f"Step: {i+1:2d} | X: {x.data:4.1f} | f(X): {y.data:6.2f} | Gradient dy/dx: {x.grad:6.2f}")
    
    # Update x 
    x -= alpha * x.grad

Let's also visualise, what is happening with the x over the iterations

In [None]:
xy_list = np.asarray(xy_list)

fig = plt.figure(figsize=(16, 8))
ax = fig.subplots()

ax.plot(X, Y)
ax.plot(xy_list[:, 0], xy_list[:, 1], 'r')

for i in range(len(xy_list)):
    ax.text(xy_list[i, 0], xy_list[i, 1] + 0.5, round(xy_list[i, 1], 2))

    
ax.set(xlabel='X', ylabel='Cost Function', title=f"$f(x) = y = x^2 - 4x + 3$")
plt.show()

#### 🚀 **Try-it-out**: Try some 2D function and see how it goes for the minima. 
#### 🚀 **Try-it-out**: Can you guess how gradients will be calculated manually for 2D functions ?

## 🎭 What about classification ?


We have done a **regression task**, where we ride along the function to find the minima. 

However, what should be done when we want to perform **classification**?

Moreover, what do we do when we have an approximation of the function defined by **sampled points**? 

Let's first visualise what I am talking about...

In [None]:
X, y = datasets.make_moons(n_samples=100, noise=0.1)

y = y*2 - 1 # make y be -1 or 1
fig = plt.figure(figsize=(5,5))
ax = fig.subplots()
ax.scatter(X[:,0], X[:,1], c=y, s=20, cmap=plt.cm.Spectral)
ax.set(xlabel="X", ylabel="Y", xlim =(X.min()-1, X.max()+1), ylim=(y.min()-1, y.max()+1))
plt.show()

We see in the above figure, we don't have an exact defined function for either left-moon or right-more. Moreover, we have to separate the two moons rather than find a minima. 

<br>

This is the task, which the current deep learning networks or artificial neural networks can do the best! 

Let's understand about the neural networks! 

<div align="center">
    <video width="1280" height="720" src="./assets/ML-DL.mp4" controls></video>
</div>

## 🔬 Microscopic view of Neural Networks

<div align="center">
    <video width="1280" height="720" src="./assets/Neuron.mp4" controls></video>
</div>

### 👀 Visualising Artificial Neuron (Perceptron)


Feed-Forward Neural Networks are inspired by the information processing of one or more neural cells, called a neuron. 
A neuron accepts input signals via its dendrites, which pass the electrical signal down to the cell body. 
The axon carries the signal out to synapses, which are the connections of a cell’s axon to other cell’s dendrites.

<details>
<br>
The human nervous system is composed of more than 100 billion cells known as neurons. 

A neuron is a cell in the nervous system whose function it is to receive and transmit information. 

Neurons are made up of three major parts:

* The cell body, or **soma**, which contains the nucleus of the cell and keeps the cell alive
* A branching treelike fiber known as the **dendrite**, which collects information from other cells and sends the information to the soma
* A long, segmented fiber known as the **axon**, which transmits information away from the cell body toward other neurons or to the muscles and glands

<img src="https://c4.staticflickr.com/3/2656/4253587827_9723c3ffd3_z.jpg" />

*Photo courtesy of GE Healthcare, http://www.flickr.com/photos/gehealthcare/4253587827/ *

<img src="https://askabiologist.asu.edu/sites/default/files/resources/articles/neuron_anatomy.jpg"/>

Some neurons have hundreds or even thousands of dendrites, and these dendrites may themselves be branched to allow the cell to receive information from thousands of other cells. 

The axons are also specialized; some, such as those that send messages from the spinal cord to the muscles in the hands or feet, may be very long---even up to several feet in length. 
To improve the speed of their communication, and to keep their electrical charges from shorting out with other neurons, axons are often surrounded by a **myelin sheath**. 
The myelin sheath is a layer of fatty tissue surrounding the axon of a neuron that both acts as an insulator and allows faster transmission of the electrical signal.
Axons branch out toward their ends, and at the tip of each branch is a *terminal button*.

</details>

The actual working of neurons involves many aspects (including chemical, electrical, physical, timings). 

We will abstract all of this away into three numbers:

* **Activation** - A value representing the excitement of a neuron
* **Bias** - A value representing a default or bias (sometimes called a threshold)
* **Weight** - A value representing a connection to another neuron

In addition, there is a **transfer function** that takes all of the incoming activations times their associated weights plus the bias, and squashes the resulting sum. 
This limits the activations from growing too big or too small.

<div align="center">
<img src="./assets/Perceptron.png"/>
</div>

A perceptron maintains an activation value that depends on the activation values of its incoming neighbors, the weights from its incoming neighbors, and an additional value, called the **default bias value**. To compute this activation value, we first calculate the node's net input.

The net input is a weighted sum of all the incoming activations plus the node's bias value:

$$ y = f(b + \sum\limits_{i=1}^n x_i w_i) $$

where $w_{i}$ is the weight, or connection strength, from the $i^{th}$ node, $x_i$ is the  input, $b$ is the bias value, and $f$ is the activation function that transforms the linear combination of inputs and weights. 


In [None]:
# Inputs x1,x2
x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')

# Weights w1,w2
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')

# Bias of the neuron
b = Value(6.8813735870195432, label='b')

# x1*w1 + x2*w2 + b
x1w1 = x1*w1; x1w1.label = 'x1*w1'
x2w2 = x2*w2; x2w2.label = 'x2*w2'

x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'

n = x1w1x2w2 + b; n.label = 'n'

o = n.tanh(); o.label = 'o'

draw_dot(o)

### 🕸️ Network of Neurons (Multi Layer Perceptron --- MLP)


To build a network of neurons, we first start by grouping neurons together in **layers**.

A typical Artificial Neural Network (ANN) is composed of three layers: **input**, **hidden**, and **output**. Each layer contains a collection of neurons. Typically, the neurons in a layer are **fully connected** to the neurons in the next layer. 

For instance, every input neuron will have a weighted connection to every hidden neuron. Similarly, every hidden neuron will have a *weighted connection* to every output neuron.

<div align="center">
    <img src="./assets/Lecture-ANN.png"/>
    <br>
    <video width="1280" height="720" src="./assets/NeuralNetwork.mp4" controls></video>
</div>

Processing in a network works as follows:

Input is propagated forward from the input layer through the hidden layer and finally through the output layer to produce a response. Each neuron, regardless of the layer it is in, uses the same transfer function in order to propagate its information forward to the next layer. 

<div align="center">
    <video width="1280" height="720" src="./assets/ForwardPropagation.mp4" controls></video>
</div>



Let's now create a neuron, a layer and a network of neurons using our defined Value class!

In [None]:
class Module:

    # Explictly make gradients 0.0
    def zero_grad(self):
        for p in self.parameters():
            p.grad = 0.0

    # List of Parameters
    def parameters(self):
        return []

class Neuron(Module):

    def __init__(self, nin, activation='ReLU', layer_name='', neuron_name=''):
        
        # Sets, weights, bias and activations for the neuron
        self.w = [Value(np.random.uniform(-1,1), label=f"Weight of {layer_name} {neuron_name} for Input {i+1}") for i in range(nin)]
        self.b = Value(0, label=f"Bias of {layer_name} {neuron_name}")
        self.activation = activation

    # Sets the list of parameters in the neuron
    def parameters(self):
        return self.w + [self.b]

    # Information when printing neuron
    def __repr__(self):
        return f"{self.activation}Neuron(nin={len(self.w)})"
    
    # Forward Pass -> Compute the output of the neuron
    def __call__(self, x):
        
        w = sum((wi*xi for wi,xi in zip(self.w, x)))
        out = w + self.b
        
        if self.activation == 'ReLU':
            out = out.relu()
        elif self.activation == 'Tanh':
            out = out.tanh()
        elif self.activation == 'Linear':
            out = out
            
        return out

class Layer(Module):

    def __init__(self, nin, nout, **kwargs):
        # Define neurons of a layer
        self.neurons = [Neuron(nin, neuron_name=f"Neuron {i+1}", **kwargs) for i in range(nout)]

    # Sets the list of parameters in the layer
    def parameters(self):
        return [p for n in self.neurons for p in n.parameters()]

    # Information when printing layer
    def __repr__(self):
        return f"Layer of [ {', '.join(str(n) for n in self.neurons)} ]"
    
    # Forward Pass -> Compute the output of the layer
    def __call__(self, x):
        out = [n(x) for n in self.neurons]
        return out[0] if len(out) == 1 else out

class MLP(Module):

    def __init__(self, nin, nouts, activations=None):
        if activations is not None:
            assert len(nouts) == len(activations), 'Activations not defined for some layers'
        else:
            activations = ['Linear'] * len(nouts)
            
        sz = [nin] + nouts 
        
        # Define layers of a MLP
        self.layers = [Layer(sz[i], sz[i+1], activation=activations[i], layer_name=f"Layer {i+1}") for i in range(len(nouts))]

    # Sets the list of parameters in the MLP
    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]

    # Information when printing MLP
    def __repr__(self):
        new_line = f"\n{'-'*8}> "
        return f"MLP of [{new_line}{new_line.join(str(layer) for layer in self.layers)}\n]"
    
    # Forward Pass -> Compute the output of the MLP
    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

Let's also define a function that will compute **loss** *(reward/penalty)* for our neural network

In [None]:
def compute_loss(batch_size=None):
    
    # Process Data in batches, in case data is too big to handle
    if batch_size is None:
        Xb, yb = X, y
    else:
        ri = np.random.permutation(X.shape[0])[:batch_size]
        Xb, yb = X[ri], y[ri]
    
    # Format Data to our Datatype
    inputs = [ [Value(xrow[0], label='X'), Value(xrow[1], label='Y')] for xrow in Xb]
    
    # Forward Pass to get the scores
    scores = list(map(model, inputs))
    
    # Max-Margin Loss to calculate fitness based on scores and y
    losses = [(1 + -yi*scorei).relu() for yi, scorei in zip(yb, scores)]
    data_loss = sum(losses) * (1.0 / len(losses))
    
    # L2 Regularization (Optional)
    ## To improve performance, we also regularise the parameters. 
    alpha = 1e-4
    reg_loss = alpha * sum((p*p for p in model.parameters()))
    
    # Compute Final Loss -> Max-Margin Loss + L2 Regularization Loss
    data_loss = data_loss + reg_loss
    
    # Compute Accuracy
    accuracy = [(yi > 0) == (scorei.data > 0) for yi, scorei in zip(yb, scores)]
    
    # Return everything required
    return data_loss, scores, sum(accuracy) / len(accuracy)

In [None]:
model = MLP(nin=2, nouts=[2, 2, 1]) # 2-layer neural network
print(model)
print(f"Number of Parameters: {len(model.parameters())}")

In [None]:
loss, scores, acc = compute_loss()
print(f"Loss: {loss.data:.4f} | Accuracy: {100*acc: 5.2f}")
# print(f"Prediction of 1st Input: {int(scores[1].data>0)} | True classification of 1st Input: {y[1]}")

### 🔁 Backpropagation --- Trick to update weights (parameters) of multilayer neural networks

<div align="center">
    <video width="1280" height="720" src="./assets/Backpropagation.mp4" controls></video>
</div>

<details>
<br>
For many years, it was unknown how to learn the weights in a multi-layered neural network. In addition, Marvin Minsky and Seymour Papert proved in their 1969 book [Perceptrons](https://en.wikipedia.org/wiki/Perceptrons_(book)) that you could not do simple functions without having multi-layers. 

(Actually, the idea of using simulated evolution to search for the weights could have been used, but no one thought to do that.) 

Specifically, they looked at the function XOR:

**Input 1** | **Input 2** | **Target**
------------|-------------|-------
 0 | 0 | 0
 0 | 1 | 1 
 1 | 0 | 1 
 1 | 1 | 0 

This killed research into neural networks for more than a decade. So, the idea of neural networks generally was ignored until the mid 1980s when the **Back-Propagation of Error** (Backprop) was created.
</details>

The **Backpropagation algorithm** (using *Backprop*), also called the *generalized delta rule*, is a *supervised* learning method for multilayer feed-forward networks in the field of Deep Learning.
Technically, it is a method for training the weights in a multilayer feed-forward neural network. 

The principle of the backpropagation approach is to model a given function by modifying internal weightings of input signals to produce an expected output signal. 
The system is trained using a supervised learning method, where the error between the system’s output and a known expected output is presented to the system and used to modify its internal state.

Backpropagation can be used for both classification and regression problems, but we will focus on classification in this lecture.

In [None]:
scores[0].backward()
draw_dot(scores[0])

### ⚙️ Train a simple neural network

<div align="center">
    <video width="480" height="360" src="./assets/Part6.mp4" controls></video>
</div>

Now let's train our neural network to classify the two moons. 

Backprop in action!

In [None]:
n_iter = 20
n_log  = 1
learning_rate = 1.0

model = MLP(nin=2, nouts=[16, 16, 1], activations=['ReLU', 'ReLU', 'Linear']) # 2-layer neural network
print(model)
print(f"Number of Parameters: {len(model.parameters())}")
print(f"\n{'-'*70}\n")
data = []

# Optimize Iteratively
for k in range(n_iter):
    
    # Zero-Grad
    model.zero_grad()

    # Forward Pass -> Compute Loss
    loss, scores, acc = compute_loss()
    
    # Backward Pass
    loss.backward()
    
    # Update Weights using SGD
    lr = learning_rate - 0.9*(k+1)/n_iter
    for p in model.parameters():
        p.data -= lr * p.grad
    
    if k % n_log == 0:
        print(f"Step: {k+1:3d} | Loss: {loss.data:.4f} | Accuracy: {acc*100:3.2f}% | Learning Rate: {lr:.2f}")
        data.append((loss.data, acc, lr))

Let's also visualise, Loss, Accuracy and Learning Rate over the iterations. 

In [None]:
fig = plt.figure(figsize=(6*3, 6))
axes = fig.subplots(1, 3)

data = np.asarray(data)

for ax, d, t in zip(axes.flat, [data[:, 0], data[:, 1], data[:, 2]], ['Loss', 'Accuracy', 'Learning Rate']):
    ax.plot(d)
    ax.set(title=t, xlim=(0, n_iter))

plt.show()

Now let's visualise the boundary of the trained neural network, how does the boundary separating the two moons look like. 

In [None]:
# Visualise Decision Boundary
resolution = 0.25
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, resolution), np.arange(y_min, y_max, resolution))
Xmesh = np.c_[xx.ravel(), yy.ravel()]

inputs = [list(map(Value, xrow)) for xrow in Xmesh]

scores = list(map(model, inputs))

Z = np.array([s.data > 0 for s in scores]).reshape(xx.shape)

fig = plt.figure()
ax = fig.subplots()
ax.contourf(xx, yy, Z, cmap=plt.cm.Spectral, alpha=0.8)
ax.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)
ax.set(xlabel='X', ylabel='Y', xlim=(xx.min(), xx.max()), ylim=(yy.min(), yy.max()))
plt.show()

#### 🚀 **Try-it-out**: Try some other simple datasets

<div align="center">
    <video width="1280" height="720" src="./assets/Quiz-1.mp4" controls></video>
    <video width="1280" height="720" src="./assets/Quiz-2.mp4" controls></video>
</div>

## 🤖 Moving Forward: Example of Handwritten Digits Classification using PyTorch

Given a task, how does one train a neural network to do/solve the task? This involves the following steps:

1. Determine an appropriate network architecture.
2. Define a data set that will be used for training.
3. Define the neural network parameters to be used for training.
4. Train the network.
5. Test the trained network.
6. Do post training analysis.

### ⚒️ Libraries Setup



We again start by importing the libraries

In [None]:
!pip install torch torchvision

from tqdm.auto import tqdm
from itertools import repeat

import torch
import torchvision
import torch.nn.functional as F
from torch import nn as nn, optim as optim
from torchvision import datasets, transforms, utils as vutils

torch.manual_seed(seed)
torch.use_deterministic_algorithms(True)

### 🏛️ Determining an appropriate architecture

Recall that a neural network consists of an input layer, an output layer, and zero or more hidden layers. Once a network has been trained, when you present an input to the network, the network will propagate the inputs through its layers to produce an output. 

If the input represents an instance of the task, the output should be the solution to that instance after the network has been trained. Thus, one can view a neural network as a general pattern associator. Thus, given a task, the first step is to identify the nature of inputs to the pattern associator. 

This is normally in the form of number of neurons required to represent the input. 
Similarly, you will need to determine how many output neurons will be required. 

For example, consider a simple logical connective, AND whose input-output characteristics are summarized in the table below:

**Input A** | **Input B** | **Target**
------------|-------------|-------
 0 | 0 | 0
 0 | 1 | 0 
 1 | 0 | 0 
 1 | 1 | 1 

This is a very simple example, but it will help us illustrate all of the important concepts in defining and training neural networks.

In this example, it is clear that we will need two neurons in the input layer, and one in the output layer. 
We can start by assuming that we will not need a hidden layer. In general, as far as the design of a neural network is concerned, you always begin by identifying the size of the input and output layers. 

Then, you decide how many hidden layers you would use. 
In most situations you will need atleast one hidden layer, though there are no hard and fast rules about its size. 
Through much empirical practice, you will develop your own heuristics about this. 

For the MNIST Dataset, we have used 2 Convolution Layer and 2 MLP Layers. 

In [None]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

### 🎨 Define a data set that will be used for training

Once you have decided on the network architecture, you have to prepare the data set that will be used for training. Each item in the data set represents an input pattern and the correct output pattern that should be produced by the network (since this is supervised training). 

In most tasks, there can be an infinite number of such input-output associations. Obviously it would be impossible to enumerate all associations for all tasks (and it would make little sense to even try to do this!). 
You have to then decide what comprises a good representative data set that, when used in training a network, would generalize to all situations.

In the example of the AND, the data set is very small, finite (only 4 cases!), and **exhaustive**.

However, MNIST Dataset consists of images of size 28*28 pixels which are monochromatic. Since monochromatic means either black or white, there exists $ 2^{28 \times 28} $ possible images (this number is 237 digits long). If we go for greyscale, there would exist $ 256^{28\times28} $ possible images (this number is 1889 digits long). 

In [None]:
batch_size      = 128
test_batch_size = 512
n_iter          = 5000
n_log           = 1
device          = 'cpu'
dry_run         = False

train_kwargs = {'batch_size': batch_size}
test_kwargs  = {'batch_size':  test_batch_size}

def infinite_loader(data_loader):
    for loader in repeat(data_loader):
        for data in loader:
            yield data
            
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
train_dataset = datasets.MNIST('../data', train=True,  download=True, transform=transform)
test_dataset  = datasets.MNIST('../data', train=False, transform=transform)
train_loader  = infinite_loader(torch.utils.data.DataLoader(train_dataset,**train_kwargs))
test_loader   = torch.utils.data.DataLoader(test_dataset, **test_kwargs)

In [None]:
batch = next(iter(train_loader))

fig = plt.figure(figsize=(8, 8))
ax = fig.subplots()

ax.imshow(np.transpose(torchvision.utils.make_grid(batch[0].to(device)[:64], padding=2, normalize=True).cpu(), (1,2,0)))
ax.set(xticks=[], yticks=[])
plt.show()

### 📝 Define the neural network parameters

The next step is to define the parameters required to train the neural network. 

In [None]:
model = Net().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.1)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, eta_min=0.01, T_max=n_iter)

### ♻️ Training and Testing the network


Once all the parameters are specified, you start the training process. 
This involves presenting each input pattern to the network, propagating it all the way until an output is produced, comparing the output with the desired target, computing the error, backpropagating the error, and applying the learning rule. 
This process is repeated as needed. 

In general, you should train the network for several iterations, it can be anywhere from a few hundred to millions! depennding on the dataset. 
Gradually, the network will begin to show improved and stable performance. 
Performance of the network is measured according to the task, for our classificiation task, we measure the accuracy, while for regression task, we can measure mean squared error. 

You can either stop the training process after a certain number of iterations have elapsed, or after the performance has saturated (early-stopping).

Once the network has been trained, it is time to test it. 
There are several ways of doing this. 
Perhaps the easiest is to turn learning off and then see the outputs produced by the network for each input in the data set. 
When a trained network is going to be used in a *deployed* application, all you have to do is save the weights of all interconnections in the network into a file. 
The trained network can then be recreated at anytime by reloading the weights.

**Note**: Instead of training-then-testing, there is another methodology: you can test-while-training, which encompases other evaluation technqiues like cross validation. 

In [None]:
def train(model, device, train_loader, optimizer, scheduler, n_iter):
    model.train()
    with tqdm(total=n_iter) as bar:
        for batch_idx, (data, target) in enumerate(train_loader):
            
            # Converting data to required format
            data, target = data.to(device), target.to(device)
            
            # Explictily Zeroing Gradients
            optimizer.zero_grad()
            
            # Forward Pass
            output = model(data)
            
            # Calculate Loss
            loss = F.nll_loss(output, target)
            
            # Backward Pass
            loss.backward()
            
            # Updating Weights
            optimizer.step()
            
            if batch_idx % n_log == 0:
                bar.update(n_log)
                bar.set_postfix({'Loss':  f"{loss.item():.4f}", 'Learning Rate': f"{scheduler.get_last_lr()[0]:.4f}"})
                if dry_run:
                    break
            
            # Changing Learning Rate
            scheduler.step()
            
            if batch_idx == n_iter-1:
                break

def test(model, device, test_loader):
    model.eval()
    
    test_loss = 0
    correct = 0
    
    with torch.no_grad() and tqdm(total=len(test_loader)) as bar:
        for data, target in test_loader:
            
            # Converting data to required format
            data, target = data.to(device), target.to(device)
            
            # Forward Pass
            output = model(data)
            
            # Calculate Loss
            loss = F.nll_loss(output, target, reduction='sum').item()  
            
            # Get Prediction
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            
            correct   += pred.eq(target.view_as(pred)).sum().item()
            test_loss += loss
            
            bar.update(1)

    test_loss /= len(test_loader.dataset)
    correct   /= len(test_loader.dataset)

    print(f"Test set: Average loss: {test_loss:.4f}, Accuracy: {100. * correct:.2f}%")

In [None]:
train(model, device, train_loader, optimizer, scheduler, n_iter)

In [None]:
test(model, device, test_loader)

### 💡 Do post training analysis

Perhaps the most important step in using neural networks is the analysis one performs once a network has been trained. There are a whole host of analysis techniques, here we present some of them which can be used.

In [None]:
batch = next(iter(test_loader))

fig = plt.figure(figsize=(8, 8))
ax = fig.subplots()

ax.imshow(np.transpose(torchvision.utils.make_grid(batch[0].to(device)[:16], padding=2, normalize=True).cpu(), (1,2,0)))
ax.set(xticks=[], yticks=[])
plt.show()

## 🦉 Words of Wisdom


- You have to take into consideration the range of each input and output value. 
Remember the activation function of a neuron is a mostly sigmoid-like-function that serves to squash all input values between 0.0 and 1.0. 
Thus, regardless of the size of each input value into a node, the output produced by each node is between 0.0 and 1.0. 
This means that all output nodes have values in that range. I
If the task you are dealing with expects outputs between 0.0 and 1.0, then there is nothing to worry about. 
However, in most situations, you will need to *scale* the output values back to the values in the task domain. 
Same goes for ReLU, which squashes the negative range. 

- In reality, it is also a good idea to scale the input values from the domain into the 0.0 to 1.0 range (especially if most input values are outside the -5.0 and 5.0 range). Thus, defining a data set for training almost always requires a collection of input-output pairs, as well as scaling and unscaling operations. Luckily, for the AND task, we do not need to do any scaling, but we will see several examples of this later.


The learning rate, EPSILON, and the momentum constant, MOMENTUM, have to be between 0.0 and 1.0 and are critical to the overall training algorithm. 
The appropriate values of these constants are best determined by experimentation. 
Tolerance (which is also between 0.0 and 1.0) refers to the level of tolerance that is acceptable for determining correctness of the output. 
For example, if tolerance is set to 0.1, then an output value within 10% of the desired output is considered correct. 
Other training parameters generally exist to specify the reporting rate of the progress of the training, where to log such progress, etc. 
We will see specific examples of these as we start working with actual networks.


## 📚 Extra Study Material

- https://www.youtube.com/watch?v=aircAruvnKk

## 🙏 Acknowledgements

- SimpliLearn- What is a Neural Network? https://www.youtube.com/watch?v=bfmFfD2RIcg
- SimpliLearn- What is Deep Learning? https://www.youtube.com/watch?v=6M5VXKLf4D4
- https://jupyter.brynmawr.edu/services/public/dblank/BioCS115%20Computing%20through%20Biology/2016-Spring/Notebooks/Artificial_Neural_Networks.ipynb
- https://github.com/karpathy/micrograd