# Introduction

## The Perceptron: Forward Propagation

The Perceptron it is a linear combination of inputs added by a bias passing throught a non-linear activation function defined as:

$$
\hat{y} = g\Biggl(\sum_{i=1}^{m} x_i w_i + w_0 \Biggr)
$$

where:

- $x_i$ represents the input from $i$ to $m$;
    
- $w_i$ represents the weights from $i$ to $m$;
    
- $w_0$ represents the bias for each neuron;

It can be implemented as follow:

In [2]:
# Import Pytorch
try:
    import torch
except:
    !pip3 install torch torchvision torchaudio

Build model:

In [3]:
class Perceptron(torch.nn.Module):
    def __init__(self, x_dim):
        super().__init__()
        
        # Initialize weights and bias randomlly
        self.w = torch.randn(x_dim, 1)
        self.b = torch.zeros(1)
        
    def forward(self, x):
        z = torch.matmul(x, self.w) + self.b  # Equivalent to xT*w + b
        g = torch.sigmoid(z)
        
        return g
        

Curiosity about building standard models with PyTorch:: [init and forward method](https://discuss.pytorch.org/t/beginner-should-relu-sigmoid-be-called-in-the-init-method/18689/5).

Forward:

In [4]:
b_dim = 1  # Number of samples
x_dim = 20  # Feature dimension

x = torch.randn(b_dim, x_dim) # Equivalent to xT in matrix notation

model = Perceptron(x.shape[-1]) # Auto-detected the input dimension, same as x_dim

y = model(x)

print(f"Neuron output: {y.item():.02}")

Neuron output: 0.95


## Single Layer

The Perceptrons can be used to create single layers defined as:

Hidden:
$$
z_i = \sum_{j=1}^{m} x_{j} w_{ji}^{(1)} + w_{0,i}^{(1)}
$$

Output:
$$
\hat{y} = g\Biggl(\sum_{j=1}^{d1} g(z_{j}) w_{ji}^{(2)} + w_{0,i}^{(2)} \Biggr)
$$


## Multiple Layers (deep)

Dense neural networks can be created stacking multiple layers:

Hidden:
$$
z_{i}^{(k)} = \sum_{j=1}^{n^{k-1}} g(z_{j})^{(k-1)} w_{j,i}^{(k)} + w_{0,i}^{(k)}
$$

Output:
$$
\hat{y_i} = g\Biggl(\sum_{j=1}^{n^{k-1}} g(z_{j})^{(k-1)} w_{j,i}^{(k)} + w_{0,i}^{k)} \Biggr)
$$

where:

- $g_(z_{j})^{(k-1)}$: represents the outputs of each neuron $j$ from the previous layer $k-1$;

- $w_{j,i}$: represents the weights for each neuron $j$ from previous layer and the output for each neuron $i$ from current layer $k$;
    
- $w_{0,i} $ represents the bias for each neuron $i$ from current layer $k$;


Building the model:

In [5]:
class MLP(torch.nn.Module):
    def __init__(self, x_dim, h_units):
        super().__init__()
        # Layers
        self.linear_hidden = torch.nn.Linear(x_dim, h_units)
        self.linear_output = torch.nn.Linear(h_units, 1)
        
        # Activations
        self.relu = torch.nn.ReLU()
        self.sigmoid = torch.nn.Sigmoid()
        
    def forward(self, x):
        # hidden layer
        zh = self.linear_hidden(x)
        gh = self.relu(zh)
        
        # output layer
        zo = self.linear_output(gh)
        go = self.sigmoid(zo)
        
        return go

Forward:

In [6]:
h_units = 8

# Model configuration: x_dim:h_dim:o_dim -> 20:8:8:4 
model = MLP(x.shape[-1], h_units)

y = model(x)

print(y)

tensor([[0.5607]], grad_fn=<SigmoidBackward0>)


## Quantify prediction

To quantify incorrect predictions it can be used a loss (objective, cost or empirical risk) function.

The loss function quantifies how well or poorly your model is performing on the given task.

It measures the error or discrepancy between the predicted output of your model and the actual target values.

$$J(W) = \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}(f(x^{(i)};W), y^{(i)})$$

where:

- $f(x^{(i)};W)$: represents the model prediction for each output $i$;
- $y^{(i)}$: represents the desired output for each output $i$;

As example, the Mean Square Error (MSE) loss can be defined as:

$$MSE = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_{i})^2$$



## Loss optimization
Predictions can be modified based on some optmization algorithm. 

For neural network it can be defined as:

$$W^* = \arg\min_{W} \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}(f(x^{(i)};W), y^{(i)})$$


The formula represents the general loss optimization for neural networks:

- $W$: The parameters of the neural network that are being optimized to minimize the loss.
- $\mathcal{L}(.)$: The loss function that quantifies the discrepancy between the predicted outputs and the target outputs.
- $N$: The total number of data samples in the dataset.
- $x_i$: The input features of the $i$-th data sample.
- $f(x^{(i)};W)$: The prediction made by the neural network for input $x_i$ using parameters $W$.
- $y_i$: The target output for the $i$-th data sample.
- $\mathcal{L}(f(x^{(i)};W)$: A per-sample loss function that measures the difference between the predicted output and the target output.

The goal of the optimization is to minimize $\mathcal{L}(.)$ by adjusting the parameters $W$ of the neural network.

The most famous methods are the gradient based ones.

### Gradient

The gradient of a function with respect to its parameters represents the direction and magnitude of the steepest descent or ascent in the function at a particular point. In other words, is a vector that points in the direction of the maximum (or minimum) rate of decrease or increase of the function.

In this optimization case, computation of loss, the aim is to find the values of the model's parameters that minimize the loss function.

Gradient Descent (Algorithm)

1. Randomly initialize parameters
2. Loop until convergence:
3. &emsp; Compute gradient of the loss (backpropagation)
4. &emsp; Update weights based on gradient
5. Return weights


### Backpropagation

The process of computing gradients for all the model's parameters is typically performed using an algorithm called backpropagation. 

Backpropagation efficiently computes the gradients of the loss function with respect to each parameter by applying the chain rule of calculus.

The best way to understand backpropagation I believe is doing a simple example.

The example [2] used here will be calculating the gradient of the equation bellow:
$$ r = w^2 $$

where
$$w = zv$$
$$v = u+y$$
$$u = x^2$$

thus
$$ r = z^2 (x^2 + y)^2 $$

Calculating local gradients (partial derivatives):

1. Partial derivative of $r$ with respect to $w$:
$
\frac{\partial r}{\partial w} = \frac{\partial w^2}{\partial w} = 2w
$

2. Partial derivative of $w$ with respect to $z$ and $v$:
$
\frac{\partial w}{\partial z} = \frac{\partial zv}{\partial z} = v
\frac{\partial w}{\partial v} = \frac{\partial zv}{\partial v} = z
$

3. Partial derivative of $v$ with respect to $u$ and $y$:
$
\frac{\partial v}{\partial u} = \frac{\partial u+y}{\partial u} = 1
\frac{\partial v}{\partial y} = \frac{\partial u+y}{\partial y} = 1
$

4. Partial derivative of $u$ with respect to $x$:
$
\frac{\partial u}{\partial x} = \frac{\partial x^2}{\partial x} = 2x
$

5. Partial derivative of $r$ with respect to $z$:
$
\frac{\partial r}{\partial z} = \frac{\partial r}{\partial w} \frac{\partial w}{\partial z} = 2wv
$

6. Partial derivative of $r$ with respect to $y$:
$
\frac{\partial r}{\partial y} = \frac{\partial r}{\partial w} \frac{\partial w}{\partial v} \frac{\partial v}{\partial y} = 2wz
$

7. Partial derivative of $r$ with respect to $x$:
$
\frac{\partial r}{\partial x} = \frac{\partial r}{\partial w} \frac{\partial w}{\partial v} \frac{\partial v}{\partial u} \frac{\partial u}{\partial x}= 2wz2x
$


Hence

In [7]:
# input
x = 1; y = 2; z = 4

# nodes
u = x**2
v = u+y
w = z*v

# forward
r = w**2
print(f"Forward: {r}")

# backward
drdz = 2*w*v
drdy = 2*w*z
drdx = 2*w*z*2*x

print(f"Gradients: drdz={drdz}, drdy={drdy}, drdx={drdx}")

Forward: 144
Gradients: drdz=72, drdy=96, drdx=192


For small problems calculating the gradient can be simple, but for large neural models calculating the partial derivative of each node and applying chain rules can be challenging and trick.

This is where frameworks like Pytorch shine, they can do this automatically like in pytorch.

Using Pytorch ([autograd](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html)).

In [8]:
import torch

x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
z = torch.tensor(4.0, requires_grad=True)

# forward pass
u = x**2
v = u+y
w = z*v
r = w**2
print(f'r = {r}')

# backward pass
r.backward()  # simple like this

print(f'dr/dx = {x.grad}')
print(f'dr/dy = {y.grad}')
print(f'dr/dz = {z.grad}')

r = 144.0


dr/dx = 192.0
dr/dy = 96.0
dr/dz = 72.0


Main pieces:
- Loss: Tell us how good or bad predictions are compared with target.
- Gradient: How weights should be changed to improve loss, in the case decrease loss (negative gradient)
- Updating weitghs: Change weights based on previous values, loss and gradient.

Putting everything togeter:

In [11]:
# Random input data just for example
n_samples = 10

data = torch.rand(n_samples, x_dim)
labels = torch.rand(n_samples, 1)

samples = list(zip(data, labels))

epochs = 2

# Optmize loss
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

for epoch in range(epochs):
    
    running_loss = 0.0
    i = 0
    for feature, target in samples:
        # zero the parameter gradients
        optim.zero_grad()
        
        # forward
        predictions = model(feature)

        # calculate loss
        loss = torch.nn.MSELoss()(predictions, target)

        # Compute gradient
        loss.backward()

        # Update
        optim.step()
        
        # print statistics
        running_loss += loss.item()
        print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss:.3f}')
        
        i += 1
        
print('Finished Training')

[1,     1] loss: 0.107
[1,     2] loss: 0.108
[1,     3] loss: 0.121
[1,     4] loss: 0.124
[1,     5] loss: 0.199
[1,     6] loss: 0.322
[1,     7] loss: 0.530
[1,     8] loss: 0.571
[1,     9] loss: 0.608
[1,    10] loss: 0.738
[2,     1] loss: 0.088
[2,     2] loss: 0.092
[2,     3] loss: 0.096
[2,     4] loss: 0.096
[2,     5] loss: 0.140
[2,     6] loss: 0.222
[2,     7] loss: 0.371
[2,     8] loss: 0.388
[2,     9] loss: 0.447
[2,    10] loss: 0.535
Finished Training


This code doesn't learn nothing given random data and labels, is just an illustration example.

References:

[Lecture 1 - Intro to Deep Learning](https://www.youtube.com/watch?v=QDX-1M5Nj7s&list=PLtBw6njQRU-rwp5__7C0oIVt26ZgjG9NI&index=3)

[Backpropagation - Chain Rile and Pytorch in action](https://towardsdatascience.com/backpropagation-chain-rule-and-pytorch-in-action-f3fb9dda3a7d)

