<a href="https://colab.research.google.com/github/rahiakela/deep-learning-research-and-practice/blob/main/math-and-architectures-of-deep-learning/02-introduction-to-vectors-calculus/03_cat_brain_linear_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Cat brain linear model

In machine learning, we identify the input and output variables
pertaining to the problem at hand and cast the problem as
generating outputs from input variables. All the inputs are
represented together by the vector $\vec{x}$. Sometimes there
are multiple outputs, sometimes single output. Accordingly,
we have an output vector $\vec{y}$ or output scalar $y$.
Let us denote the function that generates the output from input
 vector as $f$, i.e., $y = f\left(\vec{x}\right)$.

In real life problems, we do not know $f$. The crux of machine
learning is to estimate $f$ from a set of observed inputs
$\vec{x}_{i}$ and their corresponding outputs $y_{i}$.
Each observation can be depicted as a pair $\langle\vec{x}_{i}, y_{i}\rangle$.
We model the unknown function $f$ with a known function $\phi$.
$\phi$ is a parameterized function. Alhtough the nature of $\phi$
is known, its parameter values are unknown. These parameter values
 are "learnt" via training. This means, we estimate the parameter
values such that the overall error on the observations is minimized.

If $\vec{w}, b$ denotes the current set of parameters (weights, bias), then the model will
output $\phi\left(\vec{x}_{i}, \vec{w}, b\right)$ on the observed input $\vec{x}_{i}$.
Thus the error on this $i^{th}$ observation is $e_{i}^{2}=\left(\phi\left(\vec{x}_{i}, \vec{w}, b\right) - y_{i}\right)^{2}$.
We can batch up several observations and add up the error into a batch error
$L = \sum_{i=0}^{i=N}\left(e^{\left(i\right)}\right)^{2}$.

The error is a function of the parameter set $\vec{w}$.
The question is: how do we adjust $\vec{w}$ so that the error $e_{i}^{2}$ decreases.
We know a function's value changes most when we move along the direction of
of the gradient of the parameters. Hence, we adjust the parameters
$\vec{w}, b$ as
$\begin{bmatrix}
\vec{w}\\b
\end{bmatrix} = \begin{bmatrix}
\vec{w}\\b
\end{bmatrix} - \mu \nabla_{\vec{w}, b}L\left(\vec{w}, b\right)$.
Each adjustment reduces the error. Starting from a random set of parameter values
doing this "sufficiently" large number of times yields the desired model.

A simple and popular model $\phi$ is the linear function (predicted value is
dot product between input and parameters plus bias):
$\tilde{y}_{i} = \phi\left(\vec{x}_{i}, \vec{w}, b\right) = \vec{w}^{T}\vec{x} + b
= \sum_{j}w_{j}x_{j} + b$.
In the example below, this is the model architecture used.

Thus 
\begin{align*}
L &= \sum_{i=0}^{i=N}\left(e^{\left(i\right)}\right)^{2}\\
  &= \sum_{i=0}^{i=N}\left(\vec{w}^{T}\vec{x} + b - y_{i}\right)^{2}\\
\nabla_{\vec{w}, b}L &\propto \sum_{i=0}^{i=N}\left(\vec{w}^{T}\vec{x}_{i} + b - y_{i}\right)\vec{x}_{i} \\
                     &\propto \sum_{i=0}^{i=N}\left(\tilde{y}_{i} - y_{i}\right)\vec{x}_{i}
\end{align*}
Our initial implementation will simply mimic this formula.
For more complicated models $\phi$ (with millions of parameters and non-linearities)
we cannot obtain closed form gradients like this.

The next example, based on NumPy and PyTorch, relies on PyTorch's
autograd (automatic gradient computation) which does not have this limitation.

##Setup

In [8]:
import numpy as np
import torch
import matplotlib.pyplot as plt

##Linear Model

Let's solve the cat-brain problem directly via pseudo-inverse.

As expected, the model parameters
will be converge to a solution close to that obtained by the pseudo-inverse technique.

In [9]:
torch.manual_seed(42)

X = torch.tensor([
  [0.11, 0.09], [0.01, 0.02], [0.98, 0.91],
  [0.12, 0.21], [0.98, 0.99], [0.85, 0.87],
  [0.03, 0.14], [0.55, 0.45], [0.49, 0.51], 
  [0.99, 0.01], [0.02, 0.89], [0.31, 0.47],
  [0.55, 0.29], [0.87, 0.76], [0.63, 0.24]
], dtype=torch.float)

# add bias column
X = torch.column_stack((X, torch.ones(15)))

y = torch.tensor([
  -0.8, -0.97, 0.89, -0.67, 0.97, 0.72,
  -0.83, 0.00, 0.00, 0.00, -0.09, -0.22, 
  -0.16, 0.63, 0.37
], dtype=torch.float)

# Let us compute solution using pseudo inverse
solution_pseudo = torch.matmul(torch.matmul(torch.linalg.inv(torch.matmul(X.T, X)), X.T), y)
print(f"Solution via pseudo inverse: {solution_pseudo}")

y = y.reshape((-1, 1))

Solution via pseudo inverse: tensor([ 1.0766,  0.8976, -0.9582])


In [10]:
class LinearModel(torch.nn.Module):

  def __init__(self, num_features):
    super(LinearModel, self).__init__()
    self.w = torch.nn.Parameter(torch.randn(num_features, 1))

  def forward(self, X):
    y_pred = torch.mm(X, self.w)
    return y_pred

In [11]:
num_unknowns = 3
model = LinearModel(num_features=num_unknowns)

# Let us use  Pytorch MSE loss function
loss_fn = torch.nn.MSELoss(reduction="sum")
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)

# Train model iteratively
num_steps = 1000

for step in range(num_steps):
  # linear model
  y_pred = model(X)

  # calculate loss
  loss = loss_fn(y_pred, y)

  # Periodically plot the true function and current approximation to check how we are doing
  if step % 100 == 0:
    print(f"Loss at step {step, loss}")

  # zero out all partial derivatives
  optimizer.zero_grad()

  # Compute partial derivatives via AutoGrad
  loss.backward()

  # Update parameters from gradient computed in the backward() step
  optimizer.step()

solution_gd = torch.squeeze(model.w.data)
print(f"\n\nThe solution via gradient descent is {solution_gd}")

Loss at step (0, tensor(6.6479, grad_fn=<MseLossBackward0>))
Loss at step (100, tensor(0.2207, grad_fn=<MseLossBackward0>))
Loss at step (200, tensor(0.2172, grad_fn=<MseLossBackward0>))
Loss at step (300, tensor(0.2172, grad_fn=<MseLossBackward0>))
Loss at step (400, tensor(0.2172, grad_fn=<MseLossBackward0>))
Loss at step (500, tensor(0.2172, grad_fn=<MseLossBackward0>))
Loss at step (600, tensor(0.2172, grad_fn=<MseLossBackward0>))
Loss at step (700, tensor(0.2172, grad_fn=<MseLossBackward0>))
Loss at step (800, tensor(0.2172, grad_fn=<MseLossBackward0>))
Loss at step (900, tensor(0.2172, grad_fn=<MseLossBackward0>))


The solution via gradient descent is tensor([ 1.0766,  0.8976, -0.9582])
