<a href="https://colab.research.google.com/github/scaomath/wustl-math450/blob/main/Math_450_Notebook_3_(Matrix_Vector_multiplication).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding Lecture 3 of Math 450

Overall goal of the our class: make us learn machine learning in torch package.
- Build our own neural net using Torch's LEGO-like blocks.
- Write torch-like code from scratch.
- Write our own optimizer.

Today's goal:
- Use matrix vector multiplication in `torch` to build a network.
- `Dense` layer in `nn` module.

This is a worksheet version of the notebook. We can follow along during the coding lecture and then download the annotated version in our Github repository.

Download this notebook at: 

Reference: Numpy's neural network implementation from scratch: https://www.kaggle.com/scaomath/simple-mnist-numpy-from-scratch

In [None]:
import numpy as np
import torch

# Gradient of a single sample

We have the following neural network.

<img src="https://sites.wustl.edu/scao/files/2021/02/3Lnn.png" alt="drawing" width="800"/>

In [None]:
torch.manual_seed(42)
torch.cuda.manual_seed(42)
np.random.seed(42)

In [None]:
dtype = torch.float
device = torch.device("cpu")

# first go to Runtime->Runtime type->Select GPU as accelerator 
# then uncomment this to run on GPU
# device = torch.device("cuda:0") 

In [None]:
# N is the sample size (or current mini-batch size); 
# D_in is input dimension;
# N_H is hidden dimension; 
# D_out is output dimension.
N, D_in, N_H, D_out = 1, 10, 5, 3

In [None]:
# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
torch.manual_seed(42)
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

tensor([[ 0.3367,  0.1288,  0.2345,  0.2303, -1.1229, -0.1863,  2.2082, -0.6380,
          0.4617,  0.2674]])


In [None]:
# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
# because our data has zero mean, there is no need to include bias
torch.manual_seed(42)
w1 = torch.randn(D_in, N_H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(N_H, D_out, device=device, dtype=dtype, requires_grad=True)
# here the w1 and w2 are actually transposes

## Forward pass

$$\begin{aligned}
\mathbf{z}^{(2)} &= W^{(1)} \mathbf{x} + \mathbf{b}^{(1)} \\
\mathbf{a}^{(2)} &= f(\mathbf{z}^{(2)}) \\
\mathbf{z}^{(3)} &= W^{(2)} \mathbf{a}^{(2)} + \mathbf{b}^{(2)} \\
h(\mathbf{x}; W, b) &= \mathbf{a}^{(3)} = \mathbf{z}^{(3)}
\end{aligned}
$$

In [None]:
# code here

## Actual data in batch

In the actual implementation, the data normaly comes in batch, i.e., a matrix. For example, input is a matrix $X \in \mathbb{R}^{N \times d}$, $N$ is a number of samples in a batch, each row represents a sample $\mathbf{x} \in \mathbb{R}^{1\times d}$. The weight matrix $W$ is actually formulated as:
$$
W = \left(
\begin{array}{cccc}| & | & | & | \\
\mathbf{w}_1 & \mathbf{w}_2 & \cdots & \mathbf{w}_m \\
| & | & | & |
\end{array}\right),
$$
if the output dimension of the layer of interest is $m$. The vectorized formulation is, for example, from the input (layer 0, dimension $d$) to layer 1 (dimension $m$)
$$
A^{(1)} = X (W^{(0)})^{\top} + B
$$
where $X \in \mathbb{R}^{N \times d}$, $W^{(0)} \in \mathbb{R}^{m\times d}$ (input from $d$ perceptrons, output from $m$ perceptrons), $B$ is a matrix with each row being the same $\mathbf{b} \in \mathbb{R}^{1\times m}$ (layer 1 has $m$ perceptrons and has $m$ biases if applicable).

## Torch's nn module

We will demo this batch-based operation using `torch`'s neural network module `nn`. `nn.Linear` applies an (affine) linear transformation to the incoming data:
$$
Y = X W^{\top} + \mathbf{b}
$$

Reference: https://pytorch.org/docs/stable/nn.html

In [None]:
import torch.nn as nn

In [None]:
layer = nn.Linear(4, 3) # Wx+b transforms (4,1) vector to (3,1) vector

## Gradient in Torch: autograd

In [None]:
a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

In [None]:
Q = 3*a**3 - b**2

In [None]:
L = Q.sum()

In [None]:
L.backward()