<a href="https://colab.research.google.com/github/scaomath/wustl-math450/blob/main/Math_450_Notebook_3_(Matrix_Vector_multiplication).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding Lecture 3 of Math 450

Overall goal of the our class: make us learn machine learning in torch package.
- Build our own neural net using Torch's LEGO-like blocks.
- Write torch-like code from scratch.
- Write our own optimizer.

Today's goal:
- Use matrix vector multiplication in `torch` to build a network.
- `Dense` layer in `nn` module.

This is a worksheet version of the notebook. We can follow along during the coding lecture and then download the annotated version in our Github repository.

Download this notebook at: https://github.com/scaomath/wustl-math450/blob/main/Lectures/Math_450_Notebook_3_(Matrix_Vector_multiplication).ipynb

Reference: Numpy's neural network implementation from scratch: https://www.kaggle.com/scaomath/simple-mnist-numpy-from-scratch

In [None]:
import numpy as np
import torch

# Gradient of a single sample

We have the following neural network.

<img src="https://sites.wustl.edu/scao/files/2021/02/3Lnn.png" alt="drawing" width="500"/>

## Reproducibility
Fixing the random number generation seed.

In [None]:
torch.manual_seed(42)
torch.cuda.manual_seed(42)
np.random.seed(42)

In [None]:
torch.manual_seed(2) 
# in cell mode (not a single .py file), we have to put 
# torch.manual_seed(SEED) in each cell we want reproducibility
torch.randn((5,))

tensor([ 0.3923, -0.2236, -0.3195, -1.2050,  1.0445])

In [None]:
dtype = torch.float # single-precision float number
device = torch.device("cpu")

# first go to Runtime->Runtime type->Select GPU as accelerator 
# then uncomment this to run on GPU
# device = torch.device("cuda:0") 

In [None]:
# N is the sample size (or current mini-batch size); 
# D_in is input dimension;
# N_H is hidden dimension; 
# D_out is output dimension.
N, D_in, N_H, D_out = 1, 10, 5, 3

In [None]:
# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
# x: sample, is a row vector, each row represents a sample
torch.manual_seed(42)
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
print(x, '\n' , y)

tensor([[ 0.3367,  0.1288,  0.2345,  0.2303, -1.1229, -0.1863,  2.2082, -0.6380,
          0.4617,  0.2674]]) 
 tensor([[0.5349, 0.8094, 1.1103]])


In [None]:
print(x.size())

torch.Size([1, 10])


In [None]:
# why using row as sample?
X = torch.randn((5, 7)) # 5 samples, 7 features (input_dim)
print(X)

tensor([[-1.1109,  0.0915, -2.3169, -0.2168, -1.3847, -0.8712, -0.2234],
        [ 1.7174, -0.5920, -0.0631, -0.8286,  0.3309, -1.5576,  0.9956],
        [-0.8798, -0.6011,  1.3123,  0.6872, -1.0892, -0.4459,  1.4451],
        [ 0.8564,  2.2181,  0.5232,  0.3466, -0.1973, -1.0546, -0.7718],
        [-0.1722,  0.5238,  0.0566,  0.4263,  0.5750, -0.6417, -2.2064]])


In [None]:
# row corresponds to axis 0
print(X[0]) # returns row 0 (the 1st row), single index, handy
print(X[:, 0]) # column 0 (not convenient to track column vectors)
print(X[..., 0]) # columns 0 

tensor([-1.1109,  0.0915, -2.3169, -0.2168, -1.3847, -0.8712, -0.2234])
tensor([-1.1109,  1.7174, -0.8798,  0.8564, -0.1722])
tensor([-1.1109,  1.7174, -0.8798,  0.8564, -0.1722])


In [None]:
# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
# because our data/target has zero mean, there is no need to include bias
torch.manual_seed(42)
w1 = torch.randn(D_in, N_H, 
                 device=device, 
                 dtype=dtype, 
                 requires_grad=True)

w2 = torch.randn(N_H, D_out, 
                 device=device, 
                 dtype=dtype, 
                 requires_grad=True)
# here the w1 and w2 are actually transposes

In [None]:
print(w2)

tensor([[-0.5687,  1.2580, -1.5890],
        [-1.1208,  0.8423,  0.1744],
        [-2.1256,  0.9629,  0.7596],
        [ 0.7343, -0.6708,  2.7421],
        [ 0.5568, -0.8123,  1.1964]], requires_grad=True)


## Forward pass

$$\begin{aligned}
\mathbf{z}^{(2)} &= W^{(1)} \mathbf{x} + \mathbf{b}^{(1)} \\
\mathbf{a}^{(2)} &= f(\mathbf{z}^{(2)}) \\
\mathbf{z}^{(3)} &= W^{(2)} \mathbf{a}^{(2)} + \mathbf{b}^{(2)} \\
h(\mathbf{x}; W, b) &= \mathbf{a}^{(3)} = \mathbf{z}^{(3)}
\end{aligned}
$$

$\mathbf{b}^{(1)}$ and $\mathbf{b}^{(2)}$ are zero vector in our example.

In [None]:
# code here
z2 = x.mm(w1) # z2 is a row vector (1, 5) = (1, 10)*(10, 5)
print(z2)

tensor([[ 0.3988,  0.6874, -4.7359, -4.3096, -0.2963]], grad_fn=<MmBackward>)


In [None]:
# relu activation
a2 = z2.clamp(min=0)
print(a2)

tensor([[0.3988, 0.6874, 0.0000, 0.0000, 0.0000]], grad_fn=<ClampBackward>)


In [None]:
out = a2.mm(w2)
print(out)

tensor([[-0.9972,  1.0807, -0.5137]], grad_fn=<MmBackward>)


## Actual data in batch

In the actual implementation, the data normaly comes in batch, i.e., a matrix. For example, input is a matrix $X \in \mathbb{R}^{N \times d}$, $N$ is a number of samples in a batch, each row represents a sample $\mathbf{x} \in \mathbb{R}^{1\times d}$. The weight matrix $W$ is actually formulated as:
$$
W = \left(
\begin{array}{cccc}| & | & | & | \\
\mathbf{w}_1 & \mathbf{w}_2 & \cdots & \mathbf{w}_m \\
| & | & | & |
\end{array}\right),
$$
if the output dimension of the layer of interest is $m$. The vectorized formulation is, for example, from the input (layer 0, dimension $d$) to layer 1 (dimension $m$)
$$
A^{(1)} = X (W^{(0)})^{\top} + B
$$
where $X \in \mathbb{R}^{N \times d}$, $W^{(0)} \in \mathbb{R}^{m\times d}$ (input from $d$ perceptrons, output from $m$ perceptrons), $B$ is a matrix with each row being the same $\mathbf{b} \in \mathbb{R}^{1\times m}$ (layer 1 has $m$ perceptrons and has $m$ biases if applicable).



In [None]:
N = 8 # 8 samples in a batch
torch.manual_seed(42)
X = torch.randn(N, D_in, dtype=torch.float, device=device)
Y = torch.randn(N, D_out, dtype=torch.float, device=device)
print(X, '\n\n', Y)

tensor([[ 1.9269,  1.4873,  0.9007, -2.1055,  0.6784, -1.2345, -0.0431, -1.6047,
         -0.7521,  1.6487],
        [-0.3925, -1.4036, -0.7279, -0.5594, -0.7688,  0.7624,  1.6423, -0.1596,
         -0.4974,  0.4396],
        [-0.7581,  1.0783,  0.8008,  1.6806,  1.2791,  1.2964,  0.6105,  1.3347,
         -0.2316,  0.0418],
        [-0.2516,  0.8599, -1.3847, -0.8712, -0.2234,  1.7174,  0.3189, -0.4245,
          0.3057, -0.7746],
        [-1.5576,  0.9956, -0.8798, -0.6011, -1.2742,  2.1228, -1.2347, -0.4879,
         -0.9138, -0.6581],
        [ 0.0780,  0.5258, -0.4880,  1.1914, -0.8140, -0.7360, -1.4032,  0.0360,
         -0.0635,  0.6756],
        [-0.0978,  1.8446, -1.1845,  1.3835,  1.4451,  0.8564,  2.2181,  0.5232,
          0.3466, -0.1973],
        [-1.0546,  1.2780, -0.1722,  0.5238,  0.0566,  0.4263,  0.5750, -0.6417,
         -2.2064, -0.7508]]) 

 tensor([[ 0.0109, -0.3387, -1.3407],
        [-0.5854,  0.5362,  0.5246],
        [ 1.1412,  0.0516, -0.6788],
        [ 0.5

In [None]:
# Z2 is W1 multiplied with every x in this 8 sample batch
Z2 = X.mm(w1)
print(Z2)

tensor([[ 0.6411, -2.4556, -1.5984, -4.3532,  5.3757],
        [ 3.1372,  0.6441,  1.0954, -1.0668, -2.6129],
        [-0.2949,  2.6706,  0.3077, -0.4743,  2.1944],
        [-1.2595,  2.0259, -0.3845,  1.5235,  1.4738],
        [-1.0205, -1.3909, -0.1755,  2.3356, -0.6578],
        [ 1.9354, -0.5580,  0.6821, -1.7389,  1.0952],
        [-1.8684,  7.3570, -2.8494, -0.8635,  5.9925],
        [-2.6531,  3.5318, -4.8040,  0.5250,  2.0460]], grad_fn=<MmBackward>)


In [None]:
A2 = Z2.clamp(min=0)
print(A2)

tensor([[0.6411, 0.0000, 0.0000, 0.0000, 5.3757],
        [3.1372, 0.6441, 1.0954, 0.0000, 0.0000],
        [0.0000, 2.6706, 0.3077, 0.0000, 2.1944],
        [0.0000, 2.0259, 0.0000, 1.5235, 1.4738],
        [0.0000, 0.0000, 0.0000, 2.3356, 0.0000],
        [1.9354, 0.0000, 0.6821, 0.0000, 1.0952],
        [0.0000, 7.3570, 0.0000, 0.0000, 5.9925],
        [0.0000, 3.5318, 0.0000, 0.5250, 2.0460]], grad_fn=<ClampBackward>)


In [None]:
# output z3
Z3 = A2.mm(w2)
print(Z3)

tensor([[ 2.6286, -3.5604,  5.4129],
        [-4.8344,  5.5440, -4.0405],
        [-2.4255,  0.7632,  3.3249],
        [-0.3313, -0.5127,  6.2941],
        [ 1.7151, -1.5667,  6.4045],
        [-1.9406,  2.2018, -1.2468],
        [-4.9091,  1.3290,  8.4527],
        [-2.4337,  0.9607,  4.5034]], grad_fn=<MmBackward>)


## Torch's nn module

We will demo this batch-based operation using `torch`'s neural network module `nn`. `nn.Linear` applies an (affine) linear transformation to the incoming data:
$$
Y = X W^{\top} + \mathbf{b}
$$

Reference: https://pytorch.org/docs/stable/nn.html

In [None]:
import torch.nn as nn

In [None]:
layer1 = nn.Linear(10, 5) 
# Wx+b transforms (10,1) vector to (5,1) vector
# or xW^T transforms (1,10) vector to (1,5) vector 
layer2 = nn.Linear(5,3)
activation = nn.ReLU()

In [None]:
Z2 = layer1(X)
print(Z2)

tensor([[ 0.2808,  0.3782, -0.4373,  0.2746,  0.4777],
        [-0.1306, -0.6695,  0.5177, -0.0283, -0.7960],
        [-1.6454, -0.8339, -0.0197, -0.4252, -0.4004],
        [-0.1117,  0.1235,  0.1512, -0.2560, -0.1221],
        [-0.1911,  0.0570, -0.5551, -1.2714, -0.6000],
        [-0.0703,  0.9140, -0.7532, -0.8300, -0.4561],
        [-1.3858, -0.1712, -0.2908,  0.0939, -0.4832],
        [-0.9507, -0.1689, -0.9392, -0.6117, -0.8422]],
       grad_fn=<AddmmBackward>)


In [None]:
A2 = activation(Z2)
print(A2)

tensor([[0.2808, 0.3782, 0.0000, 0.2746, 0.4777],
        [0.0000, 0.0000, 0.5177, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.1235, 0.1512, 0.0000, 0.0000],
        [0.0000, 0.0570, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.9140, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0939, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000]], grad_fn=<ReluBackward0>)


In [None]:
def forward(x):
  '''
  forward pass function
  '''
  layer1 = nn.Linear(10,5)
  layer2 = nn.Linear(5,3)
  act = nn.ReLU()
  x = layer1(x) # z2
  x = act(x) # a2
  x = layer2(x) # z3
  return x

In [None]:
output = forward(X)
print(output)

tensor([[ 1.5598e-01, -5.9261e-01, -5.2584e-01],
        [-7.8872e-02,  6.1956e-04, -2.0961e-01],
        [-2.0640e-01,  3.4190e-01, -3.4736e-01],
        [-8.7701e-03,  9.9241e-02, -3.1565e-01],
        [ 1.9981e-01, -4.6423e-01, -6.2108e-01],
        [-8.0052e-02,  6.6529e-02, -2.3789e-01],
        [-1.8169e-01,  1.0619e-01, -2.3450e-01],
        [ 1.4324e-01, -3.5152e-01, -4.9707e-01]], grad_fn=<AddmmBackward>)


## Gradient in Torch: autograd

In [None]:
a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

In [None]:
print(a)

tensor([2., 3.], requires_grad=True)


In [None]:
Q = 3*a**3 - b**2

In [None]:
print(Q)

tensor([-12.,  65.], grad_fn=<SubBackward0>)


In [None]:
L = Q.sum()
print(L)

tensor(53., grad_fn=<SumBackward0>)


In [None]:
L.backward() # backprop in a simple command

$\frac{\partial L}{\partial \mathbf{a}}$ should be the same shape with $\mathbf{a}$

In [None]:
a.grad

tensor([36., 81.])

In [None]:
(9*a**2).detach() # detach means we do not track the gradient

tensor([36., 81.])

In [None]:
b.grad

tensor([-12.,  -8.])

In [None]:
(-2*b).detach()

tensor([-12.,  -8.])