<a href="https://colab.research.google.com/github/scaomath/wustl-math450/blob/main/Math_450_Notebook_2_(From_Numpy_to_PyTorch).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding Lecture 2 of Math 450

Goal of the our class: make us learn machine learning in torch package.
- Build our own neural net using Torch's LEGO-like blocks.
- Write testing code from scratch.
- Write our own optimizer.


This is a worksheet version of the notebook. We can follow along during the coding lecture and then download the annotated version in our Github repository.

Download this notebook at:

In [None]:
import numpy as np
import torch

## Review of Coding Lecture 1: From Numpy to Torch

- Matrix vector multiplication (`dot` in `numpy` and `mm` in `torch`) vs `*` (elementwise multiplication)
- Axes of an array, `squeeze()`.
- Object-oriented way of applying functions.
- `reshape()` in `numpy` vs `view()` in torch.

In [None]:
# matrix vector multiplication vs *
x = np.array([[1,2], [0,5]])
y = np.array([1.3, 2.5])

# Key components for a neural network


## Multi-layer, multiple perceptrons per layer
If we have $m$ perceptrons in a single layer, for example layer 2:
<img src="https://sites.wustl.edu/scao/files/2021/02/neural_net_3-layer.png" alt="drawing" width="800"/>

Our neural network has parameters $(W, b) := \big(W^{(1)},b^{(1)},W^{(2)},b^{(2)}\big)$.

* $W^{(l)} = \big(w^{(l)}_{ij}\big)$ to denote the weight matrix, where the entry-$ij$ is associated with the connection between unit $j$ in layer $l$, and unit $i$ in layer $l+1$. Note the order of the indices, $j$ is the closer to the input that this matrix is acting on 

* $b^{(l)}_i$ is the bias associated with unit $i$ in layer $l+1$. 

In our example above, we have $W^{(1)}\in \mathbb{R}^{3×2}$, and $W^{(2)}\in \mathbb{R}^{1×3}$. Note that bias units do not have inputs or connections going into them, we write their output the value $+1$ for convenience. When we count the number of units in layer $l$, we do not count the bias unit.



## Matrix-vector representation
If we allow the activation function $f(\cdot)$ to act on vectors in an element-wise fashion: $f([\mathbf{z}_1,\mathbf{z}_2,\mathbf{z}_3])=[f(\mathbf{z}_1),f(\mathbf{z}_3),f(\mathbf{z}_3)]$, then we can write the equations above more compactly as:
$$\begin{aligned}
\mathbf{z}^{(2)} &= W^{(1)} \mathbf{x} + \mathbf{b}^{(1)} \\
\mathbf{a}^{(2)} &= f(\mathbf{z}^{(2)}) \\
\mathbf{z}^{(3)} &= W^{(2)} \mathbf{a}^{(2)} + \mathbf{b}^{(2)} \\
h(\mathbf{x}; W, b) &= \mathbf{a}^{(3)} = \mathbf{z}^{(3)}
\end{aligned}
$$
More generally, if we have an arbitrary number of layers, recalling that $\mathbf{a}^{(0)}=\mathbf{x}$ also denotes the values from the input layer, then given layer $l$'s activations $\mathbf{a}^{(l)}$, we can compute layer $(l+1)$'s activations $\mathbf{a}^{(l+1)}$ as:
$$
\begin{aligned}
\mathbf{z}^{(l+1)} &= W^{(l)} \mathbf{a}^{(l)} + \mathbf{b}^{(l)}   \\
\mathbf{a}^{(l+1)} &= f(\mathbf{z}^{(l+1)}),
\end{aligned}
$$
except the last layer where we do not need any activation.
By organizing the parameters in matrices and using matrix-vector operations, we can take advantage of fast linear algebra routines to quickly perform calculations in our network.

In [None]:
# generate all the x, z, a variables above

In [None]:
# from input to first layer (hidden layer)

In [None]:
# from the first layer to the output layer

## Actual data

In the actual implementation, the data normaly comes in batch, i.e., a matrix. For example, input is a matrix $X \in \mathbb{R}^{N \times d}$, $N$ is a number of samples in a batch, each row represents a sample $\mathbf{x} \in \mathbb{R}^{1\times d}$. The weight matrix $W$ is actually formulated as:
$$
W = \left(
\begin{array}{cccc}| & | & | & | \\
\mathbf{w}_1 & \mathbf{w}_2 & \cdots & \mathbf{w}_m \\
| & | & | & |
\end{array}\right),
$$
if the output dimension of the layer of interest is $m$. The vectorized formulation is, for example, from the input (layer 0, dimension $d$) to layer 1 (dimension $m$)
$$
A^{(1)} = X W^{(0)} + B
$$
where $X \in \mathbb{R}^{N \times d}$, $W^{(0)} \in \mathbb{R}^{d\times m}$ (input from $d$ perceptrons, output from $m$ perceptrons), $B$ is a matrix with each row being the same $\mathbf{b} \in \mathbb{R}^{1\times m}$ (layer 1 has $m$ perceptrons and has $m$ biases if applicable).

## Torch's nn module

We will demo this batch-based operation using `torch`'s neural network module `nn`. `nn.Linear` applies an (affine) linear transformation to the incoming data:
$$
Y = X W^{\top} + \mathbf{b}
$$

Reference: https://pytorch.org/docs/stable/nn.html

In [None]:
import torch.nn as nn

In [None]:
# Linear layer example
layer = nn.Linear(4, 3)
input = torch.randn(32, 4)

# MNIST

"MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike."

[Read more.](https://www.kaggle.com/c/digit-recognizer)


<a title="By Josef Steppan [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:MnistExamples.png"><img width="512" alt="MnistExamples" src="https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png"/></a>

This code is adopted from the pytorch examples repository. 
It is licensed under BSD 3-Clause "New" or "Revised" License.
Source: https://github.com/pytorch/examples/
LICENSE: https://github.com/pytorch/examples/blob/master/LICENSE


In [None]:
from __future__ import print_function
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from torchvision.utils import make_grid
import matplotlib.pyplot as plt


In [None]:
train = datasets.MNIST('../data', train=True, download=True, transform = transforms.ToTensor())

In [None]:
train_loader = DataLoader(train, batch_size=1, shuffle=True, num_workers=2,)

In [None]:
data_iter = iter(train_loader)
images, labels = next(data_iter)

In [None]:
im = make_grid(images)
plt.imshow(np.transpose(im.numpy(), (1, 2, 0)))