<a href="https://colab.research.google.com/github/scaomath/wustl-math450/blob/main/Math_450_Notebook_1_(From_Numpy_to_PyTorch).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding Lecture 1 of Math 450


## Welcome to Google Colab
We are using Google Colab~

This is a text cell. It uses Markdown synt
ax.

For example, we can enter math inline by `$ $`: $E = mc^2$, and `$$   $$` for a line of equation:
$$
\int^1_0 f(x) dx + \int^1_0 f^{-1}(x) dx  = 1. 
$$

Python code:
```python
import numpy as np
x = np.array([9,1,1])
print(x)
```

In [2]:
from time import time
print("Welcome to Math 450.")
print(f"{time():.2f}") # f-string

Welcome to Math 450.
1611957316.34


In [3]:
import numpy as np
import torch

## Introduction of PyTorch and GPUs

Colab uses an NVIDIA Tesla T4, and Kaggle uses Nvidia Tesla P100, both of which are extremely powerful GPUs only subpar vs the new Ampere GPUs (RTX 3090, A4000, A8000). 

A GPU instance has a time limit (12h on Colab, 9h on Kaggle). However, Colab's GPU limit is more shady as stated in the 
> Colab resources are not guaranteed and not unlimited, and the usage limits sometimes fluctuate. This is necessary for Colab to be able to provide resources for free. For more details, see Resource Limits.

If you want to get into serious Machine Learning, my personal recommendation is to build a computer around an RTX 3060 12GB under a budget and learn Linux. If you started working on CV (computer vision) or NLP (natural language processing), then it is recommended to get an RTX 3090.

In [None]:
torch.__version__ # cu means cuda

'1.7.0+cu101'

In [None]:
torch.cuda.is_available()

True

In [None]:
!nvidia-smi

Fri Jan 29 21:14:08 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8     9W /  70W |     10MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Compare a `torch.tensor` with a `numpy.ndarray`

- Initialization
- Convert one to the other and vice versa
- Common methods (functions) associated with them
- PyTorch has a special "in-place" operation which has an underscore `_` as a suffix of a certain function, meaning they will modify the underlying variable.

In [4]:
np.__version__

'1.19.5'

In [5]:
x = np.array(range(10))

In [6]:
np.sum(x)

45

In [None]:
print(x)

[0 1 2 3 4 5 6 7 8 9]


In [None]:
x.sum()

45

In [None]:
torch.tensor(list([1,2,5]))

tensor([1, 2, 5])

In [25]:
x_t = torch.tensor(range(10))
print(x_t)

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


In [None]:
x_t.sum()

tensor(45)

In [None]:
torch.tensor(x)

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
x_np = x_t.numpy()
print(type(x_np))

<class 'numpy.ndarray'>


In [None]:
# relu
x.clip(min=5)

array([5, 5, 5, 5, 5, 5, 6, 7, 8, 9])

In [None]:
np.array([-0.3, -0.1, 2, 4]).clip(min=0)

array([0., 0., 2., 4.])

In [None]:
x_t.clamp(min=0)

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
x_t.add(-5).clamp(min=0)

tensor([0, 0, 0, 0, 0, 0, 1, 2, 3, 4])

In [27]:
print(x_t.add(1))
print(x_t)

tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


In [28]:
x_t.add_(1) # in-place operations

tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [29]:
print(x_t)

tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])


## Neural network

We want to implement the following key component of a multi-layer perceptron neural net: 

### A single perceptron in the $l$-th layer.  

<img src="https://sites.wustl.edu/scao/files/2021/01/neuron-1.png" alt="drawing" width="800"/>

- Input: $\mathbf{a} \in \mathbb{R}^d$ ($d$ is the dimension of an input vector).
- Output: $\hat{a}$ (if this perceptron is the final one, then the output will try to approximating the actual target value $y$).
- The formula (with bias) is then
$$
\hat{a} = f(\mathbf{w} \cdot \mathbf{a} + b) = f (w_1 a_1 + w_2 a_2 + ... + w_d a_d + b)
$$
Here, $\mathbf{w} \in \mathbb{R}^d$ is a weight vector; $b \in \mathbb{R}$ is a bias; and $f(\cdot)$ denotes the nonlinear activation function we learned in Lecture 2:
$$
f (x) = \begin{cases}
x & \text{if } x>0,
\\
0 & \text{if } x\leq 0.
\end{cases}
$$


### Multi-layer, multiple perceptrons per layer
If we have $m$ perceptrons in a single layer, for example layer 2:
<img src="https://sites.wustl.edu/scao/files/2021/01/neural_net_3l.png" alt="drawing" width="800"/>

Our neural network has parameters $(W, b) := \big(W^{(1)},b^{(1)},W^{(2)},b^{(2)}\big)$.

* $W^{(l)} = \big(w^{(l)}_{ij}\big)$ to denote the weight matrix, where the entry-$ij$ is associated with the connection between unit $j$ in layer $l$, and unit $i$ in layer $l+1$. Note the order of the indices, $j$ is the closer to the input that this matrix is acting on 

* $b^{(l)}_i$ is the bias associated with unit $i$ in layer $l+1$. 

In our example above, we have $W^{(1)}\in \mathbb{R}^{3×2}$, and $W^{(2)}\in \mathbb{R}^{1×3}$. Note that bias units do not have inputs or connections going into them, we write their output the value $+1$ for convenience. When we count the number of units in layer $l$, we do not count the bias unit.


## Representation of the neural network

### Step 1: component-wise representation
We write $a^{(l)}_i$ to denote the activation or output value of unit $i$ in layer $l$. For $l=1$, $a^{(1)}_i= x_i$ denotes the $i$-th input to this network. Given a fixed set of parameters $(W,b)$, and the input $\mathbf{x}$, the neural network above defines a model function $h(\mathbf{x}; W, b)$ made of layers of function compositions that outputs a real number. Specifically, the computation that this neural network represents is given by:
$$
a_1^{(2)} = f\big(w_{11}^{(1)}x_1 + w_{12}^{(1)} x_2  + b_1^{(1)}\big)
$$
$$
a_2^{(2)} = f\big(w_{21}^{(1)}x_1 + w_{22}^{(1)} x_2 + b_2^{(1)}\big) 
$$
$$
a_3^{(2)} = f\big(w_{31}^{(1)}x_1 + w_{32}^{(1)} x_2 + b_3^{(1)}\big) 
$$
and the output 
$$
h(\mathbf{x}; W,b) =\hat{y} =  a_1^{(3)} =  f\big(w_{11}^{(2)} a_1^{(2)} + w_{12}^{(2)} a_2^{(2)} + w_{13}^{(2)} a_3^{(2)} + b_1^{(2)}\big) 
$$

### Step 2: from component to vector to matrix
\begin{align*}
a_1^{(l+1)} = f\big(\mathbf{w}_1^{(1)} \cdot \mathbf{a}^{(l)} + b_1^{(l)}\big) 
\\
a_2^{(l+1)} = f\big(\mathbf{w}_2^{(1)} \cdot \mathbf{a}^{(l)} + b_2^{(l)}\big) 
\\
a_3^{(l+1)} = f\big(\mathbf{w}_3^{(1)} \cdot \mathbf{a}^{(l)} + b_3^{(l)}\big) 
\end{align*}
The weight matrix $W^{(1)}$ is then consisting of $\mathbf{w}_j^{(1)}$ as its $j$-th row.


### Step 3: Matrix-vector representation
If we allow the activation function $f(\cdot)$ to act on vectors in an element-wise fashion: $f([\mathbf{z}_1,\mathbf{z}_2,\mathbf{z}_3])=[f(\mathbf{z}_1),f(\mathbf{z}_3),f(\mathbf{z}_3)]$, then we can write the equations above more compactly as:
$$\begin{aligned}
\mathbf{z}^{(2)} &= W^{(1)} \mathbf{x} + \mathbf{b}^{(1)} \\
\mathbf{a}^{(2)} &= f(\mathbf{z}^{(2)}) \\
\mathbf{z}^{(3)} &= W^{(2)} \mathbf{a}^{(2)} + \mathbf{b}^{(2)} \\
h(\mathbf{x}; W, b) &= \mathbf{a}^{(3)} = f(\mathbf{z}^{(3)})
\end{aligned}
$$
More generally, recalling that $\mathbf{a}^{(0)}=\mathbf{x}$ also denotes the values from the input layer, then given layer $l$'s activations $\mathbf{a}^{(l)}$, we can compute layer $(l+1)$'s activations $\mathbf{a}^{(l+1)}$ as:
$$
\begin{aligned}
\mathbf{z}^{(l+1)} &= W^{(l)} \mathbf{a}^{(l)} + \mathbf{b}^{(l)}   \\
\mathbf{a}^{(l+1)} &= f(\mathbf{z}^{(l+1)})
\end{aligned}
$$
By organizing the parameters in matrices and using matrix-vector operations, we can take advantage of fast linear algebra routines to quickly perform calculations in our network.


## Linear algebra: Numpy vs PyTorch
### Operations needed to implement this model
- Inner product
- Matrix-vector multiplication
- Element-wise operation



In [7]:
# recall numpy's various operations
a = np.array([[1,0], [2,3]])
print(a)

[[1 0]
 [2 3]]


In [8]:
x = np.array([2,-1])
print(x)

[ 2 -1]


In [9]:
# a*x is not the correct way to implement matrix-vector multiplication
a.dot(x) # a times x

array([2, 1])

In [10]:
# pytorch's counterparts
a_t = torch.tensor(a)
x_t = torch.tensor(x)
print(a_t,'\n', x_t)

tensor([[1, 0],
        [2, 3]]) 
 tensor([ 2, -1])


In [11]:
a_t.mm(x_t)

RuntimeError: ignored

In [12]:
x_t.reshape(-1,1).size() # -1 means that we do not specify that dimension

torch.Size([2, 1])

In [13]:
# x_t has to be a tensor of Size(2,1)
a_t.mm(x_t.reshape(-1,1))

tensor([[2],
        [1]])

In [14]:
# relu
y_t = torch.randn((2,5))
print(y_t)

tensor([[ 0.5825, -0.0327,  0.1823, -0.8353,  1.0416],
        [-0.6860,  1.2339, -0.1908, -0.0246, -0.5685]])


In [15]:
y_t.clamp(min=0)

tensor([[0.5825, 0.0000, 0.1823, 0.0000, 1.0416],
        [0.0000, 1.2339, 0.0000, 0.0000, 0.0000]])

In [16]:
y = y_t.numpy()
print(y)

[[ 0.5824803  -0.03270331  0.18226774 -0.8352577   1.0416421 ]
 [-0.68601185  1.2339315  -0.19084905 -0.02458323 -0.56846607]]


In [17]:
# boolean array
y>0

array([[ True, False,  True, False,  True],
       [False,  True, False, False, False]])

In [18]:
# boolean array as indices
y[y<=0] = 0 # first y<=0 is getting indices of y such that its entry is <= 0
# then we set these entries to be 0
print(y)

[[0.5824803  0.         0.18226774 0.         1.0416421 ]
 [0.         1.2339315  0.         0.         0.        ]]


In [20]:
y_t[y_t <= 0] = 0 # same syntax applies to torch tensor
print(y_t)

tensor([[0.5825, 0.0000, 0.1823, 0.0000, 1.0416],
        [0.0000, 1.2339, 0.0000, 0.0000, 0.0000]])


## Final remark
In the actual implementation, the data normaly comes in batch, i.e., a matrix. For example, input is a matrix $X \in \mathbb{R}^{N \times d}$, $N$ is a number of samples in a batch, each row represents a sample $\mathbf{x} \in \mathbb{R}^{1\times d}$. The weight matrix $W$ is actually formulated as:
$$
W = \left(
\begin{array}{cccc}| & | & | & | \\
\mathbf{w}_1 & \mathbf{w}_2 & \cdots & \mathbf{w}_m \\
| & | & | & |
\end{array}\right),
$$
if the output dimension of the layer of interest is $m$. The vectorized formulation is, for example, from the input (layer 0, dimension $d$) to layer 1 (dimension $m$)
$$
A^{(1)} = X W^{(0)} + B
$$
where $X \in \mathbb{R}^{N \times d}$, $W^{(0)} \in \mathbb{R}^{d\times m}$ (input from $d$ perceptrons, output from $m$ perceptrons), $B$ is a matrix with each row being the same $\mathbf{b} \in \mathbb{R}^{1\times m}$ (layer 1 has $m$ perceptrons and has $m$ biases if applicable).

In [None]:

# demo a matrix plus a vector (dimension match)

In [None]:
# demo using torch.randn()