# PyTorch

© Anatolii Stehnii, 2018

This assignment should give you a basic understanding of main torch concepts: tensors, broadcasting, ~~variables~~ (obsolete for Pytorch 0.4.0),  autogradient, optimizers, modules.

In [2]:
import torch

## Tensor operations
For **tensor**, torch has mostly similar to numpy API. 

In [12]:
x = torch.rand(2, 2)
x

tensor([[ 0.1601,  0.2916],
        [ 0.8702,  0.2465]])

In [15]:
x * 3

tensor([[ 0.4802,  0.8748],
        [ 2.6105,  0.7394]])

In [17]:
x + 1

tensor([[ 1.1601,  1.2916],
        [ 1.8702,  1.2465]])

Element-wise operations with two tensors:

In [18]:
y = torch.rand(2, 2)
y

tensor([[ 0.1639,  0.9313],
        [ 0.0997,  0.9620]])

In [19]:
x + y

tensor([[ 0.3240,  1.2229],
        [ 0.9699,  1.2084]])

In [20]:
x / y

tensor([[ 0.9764,  0.3131],
        [ 8.7277,  0.2562]])

You only can perform element-wise operations on tensors with the same number of dimensions; however, sizes can be different:

In [21]:
z = torch.rand(1, 2)
z

tensor([[ 0.4346,  0.1989]])

In [22]:
z * x

tensor([[ 0.0696,  0.0580],
        [ 0.3782,  0.0490]])

In [24]:
w = torch.rand(2, 1)
w

tensor([[ 0.4038],
        [ 0.5767]])

In [25]:
x / w

tensor([[ 0.3964,  0.7221],
        [ 1.5089,  0.4274]])

This is called **broadcasting**: if we have two tensors of size $m\times n\times k$ and $m \times 1 \times k$, we can perform element-wise operations with them. The smaller tensor will be virtually clonned $n$ times along the second dimension to fit the size of the larger tensor. https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html

Tensor **dot product** is possible for two tensors of size $n_1\times\ldots n_{i-1}\times n_i$ and $m_1\times m_2 \times\ldots\times m_k$ if $n_i = m_1$. The result will have dimensionality $n_1\times\ldots\times n_{i-1}\times m_2\times\ldots\times m_k$

In [35]:
# inner product (1 x 2)*(2 x 1) = (1 x 1)
z.mm(w)

tensor([[ 0.2902]])

In [34]:
# outer product (2 x 1)*(1 x 2) = (2 x 2)
w.mm(z)

tensor([[ 0.1755,  0.0803],
        [ 0.2507,  0.1147]])

In [37]:
# (4 x 2)*(2 x 3) = (4 x 3)
w1 = torch.tensor([[1,3], [5,4], [7,8], [1,7]])
w2 = torch.tensor([[3,3,4], [7,8,9]])
w1.mm(w2)

tensor([[  24,   27,   31],
        [  43,   47,   56],
        [  77,   85,  100],
        [  52,   59,   67]])

### Assignment
Write a code for a forward propagation of a simple neural network with one hidden layer. $\textbf{x} \in \mathbb{R}^{10}, y \in \mathbb{R}^{1}, \textbf{h} \in \mathbb{R}^{20}$. Select any nonlinearity you prefer. 

(Additional) Write down a code for a batch forward propagation of the same network.

In [96]:
import numpy as np
X = np.random.rand(50, 10)

---

## Autogradient

Concept of computational graph is essential part of deep learning, because it allows us not to bother with deriving gradients for backpropagation. Torch `tensor` stores not only numerical data, but it also contains a reference to it's origin: an operation, which created this tensor and related tensors.

*Note: In Torch, you are not forced to define a static computational graph before optimization. Order of your computations can be easily changed during execution as easy as for any ordinary algorithm. This way Torch allows to process dynamic structures like trees and graphs.*

In [101]:
from torch.nn import functional as F
x = torch.rand(1, 5)
W = torch.rand(5, 10, requires_grad=True)
h = F.tanh(x.mm(W))
h

tensor([[ 0.9748,  0.9281,  0.9093,  0.3735,  0.9701,  0.9616,  0.9895,
          0.9634,  0.9531,  0.8809]])

In [102]:
h.grad_fn

<TanhBackward at 0x10ed0f128>

In [103]:
h.grad_fn.next_functions

((<MmBackward at 0x10ed0f080>, 0),)

This allows to calculate an **autogradient** from each scalar, which is a result of operations with other tensors, which require gradient:

In [110]:
# backward can only be called for scalar
if W.grad is not None:
    W.grad.data.zero_()
h.sum().backward(retain_graph=True)
W.grad

tensor([[ 0.0323,  0.0902,  0.1126,  0.5596,  0.0383,  0.0489,  0.0135,
          0.0467,  0.0595,  0.1457],
        [ 0.0471,  0.1315,  0.1643,  0.8160,  0.0559,  0.0714,  0.0197,
          0.0681,  0.0868,  0.2125],
        [ 0.0218,  0.0609,  0.0761,  0.3779,  0.0259,  0.0331,  0.0091,
          0.0315,  0.0402,  0.0984],
        [ 0.0471,  0.1316,  0.1643,  0.8165,  0.0559,  0.0714,  0.0198,
          0.0682,  0.0869,  0.2126],
        [ 0.0374,  0.1045,  0.1305,  0.6486,  0.0444,  0.0567,  0.0157,
          0.0541,  0.0690,  0.1689]])

### Assignment
Use provided $\textbf{X}$ and $\textbf{y}$ values to find gradient of RMSE loss with respect to parameters of your network.

In [97]:
y = X[:, 0]*X[:, 1] + \
    np.log(X[:, 2]) + \
    (np.random.rand(5) * X[:, 3:8]).sum(axis=1) + \
    np.random.normal(0, 0.1, 50) 

---
## Optimizer
Having a gradient is enough to start training your network with a simple gradient descent; however, you can use more sophisticated optimization methods like Adam or Adagrad. This methods already implemented in torch and available in module `torch.optim` (https://pytorch.org/docs/stable/optim.html#module-torch.optim) :

In [111]:
from torch.optim import SGD, Adam
# any optimizers accepts array of model parameters as the argument; 
# optimizer specific parameters comes next
sgd = SGD([W], lr=0.1)
adam = Adam([W], lr=0.1)

In [122]:
h_true = torch.tensor([1.0]*10)

In [154]:
sgd.zero_grad()

h = F.tanh(x.mm(W))
loss = torch.sqrt((h_true**2-h**2).sum())/10

loss.backward()
print('Gradient sum: {0:.4f}\nLoss: {1:.4f}'.format(W.grad.abs().sum().item(), loss.item()))
sgd.step()

Gradient sum: 0.4104
Loss: 0.1271


### Assignment
Train your simple network with SGD optimizer, learning rate `0.01` and momentum `0.9`.

---
## Module

At this point, you already have everything you need to start training your network. Module is simply a handy way to organize your code into a separate unit and provide optimizer with a list of parameters:

In [184]:
import torch.nn as nn
import torch.nn.init as init

class SimpleModule(nn.Module):
    def __init__(self, in_dim, out_dim):
        super(SimpleModule, self).__init__()
        self.W = nn.Parameter(torch.empty(in_dim, out_dim))
        init.normal_(self.W)
        self.b = nn.Parameter(torch.zeros(out_dim))
        
    def forward(self, X):
        return F.tanh(X.mm(self.W)) + self.b
    
model = SimpleModule(10, 1)
optimizer = Adam(model.parameters(), lr=0.1)
loss = nn.MSELoss()

x = torch.rand(1, 10)
y_true = torch.tensor([[5.0]])

for epoch in range(100):
    optimizer.zero_grad()
    
    y = model(x)
    l = loss(y, y_true)
    l.backward()
    
    optimizer.step()
    if (epoch+1) % 10 == 0:
        print('Epoch {0}, loss {1:.4f}, target {2:.2f}'.format(epoch+1, l.item(), y.item()))

Epoch 10, loss 9.7928, target 1.87
Epoch 20, loss 4.9344, target 2.78
Epoch 30, loss 2.0424, target 3.57
Epoch 40, loss 0.6333, target 4.20
Epoch 50, loss 0.1207, target 4.65
Epoch 60, loss 0.0061, target 4.92
Epoch 70, loss 0.0023, target 5.05
Epoch 80, loss 0.0062, target 5.08
Epoch 90, loss 0.0039, target 5.06
Epoch 100, loss 0.0011, target 5.03
