# PyTorch

© Anatolii Stehnii, 2018

This assignment should give you a basic understanding of main torch concepts: tensors, broadcasting, ~~variables~~ (obsolete for Pytorch 0.4.0),  autogradient, optimizers, modules.

In [1]:
import torch

## Tensor operations
For **tensor**, torch has mostly similar to numpy API. 

In [2]:
x = torch.rand(2, 2)
x

tensor([[ 0.0726,  0.7453],
        [ 0.7320,  0.9103]])

In [3]:
x * 3

tensor([[ 0.2177,  2.2360],
        [ 2.1960,  2.7308]])

In [4]:
x + 1

tensor([[ 1.0726,  1.7453],
        [ 1.7320,  1.9103]])

Element-wise operations with two tensors:

In [5]:
y = torch.rand(2, 2)
y

tensor([[ 0.6491,  0.2243],
        [ 0.3772,  0.3437]])

In [6]:
x + y

tensor([[ 0.7217,  0.9697],
        [ 1.1092,  1.2540]])

In [7]:
x / y

tensor([[ 0.1118,  3.3224],
        [ 1.9406,  2.6487]])

You only can perform element-wise operations on tensors with the same number of dimensions; however, sizes can be different:

In [8]:
z = torch.rand(1, 2)
z

tensor([[ 0.8274,  0.0905]])

In [9]:
z * x

tensor([[ 0.0600,  0.0674],
        [ 0.6057,  0.0823]])

In [10]:
w = torch.rand(2, 1)
w

tensor([[ 0.2742],
        [ 0.9058]])

In [11]:
x / w

tensor([[ 0.2647,  2.7183],
        [ 0.8081,  1.0049]])

This is called **broadcasting**: if we have two tensors of size $m\times n\times k$ and $m \times 1 \times k$, we can perform element-wise operations with them. The smaller tensor will be virtually clonned $n$ times along the second dimension to fit the size of the larger tensor. https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html

Tensor **dot product** is possible for two tensors of size $n_1\times\ldots n_{i-1}\times n_i$ and $m_1\times m_2 \times\ldots\times m_k$ if $n_i = m_1$. The result will have dimensionality $n_1\times\ldots\times n_{i-1}\times m_2\times\ldots\times m_k$

In [12]:
# inner product (1 x 2)*(2 x 1) = (1 x 1)
z.mm(w)

tensor([[ 0.3088]])

In [13]:
# outer product (2 x 1)*(1 x 2) = (2 x 2)
w.mm(z)

tensor([[ 0.2269,  0.0248],
        [ 0.7495,  0.0819]])

In [14]:
# (4 x 2)*(2 x 3) = (4 x 3)
w1 = torch.tensor([[1,3], [5,4], [7,8], [1,7]])
w2 = torch.tensor([[3,3,4], [7,8,9]])
w1.mm(w2)

tensor([[  24,   27,   31],
        [  43,   47,   56],
        [  77,   85,  100],
        [  52,   59,   67]])

In [15]:
w1 = torch.rand(2,3,3)
w2 = torch.rand(3,2, 2)
w1.matmul(w2)

RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 0

### Assignment
Write a code for a forward propagation of a simple neural network with one hidden layer. $\textbf{x} \in \mathbb{R}^{10}, y \in \mathbb{R}^{1}, \textbf{h} \in \mathbb{R}^{20}$. Select any nonlinearity you prefer. 

(Additional) Write down a code for a batch forward propagation of the same network.

In [118]:
import numpy as np
X = np.random.rand(50, 10)

In [119]:
W1 = torch.rand(10, 20, requires_grad=True)
W2 = torch.rand(20, 1, requires_grad=True)
X = torch.tensor(X).float()

In [120]:
X.requires_grad

False

In [121]:
import torch.nn.functional as F

In [122]:
y = F.relu(X.mm(W1)).mm(W2)

<MmBackward at 0x11e2eacf8>

---

## Autogradient

Concept of computational graph is essential part of deep learning, because it allows us not to bother with deriving gradients for backpropagation. Torch `tensor` stores not only numerical data, but it also contains a reference to it's origin: an operation, which created this tensor and related tensors.

*Note: In Torch, you are not forced to define a static computational graph before optimization. Order of your computations can be easily changed during execution as easy as for any ordinary algorithm. This way Torch allows to process dynamic structures like trees and graphs.*

In [135]:
from torch.nn import functional as F
x = torch.rand(1, 5)
W = torch.rand(5, 10, requires_grad=True)
h = F.tanh(x.mm(W))
h

tensor([[ 0.2696,  0.8314,  0.7390,  0.8474,  0.7649,  0.6143,  0.7942,
          0.7798,  0.6302,  0.1675]])

In [136]:
h.grad_fn

<TanhBackward at 0x1171567f0>

In [137]:
h.grad_fn.next_functions

((<MmBackward at 0x1171569b0>, 0),)

This allows to calculate an **autogradient** from each scalar, which is a result of operations with other tensors, which require gradient:

In [48]:
# backward can only be called for scalar
if W.grad is not None:
    W.grad.data.zero_()
h.sum().backward(retain_graph=True)
W.grad

tensor([[ 0.0423,  0.0398,  0.0973,  0.0291,  0.0697,  0.0100,  0.0401,
          0.0630,  0.0338,  0.0322],
        [ 0.3134,  0.2952,  0.7215,  0.2157,  0.5168,  0.0740,  0.2977,
          0.4668,  0.2504,  0.2384],
        [ 0.1151,  0.1085,  0.2651,  0.0793,  0.1899,  0.0272,  0.1094,
          0.1715,  0.0920,  0.0876],
        [ 0.0165,  0.0155,  0.0379,  0.0113,  0.0271,  0.0039,  0.0156,
          0.0245,  0.0132,  0.0125],
        [ 0.2331,  0.2196,  0.5367,  0.1604,  0.3844,  0.0551,  0.2214,
          0.3472,  0.1863,  0.1773]])

### Assignment
Use provided $\textbf{X}$ and $\textbf{y}$ values to find gradient of RMSE loss with respect to parameters of your network.

In [111]:
y_true = X[:, 0]*X[:, 1] + \
    torch.tensor(np.log(X[:, 2])).float() + \
    (torch.tensor(np.random.rand(5)).float() * X[:, 3:8]).sum(dim=1) + \
    torch.tensor(np.random.normal(0, 0.1, 50)).float()

In [158]:
def rse_loss(y, y_true):
    return torch.sqrt(torch.sum((y - y_true) ** 2))

y = F.relu(X.mm(W1)).mm(W2)
loss = rmse_loss(y, y_true)/50
loss.item()

0.965482771396637

In [130]:
y.grad_fn

<MmBackward at 0x11e2bd5c0>

In [131]:
loss.backward(retain_graph=True)

In [132]:
alpha = 0.001

In [133]:
W1.grad

In [128]:
W1 = W1 - alpha*W1.grad
W2 = W2 - alpha*W2.grad

---
## Optimizer
Having a gradient is enough to start training your network with a simple gradient descent; however, you can use more sophisticated optimization methods like Adam or Adagrad. This methods already implemented in torch and available in module `torch.optim` (https://pytorch.org/docs/stable/optim.html#module-torch.optim) :

In [138]:
from torch.optim import SGD, Adam
# any optimizers accepts array of model parameters as the argument; 
# optimizer specific parameters comes next
sgd = SGD([W], lr=0.1)
adam = Adam([W], lr=0.1)

In [139]:
h_true = torch.tensor([1.0]*10)

In [152]:
sgd.zero_grad()

h = F.tanh(x.mm(W))
loss = torch.sqrt((h_true**2-h**2).sum())/10

loss.backward()
print('Gradient sum: {0:.4f}\nLoss: {1:.4f}'.format(W.grad.abs().sum().item(), loss.item()))
sgd.step()

Gradient sum: 0.2197
Loss: 0.2295


### Assignment
Train your simple network with SGD optimizer, learning rate `0.01` and momentum `0.9`.

In [159]:
W1 = torch.rand(10, 20, requires_grad=True)
W2 = torch.rand(20, 1, requires_grad=True)
sgd = SGD([W1, W2], lr=0.01, momentum=0.9)

for i in range(1000):
    sgd.zero_grad()

    y = F.relu(X.mm(W1)).mm(W2)
    loss = rse_loss(y, y_true)/50

    loss.backward()
    print('Gradient sum: {0:.4f}\nLoss: {1:.4f}'.format(W1.grad.abs().sum().item(), loss.item()))
    sgd.step()

Gradient sum: 45.6538
Loss: 20.7527
Gradient sum: 43.3473
Loss: 19.4472
Gradient sum: 39.7143
Loss: 17.0063
Gradient sum: 36.0490
Loss: 13.6116
Gradient sum: 32.5564
Loss: 9.4453
Gradient sum: 29.1474
Loss: 4.7003
Gradient sum: 16.3909
Loss: 1.2550
Gradient sum: 30.0169
Loss: 4.8444
Gradient sum: 31.5699
Loss: 7.3291
Gradient sum: 32.3008
Loss: 8.5187
Gradient sum: 32.0718
Loss: 8.5748
Gradient sum: 30.9307
Loss: 7.6639
Gradient sum: 29.1122
Loss: 5.9482
Gradient sum: 26.6150
Loss: 3.6000
Gradient sum: 11.7205
Loss: 1.1451
Gradient sum: 23.2191
Loss: 2.6393
Gradient sum: 25.0427
Loss: 4.3725
Gradient sum: 25.3926
Loss: 5.1605
Gradient sum: 25.0413
Loss: 5.0949
Gradient sum: 23.9928
Loss: 4.3047
Gradient sum: 22.1355
Loss: 2.9359
Gradient sum: 14.2979
Loss: 1.3105
Gradient sum: 18.3267
Loss: 1.6309
Gradient sum: 22.0279
Loss: 2.7988
Gradient sum: 22.4641
Loss: 3.3183
Gradient sum: 22.0168
Loss: 3.1936
Gradient sum: 20.6669
Loss: 2.5274
Gradient sum: 16.2149
Loss: 1.5110
Gradient sum: 6.

Loss: 0.9655
Gradient sum: 0.0056
Loss: 0.9655
Gradient sum: 0.0055
Loss: 0.9655
Gradient sum: 0.0055
Loss: 0.9655
Gradient sum: 0.0055
Loss: 0.9655
Gradient sum: 0.0055
Loss: 0.9655
Gradient sum: 0.0055
Loss: 0.9655
Gradient sum: 0.0054
Loss: 0.9655
Gradient sum: 0.0054
Loss: 0.9655
Gradient sum: 0.0054
Loss: 0.9655
Gradient sum: 0.0054
Loss: 0.9655
Gradient sum: 0.0054
Loss: 0.9655
Gradient sum: 0.0053
Loss: 0.9655
Gradient sum: 0.0053
Loss: 0.9655
Gradient sum: 0.0053
Loss: 0.9655
Gradient sum: 0.0053
Loss: 0.9655
Gradient sum: 0.0052
Loss: 0.9655
Gradient sum: 0.0052
Loss: 0.9655
Gradient sum: 0.0052
Loss: 0.9655
Gradient sum: 0.0052
Loss: 0.9655
Gradient sum: 0.0052
Loss: 0.9655
Gradient sum: 0.0051
Loss: 0.9655
Gradient sum: 0.0051
Loss: 0.9655
Gradient sum: 0.0051
Loss: 0.9655
Gradient sum: 0.0051
Loss: 0.9655
Gradient sum: 0.0051
Loss: 0.9655
Gradient sum: 0.0050
Loss: 0.9655
Gradient sum: 0.0050
Loss: 0.9655
Gradient sum: 0.0050
Loss: 0.9655
Gradient sum: 0.0050
Loss: 0.9655
G

---
## Module

At this point, you already have everything you need to start training your network. Module is simply a handy way to organize your code into a separate unit and provide optimizer with a list of parameters:

In [184]:
import torch.nn as nn
import torch.nn.init as init

class SimpleModule(nn.Module):
    def __init__(self, in_dim, out_dim):
        super(SimpleModule, self).__init__()
        self.W = nn.Parameter(torch.empty(in_dim, out_dim))
        init.normal_(self.W)
        self.b = nn.Parameter(torch.zeros(out_dim))
        
    def forward(self, X):
        return F.tanh(X.mm(self.W)) + self.b
    
model = SimpleModule(10, 1)
optimizer = Adam(model.parameters(), lr=0.1)
loss = nn.MSELoss()

x = torch.rand(1, 10)
y_true = torch.tensor([[5.0]])

for epoch in range(100):
    optimizer.zero_grad()
    
    y = model(x)
    l = loss(y, y_true)
    l.backward()
    
    optimizer.step()
    if (epoch+1) % 10 == 0:
        print('Epoch {0}, loss {1:.4f}, target {2:.2f}'.format(epoch+1, l.item(), y.item()))

Epoch 10, loss 9.7928, target 1.87
Epoch 20, loss 4.9344, target 2.78
Epoch 30, loss 2.0424, target 3.57
Epoch 40, loss 0.6333, target 4.20
Epoch 50, loss 0.1207, target 4.65
Epoch 60, loss 0.0061, target 4.92
Epoch 70, loss 0.0023, target 5.05
Epoch 80, loss 0.0062, target 5.08
Epoch 90, loss 0.0039, target 5.06
Epoch 100, loss 0.0011, target 5.03
