# An introduction to Pytorch

Pytorch is a platform for deep learning in Python/C++. In this lecture we will focus in the Python landscape. 

It provides tools for efficiently creating, training, testing and analyzing neural networks:

* Different types of layers (embedding, linear, convolutional, recurrent)
* Activation functions (tanh, relu, sigmoid, etc.)
* Gradient computation
* Optimizer (adam, adagrad, RMSprop, SGD, etc.)
* Implementations speed gains in GPU

## Tensors

Let's start with some basics: tensors are similar to numpy arrays

In [None]:
import numpy as np
import torch

np.random.seed(0)
torch.manual_seed(0)

In [None]:
v1 = np.arange(10)
v2 = np.arange(10, 20)

print("v1: %s\n" % v1)
print("v2: %s\n" % v2)
print("Dot product: %d" % v1.dot(v2))

In [None]:
v1 = torch.arange(10)
v2 = torch.arange(10, 20)

print("v1: %s\n" % v1)
print("v2: %s\n" % v2)
print("Dot product: %d" % v1.dot(v2))

#### Setting values manually or randomly:

In [None]:
v3 = np.array([2, 4, 6, 8])
v4 = np.random.random(10)

print("v3: %s\n" % v3)
print("v4: %s\n" % v4)

In [None]:
v3 = torch.tensor([2, 4, 6, 8])
v4 = torch.rand(10)

print("v3: %s\n" % v3)
print("v4: %s\n" % v4)

#### You can also change a value inside the array manually

In [None]:
v4[1] = 0.1
print(v4)

#### Accessing values (indexing)

Individual tensor positions are scalars, or 0-dimension tensor:

In [None]:
v1 = torch.arange(10)

In [None]:
print(v1[0])
print(v1[0].shape)

`.item()` returns a Python number:

In [None]:
number = v1[0].item()
print(number)
print(isinstance(number, int))

## Converting

In [None]:
A = torch.eye(3)
A

In [None]:
# torch --> numpy
B = A.numpy()
B

In [None]:
# numpy --> torch
torch.from_numpy(np.eye(3))

## Elementwise operations

In [None]:
v1

In [None]:
v2

In [None]:
v1 + v2

In [None]:
v1 * v2

Some caveats when working with integer values!

In [None]:
v1 / v2 

In [None]:
x = v1.float()
y = v2.float()
x / y

#### Operations with constants

In [None]:
x

In [None]:
x + 1

In [None]:
x ** 2

#### Matrices

In [None]:
m1 = torch.rand(5, 4)
m2 = torch.rand(4, 5)

print("m1: %s\n" % m1)
print("m2: %s\n" % m2)
print(m1.dot(m2))

Oops... that can be misleading if you are used to numpy. Instead, call `mm`

In [None]:
print(m1.mm(m2))

In [None]:
print(m1 @ m2)

What if I have batched data? It's better to use `.bmm()`! This is a common source of errors.

In [None]:
m1 = torch.rand(2, 5, 4)
m2 = torch.rand(2, 4, 5)

print(m1.bmm(m2))

`@` will work as `.bmm()`!

In [None]:
print(m1 @ m2)

What if I have even more dimensions?

In [None]:
m1 = torch.rand(2, 3, 5, 4)
m2 = torch.rand(2, 3, 4, 5)

print(m1.bmm(m2))

`.bmm` works with 3d tensors. We can use the more general `matmul` instead. In fact, the `@` operator is a shorthand for `matmul`.

In [None]:
print(m1.matmul(m2).shape)
print(m1.matmul(m2))

Anoter option is to use the powerful `einsum` function. Let's say our input have the following representation:
- `b` = batch size 
- `c` = channels
- `i` = `m1` timesteps
- `j` = `m2` timesteps
- `d` = hidden size

In [None]:
torch.einsum('bcid,bcdj->bcij', m1, m2)

See more about `einsum` here: https://pytorch.org/docs/master/generated/torch.einsum.html#torch.einsum

## Broadcasting

Broadcasting means doing some arithmetic operation with tensors of different ranks, as if the smaller one were expanded, or broadcast, to match the larger.

Let's experiment with a matrix (rank 2 tensor) and a vector (rank 1).

In [None]:
m = torch.rand(5, 4)
v = torch.arange(4)

In [None]:
print("m:", m)
print()
print("v:", v)
print()

In [None]:
m_plus_v = m + v
print("m + v:\n", m_plus_v)

Proof check

In [None]:
print("m[0] = %s\n" % m[0])
print("v = %s\n" % v)

row_sum = m[0] + v
print("m[0] + v = %s\n" % row_sum)
print("(m + v)[0] = %s" % m_plus_v[0])

We can also reshape tensors

In [None]:
v.shape

In [None]:
v

In [None]:
v = v.view(2, 2)
v

In [None]:
v = v.view(4, 1)
v

Note that shape `[4, 1]` is not broadcastable to match `[5, 4]`!

In [None]:
m + v

... but `[1, 4]` is!

In [None]:
v = v.view(1, 4)
m + v

### General Broadcast Semantics

See more here: https://pytorch.org/docs/master/notes/broadcasting.html

Two tensors are “broadcastable” if the following rules hold:

- Each tensor has at least one dimension.

- When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.

In [None]:
x = torch.rand(5,7,3)
y = torch.rand(5,7,3)
z = x + y
# same shapes are always broadcastable (i.e. the above rules always hold)

In [None]:
x = torch.rand((0,))
y = torch.rand(2,2)
z = x + y
# x and y are not broadcastable, because x does not have at least 1 dimension

In [None]:
# can line up trailing dimensions
x = torch.empty(5,3,4,1)
y = torch.empty(  3,1,1)
z = x + y
# x and y are broadcastable.
# 1st trailing dimension: both have size 1
# 2nd trailing dimension: y has size 1
# 3rd trailing dimension: x size == y size
# 4th trailing dimension: y dimension doesn't exist

In [None]:
# but:
x = torch.empty(5,2,4,1)
y = torch.empty(  3,1,1)
z = x + y
# x and y are not broadcastable, because in the 3rd trailing dimension 2 != 3

Always take care with tensor shapes! It is a good practice to verify in the interpreter how some expression is evaluated before inserting into your model code. 

In other words, **you can use pytorch's dynamic graph creation ability to debug your model by printing tensor shapes!**

## Useful Functions

Pytorch (and other libraries) have many functions that operate on tensors. Let's try some of them and plot the results.

In [None]:
%matplotlib inline

import matplotlib
import matplotlib.pyplot as pl

Create a vector x with values from -10 to 10, and intervals of 0.1.

In [None]:
x = torch.arange(-10, 10, 0.1, dtype=torch.float)

In [None]:
x.shape

The `.numpy()` method converts Pytorch tensors to numpy array. It is necessary to plot with matplotlib.

In [None]:
y = x.sin()
pl.plot(x.numpy(), y.numpy())

Hyperbolic tangent

In [None]:
y = x.tanh()
pl.plot(x.numpy(), y.numpy())

$e^x$ 

In [None]:
y = x.exp()
pl.plot(x.numpy(), y.numpy())

In [None]:
y = torch.log(x)
pl.plot(x.numpy(), y.numpy())

# But what about GPUs?
How do I use A GPU?

In [None]:
my_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
my_device

If you have a GPU you should get something like: 
`device(type='cuda', index=0)`

In [None]:
# you can initialize a tensor in a specfic device
torch.ones(5, device=my_device)

In [None]:
# you can move data to the GPU by doing .to(device)
data = torch.eye(3)  # data is on the cpu 
data.to(my_device)  # data is moved to my_device

Now the computation happens on the GPU.

In [None]:
res = data + data
res

In [None]:
# you can get a tensor's device via the .device attribute
res.device

# Automatic differentiation with `autograd`

Central to all neural networks in PyTorch is the `autograd` package. 

We can say that it is the _true_ power behind PyTorch. The autograd package provides automatic differentiation for all operations on Tensors. It is a **define-by-run** framework, which means that your backprop is defined by how your code is run, and that **every single iteration can be different**.

Refs:
- https://pytorch.org/docs/stable/autograd.html
- https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

`torch.Tensor` is the central class of the package. If you set its attribute `.requires_grad` as `True`, it starts to track all operations on it. When you finish your computation you can call `.backward()` and have all the gradients computed automatically. The gradient for this tensor will be accumulated into `.grad` attribute.

In [None]:
x = torch.tensor(2.)
print(x)

In [None]:
# setting requires_grad in directly via tensor's constructor
x = torch.tensor(2., requires_grad=True)

# or by setting .requires_grad attribute
# you can do this at any moment to track operations on x
x.requires_grad = True  

print(x)

In [None]:
print(x.requires_grad)
print(x.grad)  # no gradient yet

In [None]:
# let's perform a simple operation on x
y = x ** 2

print("Grad of x:", x.grad)

In [None]:
# if you want to compute the derivatives, you can call .backward() on a Tensor
y.backward()
print("Grad of y with respect to x:", x.grad)

To stop a tensor from tracking history, you can call `.detach()` to detach it from the computation history, and to prevent future computation from being tracked.

In [None]:
x = torch.tensor(2., requires_grad=True)
print(x)

y = x ** 2
print(y)

c = y.detach()  # c will be treated as a constant! c has the same contents as y but requires_grad=False
print(c)

z = c * y.exp()  
print(z)

z.backward()
print(x.grad)

To prevent tracking history (and using memory), you can also wrap the code block in with `torch.no_grad()`:. This can be particularly helpful when evaluating a model because the model may have trainable parameters with `requires_grad=True`, but for which we don’t need the gradients.

In [None]:
x = torch.tensor(2.)
x.requires_grad = True
print('x:', x)

y = x ** 2
print('y:', y)

with torch.no_grad():
    y = 2 * y
    print('x:', x)  # Try to think why x.requires_grad is True
    print('y:', y)

There’s one more class which is very important for autograd implementation - a `Function`.

`Tensor` and `Function` are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each tensor has a `.grad_fn` attribute that references a `Function` that has created the `Tensor` (except for `Tensor`s created by the user - their `grad_fn` is `None`).

====> Let's go back and see the `grad_fn` in our previous examples.

If you still don't believe autograd works, here's something that I think will change your mind --- we're going to compute the derivative of an unnecessarily complicated function:

$$ y(x) = \sum_{x_i} e^{0.001 x_i^2} + \sin(x_i^3) \cdot \log(x_i)$$

In [None]:
def complicated_func(X):
    return torch.sum(torch.exp(0.001 * X ** 2) + torch.sin(X ** 3) * torch.log(X))

In [None]:
x = torch.arange(1, 10, 0.1, dtype=torch.float, requires_grad=True)
x

In [None]:
y = complicated_func(x)
y.backward()

In [None]:
x.grad

In [None]:
pl.plot(x.detach(), x.grad.detach())

### Concepts not covered in this lecture

PyTorch's `autograd` is a very powerfull tool. For instance, it can calculate the Jacobian and Hessian of any given function! Here is a list of more advanced things that you can accomplish with `autograd`:

- Vector-Jacobian products for non-scalar outputs (e.g. when `y` is a vector)
- Compute Jacobian and Hessian
- Retain the computation graph (useful for inspecting gradients inside a model)
- Sparse gradients
- Register and remove hooks (useful for saving gradients)
- How to set up user-designed `Function`s properly
- Numerical gradient checking


More info: pytorch.org/docs/stable/autograd.html

### The interaction of `autograd` with `nn.Module`s and `nn.Parameters`

In the next notebook we will see how to build a linear regression model using PyTorch's `nn.Module`. You will see that you don't need to worry about gradients when using `nn.Module` and `nn.Parameter`. This is because they automatically keep track of gradients for you.

In [None]:
# w.x + b
lin = torch.nn.Linear(2, 1, bias=True)  # nn.Linear is a nn.Module
lin.weight  # lin.weight is a nn.Parameter!

In [None]:
type(lin.weight)

**Exercise:**

Derive the gradient $$\frac{\partial y}{\partial x}$$ and make a function that computes it. Check that it gives the same as `x.grad`.