# An introduction to PyTorch

PyTorch is a platform for deep learning in Python or C++. In this lecture we will focus in the **Python** landscape. 

# Tensors

Tensors are elementary units of PyTorch. They are very similar to numpy arrays

In [None]:
import numpy as np
np.random.seed(0)

import torch
torch.manual_seed(0)

In [None]:
x = np.array([1.0, 2.0, 3.0])
y = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

In [None]:
x

In [None]:
y

In [None]:
z = y ** 2
z

Broadly speaking, a tensor is like a numpy array that can carry gradient information from the chain of operations applied on top of it. There are other flavors that make them different, but this is the key distinction.

## Creating tensors 

In [None]:
# directly from data
data = [[0, 1], [1, 0]]
x_data = torch.tensor(data)
x_data

In [None]:
# from a numpy array
x_numpy = np.array([[1, 2], [3, 4]])
x_torch = torch.from_numpy(x_numpy)
x_torch

In [None]:
# convert it back to a numpy array
x_numpy = x_torch.numpy()
x_numpy

In [None]:
# with constant data
x = torch.ones(2, 3)  # 2 rows and 3 columns
print(x)
y = torch.zeros(3, 2) # 3 rows and 2 columns
print(y)
z = torch.full((3, 1), -5)  # 3 row and 1 columns (aka column vector)
print(z)

In [None]:
# with random data
x = torch.rand(2, 3)  # uniform distribution U(0, 1)
print(x)
y = torch.randn(2, 3)  # standard gaussian N(0, 1)
print(y)
z = torch.randint(0, 10, size=(2, 3))  # random integers [0, 10)
print(z)

In [None]:
# other initializations
print(torch.arange(5))  # from 0 (inclusive) to 5 (exclusive)
print(torch.arange(2, 8))  # from 2 to 8
print(torch.arange(2, 8, 2))  # from 2 to 8, with stepsize=2

print(torch.linspace(0, 1, 6))  # returns 6 linear spaced numbers from 0 to 1 (inclusive)
print(torch.linspace(-1, 1, 8))  # returns 8 linear spaced numbers form -1 to 1 

print(torch.eye(3))  # identity matrix

See the full set of creation ops [here](https://pytorch.org/docs/stable/torch.html#creation-ops).

## Tensor attributes

In [None]:
x = torch.rand(3, 4, requires_grad=True)
print(x.device)
print(x.shape)
print(x.dtype)
print(x)
print(x.data)
print(x[0, 0])
print(x[0, 0].item())

Tensor data types:

<table class="docutils colwidths-auto align-default">
<thead>
<tr class="row-odd"><th class="head"><p>Data type</p></th>
<th class="head"><p>dtype</p></th>
<th class="head"><p>Legacy Constructors</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>32-bit floating point</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.float32</span></code> or <code class="docutils literal notranslate"><span class="pre">torch.float</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.*.FloatTensor</span></code></p></td>
</tr>
<tr class="row-odd"><td><p>64-bit floating point</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.float64</span></code> or <code class="docutils literal notranslate"><span class="pre">torch.double</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.*.DoubleTensor</span></code></p></td>
</tr>
<tr class="row-even"><td><p>64-bit complex</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.complex64</span></code> or <code class="docutils literal notranslate"><span class="pre">torch.cfloat</span></code></p></td>
<td></td>
</tr>
<tr class="row-odd"><td><p>128-bit complex</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.complex128</span></code> or <code class="docutils literal notranslate"><span class="pre">torch.cdouble</span></code></p></td>
<td></td>
</tr>
<tr class="row-even"><td><p>16-bit floating point <a class="footnote-reference brackets" href="#id3" id="id1">1</a></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.float16</span></code> or <code class="docutils literal notranslate"><span class="pre">torch.half</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.*.HalfTensor</span></code></p></td>
</tr>
<tr class="row-odd"><td><p>16-bit floating point <a class="footnote-reference brackets" href="#id4" id="id2">2</a></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.bfloat16</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.*.BFloat16Tensor</span></code></p></td>
</tr>
<tr class="row-even"><td><p>8-bit integer (unsigned)</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.uint8</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.*.ByteTensor</span></code></p></td>
</tr>
<tr class="row-odd"><td><p>8-bit integer (signed)</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.int8</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.*.CharTensor</span></code></p></td>
</tr>
<tr class="row-even"><td><p>16-bit integer (signed)</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.int16</span></code> or <code class="docutils literal notranslate"><span class="pre">torch.short</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.*.ShortTensor</span></code></p></td>
</tr>
<tr class="row-odd"><td><p>32-bit integer (signed)</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.int32</span></code> or <code class="docutils literal notranslate"><span class="pre">torch.int</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.*.IntTensor</span></code></p></td>
</tr>
<tr class="row-even"><td><p>64-bit integer (signed)</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.int64</span></code> or <code class="docutils literal notranslate"><span class="pre">torch.long</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.*.LongTensor</span></code></p></td>
</tr>
<tr class="row-odd"><td><p>Boolean</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.bool</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">torch.*.BoolTensor</span></code></p></td>
</tr>
</tbody>
</table>


Casting tensors accoding to regular Python rules:
```
complex > floating > integral > boolean
```

Also, be careful with casts to the same dtypes to avoid underflow/overflow:

In [None]:
float_tensor = torch.randn(2, 2, dtype=torch.float)
int_tensor = torch.ones(1, dtype=torch.int)
long_tensor = torch.ones(1, dtype=torch.long)
uint_tensor = torch.ones(1, dtype=torch.uint8)

In [None]:
long_tensor_big_number = long_tensor * 2**33
long_tensor_big_number, long_tensor_big_number.int()

In [None]:
float_tensor, float_tensor.long()

See the full list of attributes [here](https://pytorch.org/docs/stable/tensor_attributes.html)

## Examples

In [None]:
# scalar
x = torch.tensor(2)
print(x)
print(x.shape)
print(x.item())  # access the (single) element inside the tensor
print('')

# vector
x = torch.rand(4)
print(x)
print(x.shape)
print('')

# matrix
x = torch.rand(4, 3)
print(x)
print(x.shape)
print('')

# n-dimensional array
x = torch.rand(3, 4, 3)  # e.g., image with width=3, height=4, and channels=3
print(x)
print(x.shape)
print('')

from matplotlib import pyplot as plt; plt.imshow(x)

## Tensor operations

In [None]:
v1 = torch.arange(8)
v2 = torch.arange(10, 18)

print("v1: %s" % v1)
print("v2: %s" % v2)
print("Dot product: %d" % v1.dot(v2))

#### You can also change a value inside the array manually

In [None]:
v2[1] = 25
print(v2)

**Accessing values:**

Individual tensor positions are scalars, or 0-dimension tensor:

In [None]:
print(v1[0])
print(v1[0].shape)

`.item()` returns a Python number:

In [None]:
number = v1[0].item()
print(number)
print(isinstance(number, int))

**Numpy-style indexing:**

In [None]:
m = torch.randn(3, 4, 3)
m

In [None]:
m[0,1,0]

In [None]:
m[:, 1, 0]

In [None]:
m[0, :, -1]

In [None]:
m[:, :, -1]

In [None]:
m[..., -1]

## Elementwise operations

In [None]:
v1

In [None]:
v2

In [None]:
v1 + v2

In [None]:
v1 * v2

Some caveats when working with integer values!

In [None]:
v1 / v2 

In [None]:
x = v1.float()
y = v2.float()
x / y

#### Operations with constants

In [None]:
x

In [None]:
x + 1

In [None]:
x ** 2

## Aggregating tensors

In [None]:
(x ** 2).sum().sqrt()

In [None]:
x.mean(), x.std()

In [None]:
x.min(), x.max()

In [None]:
x.norm(p=3)

## Joining tensors

In [None]:
torch.cat([x, y])

In [None]:
z = torch.stack([x, y])
z

In [None]:
torch.vstack([z, x])

## Tensor multiplication

In [None]:
m1 = torch.rand(5, 4)
m2 = torch.rand(4, 5)

print("m1: %s\n" % m1)
print("m2: %s\n" % m2)
print(m1.dot(m2))

Oops... that can be misleading if you are used to numpy. In PyTorch, `dot` is reserved for vectors only.
For matrices, call `mm`:

In [None]:
print(m1.mm(m2))

Or the now-default-python operator for matrix multiplication `@`

In [None]:
print(m1 @ m2)

What if I have batched data? It's better to use `.bmm()` (this is a common source of error)

In [None]:
m1 = torch.rand(2, 5, 4)
m2 = torch.rand(2, 4, 5)

print(m1.bmm(m2))

`@` will work as `.bmm()`!

In [None]:
print(m1 @ m2)

What if I have even more dimensions?

In [None]:
m1 = torch.rand(2, 3, 5, 4)
m2 = torch.rand(2, 3, 4, 5)

print(m1.bmm(m2))

`.bmm` works only with 3d tensors. For higher dimensionalities, we can use the more general `matmul`. In fact, the `@` operator is a shorthand for `matmul` (which is implemented in the magic method `__matmul__` )

In [None]:
print(m1.matmul(m2).shape)
print(m1.matmul(m2))

Anoter option is to use the powerful `einsum` function. Let's say our input have the following representation:
- `b` = batch size 
- `c` = channels
- `i` = `m1` timesteps
- `j` = `m2` timesteps
- `d` = hidden size

In [None]:
torch.einsum('bcid,bcdj->bcij', m1, m2)

See more about `einsum` here: https://pytorch.org/docs/master/generated/torch.einsum.html#torch.einsum

## Broadcasting

Broadcasting means doing some arithmetic operation with tensors of different ranks, as if the smaller one were expanded, or broadcast, to match the larger.

Let's experiment with a matrix (rank 2 tensor) and a vector (rank 1).

In [None]:
m = torch.rand(5, 4)
v = torch.arange(4)

In [None]:
print("m:", m)
print("v:", v)

In [None]:
m_plus_v = m + v
print("m + v:\n", m_plus_v)

Proof check

In [None]:
print("m[0] = %s\n" % m[0])
print("v = %s\n" % v)

row_sum = m[0] + v
print("m[0] + v = %s\n" % row_sum)
print("(m + v)[0] = %s" % m_plus_v[0])

We can also reshape tensors

In [None]:
v.shape

In [None]:
v

In [None]:
v = v.view(2, 2)
v

In [None]:
v = v.view(4, 1)
v

Note that shape `[4, 1]` is not broadcastable to match `[5, 4]`!

In [None]:
m + v

... but `[1, 4]` is!

In [None]:
v = v.view(1, 4)
m + v

## Squeezing and Unsqueezing

Broadcasting is one of the most important concepts for manipulating n-dimensional arrays. PyTorch offers some ways of expanding the rank of a tensor. 

In [None]:
v = torch.rand(4).view(1, 4, 1)
print(v)
print(v.shape)

In [None]:
v.squeeze().shape  # "compress" all single-dimensions

In [None]:
v.squeeze(0).shape  # "compress" only the (0-indexed) single-dimension

In [None]:
v.unsqueeze(1).shape  # "add" a new dimension BEFORE the (1-indexed) dimension

In [None]:
# using numpy notation (better since it explicitily says where a new dimension is being created)
v[:, None].shape

In [None]:
v.unsqueeze(1).unsqueeze(-1).unsqueeze(1).shape  # what unsqueeze(1).unsqueeze(1) does?

In [None]:
v[:, None, None, ..., None].shape

In [None]:
# we can also use .view(dims) as long te specified dims are valid
v.view(1, 1, 1, 4, 1, 1).shape

## General Broadcast Semantics

Two tensors are “broadcastable” if the following rules hold:

- Each tensor has at least one dimension.

- When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.

In [None]:
x = torch.rand(5,7,3)
y = torch.rand(5,7,3)
z = x + y
# same shapes are always broadcastable (i.e. the above rules always hold)

In [None]:
x = torch.rand((0,))
y = torch.rand(2,2)
print(x.shape)
z = x + y
# x and y are not broadcastable, because x does not have at least 1 dimension

In [None]:
# can line up trailing dimensions
x = torch.empty(5,3,4,1)
y = torch.empty(  3,1,1)
z = x + y
# x and y are broadcastable.
# 1st trailing dimension: both have size 1
# 2nd trailing dimension: y has size 1
# 3rd trailing dimension: x size == y size
# 4th trailing dimension: y dimension doesn't exist

In [None]:
# but:
x = torch.empty(5,2,4,1)
y = torch.empty(  3,1,1)
z = x + y
# x and y are not broadcastable, because in the 3rd trailing dimension 2 != 3

Always take care of tensor shapes! It is a good practice to debug how some expression is evaluated before inserting adding it to your codebase. 

<!-- In other words, **you can use pytorch's dynamic graph creation ability to debug your model by printing tensor shapes!** -->

See more here: https://pytorch.org/docs/master/notes/broadcasting.html

## Useful Functions

Pytorch (and other libraries) have many functions that operate on tensors. Let's try some of them and plot the results.

In [None]:
import matplotlib.pyplot as plt

Create a vector x with values from -10 to 10, and intervals of 0.1.

In [None]:
x = torch.arange(-10, 10, 0.1, dtype=torch.float)

In [None]:
x.shape

In [None]:
y = x.sin()
plt.plot(x.numpy(), y.numpy())

In [None]:
y = x.tanh()
plt.plot(x.numpy(), y.numpy())

In [None]:
y = x.exp()
plt.plot(x.numpy(), y.numpy())

In [None]:
y = torch.log(x)
pl.plot(x.numpy(), y.numpy())

# But what about GPUs?
How do I use A GPU?

In [None]:
my_device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
my_device

If you have a GPU you should get something like: 
`device(type='cuda', index=0)`

In [None]:
# you can initialize a tensor in a specfic device
torch.ones(5, device=my_device)

In [None]:
# you can move data to the GPU by doing .to(device)
data = torch.eye(3)  # data is on the cpu 
data.to(my_device)  # data is moved to my_device

Now the computation happens on the GPU.

In [None]:
res = data + data
res

In [None]:
# you can get a tensor's device via the .device attribute
res.device
z = torch.arange(10)
z = z.to(res.device)
print(z.device)

# Automatic differentiation with `autograd`

Central to all neural networks in PyTorch is the `autograd` package. 

We can say that it is the _true_ power behind PyTorch. The autograd package provides automatic differentiation for all operations on Tensors. It is a **define-by-run** framework, which means that your backprop is defined by how your code is run, and that **every single iteration can be different**.

`torch.Tensor` is the central class of the package. If you set its attribute `.requires_grad` as `True`, it starts to track all operations applied on it. When you finish your computation you can call `.backward()` and have all the gradients computed automatically. The gradient for this tensor will be accumulated into the `.grad` attribute.

In [None]:
x = torch.tensor(2.)
print(x)

In [None]:
# setting requires_grad in directly via tensor's constructor
x = torch.tensor(2., requires_grad=True)

# or by setting .requires_grad attribute
# you can do this at any moment to track operations on x
x.requires_grad = True  

print(x)

In [None]:
print(x.requires_grad)
print(x.grad)  # no gradient yet

In [None]:
# let's perform a simple operation on x
y = x ** 2

print("Grad of x:", x.grad)

In [None]:
# if you want to compute the derivatives, you can call .backward() on a Tensor
y.backward()
print("Grad of y with respect to x:", x.grad)

To stop a tensor from tracking history, you can call `.detach()` to detach it from the computation history, and to prevent future computation from being tracked.

In [None]:
x = torch.tensor(2., requires_grad=True)
print(x)

y = x ** 2
print(y)

c = y.detach()  # c will be treated as a constant! c has the same contents as y but requires_grad=False
print(c)

z = c * y.exp()  
print(z)

z.backward()
print(x.grad)

To prevent tracking history (and using memory), you can also wrap the code block in with `torch.no_grad()`: This can be particularly helpful when evaluating a model because the model may have trainable parameters with `requires_grad=True`, but for which we don’t need the gradients.

In [None]:
x = torch.tensor(2.)
x.requires_grad = True
print('x:', x)

y = x ** 2
print('y:', y)

with torch.no_grad():
    y = 2 * y
    print('x:', x)  # Try to think why x.requires_grad is True
    print('y:', y)

There’s one more class which is very important for autograd implementation - a `Function`.

`Tensor` and `Function` are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each tensor has a `.grad_fn` attribute that references a `Function` that has created the `Tensor` (except for `Tensor`s created by the user - their `grad_fn` is `None`).

Let's go back and see the `grad_fn` in our previous example:
```
input -> x -> Pow(2) -> y -> Exp() -> Mul(constant) -> output
```

We can create a `Function` and manually define its gradient (this is particularly useful for originally non-differentiable operations)

In [None]:
class Exp(torch.autograd.Function):
    @staticmethod
    def forward(ctx, i):
        result = i.exp()
        ctx.save_for_backward(result)
        return result
    
    @staticmethod
    def backward(ctx, grad_output):
        result, = ctx.saved_tensors
        return grad_output * result

# Use it by calling the apply method:
x = torch.arange(4)
output = Exp.apply(x)
output

If you still don't believe autograd works, here's something that I think will change your mind --- we're going to compute the derivative of an unnecessarily complicated function:

$$ y(x) = \sum_{x_i} e^{0.001 x_i^2} + \sin(x_i^3) \cdot \log(x_i)$$

In [None]:
def complicated_func(X):
    return torch.sum(torch.exp(0.001 * X ** 2) + torch.sin(X ** 3) * torch.log(X))

In [None]:
x = torch.arange(1, 10, 0.1, dtype=torch.float, requires_grad=True)
x

In [None]:
y = complicated_func(x)
y.backward()

In [None]:
x.grad

### Concepts not covered in this lecture

PyTorch's `autograd` is a very powerfull tool. For instance, it can calculate the Jacobian and Hessian of any given function! Here is a list of more advanced things that you can accomplish with `autograd`:

- Vector-Jacobian products for non-scalar outputs (e.g., when `y` is a vector)
- Compute Jacobian and Hessian
- Retain the computation graph (useful for inspecting gradients inside a model)
- Sparse gradients
- Register and remove hooks (useful for saving gradients)
- How to set up user-designed `Function`s properly
- Numerical gradient checking


More info: https://pytorch.org/docs/stable/autograd.html

### The interaction of `autograd` with `nn.Module`s and `nn.Parameters`

In the next notebook we will see how to build a linear regression model using PyTorch's `nn.Module`. You will see that you don't need to worry about gradients when using `nn.Module` and `nn.Parameter`. This is because they automatically keep track of gradients for you.

In [None]:
# w.x + b
lin = torch.nn.Linear(2, 1, bias=True)  # nn.Linear is a nn.Module
lin.weight  # lin.weight is a nn.Parameter!

In [None]:
type(lin.weight)

---

<center>
    <b>Exercise:</b> Derive the gradient 
    <br><br>
    $$
    \dfrac{\partial \big[\sum_{x_i} e^{0.001 x_i^2} + \sin(x_i^3) \cdot \log(x_i)\big]}{\partial x}
    $$
    <br>
    and make a function that computes it. Check that it gives the same output as `x.grad` in our previous example.
</center>