# Introduction to PyTorch for former Torchies

In this tutorial, you will learn the following:

* Using torch Tensors, and important difference against (Lua)Torch
* Using the autograd package
* Building a ConvNet
* Building a Recurrent Net


## Tensors 

Tensors behave almost exactly the same way in PyTorch as they do in Torch.

In [1]:
import torch
a = torch.FloatTensor(10, 20)
# creates tensor of size (10 x 20) with uninitialized memory

a = torch.randn(10, 20)
# initializes a tensor randomized with a normal distribution with mean=0, var=1

print(a.size())

 10
 20
[torch.LongStorage of size 2]


### Inplace / Out-of-place

The first difference is that ALL operations on the tensor that operate in-place on it will have an `_` postfix.
For example, `add` is the out-of-place version, and `add_` is the in-place version.

In [2]:
a.fill_(3.5)
# a has now been filled with the value 3.5

b = a.add(4.0)
# a is still filled with 3.5
# new tensor b is returned with values 3.5 + 4.0 = 7.5

Some operations like narrow do not have in-place versions, and hence, .narrow_ does not exist. 
Similarly, some operations like fill_ do not have an out-of-place version, so .fill does not exist.

### Zero Indexing

Another difference is that Tensors are zero-indexed. (Torch tensors are one-indexed)

In [3]:
b = a[0,3] # select 1st row, 4th column from a

Tensors can be also indexed with Python's slicing

In [4]:
b = a[:,3:5] # selects all rows, columns 3 to 5

### No camel casing

The next small difference is that all functions are now NOT camelCase anymore.
For example `indexAdd` is now called `index_add_`

In [5]:
x = torch.ones(5, 5)
print(x)


 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
[torch.FloatTensor of size 5x5]



In [6]:
z = torch.Tensor(5, 2)
z[:,0] = 10
z[:,1] = 100
print(z)


  10  100
  10  100
  10  100
  10  100
  10  100
[torch.FloatTensor of size 5x2]



In [7]:
x.index_add_(1, torch.LongTensor([4,0]), z)
print(x)


 101    1    1    1   11
 101    1    1    1   11
 101    1    1    1   11
 101    1    1    1   11
 101    1    1    1   11
[torch.FloatTensor of size 5x5]



### Numpy Bridge

Converting a torch Tensor to a numpy array and vice versa is a breeze.
The torch Tensor and numpy array will share their underlying memory, and changing one will change the other.

#### Converting torch Tensor to numpy Array

In [8]:
a = torch.ones(5)
print(a)


 1
 1
 1
 1
 1
[torch.FloatTensor of size 5]



In [9]:
b = a.numpy()
print(b)

[ 1.  1.  1.  1.  1.]


In [10]:
a.add_(1)
print(a)
print(b) # see how the numpy array changed in value


 2
 2
 2
 2
 2
[torch.FloatTensor of size 5]

[ 2.  2.  2.  2.  2.]


#### Converting numpy Array to torch Tensor

In [11]:
import numpy as np
a = np.ones(5)
b = torch.DoubleTensor(a)
np.add(a, 1, out=a)
print(a)
print(b) # see how changing the np array changed the torch Tensor automatically

[ 2.  2.  2.  2.  2.]

 2
 2
 2
 2
 2
[torch.DoubleTensor of size 5]



All the Tensors on the CPU except a CharTensor support converting to NumPy and back.

### CUDA Tensors

CUDA Tensors are nice and easy in pytorch, and they are much more consistent as well.
Transfering a CUDA tensor from the CPU to GPU will retain it's type.

In [None]:
# creates a LongTensor and transfers it 
# to GPU as torch.cuda.LongTensor
a = torch.LongTensor(10).fill_(3).cuda()
print(type(a))
b = a.cpu()
# transfers it to CPU, back to 
# being a torch.LongTensor

## Autograd

Autograd is now a core torch package for automatic differentiation. 

It uses a tape based system for automatic differentiation. 

In the forward phase, the autograd tape will remember all the operations it executed, and in the backward phase, it will replay the operations.

In autograd, we introduce a `Variable` class, which is a very thin wrapper around a `Tensor`. 
You can access the raw tensor through the `.data` attribute, and after computing the backward pass, a gradient w.r.t. this variable is accumulated into `.grad` attribute.
[Image: https://fb.quip.com/-/blob/WUCAAAluOup/gzkg-mq1VLAc3DlafCX1mg]
There's one more class which is very important for autograd implementation - a `Function`. `Variable` and `Function` are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each variable has a `.creator` attribute that references a function that has created a function (except for Variables created by the user - these have `None` as  `.creator`).

If you want to compute the derivatives, you can call `.backward()` on a `Variable`. 
If `Variable` is a scalar (i.e. it holds a one element tensor), you don't need to specify any arguments to `backward()`, however if it has more elements, you need to specify a `grad_output` argument that is a tensor of matching shape.

In [13]:
from torch.autograd import Variable
x = Variable(torch.ones(2, 2))
x  # notice the "Variable containing" line

Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]

In [14]:
x.data


 1  1
 1  1
[torch.FloatTensor of size 2x2]

In [15]:
x.grad


 0  0
 0  0
[torch.FloatTensor of size 2x2]

In [16]:
x.creator is None  # we've created x ourselves

True

In [17]:
y = x + 2
y

Variable containing:
 3  3
 3  3
[torch.FloatTensor of size 2x2]

In [18]:
y.creator
# y was created as a result of an operation, 
# so it has a creator

<torch.autograd.functions.basic_ops.AddConstant at 0x1079a80d0>

In [19]:
z = y * y * 3
z

Variable containing:
 27  27
 27  27
[torch.FloatTensor of size 2x2]

In [20]:
out = z.mean()
out

Variable containing:
 27
[torch.FloatTensor of size 1]

In [21]:
# let's backprop now
out.backward()

In [22]:
# print gradients d(out)/dx
x.grad


 4.5000  4.5000
 4.5000  4.5000
[torch.FloatTensor of size 2x2]

By default, gradient computation flushes all the internal buffers contained in the graph, so if you even want to do the backward on some part of the graph twice, you need to pass in `retain_variables = True` during the first pass.

In [23]:
x = Variable(torch.ones(2, 2))
y = x + 2
y.backward(torch.ones(2, 2), retain_variables=True)
# the retain_variables flag will prevent the internal buffers from being freed
x.grad


 1  1
 1  1
[torch.FloatTensor of size 2x2]

In [24]:
z = y * y
z

Variable containing:
 9  9
 9  9
[torch.FloatTensor of size 2x2]

In [25]:
gradient = torch.randn(2, 2)
# just backproping random gradients

z.backward(gradient)
# Ross: make default to allow z.backward()

# this would fail if we didn't specify 
# that we want to retain variables
x.grad


 -0.2229  13.6553
  2.7308   5.1657
[torch.FloatTensor of size 2x2]

## nn package

In [30]:
import torch.nn as nn

We've redesigned the nn package, so that it's much more intuitive and fully integrated with autograd. 
You no longer have to use Containers like ConcatTable, or modules like CAddTable, or use and debug with nngraph. 
We will seamlessly use autograd to define our neural networks.

Debugging is intuitive using Python's pdb debugger, and **the debugger and stack traces stop at exactly where an error occurred.** What you see is what you get.

### Example 1: ConvNet

In [35]:
class MNISTConvNet(nn.Container):
    def __init__(self):
        # this is the place where you instantiate all your modules
        # you can later access them using the same names you've given them in here
        super(MNISTConvNet, self).__init__(
            conv1 = nn.Conv2d(1, 20, 5),
            pool1 = nn.MaxPool2d(2, 2),
            conv2 = nn.Conv2d(20, 50, 5),
            pool2 = nn.MaxPool2d(2, 2),
            fc1   = nn.Linear(800, 500),
            fc2   = nn.Linear(500, 10),
            relu  = nn.ReLU(),
            softmax = nn.LogSoftmax(),
        )
        
    # it's the forward function that defines the network structure
    # we're accepting only a single input in here, but if you want,
    # feel free to use more
    def forward(self, input):
        x = self.pool1(self.relu(self.conv1(input)))
        x = self.pool2(self.relu(self.conv2(x)))

        # in your model definition you can go full crazy and use arbitrary
        # python code to define your model structure
        # all these are perfectly legal, and will be handled correctly 
        # by autograd:
        # if x.gt(0) > x.numel() / 2:
        #      ...
        # 
        # you can even do a loop and reuse the same module inside it
        # modules no longer hold ephemeral state, so you can use them
        # multiple times during your forward pass
        # e.g. in this example we're using relu multiple times (but
        # modules with parameters will have correct gradients w.r.t.
        # their weights as well)
        # while x.norm(2) < 10:
        #    x = self.conv1(x) 
        x = x.view(x.size(0), -1)
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        return self.softmax(x)

Using this ConvNet now is intuitive. 
You create an instance of it, and you can run things through it and backward.

In [36]:
net = MNISTConvNet()
net.conv1.add_forward_hook(lambda x,y print(x))
# TODO: tutorial on using hooks
# only mini-batches are supported in all of nn
input = Variable(torch.randn(1, 1, 28, 28))
out = net(input)
print(out.size())

 1
 10
[torch.LongStorage of size 2]


In [40]:
target = Variable(torch.LongTensor([3]), requires_grad = False)
loss = nn.NLLLoss()
err = loss(out, target)
print(err)
# Ross: how to freeze part of the graph

Variable containing:
 2.3660
[torch.FloatTensor of size 1]



In [41]:
err.backward()

It's easy to access individual layer gradients

In [44]:
print(net.conv1.weight.grad.size())

 20
 1
 5
 5
[torch.LongStorage of size 4]


In [45]:
print(net.conv1.weight.grad.norm())

0.49256013227


In [46]:
# add device id / location to print of tensor / variable

nn.Sequential

torch.nn.modules.container.Sequential

In [47]:
# # Justin: save to GPU and load from CPU

In [None]:
# Ross: optim API, per-layer learning rates?