# Introduction to PyTorch for former Torchies

In this tutorial, you will learn the following:

1. Using torch Tensors, and important difference against (Lua)Torch
2. Using the autograd package
3. Building neural networks
  - Building a ConvNet
  - Building a Recurrent Net
  - Using multiple GPUs


## Tensors 

Tensors behave almost exactly the same way in PyTorch as they do in Torch.

In [83]:
import torch
a = torch.FloatTensor(10, 20)
# creates tensor of size (10 x 20) with uninitialized memory

a = torch.randn(10, 20)
# initializes a tensor randomized with a normal distribution with mean=0, var=1

print(a.size())

 10
 20
[torch.LongStorage of size 2]


### Inplace / Out-of-place

The first difference is that ALL operations on the tensor that operate in-place on it will have an `_` postfix.
For example, `add` is the out-of-place version, and `add_` is the in-place version.

In [84]:
a.fill_(3.5)
# a has now been filled with the value 3.5

b = a.add(4.0)
# a is still filled with 3.5
# new tensor b is returned with values 3.5 + 4.0 = 7.5

Some operations like narrow do not have in-place versions, and hence, .narrow_ does not exist. 
Similarly, some operations like fill_ do not have an out-of-place version, so .fill does not exist.

### Zero Indexing

Another difference is that Tensors are zero-indexed. (Torch tensors are one-indexed)

In [85]:
b = a[0,3] # select 1st row, 4th column from a

Tensors can be also indexed with Python's slicing

In [86]:
b = a[:,3:5] # selects all rows, columns 3 to 5

### No camel casing

The next small difference is that all functions are now NOT camelCase anymore.
For example `indexAdd` is now called `index_add_`

In [87]:
x = torch.ones(5, 5)
print(x)


 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
[torch.FloatTensor of size 5x5]



In [88]:
z = torch.Tensor(5, 2)
z[:,0] = 10
z[:,1] = 100
print(z)


  10  100
  10  100
  10  100
  10  100
  10  100
[torch.FloatTensor of size 5x2]



In [89]:
x.index_add_(1, torch.LongTensor([4,0]), z)
print(x)


 101    1    1    1   11
 101    1    1    1   11
 101    1    1    1   11
 101    1    1    1   11
 101    1    1    1   11
[torch.FloatTensor of size 5x5]



### Numpy Bridge

Converting a torch Tensor to a numpy array and vice versa is a breeze.
The torch Tensor and numpy array will share their underlying memory, and changing one will change the other.

#### Converting torch Tensor to numpy Array

In [90]:
a = torch.ones(5)
print(a)


 1
 1
 1
 1
 1
[torch.FloatTensor of size 5]



In [91]:
b = a.numpy()
print(b)

[ 1.  1.  1.  1.  1.]


In [92]:
a.add_(1)
print(a)
print(b) # see how the numpy array changed in value


 2
 2
 2
 2
 2
[torch.FloatTensor of size 5]

[ 2.  2.  2.  2.  2.]


#### Converting numpy Array to torch Tensor

In [93]:
import numpy as np
a = np.ones(5)
b = torch.DoubleTensor(a)
np.add(a, 1, out=a)
print(a)
print(b) # see how changing the np array changed the torch Tensor automatically

[ 2.  2.  2.  2.  2.]

 2
 2
 2
 2
 2
[torch.DoubleTensor of size 5]



All the Tensors on the CPU except a CharTensor support converting to NumPy and back.

### CUDA Tensors

CUDA Tensors are nice and easy in pytorch, and they are much more consistent as well.
Transfering a CUDA tensor from the CPU to GPU will retain it's type.

In [None]:
# creates a LongTensor and transfers it 
# to GPU as torch.cuda.LongTensor
a = torch.LongTensor(10).fill_(3).cuda()
print(type(a))
b = a.cpu()
# transfers it to CPU, back to 
# being a torch.LongTensor

## Autograd

Autograd is now a core torch package for automatic differentiation. 

It uses a tape based system for automatic differentiation. 

In the forward phase, the autograd tape will remember all the operations it executed, and in the backward phase, it will replay the operations.

In autograd, we introduce a `Variable` class, which is a very thin wrapper around a `Tensor`. 
You can access the raw tensor through the `.data` attribute, and after computing the backward pass, a gradient w.r.t. this variable is accumulated into `.grad` attribute.

![Variable](images/Variable.png)

There's one more class which is very important for autograd implementation - a `Function`. `Variable` and `Function` are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each variable has a `.creator` attribute that references a function that has created a function (except for Variables created by the user - these have `None` as  `.creator`).

If you want to compute the derivatives, you can call `.backward()` on a `Variable`. 
If `Variable` is a scalar (i.e. it holds a one element tensor), you don't need to specify any arguments to `backward()`, however if it has more elements, you need to specify a `grad_output` argument that is a tensor of matching shape.

In [94]:
from torch.autograd import Variable
x = Variable(torch.ones(2, 2))
x  # notice the "Variable containing" line

Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]

In [95]:
x.data


 1  1
 1  1
[torch.FloatTensor of size 2x2]

In [96]:
x.grad


 0  0
 0  0
[torch.FloatTensor of size 2x2]

In [97]:
x.creator is None  # we've created x ourselves

True

In [98]:
y = x + 2
y

Variable containing:
 3  3
 3  3
[torch.FloatTensor of size 2x2]

In [99]:
y.creator
# y was created as a result of an operation, 
# so it has a creator

<torch.autograd.functions.basic_ops.AddConstant at 0x10d4076d0>

In [100]:
z = y * y * 3
z

Variable containing:
 27  27
 27  27
[torch.FloatTensor of size 2x2]

In [101]:
out = z.mean()
out

Variable containing:
 27
[torch.FloatTensor of size 1]

In [102]:
# let's backprop now
out.backward()

In [103]:
# print gradients d(out)/dx
x.grad


 4.5000  4.5000
 4.5000  4.5000
[torch.FloatTensor of size 2x2]

By default, gradient computation flushes all the internal buffers contained in the graph, so if you even want to do the backward on some part of the graph twice, you need to pass in `retain_variables = True` during the first pass.

In [104]:
x = Variable(torch.ones(2, 2))
y = x + 2
y.backward(torch.ones(2, 2), retain_variables=True)
# the retain_variables flag will prevent the internal buffers from being freed
x.grad


 1  1
 1  1
[torch.FloatTensor of size 2x2]

In [105]:
z = y * y
z

Variable containing:
 9  9
 9  9
[torch.FloatTensor of size 2x2]

In [106]:
gradient = torch.randn(2, 2)
# just backproping random gradients

z.backward(gradient)

# this would fail if we didn't specify 
# that we want to retain variables
x.grad


 5.8201  9.2361
 1.5865 -6.4018
[torch.FloatTensor of size 2x2]

## nn package

In [107]:
import torch.nn as nn

We've redesigned the nn package, so that it's much more intuitive and fully integrated with autograd.

### Replace containers with autograd

You no longer have to use Containers like ConcatTable, or modules like CAddTable, or use and debug with nngraph. 
We will seamlessly use autograd to define our neural networks.
For example, 

`output = CAddTable():forward({input1, input2})` simply becomes `output = input1 + input2`

`output = MulConstant(0.5):forward(input)` simply becomes `output = input * 0.5`

### State is no longer held in the module, but in the network graph

Using recurrent networks is a breeze. If you want to create a recurrent network, simply use the same Linear layer multiple times, and the weights are shared.

![torch-nn-vs-pytorch-nn](images/torch-nn-vs-pytorch-nn.png)

### Simplified debugging

Debugging is intuitive using Python's pdb debugger, and **the debugger and stack traces stop at exactly where an error occurred.** What you see is what you get.

### Example 1: ConvNet

Creating networks is simple. All of your networks are derived from the base class nn.Container.

- In the constructor, you declare all the layers you want to use.
- In the forward function, you define how your model is going to be run, from input to output

In [108]:
class MNISTConvNet(nn.Container):
    def __init__(self):
        # this is the place where you instantiate all your modules
        # you can later access them using the same names you've given them in here
        super(MNISTConvNet, self).__init__(
            conv1 = nn.Conv2d(1, 20, 5),
            pool1 = nn.MaxPool2d(2, 2),
            conv2 = nn.Conv2d(20, 50, 5),
            pool2 = nn.MaxPool2d(2, 2),
            fc1   = nn.Linear(800, 500),
            fc2   = nn.Linear(500, 10),
            relu  = nn.ReLU(),
            softmax = nn.LogSoftmax(),
        )
        
    # it's the forward function that defines the network structure
    # we're accepting only a single input in here, but if you want,
    # feel free to use more
    def forward(self, input):
        x = self.pool1(self.relu(self.conv1(input)))
        x = self.pool2(self.relu(self.conv2(x)))

        # in your model definition you can go full crazy and use arbitrary
        # python code to define your model structure
        # all these are perfectly legal, and will be handled correctly 
        # by autograd:
        # if x.gt(0) > x.numel() / 2:
        #      ...
        # 
        # you can even do a loop and reuse the same module inside it
        # modules no longer hold ephemeral state, so you can use them
        # multiple times during your forward pass
        # e.g. in this example we're using relu multiple times (but
        # modules with parameters will have correct gradients w.r.t.
        # their weights as well)
        # while x.norm(2) < 10:
        #    x = self.conv1(x) 
        x = x.view(x.size(0), -1)
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        return self.softmax(x)

Using this ConvNet now is intuitive. 
You create an instance of it, and you can run things through it and backward.

In [109]:
net = MNISTConvNet()
# only mini-batches are supported in all of nn
input = Variable(torch.randn(1, 1, 28, 28))
out = net(input)
print(out.size())

 1
 10
[torch.LongStorage of size 2]


In [110]:
target = Variable(torch.LongTensor([3]), requires_grad = False)
loss_fn = nn.NLLLoss()
err = loss_fn(out, target)
print(err)

Variable containing:
 2.3291
[torch.FloatTensor of size 1]



In [111]:
err.backward()

It's easy to access individual layer gradients

In [112]:
print(net.conv1.weight.grad.size())

 20
 1
 5
 5
[torch.LongStorage of size 4]


In [113]:
print(net.conv1.weight.grad.norm())

0.451569274352


A full and working MNIST example is located here
https://github.com/pytorch/examples/tree/master/mnist

### Example 2: Recurrent Net

Building recurrent nets with PyTorch is quite a breeze.
Since the state of the network is held in the graph and not
in the layers, you can simply create an nn.Linear and 
reuse it over and over again for the recurrence.

In [114]:
class RNN(nn.Container):

    # you can also accept arguments in your model constructor
    # if you want to parametrize it somehow
    def __init__(self, data_size, hidden_size, output_size):
        self.hidden_size = hidden_size
        input_size = data_size + hidden_size
        super(RNN, self).__init__(
            i2h=nn.Linear(input_size, hidden_size),
            i2o=nn.Linear(input_size, output_size)
        )
    
    def forward(self, data, last_hidden):
        input = data.cat((data, last_hidden,), 1)
        hidden = self.i2h(input)
        output = self.i2o(input)
        return hidden, output

rnn = RNN(50, 20, 10)

In [115]:
loss_fn = nn.MSELoss()

batch_size = 10
TIMESTEPS = 5
batch = Variable(torch.randn(batch_size, 50), requires_grad = False)
hidden = Variable(torch.zeros(batch_size, 20), requires_grad = False)
target = Variable(torch.zeros(batch_size, 10), requires_grad = False)
loss = 0
for t in range(TIMESTEPS):                  
    # yes! you can reuse the same network several times,
    # sum up the losses, and call backward!
    hidden, output = rnn(batch, hidden)
    loss += loss_fn(output, target)
loss.backward()

### Multi-GPU examples

#### Data Parallel

In [117]:
class DataParallelModel(nn.Container):
    def __init__(self):
        super().__init__(
            block1=nn.Linear(10, 20),
            block2=nn.Linear(20, 20),
            block3=nn.Linear(20, 20),
        )
        
    def forward(self, x):
        x = self.block1(x)
        x = nn.parallel.data_parallel(self.block2, x, (0, 1, 2, 3))
        x = self.block3(x)
        return x

#### Model Parallel

In [121]:
class DistributedModel(nn.Container):
    def __init__(self):
        super().__init__(
            embedding=nn.Embedding(1000, 10),
            rnn=nn.Linear(10, 10).cuda(0),
        )
        
    def forward(self, x):
        x = self.embedding(x)
        x = x.cuda(0)
        x = self.rnn(x)
        return x

#### Primitives on which data parallel is implemented upon

In [123]:
def data_parallel(module, input, device_ids, output_device=None):
    """Distributes replicas of module accross gpus given in device_ids,
       slices the input and applies the copies in parallel.
       Outputs are concatenated on the same device as input (or on
       output_device if specified). Device id -1 means the CPU.
    """
    if not device_ids:
        return module(input)
    replicas = replicate(module, device_ids)
    inputs = scatter(input, device_ids)
    outputs = parallel_apply(replicas, inputs)
    if output_device is None:
        output_device = -1 if not input.is_cuda else input.get_device()
    return gather(outputs, output_device)