![](pics/header.jpg)

# PyTorch

Kevin Walchko, Phd

---

Some of this material comes from Udacities AI course.

## Origins

PyTorch was released in early 2017 and has been making a pretty big impact in the deep learning community. It's developed as an open source project by the Facebook AI Research team.

- tensor: main data structure 
- autograd: automatically calculates gradients for backpropagation

## Simple Network

There are several ways to do this, here is a simple, handcrafted way.

In [21]:
import torch
from torch import nn
from helper import summary

In [22]:
class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()
        
        # Inputs to hidden layer linear transformation
        # MNIST images 1 channel (grayscale) x 28 pix x 28 pix
        self.hidden = nn.Linear(1*28*28, 256)
        # Output layer, 10 units - one for each digit
        self.output = nn.Linear(256, 10)
        
        # Define sigmoid activation and softmax output 
        self.sigmoid = nn.Sigmoid()
        self.softmax = nn.Softmax(dim=1)
        
    def forward(self, x):
        # Pass the input tensor through each of our operations
        x = self.hidden(x)  # input to hidden layer 
        x = self.sigmoid(x) # activation - sigmoid
        x = self.output(x)  # hidden to output layer
        x = self.softmax(x) # activation - softmax
        
        return x

Let's go through this bit by bit.

```python
class Network(nn.Module):
```

Here we're inheriting from `nn.Module`. Combined with `super().__init__()` this creates a class that tracks the architecture and provides a lot of useful methods and attributes. It is mandatory to inherit from `nn.Module` when you're creating a class for your network. The name of the class itself can be anything.

```python
self.hidden = nn.Linear(784, 256)
```

This line creates a module for a linear transformation, $x\mathbf{W} + b$, with 784 inputs and 256 outputs and assigns it to `self.hidden`. The module automatically creates the weight and bias tensors which we'll use in the `forward` method. You can access the weight and bias tensors once the network (`net`) is created with `net.hidden.weight` and `net.hidden.bias`.

```python
self.output = nn.Linear(256, 10)
```

Similarly, this creates another linear transformation with 256 inputs and 10 outputs.

```python
self.sigmoid = nn.Sigmoid()
self.softmax = nn.Softmax(dim=1)
```

Here I defined operations for the sigmoid activation and softmax output. Setting `dim=1` in `nn.Softmax(dim=1)` calculates softmax across the columns.

```python
def forward(self, x):
```

PyTorch networks created with `nn.Module` must have a `forward` method defined. It takes in a tensor `x` and passes it through the operations you defined in the `__init__` method.

```python
x = self.hidden(x)
x = self.sigmoid(x)
x = self.output(x)
x = self.softmax(x)
```

Here the input tensor `x` is passed through each operation and reassigned to `x`. We can see that the input tensor goes through the hidden layer, then a sigmoid function, then the output layer, and finally the softmax function. It doesn't matter what you name the variables here, as long as the inputs and outputs of the operations match the network architecture you want to build. The order in which you define things in the `__init__` method doesn't matter, but you'll need to sequence the operations correctly in the `forward` method.

Now we can create a `Network` object.

In [23]:
model = Network()
print("inputs * output + bias:", 784 * 256 + 256*10 + 256 + 10, '\n')
summary(model)

inputs * output + bias: 203530 

Layer (type (var_name))                  Kernel Shape              Param #
Network                                  --                        --
├─Linear (hidden)                        [784, 256]                200,960
├─Linear (output)                        [256, 10]                 2,570
├─Sigmoid (sigmoid)                      --                        --
├─Softmax (softmax)                      --                        --
Total params: 203,530
Trainable params: 203,530
Non-trainable params: 0


In [9]:
model

Network(
  (hidden): Linear(in_features=784, out_features=256, bias=True)
  (output): Linear(in_features=256, out_features=10, bias=True)
  (sigmoid): Sigmoid()
  (softmax): Softmax(dim=1)
)

In [10]:
model.hidden

Linear(in_features=784, out_features=256, bias=True)

In [11]:
model.hidden.weight

Parameter containing:
tensor([[-0.0269,  0.0144,  0.0353,  ..., -0.0349, -0.0097, -0.0321],
        [-0.0108, -0.0138, -0.0250,  ..., -0.0159, -0.0313, -0.0222],
        [-0.0092, -0.0259,  0.0007,  ...,  0.0054,  0.0047,  0.0022],
        ...,
        [-0.0221, -0.0261, -0.0052,  ...,  0.0322, -0.0269,  0.0170],
        [-0.0038,  0.0097,  0.0065,  ...,  0.0174,  0.0319,  0.0217],
        [ 0.0159,  0.0173,  0.0321,  ...,  0.0234,  0.0190, -0.0074]],
       requires_grad=True)

## Training

Typically it's more convenient to build the model with a log-softmax output using `nn.LogSoftmax` or `F.log_softmax` ([documentation](https://pytorch.org/docs/stable/nn.html#torch.nn.LogSoftmax)). Then you can get the actual probabilities by taking the exponential `torch.exp(output)`. With a log-softmax output, you want to use the negative log likelihood loss, `nn.NLLLoss` ([documentation](https://pytorch.org/docs/stable/nn.html#torch.nn.NLLLoss)).

In [12]:
from torchvision import datasets
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

transform = transforms.ToTensor()

train_data = datasets.MNIST(
    'data', 
    train=True,
    download=True, 
    transform=transform)

train_loader = DataLoader(
    train_data, 
    batch_size=20,
    shuffle=True)

In [13]:
from torch import optim

criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for ii, (inputs, labels) in enumerate(train_loader):
    inputs = inputs.view(inputs.shape[0], -1) # resize imagery to fit inputs
    outputs = model(inputs) # get model output
    loss = criterion(outputs, labels) # determine error

    optimizer.zero_grad() # clear gradient from last run
    loss.backward()       # update gradient
    optimizer.step()      # step towards lowest level
    
    if ii % 1000 == 0:
        print(f"{ii+1}/{len(train_loader)}")

1/3000
1001/3000
2001/3000


## GPU Support

You can write device agnostic code which will automatically use CUDA if it's enabled like so:
```python
# at beginning of the script
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

...

# then whenever you get a new Tensor or Module
# this won't copy if they are already on the desired device
input = data.to(device)
model = MyModule(...)
model.to(device)
```

In [14]:
torch.cuda.is_available()

False

## Sequential

In [16]:
net = nn.Sequential()
net.add_module('fc1', nn.Linear(1000, 100))
net.add_module('relu', nn.ReLU())
net.add_module('dropout', nn.Dropout(0.2))
net.add_module('fc2', nn.Linear(100, 10))

summary(net)

Layer (type (var_name))                  Kernel Shape              Param #
Sequential                               --                        --
├─Linear (fc1)                           [1000, 100]               100,100
├─ReLU (relu)                            --                        --
├─Dropout (dropout)                      --                        --
├─Linear (fc2)                           [100, 10]                 1,010
Total params: 101,110
Trainable params: 101,110
Non-trainable params: 0


In [18]:
from collections import OrderedDict

net = nn.Sequential(OrderedDict([
    ('fc1', nn.Linear(1000, 100)),
    ('relu', nn.ReLU()),
    ('dropout', nn.Dropout(0.2)),
    ('fc2', nn.Linear(100, 10))
]))

summary(net)

Layer (type (var_name))                  Kernel Shape              Param #
Sequential                               --                        --
├─Linear (fc1)                           [1000, 100]               100,100
├─ReLU (relu)                            --                        --
├─Dropout (dropout)                      --                        --
├─Linear (fc2)                           [100, 10]                 1,010
Total params: 101,110
Trainable params: 101,110
Non-trainable params: 0


In [19]:
n = (nn.Linear(1000, 100),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(100, 10))
net = nn.Sequential(*n)

summary(net)

Layer (type (var_name))                  Kernel Shape              Param #
Sequential                               --                        --
├─Linear (0)                             [1000, 100]               100,100
├─ReLU (1)                               --                        --
├─Dropout (2)                            --                        --
├─Linear (3)                             [100, 10]                 1,010
Total params: 101,110
Trainable params: 101,110
Non-trainable params: 0


In [20]:
net = nn.Sequential(
    nn.Linear(1000, 100),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(100, 10)
)

summary(net)

Layer (type (var_name))                  Kernel Shape              Param #
Sequential                               --                        --
├─Linear (0)                             [1000, 100]               100,100
├─ReLU (1)                               --                        --
├─Dropout (2)                            --                        --
├─Linear (3)                             [100, 10]                 1,010
Total params: 101,110
Trainable params: 101,110
Non-trainable params: 0
