# Transfer Learning

In this notebook, you'll learn how to use pre-trained networks to solved challenging problems in computer vision. Specifically, you'll use networks trained on [ImageNet](http://www.image-net.org/) [available from torchvision](http://pytorch.org/docs/0.3.0/torchvision/models.html). 

ImageNet is a massive dataset with over 1 million labeled images in 1000 categories. It's used to train deep neural networks using an architecture called convolutional layers. I'm not going to get into the details of convolutional networks here, but if you want to learn more about them, please [watch this](https://www.youtube.com/watch?v=2-Ol7ZB0MmU).

Once trained, these models work astonishingly well as feature detectors for images they weren't trained on. Using a pre-trained network on images not in the training set is called transfer learning. Here we'll use transfer learning to train a network that can classify our cat and dog photos with near perfect accuracy.

With `torchvision.models` you can download these pre-trained networks and use them in your applications. We'll include `models` in our imports now.

In [129]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt

import torch
from torch import nn
from torch import optim
import torch.nn.functional as F
from torchvision import datasets, transforms, models

Most of the pretrained models require the input to be 224x224 images. Also, we'll need to match the normalization used when the models were trained. Each color channel was normalized separately, the means are `[0.485, 0.456, 0.406]` and the standard deviations are `[0.229, 0.224, 0.225]`.

In [130]:
data_dir = 'Cat_Dog_data'

# TODO: Define transforms for the training data and testing data
train_transforms = transforms.Compose([transforms.RandomRotation(30),
                                      transforms.RandomResizedCrop(224),
                                      transforms.RandomHorizontalFlip(),
                                      transforms.ToTensor(),
                                      transforms.Normalize([0.485, 0.456, 0.406],
                                                          [0.229, 0.224, 0.225])])

test_transforms = transforms.Compose([transforms.Resize(225),
                                     transforms.CenterCrop(224),
                                     transforms.ToTensor()])

# Pass transforms in here, then run the next cell to see how the transforms look
train_data = datasets.ImageFolder(data_dir + '/train', transform=train_transforms)
test_data = datasets.ImageFolder(data_dir + '/test', transform=test_transforms)

trainloader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True)
testloader = torch.utils.data.DataLoader(test_data, batch_size=64)

We can load in a model such as [DenseNet](http://pytorch.org/docs/0.3.0/torchvision/models.html#id5). Let's print out the model architecture so we can see what's going on.

In [132]:
model = models.densenet121(pretrained=True)
model

  nn.init.kaiming_normal(m.weight.data)


DenseNet(
  (features): Sequential(
    (conv0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (norm0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu0): ReLU(inplace)
    (pool0): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (denseblock1): _DenseBlock(
      (denselayer1): _DenseLayer(
        (norm1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplace)
        (conv1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu2): ReLU(inplace)
        (conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      )
      (denselayer2): _DenseLayer(
        (norm1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplac

This model is built out of two main parts, the features and the classifier. The features part is a stack of convolutional layers and overall works as a feature detector that can be fed into a classifier. The classifier part is a single fully-connected layer `(classifier): Linear(in_features=1024, out_features=1000)`. This layer was trained on the ImageNet dataset, so it won't work for our specific problem. That means we need to replace the classifier, but the features will work perfectly on their own. In general, I think about pre-trained networks as amazingly good feature detectors that can be used as the input for simple feed-forward classifiers.

In [133]:
# Freeze parameters so we don't backprop through them
for param in model.parameters():
    param.requires_grad = False

from collections import OrderedDict
classifier = nn.Sequential(OrderedDict([
                          ('fc1', nn.Linear(1024, 500)),
                          ('relu', nn.ReLU()),
                          ('fc2', nn.Linear(500, 2)),
                          ('output', nn.LogSoftmax(dim=1))
                          ]))
    
model.classifier = classifier

With our model built, we need to train the classifier. However, now we're using a **really deep** neural network. If you try to train this on a CPU like normal, it will take a long, long time. Instead, we're going to use the GPU to do the calculations. The linear algebra computations are done in parallel on the GPU leading to 100x increased training speeds. It's also possible to train on multiple GPUs, further decreasing training time.

PyTorch, along with pretty much every other deep learning framework, uses [CUDA](https://developer.nvidia.com/cuda-zone) to efficiently compute the forward and backwards passes on the GPU. In PyTorch, you move your model parameters and other tensors to the GPU memory using `model.to('cuda')`. You can move them back from the GPU with `model.to('cpu')` which you'll commonly do when you need to operate on the network output outside of PyTorch. As a demonstration of the increased speed, I'll compare how long it takes to perform a forward and backward pass with and without a GPU.

In [134]:
import time

In [79]:
for device in ['cpu', 'cuda']:

    criterion = nn.NLLLoss()
    # Only train the classifier parameters, feature parameters are frozen
    optimizer = optim.Adam(model.classifier.parameters(), lr=0.001)

    model.to(device)

    for ii, (inputs, labels) in enumerate(trainloader):

        # Move input and label tensors to the GPU
        inputs, labels = inputs.to(device), labels.to(device)

        start = time.time()

        outputs = model.forward(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        if ii==3:
            break
        
    print(f"Device = {device}; Time per batch: {(time.time() - start)/3:.3f} seconds")

Device = cpu; Time per batch: 5.636 seconds
Device = cuda; Time per batch: 0.010 seconds


You can write device agnostic code which will automatically use CUDA if it's enabled like so:
```python
# at beginning of the script
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

...

# then whenever you get a new Tensor or Module
# this won't copy if they are already on the desired device
input = data.to(device)
model = MyModule(...).to(device)
```

From here, I'll let you finish training the model. The process is the same as before except now your model is much more powerful. You should get better than 95% accuracy easily.

>**Exercise:** Train a pretrained models to classify the cat and dog images. Continue with the DenseNet model, or try ResNet, it's also a good model to try out first. Make sure you are only training the classifier and the parameters for the features part are frozen.

In [135]:
model= models.resnet50(pretrained=True)
model

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=F

In [136]:
## TODO: Use a pretrained model to classify the cat and dog images
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
# freeze feature parameters
for param in model.parameters():
    param.requires_grad=False

# bild classifier
from collections import OrderedDict
classifier= nn.Sequential(nn.Linear(2048,512),
                          nn.ReLU(),
                          nn.Dropout(p=.2),
                          nn.Linear(512,128),
                          nn.ReLU(),
                          nn.Dropout(p=.2),
                          nn.Linear(128,2),
                          nn.LogSoftmax(dim=1))

# need to match full-connect layer (classifier or fc)
model.fc= classifier

# criterion and optim
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.fc.parameters(), lr=0.003)

# set to GPU
model.to(device)

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=F

In [137]:
# train
from time import time
time0=time()
running_loss=0
step=0

for traininputs, trainlabels in trainloader:
    step+=1
    # setup
    traininputs, trainlabels= traininputs.to(device), trainlabels.to(device)
    optimizer.zero_grad()
    # implementation
    output=model.forward(traininputs)
    loss=criterion(output, trainlabels)
    loss.backward()
    optimizer.step()
    running_loss+= loss.item()
    
    '''Inner forloop will finish the entire loop cycle then go back to outer forloop'''
    if step%10==0:
        model.eval()
        test_loss=0
        accuracy=0
        with torch.no_grad():
            # setups
            for testinput, testlabels in testloader:
                testinput,testlabels=testinput.to(device), testlabels.to(device)
                testoutput=model.forward(testinput)
                batch_loss=criterion(testoutput,testlabels)
                test_loss+= batch_loss.item()

                # get difference
                ps=torch.exp(testoutput)
                equals=ps.max(1)[1]==testlabels
                # equals contains 64 1 or 0, get mean will get 1 batch accuracy
                accuracy+= torch.mean(equals.type(torch.FloatTensor)).item()
                
        '''training mode need to be reset here, out of for loop, otherwise it will run validation set with train mode'''        
        model.train()        
        print('steps: {}/{}'.format(step,len(trainloader)),
              'training loss: {:.3f}'.format(running_loss/step),
              'test loss: {:.3f}'.format(test_loss/len(testloader)),
              'test accuracy: {:.3f}'.format(accuracy/len(testloader)),
              'duration:{:.0f}'.format(time()-time0))

steps: 10/352 training loss: 0.874 test loss: 0.495 test accuracy: 0.723 duration:39
steps: 20/352 training loss: 0.556 test loss: 0.129 test accuracy: 0.946 duration:78
steps: 30/352 training loss: 0.442 test loss: 0.338 test accuracy: 0.880 duration:117
steps: 40/352 training loss: 0.387 test loss: 0.184 test accuracy: 0.928 duration:156
steps: 50/352 training loss: 0.361 test loss: 0.553 test accuracy: 0.786 duration:195
steps: 60/352 training loss: 0.347 test loss: 0.192 test accuracy: 0.916 duration:234
steps: 70/352 training loss: 0.326 test loss: 0.232 test accuracy: 0.909 duration:273
steps: 80/352 training loss: 0.307 test loss: 0.162 test accuracy: 0.934 duration:311
steps: 90/352 training loss: 0.290 test loss: 0.144 test accuracy: 0.940 duration:351
steps: 100/352 training loss: 0.278 test loss: 0.119 test accuracy: 0.949 duration:390
steps: 110/352 training loss: 0.272 test loss: 0.341 test accuracy: 0.852 duration:429
steps: 120/352 training loss: 0.269 test loss: 0.160 t

> second approach with densenet121

In [138]:
model = models.densenet121(pretrained=True)
model

  nn.init.kaiming_normal(m.weight.data)


DenseNet(
  (features): Sequential(
    (conv0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (norm0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu0): ReLU(inplace)
    (pool0): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (denseblock1): _DenseBlock(
      (denselayer1): _DenseLayer(
        (norm1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplace)
        (conv1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu2): ReLU(inplace)
        (conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      )
      (denselayer2): _DenseLayer(
        (norm1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplac

In [139]:
# freeze parameters
for param in model.parameters():
    param.requires_grad=False
    
classifier =nn.Sequential(nn.Linear(1024, 256),
                                 nn.ReLU(),
                                 nn.Dropout(0.2),
                                 nn.Linear(256, 2),
                                 nn.LogSoftmax(dim=1))

model.classifier= classifier
criterion=nn.NLLLoss()
optimizer=optim.Adam(model.classifier.parameters(),lr=0.003)
model.to(device)

DenseNet(
  (features): Sequential(
    (conv0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (norm0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu0): ReLU(inplace)
    (pool0): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (denseblock1): _DenseBlock(
      (denselayer1): _DenseLayer(
        (norm1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplace)
        (conv1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu2): ReLU(inplace)
        (conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      )
      (denselayer2): _DenseLayer(
        (norm1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplac

In [140]:
# train
time0=time()
running_loss=0
step=0

for traininputs, trainlabels in trainloader:
    step+=1
    # setup
    traininputs, trainlabels= traininputs.to(device), trainlabels.to(device)
    optimizer.zero_grad()
    # implementation
    output=model.forward(traininputs)
    loss=criterion(output, trainlabels)
    loss.backward()
    optimizer.step()
    running_loss+= loss.item()
    
    if step%10==0:
        model.eval()
        test_loss=0
        accuracy=0
        
        with torch.no_grad():
            # setups
            for testinput, testlabels in testloader:
                testinput,testlabels=testinput.to(device), testlabels.to(device)
                testoutput=model.forward(testinput)
                batch_loss=criterion(testoutput,testlabels)
                test_loss+= batch_loss.item()

                # get difference
                ps=torch.exp(testoutput)
                equals=ps.max(1)[1]==testlabels
                # equals contains 64 1 or 0, get mean will get 1 batch accuracy
                accuracy+= torch.mean(equals.type(torch.FloatTensor)).item()
                
        model.train()        
        print('steps: {}/{}'.format(step,len(trainloader)),
              'training loss: {:.3f}'.format(running_loss/step),
              'test loss: {:.3f}'.format(test_loss/len(testloader)),
              'test accuracy: {:.3f}'.format(accuracy/len(testloader)),
              'duration:{:.0f}'.format(time()-time0))

steps: 10/352 training loss: 0.586 test loss: 0.155 test accuracy: 0.945 duration:38
steps: 20/352 training loss: 0.411 test loss: 0.121 test accuracy: 0.945 duration:76
steps: 30/352 training loss: 0.329 test loss: 0.111 test accuracy: 0.955 duration:115
steps: 40/352 training loss: 0.284 test loss: 0.143 test accuracy: 0.938 duration:153
steps: 50/352 training loss: 0.284 test loss: 0.300 test accuracy: 0.873 duration:191
steps: 60/352 training loss: 0.307 test loss: 0.107 test accuracy: 0.964 duration:230
steps: 70/352 training loss: 0.302 test loss: 0.114 test accuracy: 0.956 duration:268
steps: 80/352 training loss: 0.289 test loss: 0.128 test accuracy: 0.945 duration:307
steps: 90/352 training loss: 0.280 test loss: 0.167 test accuracy: 0.925 duration:345
steps: 100/352 training loss: 0.269 test loss: 0.108 test accuracy: 0.954 duration:383
steps: 110/352 training loss: 0.261 test loss: 0.094 test accuracy: 0.961 duration:421
steps: 120/352 training loss: 0.255 test loss: 0.083 t

In [142]:
epochs = 1
steps = 0
running_loss = 0
print_every = 5
for epoch in range(epochs):
    for inputs, labels in trainloader:
        steps += 1
        # Move input and label tensors to the default device
        inputs, labels = inputs.to(device), labels.to(device)
        
        optimizer.zero_grad()
        
        logps = model.forward(inputs)
        loss = criterion(logps, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        
        if steps % 10 == 0:
            test_loss = 0
            accuracy = 0
            model.eval()
            with torch.no_grad():
                for inputs, labels in testloader:
                    inputs, labels = inputs.to(device), labels.to(device)
                    logps = model.forward(inputs)
                    batch_loss = criterion(logps, labels)
                    
                    test_loss += batch_loss.item()
                    
                    # Calculate accuracy
                    ps = torch.exp(logps)
                    top_p, top_class = ps.topk(1, dim=1)
                    equals = top_class == labels.view(*top_class.shape)
                    accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
                    
            print(f"Epoch {epoch+1}/{epochs}.. "
                  f"Train loss: {running_loss/print_every:.3f}.. "
                  f"Test loss: {test_loss/len(testloader):.3f}.. "
                  f"Test accuracy: {accuracy/len(testloader):.3f}")
            running_loss = 0
            model.train()

Epoch 1/1.. Train loss: 0.224.. Test loss: 0.206.. Test accuracy: 0.912
Epoch 1/1.. Train loss: 0.281.. Test loss: 0.143.. Test accuracy: 0.941
Epoch 1/1.. Train loss: 0.299.. Test loss: 0.123.. Test accuracy: 0.949
Epoch 1/1.. Train loss: 0.312.. Test loss: 0.138.. Test accuracy: 0.939
Epoch 1/1.. Train loss: 0.315.. Test loss: 0.088.. Test accuracy: 0.963
Epoch 1/1.. Train loss: 0.351.. Test loss: 0.109.. Test accuracy: 0.954
Epoch 1/1.. Train loss: 0.273.. Test loss: 0.102.. Test accuracy: 0.957
Epoch 1/1.. Train loss: 0.228.. Test loss: 0.133.. Test accuracy: 0.948
Epoch 1/1.. Train loss: 0.272.. Test loss: 0.211.. Test accuracy: 0.918
Epoch 1/1.. Train loss: 0.373.. Test loss: 0.077.. Test accuracy: 0.970
Epoch 1/1.. Train loss: 0.391.. Test loss: 0.167.. Test accuracy: 0.926
Epoch 1/1.. Train loss: 0.320.. Test loss: 0.100.. Test accuracy: 0.957
Epoch 1/1.. Train loss: 0.392.. Test loss: 0.191.. Test accuracy: 0.908
Epoch 1/1.. Train loss: 0.336.. Test loss: 0.076.. Test accuracy