# Part 3: Transfer Learning with ResNet
Here we will use ResNet, a pretrained CNN with a unique architecture that allows it to retain some information from the original features after each convolutional layer.

We will utilize it for transfer learning just like we saw in the previous notebook, both for CIFAR-10 and our custom dataset, and compare its performance with our own solution.

## Setup

In [17]:
import torch
import torchvision
import torchvision.models as models
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, random_split
from torch.optim.lr_scheduler import CosineAnnealingLR
import cv2
import utils
from pathlib import Path
import copy

In [18]:

chess_transforms = transforms.Compose([
    # Resize to 32x32 to match the model's input
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    # Same normalization stats we used during CIFAR-10 training
    transforms.Normalize((0.49139968, 0.48215841, 0.44653091), (0.24703223, 0.24348513, 0.26158784))
])

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.49139968, 0.48215841, 0.44653091), (0.24703223, 0.24348513, 0.26158784))
])

# CIFAR-10
train_data = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
full_test_data = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
val_data = torch.utils.data.Subset(full_test_data, indices=list(range(5000)))
test_data = torch.utils.data.Subset(full_test_data, indices=list(range(5000, 10000)))

# Create DataLoaders
batch_size = 64
cifar_trainloader = DataLoader(train_data, batch_size=batch_size, shuffle=True, num_workers=2)
cifar_valloader = DataLoader(val_data, batch_size=batch_size, shuffle=False, num_workers=2)
cifar_testloader = DataLoader(test_data, batch_size=batch_size, shuffle=False, num_workers=2)

## Importing ResNet50

In [19]:
resnet = models.resnet50(weights='DEFAULT')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Freeze the entire body (The "Warmup" prep)
for param in resnet.parameters():
    param.requires_grad = False

# Replace the head
num_ftrs = resnet.fc.in_features
resnet.fc = nn.Linear(num_ftrs, 10) # 10 classes for the CIFAR-10 dataset

resnet = resnet.to(device)

## Training and Evaluation functions from before

In [20]:
def evaluate(model, testloader, criterion, device):
    model.eval()  # Set to evaluation mode (turns off Dropout/BatchNorm)
    test_loss = 0
    correct = 0
    with torch.no_grad():  # No gradient calculation saved (saves memory)
        for images, labels in testloader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            test_loss += criterion(outputs, labels).item()
            _, predicted = outputs.max(1)
            correct += predicted.eq(labels).sum().item()

    loss = test_loss / len(testloader)
    acc = 100. * correct / len(testloader.dataset)
    return loss, acc

In [21]:
def train_CNN(model, trainloader, valloader, criterion, optimizer, device, epochs=5,
               console=False, early_stopping=False, patience=3, scheduler=None):
    model.to(device)
    history = {
        'train_loss': [], 'val_loss': [],
        'train_acc': [], 'val_acc': []
    }
    best_val_loss = float('inf') # For early stopping
    best_model_wts = copy.deepcopy(model.state_dict())
    no_improvement_counter = 0

    for epoch in range(epochs):
        model.train()
        running_loss = 0.0
        correct = 0
        total = 0

        for inputs, labels in trainloader:
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()

            # Calculate training accuracy
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()

        if scheduler:
          scheduler.step()

        # Calculate epoch metrics
        epoch_train_loss = running_loss / len(trainloader)
        epoch_train_acc = 100. * correct / total
        epoch_val_loss, epoch_val_acc = evaluate(model, valloader, criterion, device)

        # Save to history
        history['train_loss'].append(epoch_train_loss)
        history['val_loss'].append(epoch_val_loss)
        history['train_acc'].append(epoch_train_acc)
        history['val_acc'].append(epoch_val_acc)

        if console:
            print(f"Epoch {epoch+1}/{epochs} | "
                  f"Train Loss: {epoch_train_loss:.4f} | Train Acc: {epoch_train_acc:.2f}% | "
                  f"Val Loss: {epoch_val_loss:.4f} | Val Acc: {epoch_val_acc:.2f}%")

        if early_stopping:
            if epoch_val_loss < best_val_loss:
                best_val_loss = epoch_val_loss
                # Save the best model weights
                best_model_wts = copy.deepcopy(model.state_dict())
                no_improvement_counter = 0 # Reset counter
            else:
                no_improvement_counter += 1
                if console: print(f"  EarlyStopping counter: {no_improvement_counter} out of {patience}")

                if no_improvement_counter >= patience:
                    if console: print("Early stopping triggered! Restoring best weights...")
                    model.load_state_dict(best_model_wts) # Restore best model
                    break

    if console: print("Finished Training")
    return history

In [22]:
print(resnet.fc)

Linear(in_features=2048, out_features=10, bias=True)


Before evaluating ResNet on CIFAR-10 we will fine-tune, by freezing it except for its classifier head, and then training it with low learning rate.

In [23]:
# Unfreeze the last group of residual blocks
for param in resnet.layer4.parameters():
    param.requires_grad = True

total_params = sum(p.numel() for p in resnet.parameters())
trainable_params = sum(p.numel() for p in resnet.parameters() if p.requires_grad)

print(f"Total Parameters: {total_params:,}")
print(f"Trainable Parameters: {trainable_params:,}")


optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, resnet.parameters()),
                             lr=1e-5, weight_decay=2e-5)
criterion = nn.CrossEntropyLoss()

results = train_CNN(resnet, cifar_trainloader, cifar_valloader, criterion, optimizer, device, epochs=5, console=True)

Total Parameters: 23,528,522
Trainable Parameters: 14,985,226
Epoch 1/5 | Train Loss: 2.2554 | Train Acc: 16.62% | Val Loss: 2.1561 | Val Acc: 25.12%
Epoch 2/5 | Train Loss: 2.0285 | Train Acc: 31.17% | Val Loss: 1.9643 | Val Acc: 37.72%
Epoch 3/5 | Train Loss: 1.7968 | Train Acc: 41.05% | Val Loss: 1.8591 | Val Acc: 45.18%
Epoch 4/5 | Train Loss: 1.6090 | Train Acc: 47.60% | Val Loss: 1.5403 | Val Acc: 51.50%
Epoch 5/5 | Train Loss: 1.4540 | Train Acc: 52.33% | Val Loss: 1.4197 | Val Acc: 54.94%
Finished Training


In [24]:
# Unfreeze the whole model
for param in resnet.parameters():
    param.requires_grad = True

total_params = sum(p.numel() for p in resnet.parameters())
trainable_params = sum(p.numel() for p in resnet.parameters() if p.requires_grad)

print(f"Total Parameters: {total_params:,}")
print(f"Trainable Parameters: {trainable_params:,}")


optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, resnet.parameters()),
                             lr=1e-4, weight_decay=2e-5)
cos_scheduler = CosineAnnealingLR(optimizer, T_max=15, eta_min=1e-6)
criterion = nn.CrossEntropyLoss()

results = train_CNN(resnet, cifar_trainloader, cifar_valloader, criterion, optimizer, device, epochs=15,
                    console=True, early_stopping=True, patience=5, scheduler=cos_scheduler)

Total Parameters: 23,528,522
Trainable Parameters: 23,528,522
Epoch 1/15 | Train Loss: 0.8501 | Train Acc: 71.30% | Val Loss: 0.6076 | Val Acc: 80.48%
Epoch 2/15 | Train Loss: 0.4610 | Train Acc: 84.46% | Val Loss: 0.5025 | Val Acc: 83.68%
Epoch 3/15 | Train Loss: 0.2927 | Train Acc: 90.10% | Val Loss: 0.6542 | Val Acc: 82.64%
  EarlyStopping counter: 1 out of 5
Epoch 4/15 | Train Loss: 0.1841 | Train Acc: 93.94% | Val Loss: 0.6927 | Val Acc: 83.88%
  EarlyStopping counter: 2 out of 5
Epoch 5/15 | Train Loss: 0.1376 | Train Acc: 95.47% | Val Loss: 0.5465 | Val Acc: 84.74%
  EarlyStopping counter: 3 out of 5
Epoch 6/15 | Train Loss: 0.0872 | Train Acc: 97.21% | Val Loss: 0.6913 | Val Acc: 85.06%
  EarlyStopping counter: 4 out of 5
Epoch 7/15 | Train Loss: 0.0589 | Train Acc: 98.11% | Val Loss: 0.6153 | Val Acc: 85.94%
  EarlyStopping counter: 5 out of 5
Early stopping triggered! Restoring best weights...
Finished Training


ResNet50 is an extremely powerful model and much larger than the CNN we have built specifically for CIFAR-50. As such, it is easy for it to overfit to the dataset. The simplest way to deal with this problem would be to introduce the same kind of data augmentation we used to train our own Deeper-Wider CNN model.

## Data Augmentation

Those are the exact same augmentations we used for our own CNN in notebook 3.

In [25]:
transform_augment = transforms.Compose([
    transforms.RandomCrop(32, padding=4), # Adds a 4px border, then crops a 32x32 square randomly
    transforms.RandomHorizontalFlip(), # 50% chance to flip the image horizontally
    transforms.ToTensor(),
    transforms.Normalize((0.49139968, 0.48215841, 0.44653091), (0.24703223, 0.24348513, 0.26158784))
])

train_aug_data = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_augment)
augtrainloader = DataLoader(train_aug_data, batch_size=64, shuffle=True, num_workers=2)

The model is imported again fresh:

In [26]:
resnet = models.resnet50(weights='DEFAULT')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

for param in resnet.parameters():
    param.requires_grad = False

num_ftrs = resnet.fc.in_features
resnet.fc = nn.Linear(num_ftrs, 10)

resnet = resnet.to(device)

### Second Training Attempt

In [27]:
# Unfreeze the last group of residual blocks
for param in resnet.layer4.parameters():
    param.requires_grad = True

trainable_params = sum(p.numel() for p in resnet.parameters() if p.requires_grad)

print(f"Total Parameters: {total_params:,}")
print(f"Trainable Parameters: {trainable_params:,}")


optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, resnet.parameters()),
                             lr=1e-5, weight_decay=2e-5)
criterion = nn.CrossEntropyLoss()

results = train_CNN(resnet, cifar_trainloader, cifar_valloader, criterion, optimizer, device, epochs=5, console=True)

Total Parameters: 23,528,522
Trainable Parameters: 14,985,226
Epoch 1/5 | Train Loss: 2.2538 | Train Acc: 16.25% | Val Loss: 2.1623 | Val Acc: 24.38%
Epoch 2/5 | Train Loss: 2.0171 | Train Acc: 30.84% | Val Loss: 1.9155 | Val Acc: 38.26%
Epoch 3/5 | Train Loss: 1.7897 | Train Acc: 41.29% | Val Loss: 1.7305 | Val Acc: 45.84%
Epoch 4/5 | Train Loss: 1.6017 | Train Acc: 47.65% | Val Loss: 1.5198 | Val Acc: 51.72%
Epoch 5/5 | Train Loss: 1.4469 | Train Acc: 52.40% | Val Loss: 1.3560 | Val Acc: 54.92%
Finished Training


In [28]:
# Unfreeze the whole model
for param in resnet.parameters():
    param.requires_grad = True

total_params = sum(p.numel() for p in resnet.parameters())
trainable_params = sum(p.numel() for p in resnet.parameters() if p.requires_grad)

print(f"Total Parameters: {total_params:,}")
print(f"Trainable Parameters: {trainable_params:,}")


optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, resnet.parameters()),
                             lr=1e-4, weight_decay=2e-5)
cos_scheduler = CosineAnnealingLR(optimizer, T_max=15, eta_min=1e-6)
criterion = nn.CrossEntropyLoss()

results = train_CNN(resnet, cifar_trainloader, cifar_valloader, criterion, optimizer, device, epochs=15,
                    console=True, early_stopping=True, patience=5, scheduler=cos_scheduler)

Total Parameters: 23,528,522
Trainable Parameters: 23,528,522
Epoch 1/15 | Train Loss: 0.8420 | Train Acc: 71.53% | Val Loss: 0.5847 | Val Acc: 81.22%
Epoch 2/15 | Train Loss: 0.4477 | Train Acc: 84.92% | Val Loss: 0.6756 | Val Acc: 82.46%
  EarlyStopping counter: 1 out of 5
Epoch 3/15 | Train Loss: 0.2781 | Train Acc: 90.69% | Val Loss: 0.5649 | Val Acc: 83.86%
Epoch 4/15 | Train Loss: 0.1785 | Train Acc: 94.02% | Val Loss: 0.5447 | Val Acc: 84.60%
Epoch 5/15 | Train Loss: 0.1163 | Train Acc: 96.17% | Val Loss: 0.5732 | Val Acc: 84.14%
  EarlyStopping counter: 1 out of 5
Epoch 6/15 | Train Loss: 0.0842 | Train Acc: 97.29% | Val Loss: 0.7484 | Val Acc: 85.44%
  EarlyStopping counter: 2 out of 5
Epoch 7/15 | Train Loss: 0.0599 | Train Acc: 98.09% | Val Loss: 0.8217 | Val Acc: 84.86%
  EarlyStopping counter: 3 out of 5
Epoch 8/15 | Train Loss: 0.0441 | Train Acc: 98.59% | Val Loss: 0.7161 | Val Acc: 85.32%
  EarlyStopping counter: 4 out of 5
Epoch 9/15 | Train Loss: 0.0353 | Train Acc: 9

As we can see, Data Augmentation provided little benefit. ResNet50 was designed for 224x224 images, and even with the augmentation that we've added it still isn't enough to give the model's huge structure enough to learn to generalize. The core problem is simply not complex enough for ResNet50.

We can attempt to solve this by adding a Dropout layer to the head of the model.

In [29]:
resnet = models.resnet50(weights='DEFAULT')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

for param in resnet.parameters():
    param.requires_grad = False

num_ftrs = resnet.fc.in_features
resnet.fc = nn.Sequential(
    nn.Dropout(p=0.3),
    nn.Linear(num_ftrs, 10)
    )

resnet = resnet.to(device)

## Third Training Attempt

In [30]:
# Unfreeze the last group of residual blocks
for param in resnet.layer4.parameters():
    param.requires_grad = True

trainable_params = sum(p.numel() for p in resnet.parameters() if p.requires_grad)

print(f"Total Parameters: {total_params:,}")
print(f"Trainable Parameters: {trainable_params:,}")


optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, resnet.parameters()),
                             lr=1e-5, weight_decay=2e-5)
criterion = nn.CrossEntropyLoss()

results = train_CNN(resnet, cifar_trainloader, cifar_valloader, criterion, optimizer, device, epochs=5, console=True)

Total Parameters: 23,528,522
Trainable Parameters: 14,985,226
Epoch 1/5 | Train Loss: 2.2831 | Train Acc: 14.15% | Val Loss: 2.2037 | Val Acc: 20.84%
Epoch 2/5 | Train Loss: 2.1100 | Train Acc: 24.72% | Val Loss: 2.0222 | Val Acc: 33.22%
Epoch 3/5 | Train Loss: 1.9205 | Train Acc: 34.40% | Val Loss: 1.8288 | Val Acc: 42.24%
Epoch 4/5 | Train Loss: 1.7424 | Train Acc: 41.49% | Val Loss: 1.6591 | Val Acc: 47.90%
Epoch 5/5 | Train Loss: 1.5920 | Train Acc: 46.89% | Val Loss: 1.5680 | Val Acc: 52.52%
Finished Training


In [31]:
# Unfreeze the whole model
for param in resnet.parameters():
    param.requires_grad = True

total_params = sum(p.numel() for p in resnet.parameters())
trainable_params = sum(p.numel() for p in resnet.parameters() if p.requires_grad)

print(f"Total Parameters: {total_params:,}")
print(f"Trainable Parameters: {trainable_params:,}")


optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, resnet.parameters()),
                             lr=1e-4, weight_decay=2e-5)
cos_scheduler = CosineAnnealingLR(optimizer, T_max=15, eta_min=1e-6)
criterion = nn.CrossEntropyLoss()

results = train_CNN(resnet, cifar_trainloader, cifar_valloader, criterion, optimizer, device, epochs=15,
                    console=True, early_stopping=True, patience=5, scheduler=cos_scheduler)

Total Parameters: 23,528,522
Trainable Parameters: 23,528,522
Epoch 1/15 | Train Loss: 0.9368 | Train Acc: 68.42% | Val Loss: 0.6708 | Val Acc: 79.20%
Epoch 2/15 | Train Loss: 0.5136 | Train Acc: 82.74% | Val Loss: 0.5997 | Val Acc: 82.66%
Epoch 3/15 | Train Loss: 0.3402 | Train Acc: 88.52% | Val Loss: 0.6370 | Val Acc: 83.66%
  EarlyStopping counter: 1 out of 5
Epoch 4/15 | Train Loss: 0.2345 | Train Acc: 92.29% | Val Loss: 0.5263 | Val Acc: 85.30%
Epoch 5/15 | Train Loss: 0.1623 | Train Acc: 94.63% | Val Loss: 0.5874 | Val Acc: 85.52%
  EarlyStopping counter: 1 out of 5
Epoch 6/15 | Train Loss: 0.1136 | Train Acc: 96.26% | Val Loss: 0.6982 | Val Acc: 85.16%
  EarlyStopping counter: 2 out of 5
Epoch 7/15 | Train Loss: 0.0813 | Train Acc: 97.35% | Val Loss: 0.7325 | Val Acc: 85.12%
  EarlyStopping counter: 3 out of 5
Epoch 8/15 | Train Loss: 0.0566 | Train Acc: 98.17% | Val Loss: 0.5916 | Val Acc: 86.32%
  EarlyStopping counter: 4 out of 5
Epoch 9/15 | Train Loss: 0.0449 | Train Acc: 9

## Conclusions
The model reached 85% accuracy on the validation set regardless of what we did with the data.

ResNet50 is overfitting on CIFAR-10, which is hard to solve without more aggressive regularization and more tampering of its architecture. For this reason we are certain that it would overfit much harder on the chess piece dataset, which is smaller and simpler, even for our simple CNN.

 As we've seen, ResNet50 has worse performance on smaller image datasets than other models that have been selectively designed for such datasets (like our DW-CNN which reached 90% accuracy on CIFAR-10 under the same conditions). This seeming paradox is because ResNet50's 'receptive field' is larger than the image itself and it is capable of memorizing entire images as 'high level features', something that a smaller model like the CNN we have constructed is unable to do and is thus forced to look for more generalizing patterns.