As we have seend during the course, we can reach more than 0.99 of test accuracy in a couple of minutes on a laptop on the MNIST dataset. However, in many machine learning applications, especially in embedded systems, there is a need to optimize the learning time and also the inference time.

Look for the fastest possible way to reach a 0.97 test accuracy, on a given hardware of your choice. The hardware that you use may be the laptop of one of your group members (you can also compare two laptops). You need to briefly present the hardware used (type of processor, frequency, number of cores, etc).
In that respect, tour objective is to work on the acceleration of learning (model training) and / or of the inference time, using any method that seems relevant.
Please note that your objective is not to simply reach the target test accuracy.
It is required that you explore at leat one method taken from a book or from a scientific article. You can find the book or the article yourself, or use one of the references presented during the course. For instance, part II of the Deep learning book (see pdf for link) contains many interesting discussions, but other sources are accepted as well. Any research is welcome, including research on libraries that are not as famous as pytorch or tensorflow, like jax.
Here are some suggestions :
- You can start from the architecture that we use in this example : see pdf for link.
And then try other methods to accelerate the learning or inference time.
- Try to isolate parameters to present and analyze results of code profiling in order to find speed bottlenecks. Pay attention to the fact that profiling GPUs or pytorch code might have specificities.

In [12]:
import torch
import platform

print("PyTorch version:", torch.__version__)
print("Platform:", platform.platform())
print("Processor:", platform.processor())
print("Number of CPU cores:", torch.get_num_threads())
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

PyTorch version: 2.9.1
Platform: macOS-15.6.1-arm64-arm-64bit-Mach-O
Processor: arm
Number of CPU cores: 8
CUDA available: False


In [14]:
import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Transform
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# Load datasets
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# DataLoaders
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False, num_workers=2)

In [15]:
import torch.nn as nn
import torch.nn.functional as F

class FastCNN(nn.Module):
    def __init__(self):
        super(FastCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        self.fc1 = nn.Linear(32*7*7, 128)
        self.fc2 = nn.Linear(128, 10)
        self.pool = nn.MaxPool2d(2, 2)
        
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))  # 28->14
        x = self.pool(F.relu(self.conv2(x)))  # 14->7
        x = x.view(-1, 32*7*7)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

In [16]:
import torch.optim as optim
from torch.cuda.amp import GradScaler, autocast
import time

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = FastCNN().to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
scaler = GradScaler()  # For mixed precision

num_epochs = 5  # Usually 2-5 is enough to reach 0.97

start_time = time.time()
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        
        optimizer.zero_grad()
        
        if device.type == "cuda":
            with autocast():
                outputs = model(images)
                loss = criterion(outputs, labels)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
        
        running_loss += loss.item()
    
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(train_loader):.4f}")

end_time = time.time()
print("Training time: {:.2f} seconds".format(end_time - start_time))

  scaler = GradScaler()  # For mixed precision
  super().__init__(


Epoch [1/5], Loss: 0.2158
Epoch [2/5], Loss: 0.0582
Epoch [3/5], Loss: 0.0412
Epoch [4/5], Loss: 0.0317
Epoch [5/5], Loss: 0.0238
Training time: 60.85 seconds


In [17]:
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = correct / total
print("Test Accuracy: {:.4f}".format(accuracy))

Test Accuracy: 0.9896


In [18]:
import torch.autograd.profiler as profiler

with profiler.profile(record_shapes=True) as prof:
    with profiler.record_function("model_inference"):
        model(next(iter(train_loader))[0].to(device))
print(prof.key_averages().table(sort_by="cuda_time_total" if device.type=="cuda" else "cpu_time_total", row_limit=10))

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        model_inference        27.22%     308.823ms       100.00%        1.135s        1.135s             1  
enumerate(DataLoader)#_MultiProcessingDataLoaderIter...        72.10%     818.020ms        72.11%     818.037ms     818.037ms             1  
                                           aten::conv2d         0.00%      15.790us         0.40%       4.548ms       2.274ms             2  
                                      aten::convolution         0.00%      50.627us         0.40%       4.533ms       2.266ms             2  
      