# **Basic Introduction to Pytorch** 
## Prepared by: **Lamine Deen**


  
  ---------------------------------------------

**1. Introduction to PyTorch**

What is PyTorch?

- PyTorch is a flexible and powerful library for deep learning. It allows for the building of neural networks, handling tensors (multi-dimensional arrays), and performing automatic differentiation with ease.

Why PyTorch for Deep Learning?

- Easy to understand and debug with dynamic computation graphs.
- Rich support for GPU acceleration.
- Active community and robust ecosystem.

  
  ---------------------------------------------

**2. Tensors: The Foundation of PyTorch**

What is a Tensor?

- In PyTorch, a tensor is the fundamental data structure used for building deep learning models. It can represent data in many dimensions—scalars, vectors, matrices, and higher-dimensional generalizations.

- Scalar: A 0D tensor (e.g., 5).

- Vector: A 1D tensor (e.g., [1, 2, 3]).

- Matrix: A 2D tensor (e.g., [[1, 2], [3, 4]]).

Let's start by creating some basic tensors:

In [1]:
import torch

# Scalar tensor (0D)
scalar = torch.tensor(5)
print(scalar)

# Vector tensor (1D)
vector = torch.tensor([1, 2, 3])
print(vector)

# Matrix tensor (2D)
matrix = torch.tensor([[1, 2], [3, 4]])
print(matrix)

# 3D tensor (3D)
tensor3d = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(tensor3d)


tensor(5)
tensor([1, 2, 3])
tensor([[1, 2],
        [3, 4]])
tensor([[[1, 2],
         [3, 4]],

        [[5, 6],
         [7, 8]]])


Basic Tensor Operations

Once we have tensors, we can perform operations like addition, multiplication, etc.



In [2]:
# Tensor addition
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([4.0, 5.0, 6.0])
z = x + y
print(z)  # Output: tensor([5., 7., 9.])

# Tensor multiplication (element-wise)
z = x * y
print(z)  # Output: tensor([4., 10., 18.])

# Matrix multiplication (dot product)
mat1 = torch.tensor([[1, 2], [3, 4]])
mat2 = torch.tensor([[5, 6], [7, 8]])
result = torch.mm(mat1, mat2)  # or mat1 @ mat2
print(result)


tensor([5., 7., 9.])
tensor([ 4., 10., 18.])
tensor([[19, 22],
        [43, 50]])



Moving Tensors to GPU

One of the major advantages of PyTorch is its easy GPU support for tensors.

In [3]:
# Check if a GPU is available for linux machines
if torch.cuda.is_available():
    device = torch.device('cuda')
    print("cuda")
# Check if a GPU is available for macOS machines
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("apple silicon")
else:
    device = torch.device("cpu")
    print("no silicon, no cuda")


# Creating a tensor on GPU
x = torch.tensor([1, 2, 3], device=device)

# Moving a tensor to GPU
y = torch.tensor([4, 5, 6])
y = y.to(device)


cuda


  
  ---------------------------------------------

**3. Autograd: Automatic Differentiation in PyTorch**

In deep learning, we need to calculate gradients during backpropagation to update model parameters. PyTorch simplifies this process with its Autograd module, which automatically computes gradients of tensors involved in computational graphs.

What is Autograd?

- Autograd is PyTorch’s automatic differentiation engine. It tracks operations performed on tensors and automatically computes their derivatives during backpropagation.
- When you perform operations on tensors, PyTorch builds a computational graph in the background. This graph stores information about operations and allows for efficient gradient computation.

Key Concept: Computational Graph

- The computational graph represents all operations and tensors as nodes and edges. Each operation applied to a tensor creates a new tensor, and PyTorch tracks these operations.
- When you call .backward() on a loss, PyTorch traverses this graph in reverse to compute gradients for every parameter involved in the graph.

Tracking Gradients with requires_grad

- By default, PyTorch does not track operations for gradients. To enable gradient tracking, you need to set requires_grad=True for a tensor.
- Tensors with requires_grad=True will accumulate gradients as you perform operations on them.

In [4]:
# Create a tensor with requires_grad=True
x = torch.tensor([2.0, 3.0], requires_grad=True, device=device)

# Perform operations on the tensor
y = x ** 2 + 3 * x + 1
print(y)  # Outputs the result of the operations

# Compute gradients by calling .backward() on the output
y.sum().backward()  # You need to reduce y to a scalar before calling backward
print(x.grad)  # Outputs the gradient (dy/dx)


tensor([11., 19.], device='cuda:0', grad_fn=<AddBackward0>)
tensor([7., 9.], device='cuda:0')


Using torch.no_grad() for Inference: When you don’t need to compute gradients (for example, during inference), you can disable autograd with torch.no_grad(). This saves memory and speeds up computation by preventing the creation of the computational graph.

In [5]:
with torch.no_grad():
    # Perform operations without tracking gradients
    y = x ** 2 + 3 * x + 1


  
  ---------------------------------------------

**4. Neural Networks with torch.nn**

In PyTorch, neural networks are built using the torch.nn module, which provides:

- Layers: Predefined building blocks like fully connected layers, convolutional layers, etc.

- Activation Functions: Functions that introduce non-linearity, allowing neural networks to solve complex problems.

- Loss Functions and Optimizers: Tools to define objectives and methods for updating model parameters.

Step 1: Define a Simple Neural Network

In [6]:
import torch.nn as nn

# Define the network by subclassing nn.Module
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        # Define layers: a fully connected layer with 2 inputs and 3 outputs
        self.fc1 = nn.Linear(2, 3)  # input size = 2, output size = 3
        self.fc2 = nn.Linear(3, 1)  # input size = 3, output size = 1

    # Define forward pass
    def forward(self, x):
        x = self.fc1(x)  # Pass through the first layer
        x = torch.relu(x)  # Apply activation function (ReLU)
        x = self.fc2(x)  # Pass through the second layer
        return x

# Initialize the network
net = SimpleNN()
print(net)


SimpleNN(
  (fc1): Linear(in_features=2, out_features=3, bias=True)
  (fc2): Linear(in_features=3, out_features=1, bias=True)
)


Explanation:

nn.Linear(in_features, out_features): This defines a fully connected layer, where each input is connected to each output. In this example:

- fc1 connects a 2-dimensional input to a 3-dimensional output.

- fc2 connects the 3-dimensional output from fc1 to a 1-dimensional output.

forward() method: This method defines how data flows through the network. In the example:

- Data is passed through the first fully connected layer (fc1).

- A ReLU activation function is applied to introduce non-linearity.

- The output is passed through the second fully connected layer (fc2).

Activation Functions: These functions are used to introduce non-linearity. Without them, a neural network would just be a linear model. Common activation functions include:

- ReLU (Rectified Linear Unit): Sets all negative values to zero and leaves positive values unchanged. It is widely used in hidden layers.

- Sigmoid and Tanh: Typically used in the output layers for binary classification or multi-class classification.

Step 2: Create Input Data and Forward Pass

In [7]:
# Create a sample input tensor (batch size of 1, 2 features)
input_data = torch.tensor([[1.0, 2.0]])

# Perform a forward pass (getting the output)
output = net(input_data)
print(output)  # Output will be a tensor of shape (1, 1)


tensor([[0.2837]], grad_fn=<AddmmBackward0>)


Explanation:

- Input Shape: We pass in a tensor with shape (1, 2), meaning a batch size of 1 and 2 input features (matching the fc1 layer’s input size).

- Forward Pass: When calling net(input_data), PyTorch automatically executes the forward() method, passing the data through the layers and returning the network's output.


  
  ---------------------------------------------

To train a neural network, we need:

A loss function to quantify how far the predictions are from the actual values.
An optimizer to update the model's weights to minimize the loss.

Let’s use Mean Squared Error (MSE) loss and Stochastic Gradient Descent (SGD) for optimization:

Step 3: Define a Loss Function and Optimizer

In [8]:
# Define a loss function (Mean Squared Error)
criterion = nn.MSELoss()

# Define an optimizer (Stochastic Gradient Descent)
optimizer = torch.optim.SGD(net.parameters(), lr=0.01)  # Learning rate = 0.01


Explanation:

- nn.MSELoss(): This is used when the task is regression. It calculates the mean squared error between the predicted and target values.

- torch.optim.SGD(): The optimizer adjusts the weights of the network to minimize the loss. It takes the model parameters (net.parameters()) and a learning rate (lr=0.01).

- The net.parameters() are the weights and biases associates with each layer from  input to output

  
  ---------------------------------------------

Step 4: Training the Network

The training process involves:

1. Forward Pass: Compute predictions.
2. Compute Loss: Compare predictions with actual values using the loss function.
3. Backward Pass: Compute gradients via backpropagation.
4. Update Weights: Use the optimizer to adjust weights.

In [9]:
# Sample target output
target = torch.tensor([[1.0]])

# Training Loop
for epoch in range(100):  # Train for 100 epochs
    optimizer.zero_grad()  # Zero the gradient buffers

    output = net(input_data)  # Forward pass: compute the output
    loss = criterion(output, target)  # Compute the loss

    loss.backward()  # Backward pass: compute the gradients
    optimizer.step()  # Update weights

    if epoch % 10 == 0:  # Print loss every 10 epochs
        print(f'Epoch {epoch}, Loss: {loss.item()}')


Epoch 0, Loss: 0.5130363702774048
Epoch 10, Loss: 0.20748411118984222
Epoch 20, Loss: 0.08128686249256134
Epoch 30, Loss: 0.0293605774641037
Epoch 40, Loss: 0.009833571501076221
Epoch 50, Loss: 0.0031189327128231525
Epoch 60, Loss: 0.0009556243312545121
Epoch 70, Loss: 0.00028686862788163126
Epoch 80, Loss: 8.512281783623621e-05
Epoch 90, Loss: 2.509541081963107e-05


Explanation:

- optimizer.zero_grad(): Resets the gradients to zero. This is necessary because gradients accumulate by default in PyTorch.

- loss.backward(): Computes the gradients of the loss with respect to the model parameters.

- optimizer.step(): Updates the model parameters using the gradients.

- The training loop runs for 100 epochs, printing the loss every 10 epochs.

Gradients accumulate by default in PyTorch.
- optimizer.zero_grad() is called to reset gradients after each batch, ensuring that previous batches' gradients don’t interfere with the current batch's gradients.
- Accumulating gradients is useful for specific cases (like gradient accumulation or mini-batch strategies), but in most cases, resetting gradients each batch is necessary for proper parameter updates.

  
  ---------------------------------------------

Step 5: Evaluate the Model
- Once the model has been trained, you can use it to make predictions on new data. When making predictions, we don’t need to compute gradients, so we wrap the code in torch.no_grad() to save memory and improve efficiency.

In [10]:
# New input data
test_data = torch.tensor([[4.0, 5.0]])

# Disable gradient computation for inference
with torch.no_grad():
    prediction = net(test_data)
    print("Prediction:", prediction)


Prediction: tensor([[1.6793]])


  
  ---------------------------------------------

**5. Working with Data in PyTorch**

Handling data efficiently is critical. PyTorch provides two key abstractions to help with this: Dataset and DataLoader. These allow for efficient loading, preprocessing, and batching of data, which is essential for training models on large datasets.

Step 1: The Dataset Class

The Dataset class in PyTorch provides an abstraction for datasets. You can either use built-in datasets (like MNIST, CIFAR-10) or create a custom dataset by subclassing torch.utils.data.Dataset.

Built-in Datasets Example
Let's start by loading a simple dataset like MNIST, a popular dataset of handwritten digits.

In [15]:
from torchvision import datasets, transforms

# Define a transformation to normalize the data
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

# Load the MNIST dataset
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)


explanation:

transforms.Compose() allows you to combine multiple transformations. In this case, we are using two: ToTensor() and Normalize().

ToTensor()

- It changes the image's shape from (Height, Width, Channels) (typical for images) to (Channels, Height, Width) (expected format for PyTorch).
- It also scales pixel values from the range [0, 255] (common in image formats) to the range [0, 1], which is essential for efficient training.

Normalize()

- Normalizing the inputs ensures that the neural network’s input distribution has zero mean and a standard deviation close to 1, which improves the convergence of the training process.

datasets.MNIST: 
Downloads the MNIST dataset and applies the transformations. The train=True argument indicates we're loading the training data.

  
  ---------------------------------------------

Custom Dataset Example

You can also create a custom dataset by subclassing torch.utils.data.Dataset.  
This is useful when your data is stored in formats like CSV files, image folders, or other custom formats.

In [16]:
from torch.utils.data import Dataset

# Custom dataset class
class MyDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data  # Your input data
        self.labels = labels  # Corresponding labels

    def __len__(self):
        return len(self.data)  # Return the size of the dataset

    def __getitem__(self, idx):
        # Return a tuple of (input, label) for the given index
        return self.data[idx], self.labels[idx]

# Example data
data = torch.tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0], [9.0, 3.0]])
labels = torch.tensor([0, 1, 1, 0, 0])

# Create an instance of MyDataset
dataset = MyDataset(data, labels)


Explanation:  
- __len__(): Returns the total number of samples in the dataset.
- __getitem__(): Returns a single sample (input and label) from the dataset based on the index provided (idx).

  
  ---------------------------------------------

Step 2: The DataLoader Class

The DataLoader class is used to load data in batches and shuffle it, which is crucial for training models efficiently. It handles:

Batching: Grouping multiple samples into batches.  
Shuffling: Randomizing the order of data at each epoch to improve learning.

In [17]:
from torch.utils.data import DataLoader

# Create a DataLoader for the custom dataset
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Iterate through the DataLoader
for batch_idx, (inputs, labels) in enumerate(dataloader):
    print(f"Batch {batch_idx}:")
    print(f"Inputs: {inputs}")
    print(f"Labels: {labels}")


Batch 0:
Inputs: tensor([[1., 2.],
        [3., 4.]])
Labels: tensor([0, 1])
Batch 1:
Inputs: tensor([[5., 6.],
        [7., 8.]])
Labels: tensor([1, 0])
Batch 2:
Inputs: tensor([[9., 3.]])
Labels: tensor([0])


Explanation:

- batch_size: The number of samples in each batch. In this case, 2 samples are loaded in each batch.  
- shuffle=True: Randomizes the order of the data at the start of each epoch.

  
  ---------------------------------------------

Step 3: Putting It All Together

Let’s integrate everything we’ve learned so far (building a simple neural network and handling data) into a complete example. We’ll use the MNIST dataset, create a neural network, and train it using a DataLoader and test it.

Code Example:

In [19]:
# Define the neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 128)  # Fully connected layer (input: 28x28 pixels, output: 128)
        self.fc2 = nn.Linear(128, 10)  # Output layer (10 classes)

    def forward(self, x):
        x = torch.relu(self.fc1(x))  # ReLU activation after first layer
        x = self.fc2(x)  # Linear output
        return x

# Define transformations (same for train and test)
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

# Load the MNIST training and test datasets
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Initialize the network, loss function, and optimizer
net = SimpleNN()
criterion = nn.CrossEntropyLoss()  # Cross-entropy loss for classification
optimizer = torch.optim.SGD(net.parameters(), lr=0.01)

# Training loop
for epoch in range(5):  # Train for 5 epochs
    net.train()  # Set the network to training mode
    for inputs, labels in train_loader:
        inputs = inputs.view(-1, 28 * 28)  # Flatten the images
        
        optimizer.zero_grad()  # Clear the gradients
        outputs = net(inputs)  # Forward pass
        loss = criterion(outputs, labels)  # Compute the loss
        loss.backward()  # Backward pass
        optimizer.step()  # Update the weights

    print(f'Epoch {epoch + 1}, Training Loss: {loss.item():.4f}')

# Testing loop
correct = 0
total = 0
test_loss = 0.0

net.eval()  # Set the network to evaluation mode
with torch.no_grad():  # Disable gradient computation
    for inputs, labels in test_loader:
        inputs = inputs.view(-1, 28 * 28)
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        test_loss += loss.item() * inputs.size(0)

        _, predicted = torch.max(outputs, 1)
        correct += (predicted == labels).sum().item()
        total += labels.size(0)

avg_loss = test_loss / total
accuracy = 100 * correct / total

print(f'\nTest Loss: {avg_loss:.4f}, \nTest Accuracy: {accuracy:.2f}%')


Epoch 1, Training Loss: 0.4330
Epoch 2, Training Loss: 0.4072
Epoch 3, Training Loss: 0.1734
Epoch 4, Training Loss: 0.1398
Epoch 5, Training Loss: 0.3940

Test Loss: 0.2557, 
Test Accuracy: 92.68%


Explanation:

- Dataset: We load the MNIST dataset, applying transformations to convert images to tensors and normalize them.
- Train DataLoader: We use the DataLoader to shuffle and load the data in batches of 64 samples at a time.
- Test Dataloader:  
                 - train=False: This ensures that we are loading the test split of the MNIST dataset, not the training data.  
                 - shuffle=False: We don’t need to shuffle the test data because we are not training on it; we’re simply evaluating the model's performance on each sample.
- Neural Network: The network has two layers. The input layer takes a 28x28 pixel image (flattened into a vector of size 784), and the output layer predicts one of 10 classes (digits 0–9).
- Training Loop: For each epoch, we iterate over the data, compute the loss using cross-entropy, and update the model’s parameters using stochastic gradient descent.
- optimizer.zero_grad() is called before each batch to reset the gradients, ensuring that the gradients for the current batch are calculated independently of previous batches.  
If we don’t reset the gradients, they accumulate across batches, which leads to incorrect weight updates and unstable training.  
Gradient accumulation can be useful when simulating larger batch sizes by accumulating gradients over multiple small batches, but in most cases, zeroing the gradients before each batch is the correct approach.

- Evaluating the Model  
net.eval(): This sets the model to evaluation mode. In this mode, some layers like dropout or batch normalization behave differently from training mode. It’s important to switch to evaluation mode during testing to ensure these layers behave correctly.  
To evaluate the model on the test data, we follow these steps:  
          - torch.no_grad(): During evaluation, we don’t need to compute gradients (since we’re not updating the model). This context manager disables gradient tracking, making the process more memory-efficient and faster.  
          - Calculate accuracy or other performance metrics by comparing the model's predictions to the actual labels.  
          - Average the loss over the entire test dataset.  
          - Loss Calculation:  
We compute the loss for each batch and multiply it by the batch size (inputs.size(0)) to accumulate the total loss for the entire dataset.  
At the end, we divide the total loss by the number of samples to get the average loss over the test dataset.  
          - Accuracy Calculation:  
torch.max(outputs, 1) returns the predicted class with the highest score for each sample in the batch.  
We compare the predicted class to the actual class labels (predicted == labels) and sum the number of correct predictions.  
Accuracy is computed as the ratio of correct predictions to total predictions.  

  
  ---------------------------------------------

Step 4 (Step 3) Optimized for acceleration

Key Points:
- Check GPU Availability: We will check whether a GPU is available on the system (using MPS for Apple Silicon or CUDA for NVIDIA GPUs).
- Move Data and Model to GPU: Both the neural network and the input data need to be moved to the GPU to leverage its acceleration.
- Move Output and Gradients Back to CPU: After computation, if you need to print or log values, it’s often necessary to move data back to the CPU.  

Steps to Run on the GPU:
- Move the model to GPU: Use .to(device) where device is either MPS, CUDA, or CPU.
- Move input and target tensors to the GPU during training and testing.
- Make sure the optimizer updates the model on the GPU.

In [40]:
# Check if a GPU is available
if torch.backends.mps.is_available():
    device = torch.device("mps")  # Use MPS for Apple Silicon
elif torch.cuda.is_available():
    device = torch.device("cuda")  # Use CUDA for NVIDIA GPUs
else:
    device = torch.device("cpu")  # Use CPU if no GPU is available

# Define the neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 128)  # Fully connected layer (input: 28x28 pixels, output: 128)
        self.fc2 = nn.Linear(128, 10)  # Output layer (10 classes)

    def forward(self, x):
        x = torch.relu(self.fc1(x))  # ReLU activation after first layer
        x = self.fc2(x)  # Linear output
        return x

# Define transformations (same for train and test)
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

# Load the MNIST training and test datasets
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Initialize the network, loss function, and optimizer
net = SimpleNN().to(device)  # Move the network to the GPU
criterion = nn.CrossEntropyLoss()  # Cross-entropy loss for classification
optimizer = optim.SGD(net.parameters(), lr=0.01)

# Training loop
for epoch in range(5):  # Train for 5 epochs
    net.train()  # Set the network to training mode
    for inputs, labels in train_loader:
        inputs, labels = inputs.view(-1, 28 * 28).to(device), labels.to(device)  # Move inputs and labels to the GPU
        
        optimizer.zero_grad()  # Clear the gradients
        outputs = net(inputs)  # Forward pass
        loss = criterion(outputs, labels)  # Compute the loss
        loss.backward()  # Backward pass
        optimizer.step()  # Update the weights

    print(f'Epoch {epoch + 1}, Training Loss: {loss.item():.4f}')

# Testing loop
correct = 0
total = 0
test_loss = 0.0

net.eval()  # Set the network to evaluation mode
with torch.no_grad():  # Disable gradient computation
    for inputs, labels in test_loader:
        inputs, labels = inputs.view(-1, 28 * 28).to(device), labels.to(device)  # Move inputs and labels to the GPU
        outputs = net(inputs)  # Forward pass
        loss = criterion(outputs, labels)  # Compute the loss
        test_loss += loss.item() * inputs.size(0)

        _, predicted = torch.max(outputs, 1)  # Get the predicted class
        correct += (predicted == labels).sum().item()  # Track the number of correct predictions
        total += labels.size(0)

avg_loss = test_loss / total
accuracy = 100 * correct / total

print(f'\nTest Loss: {avg_loss:.4f}, \nTest Accuracy: {accuracy:.2f}%')


Epoch 1, Training Loss: 0.4639
Epoch 2, Training Loss: 0.2711
Epoch 3, Training Loss: 0.1035
Epoch 4, Training Loss: 0.1338
Epoch 5, Training Loss: 0.1750

Test Loss: 0.2637, 
Test Accuracy: 92.43%


Explanation of GPU Usage:
- Detect GPU Availability:

We check if an MPS device (for Apple Silicon) or CUDA device (for NVIDIA GPUs) is available. If neither is available, the code defaults to using the CPU.
- Move Model to GPU:

The model is moved to the GPU with net = SimpleNN().to(device). This ensures that all operations (forward pass, backward pass, etc.) are performed on the GPU.
- Move Data to GPU:

Before each forward pass, the input data (inputs) and the target labels (labels) are moved to the GPU using .to(device).
Important: You must move both the model and the data to the same device (either CPU or GPU) for operations to work.
- Testing on the GPU:

During testing, we also move the input data and labels to the GPU using .to(device) before making predictions.

  
  ---------------------------------------------

Benefits of Using the GPU:
- Faster Computation: GPUs are optimized for parallel computation, making operations like matrix multiplication (which are common in deep learning) much faster.
- Efficient Training: Training deep learning models on large datasets, especially with complex architectures, is significantly faster when using GPUs.
- Noticeable Difference: you will notice a difference in computation time when running models with a medium sizes and not for small models.  

The GPU will be slower for this particular task due to the small model size, small batch size, and data transfer overhead. For small-scale tasks like MNIST, the CPU can sometimes perform the same or even faster than the GPU. For larger models and datasets, the GPU will outperform the CPU.