<a href="https://colab.research.google.com/github/vijaygwu/IntroToDeepLearning/blob/main/CNNwithAndWithoutPIL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Explanation of the Code(without a CNN)**

This code is designed to compare two approaches for loading and processing the MNIST dataset in PyTorch: one using the `PIL` library for image handling and the other directly using PyTorch's built-in tensor handling through `torchvision.datasets.MNIST`. Both approaches involve training a simple neural network to classify handwritten digits from the MNIST dataset, and the results are compared in terms of training time and test accuracy.

Let's go through the code step by step.

---

### **1. Importing Required Libraries**

```python
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from torchvision import datasets, transforms
from PIL import Image
import time
```

- **`torch`**: Core PyTorch library.
- **`torch.nn`**: Provides modules to build neural networks.
- **`torch.optim`**: Contains optimizers like SGD, Adam, etc.
- **`torch.nn.functional`**: Contains functions like `relu` and `cross_entropy` that are commonly used in neural networks.
- **`torch.utils.data.DataLoader`**: Loads datasets in batches during training.
- **`torchvision.datasets`**: Provides popular datasets like MNIST.
- **`torchvision.transforms`**: Contains functions to transform data, such as converting images to tensors and normalizing them.
- **`PIL`**: Used for handling image files manually.
- **`time`**: For tracking execution time of training loops.

---


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from torchvision import datasets, transforms
from PIL import Image
import time



### **2. Version 1: Using PIL for Image Handling**

#### Custom Dataset Class

```python
class MNISTDataset(Dataset):
    def __init__(self, mnist_data, transform=None):
        self.mnist_data = mnist_data  # Store the MNIST dataset from torchvision.
        self.transform = transform    # Transformation like converting to tensor and normalization.

    def __len__(self):
        return len(self.mnist_data)  # Return the number of images in the dataset.

    def __getitem__(self, index):
        img, label = self.mnist_data[index]  # Get an image and its label.
        img = transforms.ToPILImage()(img)   # Convert the tensor image back to a PIL image.

        if self.transform:
            img = self.transform(img)  # Apply the specified transformations (to tensor, normalization).

        return img, label  # Return the transformed image and its label.
```

- **Purpose**: This class wraps the `torchvision.datasets.MNIST` dataset and allows the manual conversion of images from tensors back to `PIL` images for custom processing. It also applies transformations like converting the image back to a tensor and normalizing it.
  
#### Dataset Loading and Transformations

```python
transform = transforms.Compose([
    transforms.ToTensor(),  # Convert the PIL image to a tensor.
    transforms.Normalize((0.1307,), (0.3081,))  # Normalize the data with the mean and std of MNIST.
])

train_data = datasets.MNIST(root='./mnist_data', train=True, download=True, transform=transforms.ToTensor())
test_data = datasets.MNIST(root='./mnist_data', train=False, download=True, transform=transforms.ToTensor())

train_dataset_pil = MNISTDataset(train_data, transform=transform)
test_dataset_pil = MNISTDataset(test_data, transform=transform)

train_loader_pil = DataLoader(dataset=train_dataset_pil, batch_size=64, shuffle=True)
test_loader_pil = DataLoader(dataset=test_dataset_pil, batch_size=64, shuffle=False)
```

- **Transformations**:
  - `ToTensor()`: Converts the PIL image to a PyTorch tensor.
  - `Normalize((0.1307,), (0.3081,))`: Standard normalization for the MNIST dataset (mean and std are specific to MNIST).
  
- **Datasets**:
  - `datasets.MNIST`: Automatically downloads and loads the MNIST dataset if it's not already available.
  
- **DataLoader**: The `DataLoader` is used to load the data in batches, with `shuffle=True` for the training data to randomize the order of images for better generalization.

---



In [None]:
###############################################
# Version 1: Using PIL                        #
###############################################

# Custom Dataset Class for MNIST using PIL and torchvision
# This class wraps the torchvision MNIST dataset but loads images using PIL to allow for manual control over image processing.
class MNISTDataset(Dataset):
    def __init__(self, mnist_data, transform=None):
        self.mnist_data = mnist_data  # Store the dataset passed in (torchvision MNIST dataset).
        self.transform = transform    # Transformation (like converting to tensors and normalizing).

    def __len__(self):
        return len(self.mnist_data)  # Return the number of items in the dataset.

    def __getitem__(self, index):
        # Get the image and label at the specified index from the original dataset.
        img, label = self.mnist_data[index]

        # Convert the image from a tensor back to a PIL image for further processing.
        img = transforms.ToPILImage()(img)

        # Apply any transformations (like converting back to a tensor and normalizing).
        if self.transform:
            img = self.transform(img)

        # Return the processed image and its corresponding label.
        return img, label

# Set up transforms (convert to tensor and normalize)
# We need to convert the images to tensors and normalize them (mean and std values are specific to MNIST).
transform = transforms.Compose([
    transforms.ToTensor(),  # Convert PIL image to PyTorch tensor.
    transforms.Normalize((0.1307,), (0.3081,))  # Normalize the data with MNIST-specific mean and std.
])

# Download the MNIST dataset using torchvision.
# The dataset will be downloaded if not already present in './mnist_data'.
train_data = datasets.MNIST(root='./mnist_data', train=True, download=True, transform=transforms.ToTensor())
test_data = datasets.MNIST(root='./mnist_data', train=False, download=True, transform=transforms.ToTensor())

# Wrap the torchvision MNIST dataset with our custom dataset class, which uses PIL for image handling.
train_dataset_pil = MNISTDataset(train_data, transform=transform)
test_dataset_pil = MNISTDataset(test_data, transform=transform)

# DataLoader for batching. Batching helps in loading a set of images at once during training.
# Shuffle=True ensures that the training data is shuffled each epoch for better generalization.
train_loader_pil = DataLoader(dataset=train_dataset_pil, batch_size=64, shuffle=True)
test_loader_pil = DataLoader(dataset=test_dataset_pil, batch_size=64, shuffle=False)



### **3. Version 2: Without PIL for Image Handling**

```python
train_dataset_no_pil = datasets.MNIST(root='./mnist_data', train=True, download=True, transform=transform)
test_dataset_no_pil = datasets.MNIST(root='./mnist_data', train=False, download=True, transform=transform)

train_loader_no_pil = DataLoader(dataset=train_dataset_no_pil, batch_size=64, shuffle=True)
test_loader_no_pil = DataLoader(dataset=test_dataset_no_pil, batch_size=64, shuffle=False)
```

- **Difference**: In this version, the MNIST dataset is directly handled by `torchvision.datasets.MNIST`. The images are loaded as tensors right from the start, so there's no need for manual conversion using `PIL`.

- **Advantages**: This is more efficient when working with standard datasets like MNIST because the data is already prepared in tensor format.

---



In [None]:
###############################################
# Version 2: Without PIL                      #
###############################################

# In this version, we use the dataset directly as provided by torchvision, without wrapping it in a custom dataset class.

# Transformations (convert to tensor and normalize)
transform = transforms.Compose([
    transforms.ToTensor(),  # Directly convert the images to PyTorch tensors.
    transforms.Normalize((0.1307,), (0.3081,))  # Normalize the data (mean and std specific to MNIST).
])

# Download and load the MNIST dataset directly.
# The dataset will be downloaded and directly loaded without any manual PIL processing.
train_dataset_no_pil = datasets.MNIST(root='./mnist_data', train=True, download=True, transform=transform)
test_dataset_no_pil = datasets.MNIST(root='./mnist_data', train=False, download=True, transform=transform)

# DataLoader for batching. Same as the PIL version.
train_loader_no_pil = DataLoader(dataset=train_dataset_no_pil, batch_size=64, shuffle=True)
test_loader_no_pil = DataLoader(dataset=test_dataset_no_pil, batch_size=64, shuffle=False)



### **4. Neural Network Architecture**

```python
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(28*28, 128)  # Fully connected layer 1: from input size (28*28) to 128 neurons.
        self.fc2 = nn.Linear(128, 64)     # Fully connected layer 2: from 128 neurons to 64.
        self.fc3 = nn.Linear(64, 10)      # Output layer: 10 neurons for 10 digit classes.

    def forward(self, x):
        x = x.view(-1, 28*28)  # Flatten the 28x28 image into a vector of size 784.
        x = F.relu(self.fc1(x))  # Apply ReLU activation to the first layer.
        x = F.relu(self.fc2(x))  # Apply ReLU activation to the second layer.
        x = self.fc3(x)          # No activation here (cross-entropy will handle softmax).
        return x
```

- **Network Overview**:
  - Input size is `28x28` (since MNIST images are 28x28 pixels).
  - Two hidden layers with ReLU activation.
  - The final output layer has 10 neurons (one for each digit class).

- **Purpose**: The network takes in a flattened image, processes it through two fully connected layers with ReLU activation, and then outputs a vector of 10 scores (one for each digit).

---



In [None]:
###############################################
# Shared Neural Network Code                  #
###############################################

# Define a simple fully connected neural network for classification.
# The model has three layers: two hidden layers with ReLU activation and one output layer for classification.
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # Input size is 28*28 (since MNIST images are 28x28 pixels).
        self.fc1 = nn.Linear(28*28, 128)  # First fully connected layer, 128 neurons.
        self.fc2 = nn.Linear(128, 64)     # Second fully connected layer, 64 neurons.
        self.fc3 = nn.Linear(64, 10)      # Output layer, 10 neurons (for 10 digit classes).

    def forward(self, x):
        # Flatten the input tensor (28x28 pixels) into a vector of size 784.
        x = x.view(-1, 28*28)
        x = F.relu(self.fc1(x))  # Apply ReLU activation to the first layer.
        x = F.relu(self.fc2(x))  # Apply ReLU activation to the second layer.
        x = self.fc3(x)          # Output layer (no activation, as we'll use CrossEntropyLoss).
        return x

# Create separate models for the two versions (PIL and No PIL).
model_pil = Net()      # Model for the PIL version.
model_no_pil = Net()   # Model for the non-PIL version.


### **5. Loss Function and Optimizer**

```python
criterion = nn.CrossEntropyLoss()  # Cross-entropy loss for classification tasks.
optimizer_pil = optim.SGD(model_pil.parameters(), lr=0.01)  # Optimizer for the PIL version.
optimizer_no_pil = optim.SGD(model_no_pil.parameters(), lr=0.01)  # Optimizer for the non-PIL version.
```

- **`CrossEntropyLoss`**: This loss function is used for classification tasks. It combines `LogSoftmax` and `Negative Log Likelihood` in one function.
- **`SGD` Optimizer**: Stochastic Gradient Descent is used for optimization. The learning rate is set to 0.01 for both models.

---



In [None]:


# Use CrossEntropyLoss for classification tasks and SGD for optimization.
criterion = nn.CrossEntropyLoss()
optimizer_pil = optim.SGD(model_pil.parameters(), lr=0.01)  # Optimizer for the PIL version.
optimizer_no_pil = optim.SGD(model_no_pil.parameters(), lr=0.01)  # Optimizer for the non-PIL version.


### **6. Training and Testing Loop for PIL Version**

#### Training Loop

```python
start_time_pil = time.time()

for epoch in range(5):
    model_pil.train()  # Set the model to training mode.
    running_loss = 0.0
    
    for batch_idx, (data, target) in enumerate(train_loader_pil):
        optimizer_pil.zero_grad()  # Clear previous gradients.
        output = model_pil(data)  # Forward pass through the network.
        loss = criterion(output, target)  # Compute the loss.
        loss.backward()  # Backward pass to compute gradients.
        optimizer_pil.step()  # Update model weights.
        running_loss += loss.item()  # Accumulate the loss.
    
    print(f'PIL Version - Epoch {epoch+1}, Training Loss: {running_loss/len(train_loader_pil):.4f}')
```

- **Training Process**:
  - **Forward pass**: The data is passed through the network to make predictions.
  - **Loss calculation**: The difference between the predicted output and true labels is computed using cross-entropy loss.
  - **Backward pass**: The gradients of the loss with respect to the model parameters are computed.
  - **Optimization**: The optimizer updates the model parameters based on the gradients.

#### Testing Loop

```python
model_pil.eval()  # Set the model to evaluation mode (no gradient calculation).
correct_pil = 0
total_pil = 0

with torch.no_grad():  # No need to compute gradients during evaluation.
    for data, target in test_loader_pil:
        outputs = model_pil(data)  # Forward pass.
        _, predicted = torch.max(outputs.data, 1)  # Get the predicted class.
        total_pil += target.size(0)  # Increment the total number of samples.
        correct_pil += (predicted == target).sum().item()  # Count correct predictions.

accuracy_pil = 100 * correct_pil / total_pil  # Compute accuracy.
print(f'PIL Version - Test Accuracy: {accuracy_pil:.2f}%')
```

- **Evaluation**: The model is evaluated on the test set by making

 predictions, comparing them to the true labels, and calculating accuracy.

---


In [None]:

###############################################
# Training and Testing for PIL Version        #
###############################################

# Record start time for the PIL version to measure training time.
start_time_pil = time.time()

# Training loop for PIL version
for epoch in range(5):  # We train the model for 5 epochs.
    model_pil.train()  # Set the model to training mode.
    running_loss = 0.0  # Variable to track loss over the epoch.

    # Loop over batches of data in the training set.
    for batch_idx, (data, target) in enumerate(train_loader_pil):
        optimizer_pil.zero_grad()  # Zero the gradients (required before every backward pass).
        output = model_pil(data)   # Forward pass: get predictions from the model.
        loss = criterion(output, target)  # Calculate the loss (how far predictions are from true labels).
        loss.backward()  # Backward pass: compute gradients.
        optimizer_pil.step()  # Update model weights based on gradients.
        running_loss += loss.item()  # Accumulate the loss.

    # Print the average loss for the epoch.
    print(f'PIL Version - Epoch {epoch+1}, Training Loss: {running_loss/len(train_loader_pil):.4f}')

# Testing the model for PIL version
model_pil.eval()  # Set the model to evaluation mode (no backpropagation, etc.).
correct_pil = 0   # To count how many predictions were correct.
total_pil = 0     # To count the total number of examples.

# Loop through the test dataset.
with torch.no_grad():  # No need to compute gradients during evaluation.
    for data, target in test_loader_pil:
        outputs = model_pil(data)  # Forward pass: get predictions.
        _, predicted = torch.max(outputs.data, 1)  # Get the index of the highest score as the prediction.
        total_pil += target.size(0)  # Increment the total number of examples.
        correct_pil += (predicted == target).sum().item()  # Count correct predictions.

# Record end time for the PIL version.
end_time_pil = time.time()
training_time_pil = end_time_pil - start_time_pil  # Calculate total training time.
accuracy_pil = 100 * correct_pil / total_pil  # Calculate accuracy as a percentage.

print(f'PIL Version - Test Accuracy: {accuracy_pil:.2f}%')
print(f'PIL Version - Training Time: {training_time_pil:.2f} seconds')


PIL Version - Epoch 1, Training Loss: 0.8008
PIL Version - Epoch 2, Training Loss: 0.3135
PIL Version - Epoch 3, Training Loss: 0.2565
PIL Version - Epoch 4, Training Loss: 0.2187
PIL Version - Epoch 5, Training Loss: 0.1904
PIL Version - Test Accuracy: 94.78%
PIL Version - Training Time: 117.36 seconds


In [None]:

###############################################
# Training and Testing for No PIL Version     #
###############################################

# Record start time for the non-PIL version.
start_time_no_pil = time.time()

# Training loop for no PIL version (same as PIL version, but using the non-PIL data loader).
for epoch in range(5):
    model_no_pil.train()
    running_loss = 0.0

    # Loop over batches of data in the training set.
    for batch_idx, (data, target) in enumerate(train_loader_no_pil):
        optimizer_no_pil.zero_grad()  # Zero the gradients.
        output = model_no_pil(data)   # Forward pass.
        loss = criterion(output, target)  # Calculate the loss.
        loss.backward()  # Backward pass: compute gradients.
        optimizer_no_pil.step()  # Update model weights.
        running_loss += loss.item()  # Accumulate the loss.

    # Print the average loss for the epoch.
    print(f'No PIL Version - Epoch {epoch+1}, Training Loss: {running_loss/len(train_loader_no_pil):.4f}')

# Testing the model for no PIL version
model_no_pil.eval()  # Set the model to evaluation mode.
correct_no_pil = 0   # To count how many predictions were correct.
total_no_pil = 0     # To count the total number of examples.

# Loop through the test dataset.
with torch.no_grad():  # No need to compute gradients during evaluation.
    for data, target in test_loader_no_pil:
        outputs = model_no_pil(data)  # Forward pass.
        _, predicted = torch.max(outputs.data, 1)  # Get the index of the highest score as the prediction.
        total_no_pil += target.size(0)  # Increment the total number of examples.
        correct_no_pil += (predicted == target).sum().item()  # Count correct predictions.

# Record end time for the non-PIL version.
end_time_no_pil = time.time()
training_time_no_pil = end_time_no_pil - start_time_no_pil  # Calculate total training time.
accuracy_no_pil = 100 * correct_no_pil / total_no_pil  # Calculate accuracy as a percentage.

print(f'No PIL Version - Test Accuracy: {accuracy_no_pil:.2f}%')
print(f'No PIL Version - Training Time: {training_time_no_pil:.2f} seconds')




No PIL Version - Epoch 1, Training Loss: 0.8548
No PIL Version - Epoch 2, Training Loss: 0.3192
No PIL Version - Epoch 3, Training Loss: 0.2609
No PIL Version - Epoch 4, Training Loss: 0.2233
No PIL Version - Epoch 5, Training Loss: 0.1952
No PIL Version - Test Accuracy: 94.50%
No PIL Version - Training Time: 68.90 seconds


### **7. Timing and Comparison**

```python
# Timing and accuracy are tracked for both versions.
end_time_pil = time.time()
training_time_pil = end_time_pil - start_time_pil

print(f'PIL Version - Training Time: {training_time_pil:.2f} seconds')

# Repeat the same for the non-PIL version.
```

- **Timing**: The `time.time()` function is used to measure how long it takes to train the model for both versions. This allows a direct comparison of training times.

- **Results Comparison**:
  - Training times and accuracies for both the `PIL` and non-`PIL` versions are printed side by side to compare the performance of each approach.

---



In [None]:
###############################################
# Results Comparison                          #
###############################################

# Print a side-by-side comparison of the training times and test accuracies.
print("\n================= Comparison =================")
print(f"Training Time (PIL): {training_time_pil:.2f} seconds")
print(f"Training Time (No PIL): {training_time_no_pil:.2f} seconds")
print(f"Test Accuracy (PIL): {accuracy_pil:.2f}%")
print(f"Test Accuracy (No PIL): {accuracy_no_pil:.2f}%")


Training Time (PIL): 117.36 seconds
Training Time (No PIL): 68.90 seconds
Test Accuracy (PIL): 94.78%
Test Accuracy (No PIL): 94.50%


## **Using PIL and not using PIL**

| **Aspect**                  | **Using PIL**                                      | **Without PIL**                                     |
|-----------------------------|----------------------------------------------------|----------------------------------------------------|
| **Image Handling**           | Converts image tensors back to **PIL** images. This allows for custom image processing using `PIL` (Python Imaging Library). | Directly uses tensors from the dataset without converting to PIL. This avoids the overhead of image format conversions. |
| **Custom Dataset Class**     | Requires a custom `MNISTDataset` class to load images using `PIL` and apply transformations manually. | Does not require a custom dataset class. The `torchvision.datasets.MNIST` dataset is directly used as tensors. |
| **Transformations**          | The images are first converted back to **PIL** images and then transformed back to tensors using `transforms.ToTensor()`. This allows for more flexibility with custom image handling if needed. | Transformations (e.g., `ToTensor()` and normalization) are applied directly to the image tensors using `torchvision.transforms`. No need for PIL-based image transformations. |
| **Efficiency**               | **Less efficient**: Converting images from tensors to **PIL** and then back to tensors introduces overhead, making this approach slower, especially for large datasets. | **More efficient**: Directly working with tensors avoids unnecessary conversions, making it faster and more suitable for large-scale datasets like MNIST. |
| **Flexibility**              | **More flexible**: If custom image processing (e.g., resizing, cropping, augmentations) is required, using `PIL` allows for advanced image manipulation that isn't always available in `torchvision.transforms`. | **Less flexible**: `torchvision.transforms` is powerful for common image processing needs, but it might not cover advanced or specific custom operations that **PIL** can handle. |
| **Code Complexity**          | **Higher complexity**: Requires a custom dataset class to manage PIL conversions and manual handling of transformations. This adds extra code and complexity. | **Lower complexity**: Simply using `torchvision.datasets.MNIST` directly with transformations reduces code complexity, making it easier to implement and maintain. |
| **Use Case Suitability**     | Suitable if you need **custom image preprocessing** or manipulation (e.g., resizing, filtering, augmentation) before converting to tensors. Common in projects requiring advanced preprocessing beyond normalization or conversion. | Suitable for most standard datasets where the focus is on efficient loading and training. Common in projects where you need fast, **out-of-the-box dataset handling**, especially for widely used datasets like MNIST. |
| **Training Time**            | Takes longer due to the additional conversion steps between tensor and PIL images. This extra step increases the overall training time, especially noticeable with large datasets or high epochs. | Faster since the images are handled as tensors directly. Avoiding the PIL conversion reduces unnecessary overhead, improving training time. |
| **Code Maintenance**         | More difficult to maintain, especially if adding or modifying the transformations requires working through a custom dataset class. | Easier to maintain since you rely on PyTorch's well-documented and widely-used data handling functionality. |
| **Memory Overhead**          | Higher memory usage since each image is converted between formats, which can be taxing when working with large datasets. | Lower memory overhead since the data remains in tensor format, which is native to PyTorch and more memory efficient. |
| **Transform Customizability** | Provides full control over how images are loaded, processed, and transformed. You can create custom pipelines involving PIL methods before converting to a tensor. | Less customizable but still allows common transformations like normalization, resizing, and data augmentation with `torchvision.transforms`. Custom transformations can still be added but in a more restricted environment. |

---

### **Key Points:**

1. **Performance**:
   - The **without PIL** approach is faster and more efficient because it skips the unnecessary step of converting between image formats. By directly handling the images as tensors, this method allows for faster data loading, training, and testing, particularly important in large-scale datasets.

2. **Flexibility**:
   - The **using PIL** approach offers more flexibility for custom image manipulation. For instance, if you need to perform advanced image preprocessing, like applying filters, specific augmentations, or detailed custom transformations, the PIL approach gives you more control.
   - However, **without PIL** is still capable of common transformations like resizing, normalization, and flipping, but it's more constrained to the functionalities provided by `torchvision.transforms`.

3. **Complexity**:
   - **Using PIL** adds complexity because it requires creating a custom dataset class and manually handling image conversions. This additional code increases the risk of bugs and makes the code more difficult to maintain.
   - **Without PIL** is simpler and easier to manage since you're using PyTorch’s built-in functions for handling datasets and transformations.

4. **Use Cases**:
   - **Using PIL** is more appropriate when working with custom datasets where you might need non-standard image preprocessing steps.
   - **Without PIL** is ideal for standard tasks like MNIST classification, where the dataset is already structured and doesn't require complex image manipulations. This approach is faster and easier to implement.

---

### **Which Approach Should You Use?**

- **Use `PIL`** when:
  - You need **advanced image preprocessing**.
  - You're working with **custom datasets** that require custom image handling.
  - You want **fine-grained control** over how images are loaded and processed.

- **Skip `PIL` (Use tensors directly)** when:
  - You're working with **standard datasets** like MNIST, CIFAR, etc.
  - You prioritize **efficiency** and **simplicity**.
  - You want to reduce **code complexity** and **training time**.

In conclusion, for most typical scenarios like MNIST classification, **not using PIL** is the better choice due to its efficiency, simplicity, and ease of use. However, **using PIL** offers more control when you need custom processing for complex datasets.

## **For more information**

1. **PyTorch Documentation**:
   - [PyTorch Datasets and DataLoader](https://pytorch.org/docs/stable/data.html)
     - The DataLoader and Dataset classes are foundational to PyTorch’s data handling mechanisms. They support efficient batching, shuffling, and loading of data.
   - [torchvision.datasets.MNIST](https://pytorch.org/vision/stable/datasets.html#mnist)
     - torchvision's dataset handling, specifically for MNIST, which is a widely used dataset for digit classification tasks.
   - [torchvision.transforms](https://pytorch.org/vision/stable/transforms.html)
     - PyTorch provides standard image transformations such as converting to tensors and normalizing image data. This is the foundation of the **without PIL** approach.

2. **Python Imaging Library (PIL/Pillow)**:
   - [Pillow Documentation](https://pillow.readthedocs.io/en/stable/)
     - PIL (now maintained as Pillow) is a Python library for image processing. It offers functionality to open, manipulate, and save images in various formats. The **using PIL** approach leverages these capabilities for custom image handling.

3. **PyTorch’s Autograd and Optimization**:
   - [PyTorch Autograd Documentation](https://pytorch.org/docs/stable/autograd.html)
     - The process of automatic differentiation and optimization using backward propagation is explained in the PyTorch Autograd docs. This is crucial to understanding how gradients are computed and applied to update model parameters during training.


# **Code with a CNN**

**Explanation of the code**

1. Imports and Setup:
   The code imports necessary PyTorch libraries and other utilities.

2. CNN Architecture (class Net):
   - This class defines the CNN structure.
   - conv1: First convolutional layer (1 input channel, 32 output channels, 3x3 kernel)
   - conv2: Second convolutional layer (32 input channels, 64 output channels, 3x3 kernel)
   - dropout1 and dropout2: Dropout layers to prevent overfitting
   - fc1 and fc2: Fully connected layers for final classification

   The forward method defines how data flows through the network:
   - Apply convolutions with ReLU activation
   - Max pooling to reduce spatial dimensions
   - Dropout for regularization
   - Flatten the output and pass through fully connected layers
   - Apply log softmax for classification output

3. Data Loading (PIL version):
   - Custom MNISTDataset class for handling PIL image conversions
   - Data transformations (ToTensor and Normalize)
   - Loading MNIST dataset and wrapping with custom dataset class
   - Creating DataLoader for batching

4. Data Loading (Non-PIL version):
   - Directly uses torchvision's MNIST dataset
   - Same transformations as PIL version
   - Creating DataLoader for batching

5. Training Function:
   - Iterates over batches in the training data
   - Moves data to the appropriate device (CPU/GPU)
   - Performs forward pass, calculates loss, backpropagation, and optimization
   - Prints training progress

6. Testing Function:
   - Evaluates the model on the test set
   - Calculates average loss and accuracy
   - Returns the accuracy for comparison

7. Training and Testing Loop:
   - Sets up models, optimizers, and device (CPU/GPU)
   - Trains and tests both PIL and non-PIL versions
   - Measures training time for each version

8. Results Comparison:
   - Prints training times and test accuracies for both versions

Key Differences from Previous Code:
- Uses a CNN architecture instead of a simple fully connected network
- Implements separate train and test functions
- Uses Adam optimizer instead of SGD
- Supports GPU training if available
- Uses NLL Loss with log softmax output

This CNN should perform better on image classification tasks like MNIST compared to the fully connected network, as it can capture spatial hierarchies in the image data. The comparison between PIL and non-PIL versions allows for analysis of any performance differences in data loading and preprocessing approaches.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from torchvision import datasets, transforms
from PIL import Image
import time

 **How the filters (also called kernels) work in this CNN and how their dimensions are determined:**

**1. First Convolutional Layer (self.conv1):**
   ```python
   self.conv1 = nn.Conv2d(1, 32, 3, 1)
   ```
   This creates 32 filters, each of size 3x3, that operate on a single input channel.

   - Input: 1 channel (grayscale MNIST images)
   - Output: 32 channels
   - Filter size: 3x3
   - Stride: 1

   How it works:
   - Each 3x3 filter slides over the 28x28 input image.
   - At each position, it performs element-wise multiplication and summation.
   - This creates 32 feature maps, each slightly smaller than the input (26x26) due to the filter size.

Let's break down this statement about the first convolutional layer (self.conv1) in detail:

"This creates 32 filters, each of size 3x3, that operate on a single input channel."

1. Number of Filters: 32
   - The layer creates 32 distinct filters.
   - Each filter will produce one feature map in the output.
   - This means the output of this layer will have 32 channels.

2. Filter Size: 3x3
   - Each filter is a small 2D matrix of size 3 rows by 3 columns.
   - This size defines the receptive field of the filter - how much of the input it "sees" at once.

3. Single Input Channel
   - This refers to the input being a grayscale image (like MNIST digits).
   - Grayscale images have only one channel, representing intensity.
   - If it were an RGB image, we'd have 3 input channels.

4. How it Works:
   - Each 3x3 filter slides over the entire input image.
   - At each position, it performs a dot product operation:
     * Multiply each filter value with the corresponding image pixel.
     * Sum up all these multiplications.
     * Add a bias term.
     * Apply an activation function (like ReLU).
   - This process creates one value in the output feature map.

5. Output Dimension:
   - If the input is 28x28 (like MNIST), the output will be 26x26x32.
   - The spatial dimensions shrink by 2 in each direction due to the 3x3 filter.
   - We get 32 of these 26x26 feature maps, one from each filter.

6. Purpose:
   - Each filter learns to detect a specific pattern or feature (e.g., edges, textures).
   - Having 32 filters allows the network to detect 32 different types of features.

7. Parameters:
   - Each filter has 3 * 3 = 9 weights, plus 1 bias term.
   - Total parameters for this layer: (3 * 3 + 1) * 32 = 320

8. Intuition:
   - Think of each filter as a "feature detector" sliding over the image.
   - It's looking for a specific pattern, and it lights up (activates) when it finds that pattern.

This layer is the first step in transforming the raw pixel data into more abstract features, which subsequent layers will further process and combine to eventually classify the image.

**2. Second Convolutional Layer (self.conv2):**
   ```python
   self.conv2 = nn.Conv2d(32, 64, 3, 1)
   ```
   This creates 64 filters, each of size 3x3, that operate on 32 input channels.

   - Input: 32 channels (from previous layer)
   - Output: 64 channels
   - Filter size: 3x3
   - Stride: 1

   How it works:
   - Each filter is actually 3x3x32 (matching the input depth).
   - It slides over the 32 input feature maps, combining information across all channels.
   - The output is 64 feature maps, further reduced in size (24x24).

  When we say "Each filter is actually 3x3x32", we're referring to the structure of the filters in the second convolutional layer (self.conv2). This structure is crucial to understand how convolutional layers process multi-channel inputs. Let's break it down:

1. Filter Dimensions:
   - The filter is 3x3 in spatial dimensions (height and width).
   - It has a depth of 32, matching the number of channels from the previous layer's output.

2. Why 32 channels?
   - The first convolutional layer (self.conv1) outputs 32 feature maps.
   - These 32 feature maps become the input channels for the second layer.

3. Structure of the Filter:
   - Each filter in conv2 is not just a 2D matrix, but a 3D volume: 3 (height) x 3 (width) x 32 (depth).
   - You can think of it as 32 separate 3x3 matrices stacked together.

4. How it Works:
   - When this 3x3x32 filter slides over the input, it processes all 32 input channels simultaneously.
   - For each position, it performs 3x3x32 = 288 multiplications (plus a bias term).
   - The results are summed up to produce a single value in the output feature map.

5. Conceptual View:
   - If you were to "unroll" this 3D filter, you'd have a single row of 3 * 3 * 32 = 288 weights.
   - This allows the filter to capture patterns that span across all input channels.

6. Number of Parameters:
   - Each filter in conv2 has 3 * 3 * 32 + 1 = 289 parameters (including the bias).
   - With 64 such filters, we get (3 * 3 * 32 + 1) * 64 = 18,496 parameters in total for this layer.

7. Importance:
   - This 3D structure allows the network to learn features that combine information from all input channels.
   - It's how CNNs can detect complex patterns that aren't visible in any single channel alone.

In essence, the 3x3x32 structure of each filter in the second layer enables the CNN to process and combine information from all 32 feature maps produced by the first layer. This allows for the detection of increasingly complex and abstract features as we go deeper into the network.

**Transition from the output of conv1 to the input of conv2**

1. Output of conv1:
   - We have 32 feature maps, each of size 26x26.
   - This can be thought of as a 3D volume: 26 x 26 x 32.

2. Input to conv2:
   - This entire 26 x 26 x 32 volume becomes the input to conv2.
   - Each of the 32 feature maps is treated as an "input channel" for conv2.

3. How conv2 processes this input:
   - Each filter in conv2 is 3x3x32 in size.
   - The '32' in 3x3x32 corresponds to the 32 input channels (feature maps) from conv1.
   - Each filter in conv2 slides over all 32 input channels simultaneously.

4. Computation in conv2:
   - At each position, the 3x3x32 filter performs element-wise multiplication with a 3x3x32 patch of the input.
   - These 3 * 3 * 32 = 288 multiplications are summed up (along with a bias term).
   - This sum produces a single value in one of conv2's output feature maps.

5. Dimensionality:
   - Input to conv2: 26 x 26 x 32
   - Each filter in conv2: 3 x 3 x 32
   - Output of conv2: 24 x 24 x 64 (assuming 64 filters in conv2)

6. Preserving spatial relationships:
   - The spatial relationship between features in the 32 input channels is preserved.
   - This allows conv2 to detect higher-level features that combine patterns from multiple conv1 feature maps.

7. Analogy:
   - If conv1's feature maps detected simple edges and textures, conv2 can combine these to detect more complex shapes or patterns.
   - It's like going from detecting lines (conv1) to detecting combinations of lines that form specific shapes (conv2).

8. Importance:
   - This transition allows the network to build a hierarchy of features.
   - Each subsequent layer can detect increasingly complex and abstract patterns.

In essence, the 32 feature maps from conv1 are not "converted" in the traditional sense. Rather, they are used as a multi-channel input for conv2, allowing the network to build upon the features detected in the first layer to create more complex feature representations in the second layer.

3. Dimension Calculations:
   - Input image: 28x28
   - After conv1: 26x26 (28 - 3 + 1 = 26)
   - After conv2: 24x24 (26 - 3 + 1 = 24)
   - After max pooling: 12x12 (24 / 2 = 12)

   The formula for output size is:
   Output size = (Input size - Filter size + 2 * Padding) / Stride + 1

4. Number of Parameters:
   - conv1: (3 * 3 * 1 + 1) * 32 = 320 parameters
     (filter size * input channels + bias) * number of filters
   - conv2: (3 * 3 * 32 + 1) * 64 = 18,496 parameters

5. Flattening for Fully Connected Layer:
   ```python
   x = torch.flatten(x, 1)
   ```
   This flattens the output of conv2 after max pooling:
   12 * 12 * 64 = 9,216 (which is the input size for fc1)

The filters work by detecting specific patterns or features in the input. In early layers, they might detect simple features like edges or textures. In deeper layers, they can detect more complex, abstract features. The use of multiple filters allows the network to learn a diverse set of features at each layer.

The dimensionality reduction through convolutions and pooling helps in:
1. Reducing the number of parameters, preventing overfitting.
2. Increasing the receptive field, allowing later layers to "see" more of the original input.
3. Building a hierarchy of features, from simple to complex.

This structure is what gives CNNs their power in image-related tasks, as it mirrors the hierarchical nature of visual information processing.

In [None]:
# CNN Architecture
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

##PIL and Non PIL Versions

In [1]:
# PIL version
class MNISTDataset(Dataset):
    def __init__(self, mnist_data, transform=None):
        self.mnist_data = mnist_data
        self.transform = transform

    def __len__(self):
        return len(self.mnist_data)

    def __getitem__(self, index):
        img, label = self.mnist_data[index]
        img = transforms.ToPILImage()(img)
        if self.transform:
            img = self.transform(img)
        return img, label

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_data = datasets.MNIST(root='./mnist_data', train=True, download=True, transform=transforms.ToTensor())
test_data = datasets.MNIST(root='./mnist_data', train=False, download=True, transform=transforms.ToTensor())

train_dataset_pil = MNISTDataset(train_data, transform=transform)
test_dataset_pil = MNISTDataset(test_data, transform=transform)

train_loader_pil = DataLoader(dataset=train_dataset_pil, batch_size=64, shuffle=True)
test_loader_pil = DataLoader(dataset=test_dataset_pil, batch_size=64, shuffle=False)

# Non-PIL version
train_dataset_no_pil = datasets.MNIST(root='./mnist_data', train=True, download=True, transform=transform)
test_dataset_no_pil = datasets.MNIST(root='./mnist_data', train=False, download=True, transform=transform)

train_loader_no_pil = DataLoader(dataset=train_dataset_no_pil, batch_size=64, shuffle=True)
test_loader_no_pil = DataLoader(dataset=test_dataset_no_pil, batch_size=64, shuffle=False)

# Training function
def train(model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % 100 == 0:
            print(f'Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)} '
                  f'({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}')

# Testing function
def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()
    test_loss /= len(test_loader.dataset)
    accuracy = 100. * correct / len(test_loader.dataset)
    print(f'\nTest set: Average loss: {test_loss:.4f}, Accuracy: {correct}/{len(test_loader.dataset)} ({accuracy:.2f}%)\n')
    return accuracy

# Training and testing for PIL version
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_pil = Net().to(device)
optimizer_pil = optim.Adam(model_pil.parameters())

start_time_pil = time.time()
for epoch in range(1, 6):
    train(model_pil, device, train_loader_pil, optimizer_pil, epoch)
accuracy_pil = test(model_pil, device, test_loader_pil)
end_time_pil = time.time()
training_time_pil = end_time_pil - start_time_pil

# Training and testing for non-PIL version
model_no_pil = Net().to(device)
optimizer_no_pil = optim.Adam(model_no_pil.parameters())

start_time_no_pil = time.time()
for epoch in range(1, 6):
    train(model_no_pil, device, train_loader_no_pil, optimizer_no_pil, epoch)
accuracy_no_pil = test(model_no_pil, device, test_loader_no_pil)
end_time_no_pil = time.time()
training_time_no_pil = end_time_no_pil - start_time_no_pil

# Results comparison
print("\n================= Comparison =================")
print(f"Training Time (PIL): {training_time_pil:.2f} seconds")
print(f"Training Time (No PIL): {training_time_no_pil:.2f} seconds")
print(f"Test Accuracy (PIL): {accuracy_pil:.2f}%")
print(f"Test Accuracy (No PIL): {accuracy_no_pil:.2f}%")

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./mnist_data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 43379971.00it/s]


Extracting ./mnist_data/MNIST/raw/train-images-idx3-ubyte.gz to ./mnist_data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ./mnist_data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 1152093.26it/s]


Extracting ./mnist_data/MNIST/raw/train-labels-idx1-ubyte.gz to ./mnist_data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ./mnist_data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 9632737.61it/s]


Extracting ./mnist_data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./mnist_data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ./mnist_data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 10308727.69it/s]


Extracting ./mnist_data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./mnist_data/MNIST/raw






Test set: Average loss: 0.0318, Accuracy: 9900/10000 (99.00%)


Test set: Average loss: 0.0279, Accuracy: 9923/10000 (99.23%)


Training Time (PIL): 127.01 seconds
Training Time (No PIL): 74.76 seconds
Test Accuracy (PIL): 99.00%
Test Accuracy (No PIL): 99.23%


###**CNN Workflow**
This workflow allows the network to progressively transform raw pixel data into increasingly abstract and task-relevant features, culminating in a classification decision. Each stage plays a crucial role in the overall learning and inference process of the CNN.

1. Input:
   - Start with an input image, e.g., a 28x28 grayscale MNIST digit.

2. Convolutional Layers:
   - Conv1:
     * Input: 28x28x1
     * Apply 32 filters of size 3x3
     * Output: 26x26x32 feature maps
   - ReLU Activation:
     * Applies element-wise ReLU to introduce non-linearity
   - Conv2:
     * Input: 26x26x32
     * Apply 64 filters of size 3x3
     * Output: 24x24x64 feature maps
   - ReLU Activation:
     * Again, apply element-wise ReLU

3. Pooling Layer:
   - Max Pooling:
     * Input: 24x24x64
     * Use 2x2 max pooling with stride 2
     * Output: 12x12x64
   - Purpose: Reduce spatial dimensions, retain important features

4. Flattening:
   - Input: 12x12x64
   - Flatten operation: Reshape to 1D vector
   - Output: 9216 (12 * 12 * 64) dimensional vector
   - Purpose: Prepare convolutional output for fully connected layers

5. Fully Connected (Linear) Layers:
   - FC1:
     * Input: 9216 dimensional vector
     * Apply linear transformation to 128 neurons
     * Output: 128 dimensional vector
   - ReLU Activation:
     * Apply element-wise ReLU
   - Dropout:
     * Randomly zero out some neurons to prevent overfitting
   - FC2 (Output Layer):
     * Input: 128 dimensional vector
     * Apply linear transformation to 10 neurons (for 10 digit classes)
     * Output: 10 dimensional vector

6. Softmax:
   - Input: 10 dimensional vector
   - Apply softmax function
   - Output: 10 probabilities (sum to 1) representing the likelihood of each digit class

Workflow Summary:
1. Convolutions extract spatial features from the input image.
2. Pooling reduces the spatial dimensions while retaining important features.
3. Flattening converts the 3D feature maps into a 1D vector.
4. Fully connected layers combine these features for high-level reasoning.
5. Softmax converts the final layer outputs into class probabilities.

Key Points:
- Convolutions and pooling work in the image's spatial domain.
- Flattening bridges the gap between spatial (convolutions) and non-spatial (fully connected) processing.
- Fully connected layers perform high-level feature combination and classification.
- Softmax ensures the network outputs valid probabilities for multi-class classification.

