<h1><center>Laboratory work 5.</center></h1>
<h2><center>PyTorch Custom Datasets Exercises</center></h2>

**Performed:** Last name and First name

**Variant:** #__

<a class="anchor" id="5"></a>

## Content

1. [Task 1. Preparing data](#5.1)
2. [Task 2. Creating a model](#5.2)
3. [Task 3. Training and testing loops](#5.3)
4. [Task 4. Conducting experiments with hyperparameters](#5.4)
5. [Task 5. Conducting experiments with the model's layers](#5.5)
6. [Task 6. Conducting experiments with the data](#5.6)
7. [Task 7. Making predictions](#5.7)

In [None]:
# Import torch
import torch
from torch import nn

# Exercises require PyTorch > 1.10.0
print(torch.__version__)

# Setup device agnostic code
device = "cuda" if torch.cuda.is_available() else "cpu"
device

<a class="anchor" id="5.1"></a>

## <span style="color:blue; font-size:1em;"> Task 1. Preparing data</span>

[Go back to the content](#5)

**For all variants:** Recreate the data loading functions we built in sections 1-4 of [notebook 05](https://github.com/radiukpavlo/conducting-experiments/blob/main/01_notebooks/ce_05_pytorch_custom_datasets.ipynb). By this time, you should have had the trained and tested `DataLoader`'s ready to use.

In [None]:
# 1. Get data


In [None]:
# 2. Become one with the data
import os
def walk_through_dir(dir_path):
  """Walks through dir_path returning file counts of its contents."""
  for dirpath, dirnames, filenames in os.walk(dir_path):
    print(f"There are {len(dirnames)} directories and {len(filenames)} images in '{dirpath}'.")

In [None]:
# Setup train and testing paths


In [None]:
# Visualize an image

In [None]:
# Do the image visualization with matplotlib


We've got some images in our folders.

Now we need to make them compatible with PyTorch by:
1. Transform the data into tensors.
2. Turn the tensor data into a `torch.utils.data.Dataset` and later a `torch.utils.data.DataLoader`.

In [None]:
# 3.1 Transforming data with torchvision.transforms


In [None]:
# Write transform for turning images into tensors


In [None]:
# Write a function to plot transformed images


### Load image data using `ImageFolder`

In [None]:
# Use ImageFolder to create dataset(s)


In [None]:
# Get class names as a list
class_names = train_data.classes
class_names

In [None]:
# Can also get class names as a dict
class_dict = train_data.class_to_idx
class_dict

In [None]:
# Check the lengths of each dataset
len(train_data), len(test_data)

In [None]:
# Turn train and test Datasets into DataLoaders


In [None]:
# How many batches of images are in our data loaders?


<a class="anchor" id="5.2"></a>

## <span style="color:blue; font-size:1em;"> Task 2. Creating a model</span>

[Go back to the content](#5)

**For variants 1-3:** Enhance the efficiency of TinyVGG by integrating depthwise separable convolutions.
* Replace standard convolutional layers (`torch.nn.Conv2d`) in TinyVGG with depthwise separable convolutions (`torch.nn.Conv2d` for depthwise and pointwise convolutions). This reduces the computational load and model size by decomposing convolutions into spatial and depth components.
* It is beneficial for deploying models on devices with limited computational resources, such as mobile phones or IoT devices, without substantially sacrificing accuracy.
* Configure the model parameters to adjust the depth multiplier and the pointwise convolution to ensure it fits the needs of the food classification task.

**For variants 4-6:** Improve training stability and accelerate convergence using batch normalization layers.
* Insert a `torch.nn.BatchNorm2d` layer after each convolutional layer in the TinyVGG architecture. Batch normalization normalizes the activations of the previous layer at each batch, maintaining a mean output close to 0 and an output standard deviation close to 1.
* This adjustment helps in reducing internal covariate shift, which can speed up training and lead to higher overall accuracy.
* Ensure that hyperparameters such as momentum and epsilon are tuned for optimal performance specific to the dataset.

**For variants 7-9:** Incorporate dropout layers to prevent overfitting.
* Add `torch.nn.Dropout` layers following each activation layer in the TinyVGG setup. Dropout randomly zeros some elements of the input tensor with probability p during training, acting as a form of regularization.
* This approach is especially useful when dealing with a small or highly similar dataset, where the model is at a higher risk of overfitting.
* Adjust the dropout rate based on the validation set performance to find the right balance between learning and regularization.

**For variants 10-12:** Implement residual connections within the TinyVGG to enable training of deeper networks by mitigating the vanishing gradient problem.

* Add residual connections by incorporating skip connections that add the input of a convolution block to its output, facilitating deeper architectures without degradation in training performance.
* Use `torch.nn.Identity` or direct tensor addition to implement these connections.
* This setup allows for the potential increase in model depth, enhancing its ability to learn more complex features without suffering from training difficulties.

**For variants 13-15:** Expand the receptive field using dilated convolutions to capture a broader context without increasing the number of parameters.
* Replace some of the standard convolution layers in TinyVGG with dilated convolutions (`torch.nn.Conv2d` with a dilation parameter greater than 1). This helps in increasing the receptive field of the model, allowing it to incorporate context from a larger area of the input.
* Dilated convolutions are particularly useful for images where understanding broader spatial relationships is beneficial for classification accuracy.
* Carefully adjust dilation rates and layer configurations to optimize performance without causing gridding artifacts.

**For variants 16-18:** Integrate attention mechanisms to focus the model more on relevant parts of the image.
* Embed attention modules such as Squeeze-and-Excitation (SE) blocks within the TinyVGG structure. These blocks reweight channel-wise features by explicitly modeling interdependencies between channels, enhancing the representational power of the network.
* Such attention mechanisms can help the model to focus on more informative features and improve the classification of complex food images where certain features are more discriminative than others.
* Experiment with the placement and configuration of SE blocks to maximize their effect without overwhelming the network's capacity.

**For variants 19-21:** Aggregate features from multiple scales to improve model robustness and accuracy.
* Modify TinyVGG to incorporate feature aggregation from different layers, using techniques like feature pyramids or concatenated outputs from different stages of the network.
* This allows the model to leverage both low-level details and high-level semantic information, which is crucial for accurately classifying food items that can vary significantly in appearance at different scales.
* Implement custom forward methods to handle multiscale feature aggregation and ensure that the feature dimensions are compatible for concatenation or merging.

**For variants 22-24:** Explore the impact of different activation functions on the performance of the TinyVGG model.
* Experiment with advanced activation functions beyond ReLU, such as LeakyReLU, PReLU, or Swish, which may offer advantages in certain scenarios by providing non-linearities that can improve learning dynamics.
* Integrate these activation functions into the TinyVGG model by replacing all ReLU layers. Each activation function has unique characteristics: LeakyReLU allows a small, non-zero gradient when the unit is not active, potentially helping with the dying ReLU problem; PReLU introduces learnable parameters that allow it to adapt its shape during training; Swish, being a smooth function, often provides benefits in deeper networks.
* Evaluate the impact of these changes through experimentation, analyzing how they affect convergence speed and overall accuracy on the food classification task.

**For variants 25-27:** Modify TinyVGG to prioritize inference speed, suitable for real-time applications.
* Apply techniques like layer fusion, precision reduction (e.g., using half-precision floats), and pruning to reduce the computational cost and model size. This makes the model more suitable for deployment in environments where computational resources or latency is a concern, such as mobile apps.
* Consider integrating platform-specific optimizations if the deployment target is known, using libraries like ONNX or TensorRT, which can provide significant speedups by optimizing network graphs.
* Thoroughly test the optimized model to ensure that these speed enhancements do not unduly compromise classification accuracy.

**For variants 28-30:** Combine multiple TinyVGG models in an ensemble to improve robustness and accuracy.
* Train multiple TinyVGG models with variations in initialization, data shuffling, and hyperparameters to create a diverse set of classifiers.
* Use ensemble techniques such as averaging, voting, or stacking to combine the outputs of these models. Averaging can reduce variance in predictions; voting can be used for a robust majority rule in classification; stacking uses another model to learn the optimal combination of classifiers.
* This approach often results in better performance than any single model, especially in complex classification tasks like food recognition, where different models might learn to focus on different aspects of the inputs.

<a class="anchor" id="5.3"></a>

## <span style="color:blue; font-size:1em;"> Task 3. Training and testing loops</span>

[Go back to the content](#5)

**For variants 1-3:** Implement dynamic adjustment of the learning rate during training to improve convergence and avoid overshooting minima.
* Use a learning rate scheduler from `torch.optim.lr_scheduler` to adjust the learning rate as training progresses. Start with a higher learning rate to quickly converge towards the general vicinity of the optimum, and then decrease it progressively to fine-tune the model’s parameters without overshooting.
* Apply `StepLR` to decrease the learning rate by a certain factor every few epochs, or use `ExponentialLR` for a steady exponential decay, or `ReduceLROnPlateau` to reduce the learning rate when the validation loss plateaus, indicating that the model might benefit from more subtle updates.

**For variants 4-6:** Integrate an early stopping mechanism to halt training when the model begins to overfit.
* Monitor the model's performance on a validation set at the end of each epoch. If the validation loss fails to improve or starts to increase over several consecutive epochs, terminate training early. This prevents the model from learning noise and non-generalizable patterns present in the training data.
* Implement early stopping using a custom function that tracks the best validation loss observed during training and counts the number of epochs since it last improved. If this count exceeds a predefined threshold (patience), stop the training process.

**For variants 7-9:** Apply gradient clipping during training to prevent the exploding gradient problem, which can lead to destabilized learning processes.
* Use `torch.nn.utils.clip_grad_norm_` or `torch.nn.utils.clip_grad_value_` to clip the gradients during backpropagation. This keeps them within a manageable range and prevents the gradients from growing too large, which can cause the model parameters to oscillate wildly or diverge.
* Gradient clipping is particularly useful in training deep networks or networks with recurrent layers, where gradients can grow exponentially through time or depth.
* Incorporate this modification into the training loop, applying the clipping right after computing gradients (`loss.backward()`) and before updating the model parameters (`optimizer.step()`). Experiment with different clipping thresholds to find a balance that minimizes the impact on the natural training dynamics of the model while preventing instability.

**For variants 10-12:** Utilize mixed precision training to speed up the training process and reduce memory usage while maintaining the model's performance.
* Leverage PyTorch’s `torch.cuda.amp` for automatic mixed precision (AMP). This module allows certain parts of the model to use lower precision (float16) calculations, which can be processed faster on compatible hardware, while maintaining critical parts of the model in higher precision (float32) to preserve accuracy.
* Use `amp.GradScaler` to manage the scaling of the loss value to prevent issues with small gradients that can underflow when using float16.
* Integrate AMP into the training loop by wrapping the forward and backward passes in an `amp.autocast()` context manager to enable/disable automatic casting for specific layers or operations. This strategy helps in achieving faster training times and reducing GPU memory consumption without significant loss in model accuracy or training stability.

**For variants 13-15:** Utilize adaptive gradient algorithms to dynamically adjust learning rates at the parameter level, improving convergence speeds and model robustness.
* Implement an optimizer like Adam or RMSprop, which adjusts the learning rate for each parameter based on estimates of the first and second moments of the gradients. This method helps in handling sparse gradients and different scales of parameters effectively.
* Set up Adam with specific hyperparameters such as `beta1`, `beta2`, and `epsilon`. These control the decay rates of the moving averages and the term added to improve numerical stability, respectively.
* Integrate this optimizer into your training loop. Monitor the effect on training dynamics, specifically looking at how quickly and smoothly the model converges compared to using standard SGD. Adjust the learning rate and other parameters based on empirical results.

**For variants 16-18:** Gradually increase the complexity of training data, simulating a learning "curriculum" to help the model learn more effectively.
* Start with simpler or smaller subsets of the training data and gradually introduce more complex or larger batches as the training progresses. This can be implemented by sorting the training data by some measure of complexity (e.g., image resolution, the presence of noise, etc.) or by modifying the data loader to emit progressively more challenging examples.
* Use a scheduling mechanism to increase complexity, such as increasing the size of the input data or the number of classes the model needs to predict after certain epochs.
* Experiment with different metrics for defining complexity and schedules for introducing new challenges to optimize training outcomes.

**For variants 19-21:** Improve generalization by averaging multiple points along the trajectory of SGD, capturing a wider "ensemble" of models.
* Integrate SWA by replacing the conventional training loop's final phase with a process where the model weights are periodically averaged. This typically begins after the model has initially converged using standard training methods.
* Implement SWA by using a custom or available SWA optimizer in PyTorch, which manages the averaging process automatically. Adjust the frequency of updates and the number of cycles based on validation performance.
* SWA often leads to better generalization and more stable predictions, as it smooths out sharp minima in the loss landscape that are sensitive to small perturbations in inputs or parameters.

**For variants 22-24:** Tailor the loss function to address specific characteristics or challenges of the dataset, such as class imbalance or outliers.
* Develop and integrate custom loss functions that can better reflect the importance of certain examples or balance the influence of different classes. For instance, use a weighted cross-entropy loss where weights are inversely proportional to class frequencies.
* Include mechanisms to handle outliers, such as using a robust loss function like Huber loss, which is less sensitive to outliers than squared error loss.
* Test different configurations of the loss function to identify the best setup for balancing learning across the diverse elements of the dataset, thereby enhancing model accuracy and robustness.

**For variants 25-27:** Implement checkpointing to save the model at various stages during training, allowing recovery and fine-tuning from specific states.
* Set up a checkpointing system that periodically saves the model's state, including the weights, optimizer state, and current epoch number. This is crucial for long training sessions or when using expensive computational resources, as it allows training to resume from the last checkpoint in case of a failure.
* Optionally, use model snapshots at different points (e.g., every 5 epochs) to evaluate how the model's performance evolves over time and to choose the best model based on validation metrics rather than just the last or best epoch.
* Use PyTorch’s `torch.save` and `torch.load` for efficient management of checkpoints.

**For variants 28-30:** Adjust the batch size dynamically during training to find an optimal balance between learning stability and computational efficiency.
* Start with a smaller batch size and increase it as training progresses. This can help in stabilizing the initial learning phase when the model is more sensitive to noisy gradients, then scale up to exploit computational efficiencies of larger batch sizes once the training stabilizes.
* Implement a schedule or a performance-based trigger for adjusting the batch size, such as increasing the batch size when the training loss decreases consistently or plateaus. This approach can help in optimizing the use of GPU memory and computational power throughout the training process.
* Monitor the impact of dynamic batch sizing on training speed and model accuracy. Larger batches provide more stable but potentially less accurate gradient estimates, while smaller batches can lead to noisier updates but might escape suboptimal local minima more effectively.

In [None]:
def train_step(model: torch.nn.Module,
               dataloader: torch.utils.data.DataLoader,
               loss_fn: torch.nn.Module,
               optimizer: torch.optim.Optimizer):
  
  # Put the model in train mode
  model.train()

  # Setup train loss and train accuracy values
  train_loss, train_acc = 0, 0

  # Loop through data loader and data batches
 
    # Send data to target device

    # 1. Forward pass
    
    # 2. Calculate and accumulate loss
    

    # 3. Optimizer zero grad 
    

    # 4. Loss backward 
    

    # 5. Optimizer step
    

    # Calculate and accumualte accuracy metric across all batches
   

  # Adjust metrics to get average loss and average accuracy per batch
  

In [None]:
def test_step(model: torch.nn.Module,
              dataloader: torch.utils.data.DataLoader,
              loss_fn: torch.nn.Module):
  
  # Put model in eval mode
  model.eval()

  # Setup the test loss and test accuracy values
  test_loss, test_acc = 0, 0

  # Turn on inference context manager
  
    # Loop through DataLoader batches
    
      # Send data to target device
      

      # 1. Forward pass
      

      # 2. Calculuate and accumulate loss


      # Calculate and accumulate accuracy

    
  # Adjust metrics to get average loss and accuracy per batch


In [None]:
from tqdm.auto import tqdm

def train(model: torch.nn.Module,
          train_dataloader: torch.utils.data.DataLoader,
          test_dataloader: torch.utils.data.DataLoader,
          optimizer: torch.optim.Optimizer,
          loss_fn: torch.nn.Module = nn.CrossEntropyLoss(),
          epochs: int = 5):
  
  # Create results dictionary
  results = {"train_loss": [],
             "train_acc": [],
             "test_loss": [],
             "test_acc": []}

  # Loop through the training and testing steps for a number of epochs
  for epoch in tqdm(range(epochs)):
    # Train step
    train_loss, train_acc = train_step(model=model, 
                                       dataloader=train_dataloader,
                                       loss_fn=loss_fn,
                                       optimizer=optimizer)
    # Test step
    test_loss, test_acc = test_step(model=model, 
                                    dataloader=test_dataloader,
                                    loss_fn=loss_fn)
    
    # Print out what's happening
    print(f"Epoch: {epoch+1} | "
          f"train_loss: {train_loss:.4f} | "
          f"train_acc: {train_acc:.4f} | "
          f"test_loss: {test_loss:.4f} | "
          f"test_acc: {test_acc:.4f}"
    )

    # Update the results dictionary
    results["train_loss"].append(train_loss)
    results["train_acc"].append(train_acc)
    results["test_loss"].append(test_loss)
    results["test_acc"].append(test_acc)

  # Return the results dictionary
  return results

<a class="anchor" id="5.4"></a>

## <span style="color:blue; font-size:1em;"> Task 4. Conducting experiments with hyperparameters</span>

[Go back to the content](#5)

**For variants 1-3:** Explore the effects of varying learning rates and the addition of weight decay in the optimization process.
* Conduct experiments by training the same model with different learning rates, such as 0.01, 0.001, and 0.0001, combined with weight decay values of 0, 0.0001, and 0.001. The Adam optimizer is used for this purpose, as it handles sparse gradients and adaptive learning rate well.
* Weight decay adds a regularization term to the loss function that helps in reducing overfitting by penalizing large weights. Experimenting with weight decay can show its effect on the generalization ability of the model.
* Use metrics such as validation loss and accuracy to assess the impact of different learning rates and weight decay configurations. Plotting these metrics over epochs will help visualize trends and identify the best configurations for balance between speed of convergence and model accuracy.

**For variants 4-6:** Investigate the impact of different batch sizes on model training dynamics and performance.
* Train your model with varying batch sizes, such as 32, 64, and 128, to understand how they affect the learning process, computational demand, and model performance. Larger batch sizes provide more accurate estimates of the gradient but require more memory and computational power.
* Monitor changes in training and validation loss, as well as accuracy across different batch sizes. Additionally, observe the GPU utilization and training time per epoch to evaluate the computational trade-offs.
* This experiment will help in finding an optimal batch size that balances between efficient use of computational resources and model performance, especially important in scenarios where resources are limited or costs are a concern.

**For variants 7-9:** Examine how different activation functions influence model training and accuracy.
* Modify the model to use various activation functions such as ReLU, LeakyReLU, and ELU. Each of these functions has different properties; for example, LeakyReLU allows a small gradient when the unit is inactive, which can help mitigate the dying ReLU problem.
* Train the model using each activation function and compare performance metrics like training loss, validation loss, and final accuracy. It's crucial to observe not just the performance but also how quickly each model converges and its behavior during training (e.g., whether it exhibits more stable or erratic loss reductions).
* Analyze the results to determine which activation function performs best with your specific model architecture and dataset. This experiment can reveal insights into the non-linear dynamics of your model and how they affect learning and generalization.

**For variants 10-12:** Compare different optimizers to see how they impact model performance and training speed.
* Set up experiments to train the model using different optimizers such as SGD, Adam, and RMSprop. Each optimizer has distinct mechanisms, for instance, SGD maintains a constant learning rate and benefits from manual tuning, while Adam and RMSprop adaptively adjust the learning rates based on running averages of recent gradients.
* Evaluate how each optimizer influences the rate of convergence, stability of training, and final model accuracy. Also, consider factors like the ease of reaching a satisfactory solution and the sensitivity to initial conditions or hyperparameter settings.
* Collecting data on training duration, epochs needed to converge, and performance on a held-out validation set will provide a comprehensive view of the strengths and weaknesses of each optimizer.

**For variants 13-15:** Experiment with different optimization algorithms and learning rates.

Instead of using `torch.optim.Adam()`, try training the model from Task 2 with the following optimizers:
* `torch.optim.SGD()` with a learning rate of 0.01
* `torch.optim.RMSprop()` with a learning rate of 0.001
* `torch.optim.Adagrad()` with a learning rate of 0.01

Train the model for 20 epochs with each optimizer and observe the impact on the training and testing loss and accuracy. Additionally, try varying the learning rate for each optimizer (e.g., 0.001, 0.005, 0.01) and note the differences in performance.
* Import the required optimization algorithms from `torch.optim`.
* Create a list of tuples containing the optimizer instances and corresponding learning rates.
* Iterate over the list, training the model for 20 epochs with each optimizer and learning rate combination.
* Record the training and testing loss and accuracy for each experiment.
* Analyze and compare the results to determine which optimizer and learning rate combination works best for your model.

**For variants 16-18:** Experiment with different loss functions.

Instead of using `nn.CrossEntropyLoss()`, try training the model from Task 2 with the following loss functions:
* `nn.NLLLoss()` (Negative Log Likelihood Loss)
* `nn.BCELoss()` (Binary Cross-Entropy Loss)
* `nn.MSELoss()` (Mean Squared Error Loss)

Train the model for 20 epochs with each loss function and observe the impact on the training and testing loss and accuracy.
* Import the required loss functions from `torch.nn`.
* Create a list containing the loss function instances.
* Iterate over the list, training the model for 20 epochs with each loss function.
* Record the training and testing loss and accuracy for each experiment.
* Analyze and compare the results to determine which loss function works best for your model and dataset.

**For variants 19-21:** Experiment with different regularization techniques to prevent overfitting.

Try incorporating the following regularization methods into your model from Task 2:
* L2 regularization (weight decay)
* Dropout
* Early stopping

Train the model for 50 epochs with each regularization technique and observe the impact on the training and testing loss and accuracy.
* Implement L2 regularization by adding a `weight_decay` parameter to your optimizer.
* Incorporate dropout layers into your model by adding `nn.Dropout()` layers after the linear layers.
* Implement early stopping by monitoring the validation loss and stopping the training when it starts to increase.
* Train the model for 50 epochs with each regularization technique.
* Record the training and testing loss and accuracy for each experiment.
* Analyze and compare the results to determine which regularization technique works best for preventing overfitting in your model.

**For variants 22-24:** Experiment with different batch sizes and data augmentation techniques.

Try training the model from Task 2 with the following configurations:
* Batch size of 32 without data augmentation
* Batch size of 64 without data augmentation
* Batch size of 32 with data augmentation (e.g., random flips, rotations, and crops)
* Batch size of 64 with data augmentation (e.g., random flips, rotations, and crops)

Train the model for 20 epochs with each configuration and observe the impact on the training and testing loss and accuracy.
* Create different `DataLoader` instances with various batch sizes.
* Implement data augmentation techniques using `torchvision.transforms`.
* Create a list containing the different configurations (batch size and data augmentation).
* Iterate over the list, training the model for 20 epochs with each configuration.
* Record the training and testing loss and accuracy for each experiment.
* Analyze and compare the results to determine which batch size and data augmentation combination works best for your model and dataset.

**For variants 25-27:** Experiment with different activation functions and weight initialization techniques.

Try training the model from Task 2 with the following configurations:
* ReLU activation function and Xavier weight initialization
* Leaky ReLU activation function and Kaiming weight initialization
* ELU activation function and Xavier normal weight initialization

Train the model for 20 epochs with each configuration and observe the impact on the training and testing loss and accuracy.
* Import the required activation functions from `torch.nn`.
* Implement the different weight initialization techniques using `torch.nn.init`.
* Create a list containing the different configurations (activation function and weight initialization).
* Iterate over the list, modifying the model accordingly and training it for 20 epochs with each configuration.
* Record the training and testing loss and accuracy for each experiment.
* Analyze and compare the results to determine which activation function and weight initialization combination works best for your model and dataset.

**For variants 28-30:** Experiment with different learning rate scheduling techniques.

Instead of using a fixed learning rate, try training the model from Task 2 with the following learning rate schedulers:
* `torch.optim.lr_scheduler.StepLR()`: Decays the learning rate by a factor every specified number of epochs.
* `torch.optim.lr_scheduler.ReduceLROnPlateau()`: Decays the learning rate when the validation loss plateaus.
* `torch.optim.lr_scheduler.CosineAnnealingLR()`: Decays the learning rate following a cosine annealing schedule.

Train the model for 50 epochs with each learning rate scheduler and observe the impact on the training and testing loss and accuracy.
* Import the required learning rate scheduler from `torch.optim.lr_scheduler`.
* Create an instance of the optimizer (e.g., `torch.optim.Adam()`).
* Create an instance of the learning rate scheduler, passing the optimizer as an argument.
* In the training loop, update the learning rate after each epoch using the scheduler's `step()` method.
* Record the training and testing loss and accuracy for each experiment.
* Analyze and compare the results to determine which learning rate scheduling technique works best for your model and dataset.

It looks like the model might be starting to overfit towards the end (performing far better on the training data than on the testing data).

In order to fix this, we'd have to introduce ways of preventing overfitting.

<a class="anchor" id="5.5"></a>

## <span style="color:blue; font-size:1em;"> Task 5. Conducting experiments with the model's layers</span>

[Go back to the content](#5)

**For variants 1-3:** Experiment with different types of convolutional layers.

Instead of using the standard convolutional layers in your model from Task 2, try replacing them with the following types of convolutional layers:
* Depthwise Separable Convolutions
* Dilated Convolutions
* Transposed Convolutions (for upsampling)

Train the modified model for 20 epochs and observe the impact on the training and testing loss and accuracy.
* Import the required convolutional layers from `torch.nn`.
* Modify the existing model architecture by replacing the standard convolutional layers with the new types of convolutional layers.
* Train the modified model for 20 epochs using the same data and hyperparameters.
* Record the training and testing loss and accuracy.
* Analyze and compare the results with the original model to determine the impact of the different convolutional layer types.

**For variants 4-6:** Experiment with different types of pooling layers.

Instead of using the standard max pooling layers in your model from Task 2, try replacing them with the following types of pooling layers:
* Average Pooling
* Adaptive Max Pooling
* Adaptive Average Pooling

Train the modified model for 20 epochs and observe the impact on the training and testing loss and accuracy.
* Import the required pooling layers from `torch.nn`.
* Modify the existing model architecture by replacing the max pooling layers with the new types of pooling layers.
* Train the modified model for 20 epochs using the same data and hyperparameters.
* Record the training and testing loss and accuracy.
* Analyze and compare the results with the original model to determine the impact of the different pooling layer types.

**For variants 7-9:** Experiment with different types of normalization layers.

Instead of using batch normalization in your model from Task 2, try replacing it with the following types of normalization layers:
* Layer Normalization
* Instance Normalization
* Group Normalization

Train the modified model for 20 epochs and observe the impact on the training and testing loss and accuracy.
* Import the required normalization layers from `torch.nn`.
* Modify the existing model architecture by replacing the batch normalization layers with the new types of normalization layers.
* Train the modified model for 20 epochs using the same data and hyperparameters.
* Record the training and testing loss and accuracy.
* Analyze and compare the results with the original model to determine the impact of the different normalization layer types.

**For variants 10-12:** Experiment with different types of attention mechanisms.

Instead of using standard convolutional and fully connected layers in your model from Task 2, try incorporating the following attention mechanisms:
* Self-Attention
* Squeeze-and-Excitation Attention
* Convolutional Block Attention Module (CBAM)

Train the modified model for 20 epochs and observe the impact on the training and testing loss and accuracy.
* Import the required attention modules from PyTorch or implement them manually.
* Modify the existing model architecture by incorporating the attention mechanisms at appropriate locations.
* Train the modified model for 20 epochs using the same data and hyperparameters.
* Record the training and testing loss and accuracy.
* Analyze and compare the results with the original model to determine the impact of the different attention mechanisms.

**For variants 13-15:** Experiment with different types of activation functions.

Instead of using the ReLU activation function in your model from Task 2, try replacing it with the following activation functions:
* Leaky ReLU
* ELU (Exponential Linear Unit)
* Swish

Train the modified model for 20 epochs and observe the impact on the training and testing loss and accuracy.
* Import the required activation functions from `torch.nn`.
* Modify the existing model architecture by replacing the ReLU activation function with the new activation functions.
* Train the modified model for 20 epochs using the same data and hyperparameters.
* Record the training and testing loss and accuracy.
* Analyze and compare the results with the original model to determine the impact of the different activation functions.

**For variants 16-18:** Experiment with different types of recurrent layers.

Instead of using a feed-forward neural network architecture in your model from Task 2, try incorporating the following recurrent layer types:
* Long Short-Term Memory (LSTM)
* Gated Recurrent Unit (GRU)
* Bidirectional LSTM

Train the modified model for 20 epochs and observe the impact on the training and testing loss and accuracy.
* Import the required recurrent layers from `torch.nn`.
* Modify the existing model architecture by replacing the feed-forward layers with the recurrent layer types.
* Preprocess the input data to be suitable for the recurrent layers (e.g., sequences of images or text).
* Train the modified model for 20 epochs using the preprocessed data and appropriate hyperparameters.
* Record the training and testing loss and accuracy.
* Analyze and compare the results with the original model to determine the impact of the different recurrent layer types.

**For variants 19-21:** Experiment with different types of skip connections.

Instead of using a standard feed-forward architecture in your model from Task 2, try incorporating the following skip connection types:
* Residual Connections (as in ResNet)
* Dense Connections (as in DenseNet)
* Inception Modules (as in GoogLeNet/Inception)

Train the modified model for 20 epochs and observe the impact on the training and testing loss and accuracy.
* Import the required modules or implement the skip connection types manually.
* Modify the existing model architecture by incorporating the skip connection types at appropriate locations.
* Train the modified model for 20 epochs using the same data and hyperparameters.
* Record the training and testing loss and accuracy.
* Analyze and compare the results with the original model to determine the impact of the different skip connection types.

**For variants 22-24:** Experiment with different types of multi-head attention layers.

Instead of using standard attention mechanisms in your model from Task 2 (if applicable), try incorporating the following multi-head attention layer types:
* Scaled Dot-Product Attention
* Multi-Head Attention (as in Transformers)
* Convolutional Multi-Head Attention

Train the modified model for 20 epochs and observe the impact on the training and testing loss and accuracy.
* Import the required multi-head attention layers from PyTorch or implement them manually.
* Modify the existing model architecture by incorporating the multi-head attention layer types at appropriate locations.
* Train the modified model for 20 epochs using the same data and hyperparameters.
* Record the training and testing loss and accuracy.
* Analyze and compare the results with the original model to determine the impact of the different multi-head attention layer types.

**For variants 25-27:** Experiment with different types of generative layers.
Instead of using a discriminative model architecture in your model from Task 2, try incorporating the following generative layer types:
* Variational Autoencoder (VAE)
* Generative Adversarial Network (GAN)
* Autoregressive Layers (as in PixelCNN or WaveNet)

Train the modified model for 20 epochs and observe the impact on the training and testing loss and accuracy (if applicable).
* Import the required generative layers from PyTorch or implement them manually.
* Modify the existing model architecture by incorporating the generative layer types at appropriate locations.
* Preprocess the data to be suitable for the generative model (e.g., images or text sequences).
* Train the modified model for 20 epochs using the preprocessed data and appropriate hyperparameters.
* Record the training and testing loss and accuracy (if applicable).
* Analyze and compare the results with the original model to determine the impact of the different generative layer types.

**For variants 28-30:** Experiment with different types of transformer layers.

Instead of using standard convolutional or recurrent layers in your model from Task 2, try incorporating the following transformer layer types:
* Encoder-Decoder Transformer (as in Machine Translation models)
* Vision Transformer (ViT)
* Perceiver (a unified transformer architecture for various data modalities)

Train the modified model for 20 epochs and observe the impact on the training and testing loss and accuracy.
* Import the required transformer layers from PyTorch or implement them manually.
* Modify the existing model architecture by incorporating the transformer layer types at appropriate locations.
* Preprocess the data to be suitable for the transformer layers (e.g., sequences of images or text).
* Train the modified model for 20 epochs using the preprocessed data and appropriate hyperparameters.
* Record the training and testing loss and accuracy.
* Analyze and compare the results with the original model to determine the impact of the different transformer layer types.

It looks like the model might be overfitting, even when changing the number of hidden units.

To fix this, we'd have to look at ways to prevent overfitting with our model.

<a class="anchor" id="5.6"></a>

## <span style="color:blue; font-size:1em;"> Task 6. Conducting experiments with the data</span>

[Go back to the content](#5)

**For variants 1-3:** Examine the impact of different data augmentation techniques on model robustness and accuracy.
* Implement various data augmentation strategies using PyTorch's `torchvision.transforms` module. Common techniques include random rotations, horizontal flipping, vertical flipping, random crops, color jitters (adjusting brightness, contrast, and saturation), and adding noise.
* Integrate these transformations into the data loading pipeline so that each image is randomly transformed during training but remains unchanged during validation and testing. This can be achieved by defining separate transform chains for training and testing datasets using `transforms.Compose`.
* Train the model using the augmented dataset for a fixed number of epochs and compare the performance (accuracy and loss metrics) against the same model trained on non-augmented data. Analyze how different augmentation techniques influence the learning process, convergence behavior, and final model accuracy.

**For variants 4-6:** Investigate the effects of different scaling and normalization techniques on model training dynamics and performance.
* Experiment with various feature scaling and normalization methods, such as Min-Max scaling, Z-score normalization (standardization), and L2 normalization. These techniques adjust the range and distribution of feature values, which can significantly impact the convergence rate and stability of gradient descent algorithms.
* Use PyTorch’s `transforms.Normalize` for image data to normalize pixel values based on the mean and standard deviation of the channels across the training set.
* Train your model with each method and monitor how each affects the speed of convergence during training, the stability of training (variance in loss over epochs), and the accuracy on a validation set.
* Determine the optimal pre-processing strategy for your dataset, which is crucial for models, especially those sensitive to the scale of input data like neural networks.

**For variants 7-9:** Assess the impact of adding synthetic data to the training dataset on model performance.
* Use generative techniques, such as a simple Generative Adversarial Network (GAN), to create additional synthetic training data. For the food classification task, generate images that resemble the existing data categories but with variations not present in the original dataset.
* Integrate this synthetic data into the training set to see if it helps in improving model robustness and filling gaps in the data distribution, especially for underrepresented classes or features.
* Train the model on a mix of real and synthetic data and compare the results with training solely on real data. Evaluate the model based on its accuracy, recall, and precision on a balanced validation set.
* Analyze whether the synthetic data provides a meaningful diversity that aids the model during training, particularly checking for overfitting or underfitting scenarios.

**For variants 10-12:** Explore the effects of training the model on different subsets of the data, focusing on diversity and representation.
* Create multiple subsets of the original dataset based on different criteria, such as data recency, image quality, and the presence of specific features or classes. For example, train separate models on high-resolution images versus low-resolution images.
* Train the model on these different subsets to determine how each subset's characteristics influence learning dynamics, model bias, and performance on a general test set.
* This experiment can reveal dependencies on specific data characteristics and help in designing more effective data collection and preprocessing strategies to improve model performance and fairness.

**For variants 13-15:** Test the model's robustness against noisy data and its ability to generalize under less ideal conditions.
* Introduce various types of noise to the training data, such as Gaussian noise, salt-and-pepper noise, or occlusions (e.g., random black boxes on images). This can simulate real-world imperfections in data that the model might encounter.
* Train the model with increasing levels of noise and monitor how its performance on validation data changes with noise intensity.
* Test the model's resilience to data corruption and its ability to extract useful information from degraded inputs. It's particularly valuable in applications where data quality cannot be consistently guaranteed.
* Evaluate not just overall accuracy, but also how sensitivity and specificity change with noise levels. This will help understand if the model is still reliable for certain classes more than others when the data quality deteriorates.

**For variants 16-18:** Investigate the effects of class imbalance on model performance and explore strategies to mitigate its impact.
* Simulate varying degrees of class imbalance by artificially reducing the representation of certain classes in the training data. For instance, reduce the number of images for one or more classes by 50%, 75%, and 90%.
* Train the model on these imbalanced datasets and observe how the imbalance affects model accuracy, precision, recall, and F1 score, particularly for the underrepresented classes.
* Implement and compare various techniques to handle imbalance, such as oversampling the minority class, undersampling the majority class, or using class-weighted or balanced loss functions like weighted cross-entropy.

**For variants 19-21:** Examine the impact of augmenting training data with external datasets on model generalization.
* Augment the original Food101 dataset with additional data from similar but external sources, such as images from open datasets like ImageNet or specialized food datasets that might include dishes not represented in the original dataset.
* Ensure that the external data is preprocessed and formatted to match the original dataset in terms of image size, scaling, and color channels.
* Train the model on this augmented dataset and evaluate changes in its ability to generalize across a broader range of food categories, particularly looking at performance metrics on a separate, diverse test set.

**For variants 22-24:** Determine how variations in image quality affect model training and performance.
* Create multiple versions of your dataset with different levels of image quality. This could involve altering image resolution, adding noise, and varying lighting conditions.
* Use PyTorch's `torchvision.transforms` to simulate these quality variations dynamically during data loading, which allows for a scalable approach to manipulating image properties.
* Train your model on these datasets and monitor how changes in image quality impact the training speed, convergence behavior, and accuracy on a validation set.

**For variants 25-27:** Explore the effect of the order in which data is presented to the model during training.
* Implement two data feeding strategies: sequential, where the dataset is sorted based on certain criteria like class labels or image complexity before training, and random, where data order is shuffled for each epoch.
* Utilize PyTorch's DataLoader with parameters to control shuffling and batch sampling to facilitate these strategies.
* Evaluate how each strategy affects the learning dynamics and model performance, particularly focusing on whether a particular ordering leads to faster learning or better generalization.
* Test the hypothesis that certain sequences of data presentation might prime the model more effectively, potentially leading to better or faster learning outcomes.

**For variants 28-30:** Assess the impact of advanced feature engineering and data transformation techniques on model performance.
* Beyond basic image preprocessing, apply advanced feature engineering techniques such as PCA for dimensionality reduction, edge detection filters, or Fourier transforms to transform the input data.
* Integrate these transformations into the PyTorch data pipeline using custom `torchvision.transforms` functions or by modifying the dataset class to preprocess data before training.
* Train the model on this transformed dataset and compare the results with training on the original data. Look for changes in accuracy, training efficiency, and model interpretability.

In [None]:
# Download 20% data for Pizza/Steak/Sushi from GitHub
import requests
import zipfile
from pathlib import Path

# Setup path to data folder
data_path = Path("data/")
image_path = data_path / "pizza_steak_sushi_20_percent"

# If the image folder doesn't exist, download it and prepare it... 
if image_path.is_dir():
    print(f"{image_path} directory exists.")
else:
    print(f"Did not find {image_path} directory, creating one...")
    image_path.mkdir(parents=True, exist_ok=True)
    
# Download pizza, steak, sushi data
with open(data_path / "pizza_steak_sushi_20_percent.zip", "wb") as f:
    request = requests.get("https://github.com/radiukpavlo/applied-math-packages/blob/main/data/pizza_steak_sushi_20_percent.zip")
    print("Downloading pizza, steak, sushi 20% data...")
    f.write(request.content)

# Unzip pizza, steak, sushi data
with zipfile.ZipFile(data_path / "pizza_steak_sushi_20_percent.zip", "r") as zip_ref:
    print("Unzipping pizza, steak, sushi 20% data...") 
    zip_ref.extractall(image_path)

In [None]:
# See how many images we have
walk_through_dir(image_path)

Excellent, we now have double the training and testing images... 

In [None]:
# Create the train and test paths
train_data_20_percent_path = image_path / "train"
test_data_20_percent_path = image_path / "test"

train_data_20_percent_path, test_data_20_percent_path

In [None]:
# Turn the 20 percent datapaths into Datasets and DataLoaders
from torchvision.datasets import ImageFolder
from torchvision import transforms
from torch.utils.data import DataLoader

simple_transform = transforms.Compose([
  transforms.Resize((64, 64)),                                     
  transforms.ToTensor()
])

# Create datasets


# Create dataloaders


In [None]:
# Train a model with increased amount of data
torch.manual_seed(42)
torch.cuda.manual_seed(42)

<a class="anchor" id="5.7"></a>

## <span style="color:blue; font-size:1em;"> Task 7. Making predictions</span>

[Go back to the content](#5)

**For variants 1-3:** Utilize an ensemble of models to improve prediction accuracy and robustness.
* Instead of relying on a single model, train several models (potentially with different architectures or hyperparameters) on the same dataset. For instance, train variations of your main model with different layers, activation functions, or trained with different data augmentation strategies.
* Use a voting system among the models in the ensemble. Each model provides a prediction, and the final output is decided by majority vote. Alternatively, use a weighted approach where models that showed higher accuracy on the validation set have more influence on the final prediction.
* Evaluate the ensemble's performance by testing it on a new set of images that were not part of the training or validation datasets. Compare its accuracy and reliability against the single-model approach.

**For variants 4-6:** Implement a system for making real-time predictions as new image data is streamed to the model.
* Set up a simulated streaming environment where new images (e.g., from a live camera feed at a restaurant) are continuously fed into the model. This could involve integrating PyTorch with a web application or a database that periodically updates with new images.
* Use PyTorch's DataLoader to handle incoming data streams effectively. Ensure that the images are preprocessed and normalized in the same way as the training data before they are passed to the model for prediction.
* Implement performance metrics to evaluate the model's prediction speed and accuracy in real-time. This might include measuring the latency between receiving an image and outputting a prediction, as well as tracking the model's accuracy over time as more data is processed.

**For variants 7-9:** Adapt the trained model to make predictions on images from a different but related domain.
* Suppose the original model is trained on high-quality images of food. To adapt this model to work with lower-quality images (e.g., images from social media), implement techniques like fine-tuning or domain adaptation.
* Explore advanced domain adaptation techniques that can bridge the gap between different data distributions, such as feature-level adaptation, where the internal representations of images from both domains are encouraged to be similar.
* Evaluate the adapted model on a validation set from the new domain to measure how well it has adjusted to the new types of images. This evaluation helps in understanding the effectiveness of the adaptation process and whether further tuning or more sophisticated adaptation techniques are required.

**For variants 10-12:** Implement methods to estimate and communicate the uncertainty in the model's predictions.
* Incorporate Bayesian methods or Monte Carlo dropout into the prediction process to estimate the confidence of the model in its output. This involves modifying the model to include dropout layers during both training and testing, and running multiple forward passes with dropout enabled to get a distribution of outputs.
* Use these distributions to calculate confidence intervals or probabilities that reflect how certain the model is about its predictions. For example, a high variance in the outputs could indicate low confidence, prompting the system to request additional data or human verification.
* Integrate this uncertainty estimation into the prediction output, providing end-users or downstream systems with not just a classification result, but also a measure of reliability. This can be crucial in applications where decisions based on these predictions carry significant consequences.

**For variants 13-15:** Test the model's robustness to adversarial examples and implement strategies to mitigate potential vulnerabilities.
* Generate adversarial examples using techniques like FGSM (Fast Gradient Sign Method) or more sophisticated methods available in libraries like Foolbox. These examples are designed to fool the model into making incorrect predictions, highlighting potential vulnerabilities in its training.
* Test the model's ability to correctly classify these adversarially perturbed images. Analyze the types of errors it makes, which could indicate specific weaknesses in the model’s understanding of the input data.
* Implement defensive strategies against such adversarial attacks. These could include adversarial training, where the model is trained on a mix of normal and adversarial examples to improve its robustness, or using model architectures that are inherently more robust to adversarial perturbations.
* Evaluate the model’s performance after incorporating these defenses, particularly focusing on maintaining or improving accuracy on both standard and adversarially modified test sets.

**For variants 16-18:** Use a fusion of different models to enhance prediction accuracy and reliability.
* Develop a system where multiple models trained independently (e.g., using different architectures or subsets of data) contribute to a final decision. This could involve simple techniques like model averaging or more complex strategies like stacking, where a new model learns how to best combine the outputs of the individual models.
* Implement this by training several models on the same dataset, or different segments of it, ensuring diversity in the learning process. Each model might specialize in different features of the data, improving overall predictive performance when combined.
* For prediction, input the same image into all models and aggregate their predictions. Depending on the approach, this could mean taking the majority vote (voting ensemble) or weighting predictions based on the confidence or historical accuracy of each model (weighted average).
* Evaluate this system by comparing its accuracy, precision, and recall against those of individual models. Check if the ensemble reduces overfitting and provides more stable and reliable predictions across a diverse set of test images.

**For variants 19-21:** Implement attention mechanisms to improve model predictions by focusing on relevant parts of the image.
* Modify the existing model architecture to include attention layers, which help the model focus on areas of the image that are more relevant for making a prediction. This is particularly useful for complex images where not all parts of the image are equally informative.
* Techniques like the Self-Attention mechanism or Transformer models can be incorporated into traditional CNN architectures to enhance their ability to localize and emphasize important features without the need for additional input or segmentation maps.
* Train the modified model on the dataset, ensuring that the attention mechanism is properly integrated and contributing positively to the model's learning process.
* For prediction, visualize the attention maps generated by the model to understand which parts of the image are being focused on. This not only aids in prediction but also provides insights into the model's decision-making process, which can be crucial for applications requiring high levels of trust and interpretability.

**For variants 22-24:** Implement dynamic thresholding to handle uncertain predictions more effectively.
* Instead of using a fixed threshold (e.g., 0.5 in binary classification) to decide the class of an image, use a dynamic threshold that adjusts based on the confidence of the model’s predictions or the specific requirements of the application.
* Develop a method to calculate the threshold dynamically, perhaps based on the distribution of prediction probabilities seen in the validation data. For example, the threshold could be set at the 90th percentile of confidence scores for predicted classes.
* Integrate this dynamic thresholding into the prediction pipeline, applying it to decide the final class labels for new images.
* Evaluate how this approach affects the number of uncertain predictions, the overall accuracy, and the balance between precision and recall.

**For variants 25-27:** Establish a real-time feedback system for continuously improving the model based on new predictions and user feedback.
* Set up a mechanism where users can provide immediate feedback on the model’s predictions (e.g., correct/incorrect). Use this feedback to dynamically adjust the model, either through direct model updates or by periodically retraining the model with the new data.
* Implement a lightweight version of the model for quick updates and integrate an online learning protocol where the model can update its parameters in real-time based on user feedback.
* Monitor and analyze the impact of this continuous learning approach on model performance over time, especially how quickly the model adapts to new patterns or corrections in its predictions.

**For variants 28-30:** Quantify the uncertainty in the model’s predictions to better assess risks and confidences.
* Convert the existing model into a Bayesian framework where instead of having fixed weights, the model maintains a distribution over possible weights, reflecting uncertainty in its predictions.
* Use techniques like Monte Carlo Dropout to approximate Bayesian inference, performing multiple forward passes through the network with dropout enabled at inference time to generate a distribution of outputs for each input image.
* Analyze the variance and other statistical properties of the outputs to estimate the uncertainty associated with each prediction.
* Implement this uncertainty quantification as part of the prediction output, providing end-users with not just a categorical result but also a measure of confidence in that result.
* Evaluate how effectively this Bayesian approach mitigates overconfident errors, improves decision-making processes, and aligns with the operational needs of applications where understanding uncertainty is critical.