# Coursework 1 - Mathematics for Machine Learning

## CID: 01843211

**Colab link:** insert colab link here

***
***

## Part 1: Quickfire questions [3 points]

#### Question 1 (True risk / Empirical risk):

Let $(\mathbf{x},\mathbf{y})$ be sampled from the data-generating distribution $D$. Let $f$ be a classifier in function space $\hat{\mathcal{F}}$ and let $L$ be a loss function which is a metric or pseudo-metric. Then the risk $R$ corresponding to $f$ is defined $$R(f)=\mathbb{E}_D[L(f(\mathbf{x}, \mathbf{y})]$$


This definition assumes that $(\mathbf{x}, \mathbf{y})$ are drawn from the true underlying distribution $\mathcal{D}$, making $R(f)$ a measure of how well $f$ is expected to perform on the entire data space, not just the observed samples.

In statistical learning we seek an $\hat{f}\in \hat{\mathcal{F}}$ which minimises this true risk, i.e
$$\hat{f}\in\text{arg min}_{f\in\hat{\mathcal{F}}}R(f)$$

However, since $\mathcal{D}$ is unknown in real-world scenarios, we cannot compute $R(f)$ directly. Instead, we approximate it using the empirical risk $\hat{R}(f)$, which is calculated based on the observed dataset $D$:


$$\hat{R}(f)=\frac{1}{n}\sum_{i=1}^n L(f(\mathbf{x}^i), \mathbb{y}^i)$$

For large datasets, this is approximately equal to the true risk and allows us to learn a function

$$f^*=\text{arg min}_{f\in\mathcal{F}}\hat{R}(f)$$

where $\mathcal{F}$ is a choice of function class that we believe is likely to contain the true function, or a suitable approximation thereof. If $\mathcal{F}$ is well-chosen this will approximate the $\hat{f}$ defined above.

#### Question 2 ('Large' or 'rich' hypothesis class):

A rich hypothesis class $\mathcal{F}$ will result in the learned function minimising the training error. For example, consider a dataset of $n$ points - if the function class is all degree $n-1$ polynomials, then the learned function will perfectly interpolate all points in the dataset. 

However, this can be undesirable, as often variations in the data are due to random noise. We instead seek a simpler function will will *generalise* better, and thus minimise the *generalisation error*, which is exactly the difference between $f^*$ and $\hat{f}$ above.

Linear regression is an example of a basic function class. We approximate relationships with linear maps from the feature space to the output space. This performs well with simple relationships, and exhibits low variance, however can be biased with more complex, e.g. polynomial, relationships.

On th eother hand, neural nets are examples of a very rich function classse can capture much more complex relationships, however will overfit highly, with high variance, for a simple linear map, and therefore will exhibit high generalisation error.

#### Question 3 (Dataset splitting):

In the case that validation data is drawn from the same distribution as training data, it would be fair to assume that model performance on unseen data is similar to that on that validation data.

However, there are exceptions to this rule. Consider a time series generated by a non-stationary model, for example S&P 500 close. If the training data is 2023 data, and the validation data is Jan/Feb 2024 data, it would be unreasonable to asssume that accuracy will perform as well when evluated on the remainder of 2024 data.

#### Question 4 (Occam’s razor):

---

Occam's Razor suggests that among competing hypotheses that predict equally well, the simplest one should be selected. This simplicity is often equated with a model or explanation that makes the fewest assumptions.

In machine learning, Occam's Razor is interpreted as a guideline for model selection. The principle advises choosing the simplest model that adequately fits the data. This approach is grounded in the idea that simpler models are less likely to overfit the training data. Overfitting occurs when a model captures noise or random fluctuations in the training set rather than the underlying distribution, leading to poor generalisation to new, unseen data.

When dealing with naturally occurring data, such as images, the application of Occam's Razor becomes particularly relevant. Images are high-dimensional data with complex, intricate structures. Models trained on image data, such as convolutional neural networks (CNNs), can easily become overly complex, with millions of parameters capable of fitting the training data very closely.

Applying Occam's Razor in this context means preferring simpler models that still capture the essential patterns in the image data without memorizing specific details. This simplicity helps in generalizing better to unseen images by focusing on broader, more universal features rather than specifics of the training set.

For image data, a simpler model according to Occam's Razor would still need to be complex enough to handle the data's inherent complexity but not so complex that it learns the noise or irrelevant details. This balance helps in achieving good performance on new images that were not part of the training process.

Simpler models are often more interpretable and computationally efficient. This efficiency is crucial for deploying models in real-world applications where computational resources may be limited, and understanding model predictions is important for trust and transparency.

#### Question 5 (Generalisation error):


The generalisation error of a model quantifies how well the model performs on new, unseen data compared to the training data on which it was trained. For a "good" model, the generalisation error should be small. A small generalisation error indicates that the model effectively captures the underlying patterns or distributions of the data without being overly fitted to the noise or specific details of the training set.

A model with a small generalisation error has successfully learned the true underlying patterns in the data. This ability suggests that the model can apply what it has learned from the training data to unseen data, making accurate predictions across a variety of scenarios that were not specifically presented during training.

A good model strikes an optimal balance between bias (the error from erroneous assumptions in the learning algorithm) and variance (the error from sensitivity to small fluctuations in the training set). A small generalisation error indicates that the model has achieved this balance, being neither too simple (high bias and unable to capture complex patterns) nor too complex (high variance and overfitting to the training data).

Overfitting occurs when a model learns the noise or random fluctuations in the training data rather than the actual signal. A model with a small generalisation error is robust against overfitting, meaning it has generalized well from the training data to unseen data, focusing on the signal rather than the noise.

The ultimate goal of a machine learning model is to perform well on real-world data, which is often different in various ways from the data used during training. A small generalisation error implies that the model is likely to be effective and reliable when deployed in real-world applications, accurately handling new examples that reflect the complexities and variations of real-world scenarios.

A good model, by definition, is one that generalizes well from the training data to unseen data, evidenced by a small generalisation error. Achieving a small generalization error requires careful model design, including selecting the right model complexity, employing proper training techniques, and using strategies such as cross-validation and regularization to prevent overfitting. Ultimately, the small generalization error of a good model reflects its ability to make accurate predictions across a wide range of data, embodying the core goal of machine learning.

#### Question 6 (Rademacher complexity pt1):



#### Question 7 (Rademacher complexity pt2):

The empirical Rademacher complexity of a function class $ \mathcal{F} $ with respect to a sample $ S = \{x_1, x_2, \ldots, x_n\} $ is defined as follows:

$$
\hat{\mathcal{R}}_S(\mathcal{F}) = \mathbb{E}_{\sigma}\left[\sup_{f \in \mathcal{F}}\left(\frac{2}{n} \sum_{i=1}^n \sigma_i f(x_i)\right)\right]
$$

Where:

- $S$ represents a sample of $n$ points drawn from a distribution $\mathcal{D}$.
- $\mathcal{F}$ is a class of functions where each function $f$ maps an input space $\mathcal{X}$ to real numbers $\mathbb{R}$.
- $\sigma = (\sigma_1, \sigma_2, \ldots, \sigma_n)$ is a sequence of independent Rademacher variables, each taking values $-1$ or $+1$ with equal probability.
- $\mathbb{E}_{\sigma}$ denotes the expectation over all possible realizations of the Rademacher sequence $\sigma$.

The empirical Rademacher complexity measures the expected maximum correlation between the functions in $\mathcal{F}$ and a random pattern of signs provided by $\sigma$ on the sample $S$. It serves as an indicator of the function class's ability to fit random noise, which is directly related to its potential for overfitting and its generalization capacity.

A high value of empirical Rademacher complexity suggests that the function class $ \mathcal{F} $ has a strong ability to fit random patterns. Specifically, it means that there exist functions within $ \mathcal{F} $ that can align closely with random assignments of labels (as represented by the Rademacher variables) to the data points in $ S $. This ability is indicative of $ \mathcal{F} $'s flexibility or complexity in modeling data. While the capacity to fit data closely might seem desirable, a high empirical Rademacher complexity raises concerns about overfitting. Overfitting occurs when a model captures not just the underlying signal in the data but also the noise. Therefore, a high complexity value warns that models from $ \mathcal{F} $ might not generalise well to unseen data, as they could be fitting the noise present in the training sample.

Mathematically a high empirical Rademacher complexity will result in a looser bound on the generalisation error. Recall:
$$
\mathbb{P}_{S \sim \mathcal{D}^n}\left( \forall f \in \mathcal{F},\, \left| \mathbb{E}[L(f, S)] - \hat{\mathbb{E}}[L(f, S)] \right| \leq 2\hat{\mathcal{R}}_S(\mathcal{F}) + 3\sqrt{\frac{\log(\frac{2}{\delta})}{2n}} \right) \geq 1 - \delta
$$


This bound implies that with high probability (at least $1 - \delta$), the absolute difference between the true expected loss and the empirical loss for all functions in the class $\mathcal{F}$ is bounded by the term involving the empirical Rademacher complexity and a term that decreases as the sample size $n$ increases. The presence of $\hat{\mathcal{R}}_S(\mathcal{F})$ in the bound highlights the role of the function class's complexity in determining generalisation performance. A higher empirical Rademacher complexity leads to a looser bound, meaning that models with higher complexity may have a larger difference between their training and test performance.


#### Question 8 (Regularisation term in the loss function):

Enter your answer here

#### Question 9 (Momentum gradient descent):

Enter your answer here

#### Question 10 (Adam):

Enter your answer here

#### Question 11 (AdaGrad):

Enter your answer here

#### Question 12 (Decaying Learning Rate):

Enter your answer here

*** 
***

## Part 2: Short-ish proofs [6 points]


### Question 2.1: Bounds on the risk [1 point]


***

### Question 2.2: On semi-definiteness [1 point]

***

### Question 2.3: A quick recap of momentum [1 point]

***

### Question 2.4: Convergence proof [3 points]

***
***

## Part 3: A deeper dive into neural network implementations [3 points]

In [5]:
# Import libraries
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import torch.optim as optim
import matplotlib.pyplot as plt
import seaborn as sns


In [7]:
# Download datasets
train_set_mnist = torchvision.datasets.MNIST(root="./", download=True,
                                         train=True, transform=transforms.Compose([transforms.ToTensor()]))

test_set_mnist = torchvision.datasets.MNIST(root="./",download=True,
                                        train=False,transform=transforms.Compose([transforms.ToTensor()]),)

train_set_cifar = torchvision.datasets.CIFAR10(root="./", download=True,
                                         train=True, transform=transforms.Compose([transforms.ToTensor()]))

test_set_cifar = torchvision.datasets.CIFAR10(root="./",download=True,
                                        train=False,transform=transforms.Compose([transforms.ToTensor()]),)

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./cifar-10-python.tar.gz


100%|██████████| 170498071/170498071 [04:54<00:00, 578801.80it/s] 


Extracting ./cifar-10-python.tar.gz to ./
Files already downloaded and verified


In [None]:
SEED = 1843211
np.random.seed(SEED)
torch.manual_seed(SEED)

***

### Part 3.1: Implementations [1 point]

In [None]:
# You can of course add more cells of both code and markdown. Please remember to comment the code and explain your reasoning. Include docstrings. Tutorial provide a good example of how to style your code.
# Although not compulsory you could challenge yourself by using object oriented programming to structure your code.


class Net(nn.Module):
    """
    A fully-connected neural network with ReLU activation and softmax output.
    
    Args:
        dim (int): The dimension of the input.
        nclass (int): The number of classes.
        width (int): The width of the hidden layers.
        depth (int): The number of hidden layers.
    """
    def __init__(self, dim, nclass, width, depth):
        # Call the parent constructor
        super().__init__()
        # Define the parameters
        self.dim = dim
        self.nclass = nclass
        self.width = width
        self.depth = depth
        # Define the layers
        self.layers = nn.ModuleList([nn.Flatten()])
        self.layers.extend([nn.Linear(self.dim, self.width)])
        self.layers.extend([nn.ReLU()]) # Add ReLU activation function as every Linear layer is followed by a ReLU activation function
        # Define the hidden layers
        for i in range(self.depth-1):
            self.layers.extend([nn.Linear(self.width, self.width), nn.ReLU()])
        # Define the output layer
        self.layers.extend([nn.Linear(self.width, self.nclass)])

    
    def forward(self, input):
        # Forward pass
        x = input
        for layer in self.layers:
            x = layer(x)
        return x

In [None]:
def loading_data(batch_size, train_set, test_set):
    """
    This function loads the data using the torch.utils.data.DataLoader function.
    
    Args:
        batch_size (int): The batch size.
        train_set (torch.utils.data.Dataset): The training set.
        test_set (torch.utils.data.Dataset): The test set.
        
    Returns:
        trainloader (torch.utils.data.DataLoader): The training set loader.
        testloader (torch.utils.data.DataLoader): The test set loader.
    """
    # Load the data
    trainloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=True)
    testloader = torch.utils.data.DataLoader(test_set, batch_size=batch_size, shuffle=False)

    return trainloader, testloader

In [None]:
def train_epoch(trainloader, net, optimizer, criterion):
    """
    Trains the network for one epoch.
    
    Args:
        trainloader: The training data loader.
        net: The network to train.
        optimizer: The optimizer to use.
        criterion: The loss function to use.
        
    Returns:
        The average train loss over the epoch.
        The train error over the epoch.
    """
    # Set the network to training mode
    net.train()
    # Initialize the loss and error
    total_loss = 0
    total_error = 0
    # Loop over the training set
    for i, (images, labels) in enumerate(trainloader):
        optimizer.zero_grad()
        outputs = net(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        # Update the loss and error
        total_loss += loss.item()
        total_error += (outputs.argmax(dim=1) != labels).sum().item()

    
    return total_loss / len(trainloader), total_error / len(trainloader.dataset) # Return the average train loss and train error

In [None]:
def test_epoch(testloader, net, criterion):
    """
    Tests the network for one epoch.

    Args:
        testloader: The test data loader.
        net: The network to test.
        criterion: The loss function to use.
    
    Returns:
        The average test loss over the epoch.
        The test error over the epoch.
    """
    # Set the network to evaluation mode
    net.eval()
    # Initialize the loss and error
    test_loss = 0
    correct = 0
    total = 0
    # Loop over the test set
    with torch.no_grad():
        for i, (images, labels) in enumerate(testloader):
            outputs = net(images)
            loss = criterion(outputs, labels)
            # Update the loss and error
            test_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    return test_loss / len(testloader), 1 - correct / total # Return the average test loss and test error

In [None]:
def main(batch_size, dim, nclass, width, depth, lr, epochs, train_set, test_set, Adam=True, momentum=None, early_stop_loss=None ,verbose=True):
    """
    This function runs the training and testing epochs.
    
    Args:
        batch_size (int): The batch size.
        dim (int): The dimension of the input.
        nclass (int): The number of classes.
        width (int): The width of the hidden layers.
        depth (int): The number of hidden layers.
        lr (float): The learning rate.
        epochs (int): The number of epochs.
        train_set (torch.utils.data.Dataset): The training set.
        test_set (torch.utils.data.Dataset): The test set.
        Adam (bool): Whether to use Adam or SGD.
        momentum (float): The momentum to use for SGD.
        early_stop_loss (float): The loss threshold to stop the training.
        verbose (bool): Whether to print the results.
    
    Returns:
        The train loss over the epochs.
        The test loss over the epochs.
        The train error over the epochs.
        The test error over the epochs.
    """

    # load data
    trainloader, testloader = loading_data(batch_size, train_set, test_set)

    # define network
    net = Net(dim, nclass, width, depth)

    # define criterion function
    criterion = nn.CrossEntropyLoss()

    # define Adam or SGD optimizer
    if Adam:
        optimizer = optim.Adam(net.parameters(), lr=lr)
    else:
        if momentum is None:
            optimizer = optim.SGD(net.parameters(), lr=lr)
        else:
            optimizer = optim.SGD(net.parameters(), lr=lr, momentum=momentum)

    # Storing the results for each epoch
    store_train_loss = []
    store_test_loss = []
    store_train_error = []
    store_test_error = []
    for epoch in range(1, epochs+1):
        # Training and testing the network
        train_loss, train_error = train_epoch(trainloader, net, optimizer, criterion)
        test_loss, test_error = test_epoch(testloader, net, criterion)
        # Storing the results
        store_train_loss.append(train_loss)
        store_test_loss.append(test_loss)
        store_train_error.append(train_error)
        store_test_error.append(test_error)
        # Printing the results
        if verbose:
            print(f"Epoch: {epoch:03} | Train Loss: {train_loss:.04} | Test Loss: {test_loss:.04} | Train Error: {train_error:.04} | Test Error: {test_error:.04}")
        # Early stopping if the loss is below a certain threshold (i.e. convergence is attained)
        if early_stop_loss:
            if train_loss < early_stop_loss:
                print(f"Early stopping at epoch {epoch} with train loss {train_loss}")
                break
    return store_train_loss, store_test_loss, store_train_error, store_test_error

***

### Part 3.2: Numerical exploration [2 points]

In [None]:
# Define the hyperparameters
depths = [1, 5, 10] # Varying hyperparameter
width = 256 # Fixed hyperparameter according to exercise guidelines
lr = 0.001
epochs = 20
batch_size = 32
dim = 784
nclass = 10
Adam = True

# Store the results
store_depth = {}

# Run the main function for different depths 
for depth in depths:
    print(f"Depth: {depth}")
    train_loss, test_loss, train_error, test_error = main(batch_size, dim, nclass, width, depth, lr, epochs, train_set_cifar, test_set_cifar, Adam=Adam)
    store_depth[depth] = [train_loss, test_loss, train_error, test_error]

***
***

## Part 4: The link between Neural Networks and Gaussian Processes [8 points]

### Part 4.1: Proving the relationship between a Gaussian process and a neural network [4 points]

### Task 1: Proper weight scaling

### Task 2: Derive the GP relation for a single hidden layer

### Task 3: Why in succession

### Task 4: Derive the GP relation for multiple hidden layers

***

### Part 4.2: Analysing the performance of the Gaussian process and a neural network [4 points]

In [None]:
# You can of course add more cells of both code and markdown.