Will fix numbering at a later point in time.

# 2. Neural Network

A neural network (NN) is a machine learning model inspired by how the *human brain* works (see [16, 20] for a detailed history, including the work of McCullogh and Pitts, the development of the perceptron, and more). In this document, we come full circle, as we use it to classify brain tumours!

## 2.1. Basics

We explain the basics of neural networks, using information from [17, 18]. A neural network (NN) is composed of layers of interconnected nodes called **neurons**. These neurons are organized in three main layers:

- **Input layer**: This is where data enters the network. Each input feature (e.g., pixels of an image or values in a dataset) corresponds to one neuron in this layer.
- **Hidden layers**: These intermediate layers perform the actual processing. Each neuron in a hidden layer receives inputs from the previous layer, applies a mathematical operation (known as an **activation function**), and passes the result to the next layer.
- **Output layer**: This layer produces the network's final output or prediction, such as a classification label.

Collectively, this is called the **architecture** of the NN. This is shown in the Figure below.

<figure>
    <img src="https://miro.medium.com/v2/resize:fit:1400/1*L1bH8PUeLf67Aat-h5phpw.png" alt="Figure taken from Medium Article - accessed 10/11/2024." title="Beautiful Sunset">
    <figcaption>Figure showing the architecture of a neural network. Figure taken from <a href="https://miro.medium.com/v2/resize:fit:1400/1*L1bH8PUeLf67Aat-h5phpw.png">Medium Article</a> - accessed 10/11/2024 08:37.</figcaption>
</figure>

# 2.2. Mathematical formulation

To understand how this works mathematically, we give an example and provide an analogy to the human nervous system. Consider a neuron $i$ in a hidden or output layer. It receives inputs from previous-layer neurons $a_j$, with respective outputs $x_{a_j}$. Each input is **weighted** before passing through an activation function $f$. Mathematically, we write the output of neuron $i$ as:

$$
f(W_{0,i} + \sum_{j} W_{j,i} x_{a_j}).
$$

Here, $W_{0,i}$ is a bias term, and the $W_{j,i}$ values are weights. The bias term can be thought of as a neuron $a_0$ that always provides a signal $x_{a_0} = 1$, helping to adjust the activation function's behavior by situating the output signal in the appropriate range [18].

This mimics the biological neuron, which takes *signals* (the $x_{a_j}$ here) from other neurons (say we take in signals relating to taste, sight, touch, hearing and emotions) and based on the phenomenon (e.g. if you are eating, taste and touch are most important) there is a certain **importance** of each signal (the **weights**, which we will call $W_{i,j}$). The signal will be processed into some other signal (this is the part where we pass to the activation function) and then transmitted (this is where we output to the next layer).

For our tumour classification task, we are going to make use of a special kind of neural network: the convolutional neural network.

# 2.3. Convolutional Neural Network

A convolutional neural network (CNN) is a specialized type of neural network. CNNs are especially good at image-related tasks (like recognizing objects in pictures) because they can detect patterns and spatial hierarchies in data, such as edges, textures, and shapes [24]. They are even compared to the way the brain achieves vision processing in living organisms like cats [23]. They are the "de-facto standard" [19] in image processing, though in some cases newer architectures like transformers are being favoured. The picture below summarises how they work.

<figure>
    <img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2024/08/183560-qcmbdpukpdviccdd-66c7065d8f850.webp" alt="Figure from Analytics Vidhya - accessed 10/11/2024.">
    <figcaption>Figure from <a href="https://cdn.analyticsvidhya.com/wp-content/uploads/2024/08/183560-qcmbdpukpdviccdd-66c7065d8f850.webp">Analytics Vidhya</a> - accessed 10/11/2024.</figcaption>
</figure>

We explain this in words, focussing on the concepts rather than techincal details. We present how CNNs work, synthesising information from [25, 26, 27]. The way they work are:

1. Input: Instead of connecting every input to every neuron, CNNs use small filters, or kernels, that slide over the image to detect features. Each filter is essentially a small matrix (e.g. 3x3), which multiplies and sums pixel values (mathematically, this is a dot product) in small regions of the input image. This process mimics the way our vision works. Work by Hubel and Wiesel in the 1950s and 1960s showed that neurons related to vision individually respond to small regions of the visual field for cats [19] and therefore the kernel scans small regions of the image. The output of the convolutional neuron is sometimes called a feature map [19] or activation map [25].

    Linking back to the previous section, we can see an example of a tensor: the image input is a tensor with shape (number of inputs) × (input height) × (input width) × (input channels (3 if RGB)) [19]. In the image below, we show the action of a convolutional kernel on parts of a 64 x 64 image.

<figure>
  <img src="https://poloclub.github.io/cnn-explainer/assets/figures/convlayer_detailedview_demo.gif" alt="Convolutional Layer Demo">
  <figcaption>Figure showing a detailed view of a convolutional layer in action. GIF taken from [25]. </figcaption>
</figure>

The kernels have some hyper-parameters, called padding, kernel size and stride. We explain this below, synthesising information from [25].
- Padding: When a filter extends beyond the edge of the activation map, padding is used to preserve data at the borders. Techniques to do this vary. A simple padding method is zero-padding, which involves adding zeroes around the edges of the input. This technique is frequently employed in high-performing CNNs, such as AlexNet (60 million parameters, 650,000 neurons [27]). This helps in maintaining spatial dimensions and improves performance [26]. In essence, it is like augmenting the image so that we can extract information from the corners and edges better. Other padding techniques can be found in an accessible tutorial in [28].

- Kernel Size: Also known as the filter size, kernel size defines the dimensions of the sliding window used to scan over the input. This hyperparameter greatly impacts feature extraction:

    - Smaller kernels (e.g., 3x3) capture finer, highly localized features, allowing for deeper architectures with more complex feature hierarchies. This smaller reduction in layer dimensions enables stacking more layers, which can enhance performance on tasks like image classification.
    - Larger kernels (e.g., 7x7) capture broader, more generalized features, leading to faster reductions in layer dimensions but often resulting in less detailed feature extraction.

- Stride: The stride determines the step size by which the kernel moves across the input. For example, a stride of 1 moves the kernel one pixel at a time, covering more of the input and producing larger output layers. This slower, more thorough process allows for detailed feature extraction but requires more computation. Conversely, a larger stride (e.g., 2 or more) moves the kernel further with each step, reducing the number of computations but resulting in smaller output layers and potentially less feature detail.

2. Feature detection: Each filter is designed to recognize specific patterns, like edges or textures. As the filter slides across the image (a process called convolution), it creates the feature map, highlighting areas where the pattern is detected. We can see this above, as the shape of the cup has been found, and the "non-cup" part (the liquid inside) is also identified.

<figure>
  <img src="https://i.sstatic.net/pLlwx.png">
  <figcaption>Figure showing an example of Feature Maps. Here we use 2 filters and get 2 feature maps.</figcaption>
</figure>


3. Multiple filters: CNNs use multiple filters in each convolutional layer. The first layer might detect basic patterns, like edges, while later layers combine these simpler patterns into more complex shapes or features.

4. Pooling: After a convolutional layer, a pooling layer reduces the spatial size of the feature maps [26], which decreases the number of parameters and computations. For example, in max pooling (the most common type), a small window slides over the feature map and takes the maximum value in each window. This effectively "summarizes" the strongest features in each area.

5. Stacking layers: In a CNN, multiple convolutional and pooling layers are stacked one after another. As data moves through each layer, the network learns increasingly abstract and complex features. Early layers detect simple patterns like edges, while deeper layers detect complex shapes and objects.

5. After the convolutional and pooling layers, the feature maps are “flattened” (converted into a 1D vector) and fed into a fully connected, or FC layers (these are also called dense layers). This layer combines features from all previous layers to make the final prediction.

CNNs have key advantages [29, 30] which make them especially good for image processing. We provide a brief summary below:
- Automatic feature learning: CNNs can automatically learn relevant features from raw input data, which eliminates the need for manual feature engineering.
- High accuracy: CNNs can achieve state-of-the-art performance in various image and video recognition tasks.
- Robustness: Using data augmentation (see later section), CNNs can have high performance even if the images have different qualitative features, such as being taken with different brightness, contrast or angles.

Despite their many advantages, CNNs also have some disadvantages. The one we felt the most was the high computational requirement -- CNNs required significant computational resources, including the use of GPUs (graphics processing units) (see later section on parallelism with GPUs) or TPUs (tensor processing units) to train and deploy. It can also be prone to overfitting and can fail with adversarial examples when some noise is added to an image [29]. We now move onto the training process. We explain how training is done using backpropagation and (stochastic) gradient descent.

# 3. Training using Backpropagation

## 3.1. Gradient Descent
The training process in neural networks involves adjusting the weights and biases through backpropagation combined with gradient descent. When the network makes a mistake, the error is propagated backward, and the weights are updated to minimize the error. Gradient descent optimizes by adjusting weights in the direction that reduces the *loss function* the most. The "learning rate" refers to how far along the gradient we move, i.e. it is a sort of step size. The loss function measures the discrepancy between the predicted output and actual output, and therefore repeating these weight updates over many units of time, or "epochs," generally improves the network's accuracy in making predictions.

<figure>
    <img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2024/09/631731_P7z2BKhd0R-9uyn9ThDasA.webp" alt="Figure taken from Analytics Vidhya - accessed 10/11/2024.">
    <figcaption>Figure showing the process of gradient descent. Figure taken from <a href="https://cdn.analyticsvidhya.com/wp-content/uploads/2024/09/631731_P7z2BKhd0R-9uyn9ThDasA.webp">Analytics Vidhya</a> - accessed 10/11/2024.</figcaption>
</figure>

<figure>
    <img src="https://i.makeagif.com/media/12-08-2023/g4oDOp.gif" alt="GIF demonstrating Stochastic Gradient Descent">
    <figcaption>Figure showing the process of backpropagation. Figure taken from [https://makeagif.com/gif/backpropagation-on-neural-networks-g4oDOp](makeagif.com) - accessed 11/11/2024. </figcaption>
</figure>


Note that there is a lot of information about layers and neurons to be stored, as each neuron $a_i^{\ell}$ in each layer $\ell$ will have weights $W_{j,i}^{\ell}$ for the inputs from neurons $a_{j}^{\ell -1}$ from the previous layer. Hence, we need a lot of indices to store this information. To do this effectively, we use *tensors* [19].

Tensors are generalisations of matrices ($A = (a_{ij})$) and vectors ($a = (a_i)$) to higher dimensions [21]. For example, "rank 0" tensors are scalars, "rank 1" tensors are vectors, and "rank 2" tensors are matrices. When adding more dimensions, we call the corresponding quantity a rank $n$ tensor. Using tensors, we can write out the backpropagation process in a compact form. We direct the ineterested reader to [22] if they wish to see a mathematical formulation of backpropagation.

In short, neural networks learn patterns in data by adjusting the connections between their neurons. Tensors provide a compact way to organise the information about the weights and neurons, allowing us to express backpropagation mathematically in a compact way. (This explains both `Tensor` and `flow` in `Tensorflow` -- a particular platform that we will use to fit NNs.) The traditional gradient descent approach uses all data points to compute the gradient in each backpropagation step. This becomes computationally expensive with large datasets, as it requires going through the entire dataset to perform just one weight update. An alternative, called *stochastic gradient descent*, provides a computationally cheaper way to optimise the loss function.

## 3.2. Stochastic Gradient Descent
Sochastic Gradient Descent (SGD) was first introduced by Herbert Robbins and Sutton Monro in their 1951 paper, "[A Stochastic Approximation Method](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-22/issue-3/A-Stochastic-Approximation-Method/10.1214/aoms/1177729586.full) [31]." Although they did not specifically apply the method to neural networks (which were not yet developed in their modern form), Robbins and Monro's work laid the foundation for the iterative optimization techniques used in machine learning today.

SGD differs from traditional gradient descent by randomly selecting a small subset of data points (called a batch) and uses this subset to approximate the gradient. This has the benefit of being computationally faster as the code can make use of vectorization libraries, making it better suited for larger datasets. Typical implementations may use an adaptive learning rate so that the algorithm converges [32]. Note that an "epoch" refers to a pass through all the data, as in traditional gradient descent. Hence lower batch sizes need more iterations for one epoch of training. For example, if we have 1000 samples and a batch size of 500, we need 2 iterations for 1 epoch of training [32].

SGD has been analyzed using the theories of convex minimization and of stochastic approximation [31]. When the learning rates decrease with an appropriate rate, stochastic gradient descent converges to a global minimum with probability 1 (under some conditions) when the objective function is convex, and otherwise converges to a local minimum. While our loss function is unlikely to be convex, this provides us with reassurance that SGD is a principled method. In fact, it is widely used throughout the deep learning community.
