# Two types of Optimization
In the weeks since we began this course, we have built up the basic elements of a neural network and seen how complex architectures can be built out of these basic elements. As the complexity of Deep Learning focused networks has grown, this has required greater abilities to optimize the networks (both in terms of selecting the correct weights, and in terms of running inference once we have trained the network). Generally we can say that our networks training and inference operations need to be optimized. In this week's module we will have a look at how these optimizations can be achieved.  

# Training Time Optimization 

Training a Deep Neural Network boils down to iteratively adjusting our weights in response to errors in training. Deciding how much error should be propagated back through the network for a given training run is the job of the optimization algorithm. In earlier notes we were introduced to the Gradient Descent algorithm which is the most transparent of all the optimization algorithms -- and hence is the most useful for teaching. However when we are trying to train on more complicated datasets we need optimizers with a lot more power than vanilla Gradient Descent. 

With that in mind, in this section we examine a number of the most important Optimization algorithms that we might want to use in practice. We begin by briefly reviewing the Gradient Descent algorithm and considering some of its limitations. 

Please note that the figures below are not mine -- many thanks to the original designers. I've included a number of image links to the original material at the end of the lecture notes. 

## Review: Vanilla Gradient Descent 
We introduced Gradient Descent in some detail during our discussion of Linear and Logistic Regression. In those notes we discovered that the Gradient Descent algorithm can be understood as a process for discovering optimum parameter values for a model based on a series of small incremental adjustments to an initial parameter set that seek to minimize an error function. 

We saw that we can often visualize our error function as a surface over our model parameters. Our goal in Gradient Descent can then be thought of as to move us from a starting position to the minimum of that surface. While visualizing the problem in this way is only practical for 1 or 2 parameters, it is nonetheless useful for illustration purposes. 

<!-- gd.jpg --> 
<img width="500" src="https://drive.google.com/uc?id=1jIFMn4LvLCek0_e1sTuyB5mvGCF6auic"/>

The adjustments we make to parameters are proportional to the error between expected and actual values. More formally, the weight adjustment is proportional to the derivative of the error gradient, and we can express this as follows:  

\begin{equation}
\Delta \theta_{j} = - \alpha \frac{\partial \; J(\theta)}{\partial \theta_{j}} 
\end{equation}

where $\theta_{j}$ is the $j^{th}$ model parameter, $\alpha$ is a learning rate, and $J(\theta)$ is our cost function over the full parameter vector. 

We saw before we can express the partial derivative of sum of squared error cost function as the sum of differences between expected and produced values. This in turn leads us to our common expression of the Gradient Descent Algorithm: 

Repeat Until Convergence {

  > perform simultaneously {

> \begin{equation}
\theta_{j} = \theta_{j} - \alpha \frac{1}{m} \sum_{i=0}^{m} (h_{\theta}(x^{(i)}) - y^{(i)}).x_{j}^{i}
\end{equation}
  }

}

where $m$ is the number of training examples, $h$ is our hypothesis function, $y$ is our target value, and $x_{j}$ is our $j^{th}$ feature. 

### Limitations

While the Gradient Descent Algorithm underlies much of modern machine learning, its vanilla form as expressed above has a number of significant limitations. In the following we highlight some of these. 

#### Limitation 1 - Setting $\alpha$
The first serious consideration is connected to the Learning Rate $\alpha$. The Gradient Descent Algorithm requires  $\alpha$ to be manually set, but this parameter is sometimes difficult to determine. First we must consider that the learning rate should generally be proportional to the error surface gradient but for different problems we will have different error surfaces. A grid search can be useful in determining the best error rate, but generally that is a very expensive process that we would like to try to avoid. Related to that is the problem that the learning rate that we would like will change over the course of training. Generally we favor a larger learning rate at the start of training when we are far from the error minimum, but prefer a smaller value for $\alpha$ when we have progressed in training and are much closer to our minimum. This is illustrated below where our jumps through the weight space get smaller as we approach the minimum. 

<!-- alpha.png --> 
<img width="500" src="https://drive.google.com/uc?id=1iupuiLTKVNYgMM2lsl3W5XC2jt-SCsgS"/>

Programmatically adjusting the learning rate through the use of a decay factor is one way in which this can be addressed, but generally speaking it would be good to have a more systematic way to select $\alpha$ over the course of training. 

#### Limitation 2 - Meandering Towards the Minimum 

One of the biggest limitations of the Gradient Descent algorithm is that it will always progress in the direction of steepest descent even if that descent isn't targeted at the minimum. This is best illustrated with an elliptical 2D error surface. 

<!-- direction.png --> 
<img width="500" src="https://drive.google.com/uc?id=1iwO7sn6-pP7uViANBr1fXhxuuNhFGa-k"/>

Gradient Descent when applied strictly would likely take a step in the direction of the red arrow as at the start point the steepest descent would be straight downhill towards the middle of the valley. However from our global perspective we can see that a step in the direction of the black arrow would actually bring us closer to the error global minimum. 

#### Limitation 3 - Getting Stuck in Local Minima 
The third serious consideration is the liability of Gradient Descent to get stuck in local minima in our error surface. For real world learning problems our error surface is not going to be a strictly convex surface over our variables. Instead it will be a far more complex error surface that is puckered by many local minima. While there may be a global minimum, it may be very far away and there may be a local minimum which would pass through before reaching the global minimum. This is illustrated in the simple case below where we see that the path from a global maximum to the global minimum would be intercepted by a local minimum. 

<!-- errors.png --> 
<img width="500" src="https://drive.google.com/uc?id=1j67_jyVug2OmtUEnplfWB3c5rmprYYKm"/>

Local minima are an intrinsic problem for optimization algorithms - and not just the gradient descent algorithm. However there are mechanisms that allow us to avoid getting stuck in local minima. These typically rely on ideas such as momentum which should allow us to carry on past a minimum for a while but fall back to that minimum if no significantly further downward slope is detected. 

### Aside - notation 
In the next number of sections we will consider some alternatives to vanilla Gradient Descent. Before moving on we note for reference that we can rewrite our weight adjustment operation in vector notation as following where $\nabla$, Del, provides the gradient of a higher-order function: 

\begin{equation}
\theta = \theta - \alpha \nabla J(\theta) 
\end{equation}

This notation is used in a number of places including some papers and Wikipedia pages and should be kept in mind as an alternative to the use of the partial derivative delta operator which we have been using. 

## Stochastic Gradient Descent 

The first important variant on the vanilla Gradient Descent algorithm is Stochastic Gradient Descent (SGD). This method is a very simple variant on the vanilla algorithm which says that we should update our weights in response to individual training case errors rather than waiting to make our adjustment on a mean over all training cases. Generally we should also randomize the order of our training cases between epochs in order to avoid entering in cycles over the application of training updates. 

The SGD algorithm is a simple variant on our vanilla gradient descent algorithm where we focus on a case by case basis rather than averaging over all training cases. This is summarized as follows: 

Repeat Until Convergence {

  > Randomly Shuffle Training Data

  > for i in range(Training Data) {

    > perform simultaneously for all features {

\begin{equation}
\theta_{j} = \theta_{j} - \alpha (h_{\theta}(x^{(i)}) - y^{(i)}).x_{j}^{i}
\end{equation}

          }
  
  > }
  
}

The result of this approach to training is that we take a slightly more chaotic path towards the error surface minimum as our trajectory will be highly influenced by individual cases. Empirically though it has been found that this often results in better training as averaging out over individual cases isn't always advantageous. This stochastic path is illustrated below in comparison to a conventional averaged path: 

<!-- sgd.png --> 
<img width="500" src="https://drive.google.com/uc?id=1jSP7l0MAqrPkWi5k0fzOcZSITFHrx-EU"/>

In this context vanilla Gradient Descent is often referred to as full Batch Gradient Descent. It is worth noting that mini-batch gradient descent is an important variant on Stochastic Gradient Descent in that it provides a compromise between full Stochastic Gradient Descent and Vanilla Gradient Descent.


## Adaptive Learning Rates

While SGD gives great improvements over GD and is the backbone of all modern GD approaches, one of the biggest jumps in optimization performance has come from the use of Adaptive Learning Rates. Here, instead of using a single value for $\alpha$, we use individual learning rates for each parameter in our model. The idea is that it makes sense to vary some parameters a lot more during the training process than others. The mechanism we use to decide which parameters should change and by how much is addressed differently in a number of recent adaptive learning rate algorithms. 

Note that the maths in the notes below is not examinable but given to you to help those who are interested understand the key concepts. If this is not important to you at this point, remember two key points:

 1. It is a good idea to have a different learning rate for each individual parameter
 2. It is a good idea to use the idea of momentum to mean that adjustments we make in the current time step are influenced by our last set of adjustments 

### AdaGrad

AdaGrad, the Adaptive Gradient Algorithm, is one of the best known examples of an adaptive learning rate algorithm. AdaGrad is a relatively clean extension of the Stochastic Gradient Descent algorithm where individual learning rate adjustments are a product of a global learning rate and an individual adjustment factor extracted from the derivative of our error surface. 

The weight update rule for AdaGrad can be simplified to the following form where the update for parameter j, $\Delta \theta_{j}$, is given by: 

\begin{equation}
\Delta \theta_{j} = - \frac{\eta}{\sqrt{\sum_{\tau = 1}^{t} g^{2}_{\tau,j}}}g_{j}
\end{equation}

where $\eta$ is the base learning rate (analogous to $\alpha$ in SGD), and $g_{\tau,j}$ is the gradient of the cost function for a given weight at a specific training point $\tau$. 

The core of this definition, $g$, at a given point is simply given by:

\begin{equation}
g_{j} = \frac{\partial \; J(\theta)}{\partial \theta_{j}}
\end{equation}

i.e., $g_{j}$ is the partial derivative of our cost function. 

Stepping back we see that the key difference between the AdaGrad update and a standard SGD update is that the update for an individual weight is adjusted by a factor proportional to the inverse of the l2-norm (Euclidean norm) of gradient derivatives for the specific feature j. 

## Momentum Based Methods

In momentum based optimization algorithms the update we make to a given weight is not simply based on the gradient at the current time (or a history of recent gradients), but is also based on the last made adjustment. Therefore if a large adjustment was made in the $t-1$ step, we generally expect the $t$ step to carry forward a large amount. Thus we have an analogy with momentum where once we have some speed behind us it is hard to slow down. 

Formally we structure momentum based weight updates as follows:

\begin{equation}
\Delta \theta_{t} = \rho \Delta \theta_{t-1} - \eta g_{t}
\end{equation}

where $\rho$ is a dampening factor, and $\Delta \theta_{t-1}$ was the update during the last iteration, and all other terms are as they were used in notes above. 

We can see therefore that there are two elements to an update to the weights. As with the Adaptive Gradient models above, the right-hand side term gives individual weight updates based on the current gradient and the overall learning factor $\eta$. The extra term is the momentum term where some influence of the last weight update is carried into the current weight update. 

### RMSProp

RMSProp is one of the best known examples of a momentum based optimizer. The overall form for parameter update in RMSProp is straight forward and given by this update which is similar to that for AdaGrad: 

\begin{equation}
\Delta \theta_{t} = - \frac{\eta}{\sqrt{v(\theta,t)}}g_{j}
\end{equation}

The key to this update is however the moment update which is given by $v(w,t)$ as defined by: 

\begin{equation}
v(\theta,t) = \gamma \; v(\theta,t-1) + (1-\gamma) g_{t}^{2}
\end{equation}

where $\gamma$ allows us to set a balance between the influence of prior weight updates and the update due to the current gradient. In the limiting case, by setting $\gamma = 0$ we neutralize any influence from prior updates and end up with an update rule based purely on the absolute value of the error gradient. 

### Adam Optimizer
The Adam optimizer is an extension of the RMSProp optimzer that includes the influence of both first and second moments of the error gradient. The name ``Adam`` comes from Adaptive Moment estimation but is not an acronym. As with Adaptive Gradient methods, Adam keeps a separate learning rate for each network parameter and adapts that weight over the course of training. But as mentioned above, the adaptive learning rate is determined from first and second moments of the error gradients. 

Adam includes advantages of two optimizers that we considered above. Like AdaGrad, the Adam optimizer maintains updates for individual parameters, but as with RMSProp, these updates are based on the average of recent magnitudes of parameter updates. Adam makes use of an average of the error gradient and also an average of the square of the error gradient. 

## Newtonian Methods: L-BFGS

All the examples above are based on extensions of Stochastic Gradient Descent. However there are a wide range of optimizers that are based on entirely different approaches to optimization. 

The set of (Quasi-)Newtonian approaches to optimization is based on an iterative approach to fining the solution to equations. In the case of optimization we aim find the roots of the error function, i.e., where $g`(\theta) = 0$. The full Newtonian method for solving lower order equations is commonly practiced in high-school algebra. Applying the full Newton method to solving higher-order functions as we require in machine learning examples is in practice more complicated and requires the calculation of either the full Hessian or Jacobian over our target function. 

Quasi-Newtonian methods are a complex set of methods which allow us to iteratively find the solution when the full Hessian or Jacobian are not available. The specifics of how quasi-Newtonian methods are applied is not necessary to consider here, except to say that the best known example of a Quasi-Newtonian optimizer or solver is L-BFGS. 

While L-BFGS is a widely respected optimizer which often outperforms Stochastic Gradient Descent, L-BFGS is computationally more expensive to implement and is often outperformed by the more sophisticated SGD variants such as RMSProp and AdaGrad. As such L-BFGS is not implemented in the TensorFlow core, though is available in custom code through a link into a scipy implementation. 

## What should I use? 
There are many good online discussions on the relative merits of different optimizer uses in Deep Learning. In short, start with Adam for non-recurrent classification problems such as image classification, but use AdaGrad for problems based around recurrent architectures. 

Of course in practice you may want to try multiple options at the start with a small dataset before you commit to one optimizer or another. 

# Hardware Centric Optimization (for Training and Inference)

One of the main reasons it took 50 years for Neural Networks to really come into their own was that they are computationally expensive to train. By their nature, neural networks have many different parameters that need to be updated during each training batch. In practice working through these computations was just too costly for many years even with speedup provided by improvements in optimzers, activation functions etc.  

Rather than relying on the CPU to sequentially apply all operations for the training and inference processes, parallelisation can instead be applied to perform many of these operations in parallel. Due to the vectorised nature of the underlying mathematics, these parallelisations could often be highly efficient, reducing 100s or even 1000s of operations to a small number of these vectorised functions.  

While many hardware platforms could in principle be used to achieve parallelism, the game changer really came about through the application of commodity GPU (Graphics Processing Unit) hardware. GPUs or graphics cards are designed to apply basic elemental operations to vectors of numbers to perform visual transformations, compute shading etc. The hardware in practice could easily be built upon to perform elemental DNN operations.  

If you are reading these notes on a Collab notebook, you already have access to custom hardware including GPUs that is ideal for the speeding up of the Deep Learning process. In this section we will briefly give an overview of those hardware options and how they are applied both for Training and run-time Inference. 

## GPUs for Training 

The primary hardware for training Deep Learning models at the time of writing is the GPU or Graphics Processing Unit. The GPU is an important part of any modern computer and can be thought of as a sub-processor that is specialised for processing graphics for displays. One of the benefits of a GPU over a CPU is it is designed to do lots of operations in parallel quickly -- consider for example the need to adjust lighting on a whole lot of pixels in parallel. This inherent parallelism and ability to operate on vectors of numbers in parallel is the main advantage of GPUs over CPUs for our purposes.  

<!-- GPU vs CPU -->
<img width="500" src="https://drive.google.com/uc?id=1_F2qa8w6wCjEeTOPeczLW_g5QWZTOFTy"/>

GPUs generally have a shared memory where the network graph -- i.e., the operations, tensors etc. are loaded and a large number of parallel processing units that have access to this shared memory. It is these factors that primarily seperate one GPU from another. However, it is not the case that all GPUs were created equally at the same price point. The brand of GPU card is a major consideration. NVIDIA have until recently been the only mainstream GPU manufacturer who have been supporting GPU use for deep learning. The main mechanism for support were the NVIDIA drivers and CUDA library that were built upon these for numerical computing. This meant that even if you have a top of the line AMD graphics card, you cannot use it for TensorFlow training. It is worth noting though that things have started changing a little. For example, the Apple M1 chip when launched was billed as a major step forward in Apple's support of Deep Learning model training on its own hardware. This support was initially focused on support for TensorFlow, but support for PyTorch also came in 2022. 

AMD have also released ROCm as a library to support numerical computing on their devices. As of 2023 its usage is still however limited due to problems with driver support for different TensorFlow and Pytorch versions etc. (If anyone has gotten it working and wants to report their experiences, drop me a line! )

Beyond the issue of brand itself, GPUs can vary vastly in power. There is a world of difference between an entry level Nvidia GPU and their premium lines. As of Spring 2022, the state-of-the-art consumer focused GPUs for Deep Learning are typified by the Nvidia RTX 4080 and RTX 4090. Premium GPUs are characterised by: (a) a large amount of video memory; and (b) a large number of specialist parallel operators that perform the actual calculations. The memory size of a GPU is of particular importance in Deep Network training. Essentially the GPU memory size becomes an upper limit on the size of model that can be loaded into memory for training purposes. The model size not only includes all parameters for a single pass through of training, but generally the model size is also a multiple of the training batch size. So a model with a batch size of 32 will require 32 times more parameters than a model with a batch size of just 1.  

The number of computational units (e.g., Nvidia’s Tensor Cores) is less of a direct limitation on the training process, and is instead an influence on how fast training can be performed. Putting it simply, the more ‘cores’ that a GPU has available, the more parallel operations can be performed. Newer architecture GPUs will also tend to have newer core types. These newer core types often allow more complex basic operations to be performed directly in the hardware. Again, the more complex a basic hardware operation can be, the greater the speedup that can be achieved.  

It is notable that for very large model types (such as transformers) it is not possible to fit the entire model in the memory of a single GPU. Instead the distribution of models over multiple GPUs is needed. Many non-consumer focused GPUs from Nvidia allow multiple GPUs to be connected directly together to provide larger models to be trained.  

Unfortunately availability and subsequently pricing on high-end devices has become a challenge due to a combination of factors including supply chain disruption caused by the COVID-19 pandemic and the high demand for GPU devices for mining certain crypto currencies.  

## Inference Focused Hardware 

While training deep learning models traditionally requires large GPUs, the actual use of a network at run-time (or inference time) is usually much less computationally expensive and as such can be achieved on a variety of different hardware platforms.  

The reason for the reduced computational complexity is due to the fact that inference only needs one single forward propagation for a given input. This is in contrast to the many forward and backward propagations that we usually have to implement for neural network training.  

Given this reduction in complexity, direct inference on a CPU is possible. However since all operations now need to be performed sequentially, the processing time for an individual model can still be non-trivial. In terms of practical limitations, arguably the most important consideration is that enough memory (RAM) be available to fully load a model. Beyond this, the clock speed of the CPU will determine the processing time in practice for a given input. It is worth remembering that this can be a non-trivial consideration given the considerable speed differences across CPU architectures.  

Specific hardware solutions have however been developed that operate between the full power of the GPU and the limited CPU. Arguably the most famous of these are the Jetson platforms developed by Nvidia. The Jetson range of devices are designed for AI inference tasks and provide parallel processing power and access to shared memory resources which can be used for applying many devices in ‘Edge’ computing scenarios.  

Jetson devices are strictly speaking the combined memory / CPU / GPU devices but without any typical motherboard elements like USB or ethernet connectivity, power supplies, or storage. Instead these capabilities are provided by ‘carrier boards’ into which the Jetson device can be plugged in (in a way that is similar to how a CPU or RAM can be plugged into a traditional motherboard. It is worth noting that Nvidia do develop so called Developer Kits for the Jetson platforms, but these are not intended for end system deployments and developers are encouraged to source alternatives.  

<!-- Nvidia device -->
<img width="500" src="https://drive.google.com/uc?id=1UR9H6MTP-ES5FrycSD3ovf9kqKEX-h-g"/>

The most famous of the Jetson devices is the entry level Jetson Nano (which in itself comes in two variants as of Spring 2022). While the Jetson Nano platforms are in theory sub 150 euro in price, availability challenges has meant that when they are available, their actual retail cost price is often greater than this. In addition to the Jetson Nano, there are many more complex Jetson platforms available for more sophisticated processing tasks. These include the Jetson Xavier NX and the Jetson AGX. In Spring 2022 Jetson also launched the Jetson Orin as there highest rated machine in the current round up – but with a price point to match of approximately 2000 euro for the developer kit.  

It is worth understanding that a key difference between the computational configuration of these Jetson devices and standard GPU hardware is that they will often have lower performance operators on board. This means for example that a device might only be capable of 16 bit floating point operations rather than the 32 bit or 64 bit operations that we are used to. These devices often even include specific hardware to perform integer only operations – but in a way that is highly optimised.  

The side effect of this is that models cannot in general be run both directly and efficiently on these more limited hardware platforms. In addition some operations that are assumed as atomic within one neural network library might not be supported on the inference hardware. In practice then some level of model transformation is required to take a trained model and ‘port it’ over onto the target hardware. There are many ways in which this can be done. TensorFlow for example has for many years had TensorFlow-Lite available as an alternative lightweight deployment library. In simple terms, a fully trained TensorFlow model could be transformed into a more light weight model that had simplifications for the target hardware. Compromises often exist in the transformation. The most obvious of these is that the precision of a model may be reduced due to the conversion of the model to 16bit float or lower.  

While TensorFlow lite was one of the best known light-weight model implementations, there are many others. Covering all of these is beyond our scope here, but one notable case is the Onnx platform  from Microsoft. Onnx is designed in part as an intermediate format for converting neural networks between various training and inference platforms. In short the idea is that a model can be trained in any platform and then converted over to another platform via Onnx. There is also an Onnx runtime available (including for example for the Jetson Nano) which means that the Nano can run any model that can be converted to Onnx.  

<!-- ONNX --> 
<img width="500" src="https://drive.google.com/uc?id=1QVutSMNLrNK5cGJ7kdN6vG9CUj21JlBf"/>

While the Jetson Nano is one of the best known inference platforms, there are many other alternatives available – some of which perform significantly faster than the Nano. The Google offering is a physically smaller edge-focused TPU (Tensor Processing Unit) the Coral. The Google Coral has a slightly higher price point than the entry level Jetson Nano but does offer features such as Bluetooth that aren’t available on the Jetson Nano developer board without upgrades. Most significantly, the Coral is seen as providing higher processing power than the Nano across a number of benchmarks. A similar device to the Coral is the IBM Neuro Compute Stick (and its related hardware under-pinning which were originally developed by Movidius in Dublin.  

Smaller inference devices are designed expressly for Phones, watches and smaller integrated devices. These components often work in the same fashion as the Google Coral and Jetson Nanos but just have greater limitations in terms of the size of model that can be run on the hardware and the number of optimized elemental operations (see Cores above) that can be used at inference time.  

This is an area that changes frequently, so please see the lecture notes for visuals and benchmarks associated with this section. I do try to keep these up to date.  

### Links
https://en.wikipedia.org/wiki/Gradient_descent

https://towardsdatascience.com/gradient-descent-vs-neuroevolution-f907dace010f

https://arxiv.org/pdf/1808.06332.pdf