# Important details of GPU programming

It is wise to study general principles of GPU programming before delving into deep learning frameworks 
The following notebook is loosely based on the following sources:

* [NVIDIA blog. Inside Pascal](https://developer.nvidia.com/blog/inside-pascal/)
* [NVIDIA blog. Unified Memory in CUDA 6](https://developer.nvidia.com/blog/unified-memory-in-cuda-6/)
* [NVIDIA blog. CUDA Refresher: The CUDA Programming Model](https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/)
* [A. Minnaar. CUDA Grid-Stride Loops: What if you Have More Data Than Threads?](http://alexminnaar.com/2019/08/02/grid-stride-loops.html)
* [NVIDIA blog. Using Tensor Cores in CUDA Fortran](https://developer.nvidia.com/blog/using-tensor-cores-in-cuda-fortran/)
* [PyTorch blog. Accelerating PyTorch with CUDA Graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/)
* [JAX reference documentation. Asynchronous dispatch](https://jax.readthedocs.io/en/latest/async_dispatch.html)
* [TensorFlow resources. XLA: Optimizing Compiler for Machine Learning](https://www.tensorflow.org/xla)
* [TensorFlow resources. XLA Architecture](https://www.tensorflow.org/xla/architecture)
* [Cloud TPU documentation. System Architecture](https://cloud.google.com/tpu/docs/system-architecture-tpu-vm)
* [J. Hui blog. TensorFlow with multiple GPUs](https://jhui.github.io/2017/03/07/TensorFlow-GPU/)
* [T. Mayeesha. Introduction to Tensorflow Estimators](https://medium.com/learning-machine-learning/introduction-to-tensorflow-estimators-part-1-39f9eb666bc7)
* [T. Verhulsdonck. An Advanced Example of the Tensorflow Estimator Class](https://towardsdatascience.com/an-advanced-example-of-tensorflow-estimators-part-1-3-c9ffba3bff03)
* [I. Danish. Learning TensorFlow 2: Use tf.function and Forget About tf.Session](https://irdanish.medium.com/learning-tensorflow-2-use-tf-function-and-forget-about-tf-session-a8117158edd9)
* [Tensorflow resources. Estimators](https://www.tensorflow.org/guide/estimator)
* [Tensorflow resources. Migrate from Estimator to Keras APIs](https://www.tensorflow.org/guide/migrate/migrating_estimator)
* [Tensorflow resources. Keras optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers)
* [A. Rosebrock. Easy Hyperparameter Tuning with Keras Tuner and TensorFlow](https://pyimagesearch.com/2021/06/07/easy-hyperparameter-tuning-with-keras-tuner-and-tensorflow/)
* [MJ Bahmani. HyperBand and BOHB: Understanding State of the Art Hyperparameter Optimization Algorithms](https://neptune.ai/blog/hyperband-and-bohb-understanding-state-of-the-art-hyperparameter-optimization-algorithms)

<center><h1>Roadmap to deep learning frameworks</h1> </center>
<center><h2>Important notions</h2></center>
<br>
<center><h3>Sven Laur</h3></center>
<center><h3>swen@ut.ee</h3></center>

## How  GPU-s are used for speeding up the computations? 

<br>
<center><img src="./illustrations/4-GPU-CPU-Quad.png" alt="A possible NVIDIA Tesla P100 configuration" width="200"><center>

* CPU-s and GPU-s are connected with high speed memory buses. 
* Different links have different bandwith and latency parameters.
* A data transfer from CPU to GPU can go through OS level buffers.

## How data moves between CPU and GPU?

<br>
<center>
<img src="./illustrations/Unified-Memory-MultiGPU-FI.png" alt="Unified memory layout for CUDA 6" width="300">
</center>    

* Modern GPU drivers use a unified memory addressing.
* Data can be fetched on demand usind standard page fault system. 
* This allows to use direct memory access and copy on write mechanisms.
* Modern GPU hardware has atomic operations and syncronisation mechanisms.  

## What type of data is sent to GPU?

**Kernel:** A function that is executed on GPU.
* A kernel uses many threads to evaluate the function.
* Threads are grouped into **blocks** for syncronisation. Blocks are grouped into a **grid**. 
* Threads share the memory but different threads can do different computations. 

**Grid stride:** A block of memory processed by a single thread.
* Usually kernel performs single instruction on multiple data elements.
* There might not be enough threads to use a single thread for each data input.

## What type of data is sent to GPU?

**Kernel:** A function that is executed on GPU.
* A kernel uses many threads to evaluate the function.
* Threads are grouped into **blocks** for syncronisation. Blocks are grouped into a **grid**. 
* Threads share the memory but different threads can do different computations. 

**Tensor:** A memory segment that can be treated as a multi-dimensional array.
* Tensors are very common in the evaluation of neural networks.
* Modern GPU-s have dedicated hardware components (Tensor cores) for them. 

## Three ways to evaluate a neural network
<br>
<center>
<img src="./illustrations/cuda-evaluation.png" alt="Benefits of using CUDA graphs" width="500">
</center>
    
* An evaluation of a neural network consists of several small steps. There are three options for the evaluation.
* **Static execution:** You must specify the entire computation before it is executed (**TensorFlow**). 
* **Dynamic execution:** You can specify operations interactively but have to wait they are completed (**PyTorch**).
* **Asynchronous execution:** You can specify operations iteratively but you have to wait only if you explicitly request completion (**JAX**).


## XLA: A language for Accelerated Linear Algebra
<br>
<center>
<img src="./illustrations/XLA_execution.png" alt="XLA execution path" width="500">
</center>

* Modern GPU-s use special Tensor Cores that are optimised for vector and matrix operations. 
* By using XLA compiler the kernel is expressed in terms of special operations.
* XLA compiler can be used for static and dynamic execution. 
* XLA compiler can be applied for a subset of possible Python functions and libraries.

## Tensors in PyTorch
<br>
<center>
<img src="./illustrations/pytorch_logo.png" alt="PyTorch logo" width="500">
</center>

* A tensor in PyTorch is a multidimensional array in specified device (CPU/GPU).
* You can perform operations only tensors with the same scope.   
* You cannot control how the tensor is split between different GPU instances.

In [1]:
import torch

# Place data on specific device 
x = torch.tensor([[1,2,3],[4,5,6]], device = 'cpu')
y = torch.tensor([[7,8,9],[10,11,12]], device = 'cpu') #mps
z = x + y

# Fetch the tensor to the main memory
print(z)

tensor([[ 8, 10, 12],
        [14, 16, 18]])


## Tensors in TensorFlow

* A tensor in TensorFlow has an execution type:
  * **Constant:** the value of the tensor remains the same throughout computations.
  * **Variable:** the value of the tensor can be overwrirtten during the execution (**lvalue**).
  * **Placeholder:** the value is assigned when the computation is run (**input**). **Not any more!**  
  
* Tensorflow gives you a fine-grained control on which devices operations are carried out.
* Tensors used to be evaluated in the session that executed the computational graph. **Not any more!**

In [38]:
import tensorflow as tf

# Use standard memory 
with tf.device('CPU:0'):

    x = tf.constant([[1,2,3],[4,5,6]])
    y = tf.constant([[7,8,9],[10,11,12]])

# Transfer computations to GPU    
with tf.device('GPU:0'):    
    z = x + y

print(z)
print(type(z))

tf.Tensor(
[[ 8 10 12]
 [14 16 18]], shape=(2, 3), dtype=int32)
<class 'tensorflow.python.framework.ops.EagerTensor'>


## TensorFlow 1.x components
<br>
<center>
<img src="./illustrations/tensorflow_components.png" alt="TensorFlow components" width="600">
</center>    

* **Estimators:** prepackaged models with standardised training and evaluation loops (analogues of **sklearn** models).
* **Model building:** A way to define neural networks from predefined components: layers and activation functions (**keras**).
* **Optimisation:** Methods for finding near-best values for parameters and hyperparameters.
* **Instrumentation:** a way to observe whaty occurs during training.
* **Session:** an old way to perform a single static execution. You can use **tf.function** instead to group operations into a single graph to be evaluated. 

## Estimators vs Keras models
<br>
<center>
<img src="./illustrations/tensorflow_interface.png" alt="TensorFlow interface" width="400">
</center>

* **Estimator API** the native way to train TensorFlow models
* **Keras API** was subsumed by TensorFlow as it was really popular and convenient.
* Estimators are now legacy and only predefined models are useful.  **Do not create new custom estimators!**



## Defining computing kernels with @tf.function 

* One can chain TensorFlow operations into blocks by defining functions.
* These functions will be evaluated step by step and thus there is communication overhead.
* Decorator **@tf.function** forces TensorFlow to compile the function into separate kernel.
* The compilation might fail if the function body contains **foreign** functions. 

In [46]:
def f():
    x = tf.constant([[1,2,3],[4,5,6]])
    y = tf.constant([[7,8,9],[10,11,12]])
    z = x + y
    return z

f()

@tf.function
def compiled_f():
    x = tf.constant([[1,2,3],[4,5,6]])
    y = tf.constant([[7,8,9],[10,11,12]])
    z = x + y
    return z

# The first execution initiates just-in-time compilation 
compiled_f()

# The seconf execution uses cached code and be much faster
compiled_f()

<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
array([[ 8, 10, 12],
       [14, 16, 18]], dtype=int32)>

## Instrumentation: What is happening during training?
<br>
<center>
<img src="./illustrations/tensorboard.png" alt="Tensorboard" width="700">
</center>
    
* Keras and Estimator training code contains callback hooks for logging.
* Tensorboard is a nice Jupyter extesion that presents this information in a graphical way. 

## Parameter vs hyperparameter optimisation

* Parameter optimisation is not so important as the **data** and **model architecture**.
* **Stochastic Gradient Decent** and **Adam** are good enough to get a baseline model.
* If not there are many alternatives in [**keras.optimizers**](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers) 
* **Hyperparameters** are nasty! You just cannot apply gradient decent for $\min_{\boldsymbol{w}} f(\boldsymbol{w},\boldsymbol{h})\to_{\boldsymbol{h}}\min$.

## Hyperparameter optimisation

<br>
<center>
<img src="./illustrations/hyperparamater_optimisation.png" alt="Hyperparameter optimisation" width="700">
</center>

* Keras Tuner offers several hyperparameter optimizers. The state of the art is **Hyperband**.
