## NVidia GPU Architecture (and comparisons)

Picture of the NVidia Ampere from nvidia.com

<img src="https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/ampere-architecture/dpu-a100x_front-2c50-d.jpg" width=512 />

### CUDA Architecture Evolution

Ampere <- Turing <- Pascal <- Maxwell <- Kepler <- Fermi <- Tesla <- G70
  * G70 is GeForce Series 7 2005

#### Turing Architecture

<img src="https://devblogs.nvidia.com/wp-content/uploads/2018/09/image2.jpg" width=768 />

* 4,608 CUDA Cores
  * 64 cores / SM
* 672 GB/sec memory throughput
* 6 MB L2 memory

<img src="https://devblogs.nvidia.com/wp-content/uploads/2018/09/image11.jpg" width=512 />

Properties:
  * 16.3 GFlops of double precision floating point
  * 32.6 GFlops of single precisions floating point
  * 16.3 TIPs of integer arithmetic
  * Reduced precision tensor cores (INT8 and INT4)
    * for ML workloads that can tolerate quantization
    
What's different than a CPU?
  * High core count
  * Different use of transistor "real-estate".  Roughly
    * CPU -- ~50% of transistors in managed caches, ~20% in I/O
    * GPU -- ~80% of transistors in FP/IP
  * Very small SM private L1 cache. 
  * Small shared L2 cache.
    
CPU image of AMD Ryzen 5000 from [https://wccftech.com/amd-ryzen-5000-zen-3-vermeer-undressed-high-res-die-shots-close-ups-pictured-detailed/](https://wccftech.com/amd-ryzen-5000-zen-3-vermeer-undressed-high-res-die-shots-close-ups-pictured-detailed/)

<img src="https://cdn.wccftech.com/wp-content/uploads/2020/11/AMD-Ryzen-5000-Zen-3-Desktop-CPU_Vermeer_Die-Shot_1-2048x1350.jpg" width=512 />


### Properties of CUDA (all architectures)

* Fundamental unit is the CUDA core
  * Integer arithmetic logic unit ALU 
  * Double-precision floating point FPU
  * Fused multiply-add instruction Fully pipelined
* CUDA cores are grouped into stream multiprocessors (SM)
  * Each SM runs in SIMD lockstep, i.e. is a vector processor
* Observations
  * High memory throughput (now 672 GB/s) c.f. 256 GB/S on Intel Skylake
  * Little cache -- essentially useless

### Programming Model

<img src="https://upload.wikimedia.org/wikipedia/commons/5/59/CUDA_processing_flow_%28En%29.PNG" width=512 />

* Transfer data to accelerator
  * limited by PCIe speed, typically 8GB/s one way
* Invoke remote computation
* Extract result

#### Consequences
* For CUDA to be effective, the computation must be intense w.r.t. the data.
  * 8 GB/s of transfer compared with 600 GB/s memory to processor
  * _Must_ use data values multiple time
* CUDA programs must be simple
  * FP and integer arithmetic
  * no cache hierarchy to support data reuse
  * similarly no HW support for branching and speculation (more on this later).
  
Connecting back to __Roofline performance__.
  * Requires kernels with high intensity
  * CUDA has fused multiply add -- this a form of ILP
  * There are two off chip transfers:
    * CPU memory -> GPU memory
    * GPU memory -> SMs



### Recent Developments

The changes from the origianl (2005) GPU arcehitecture that have made CUDA general purpose.

* Double precision
    * graphics cards were always single precision.  Why?
* Memory system
    * L1 cache per SM
    * Global L2 cache
    * ECC (error correcting memory)
        * why does graphics not need ECC memory?
* Concurrent kernel execution
    * kernel is the name for a CUDA program
    * used to be one kernel at a time
* Muliple-GPU interconnects
    * fast data transfer among GPUs
    * needed from ML training

<img src="https://en.wikichip.org/w/images/8/88/nvidia_dgx-1_nvlink-gpu-xeon_config.svg" width=512 />

### So What??

CUDA is delivering FLOPs faster, cheaper and at less power than CPUs.

<img src="https://www.karlrupp.net/wp-content/uploads/2013/06/gflops-sp.png" width=512 />

GPUs are everywhere:

  * on supercomputers as accelerators
  * on laptops already because they have screens
  * on phones (or architectures inspired by GPU principles) because of power
  
### System on a Chip

System on a Chip architectures have graphics (GPU) built in. These architectures will dominate desktop and laptop market going forward.

  * Apple M1 (with shared GPU/CPU memory. no copying.)
  * Intel Alder Lake (shown below)
  * AMD Exynos with Samsung coming later in 2021

<img src="https://preview.redd.it/lyhmdzo6c3w71.jpg?width=960&crop=smart&auto=webp&s=89fa3a1cf9d1925bb5f4922333e9c902e7c75148" width=768 />

So what about GPU PCIe cards???? 
* In supercomputers
* On the cloud
  * machine learning training
  * collaborative/cloud gaming
  * rendering, video editing 
  * ????
