## Lecture 24: The Google TPU and the Evolution of Accelerators

Lecture derived mostly from [Jouppi et al. _In-Datacenter Performance Analysis of a Tensor Processing Unit_. ISCA, 2017.](https://arxiv.org/pdf/1704.04760.pdf)


### Unpacking the abstract

> Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. 

This is not at all an obvious assertion. Let's come back to it. **

> custom ASIC—called a Tensor Processing Unit (TPU)—deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). 

ASIC == application-specific integrated circuit. 

This is a somewhat general term that captures a lot of different chips. The hardware is specfically designed for an individual task, rather than a general purpose computing platform for many programs. In this case, it means that Google built a new chip that is fully-customized, i.e. defined from the photolithography up. 

_Inference phase_: This is important. Neural networks operate in two modes:
  * training: run examples through the neural networks, back propagate to determine error, gradient descent to udpate weights. (and perhaps a bunch of other stuff)
  * inference: evaluate an input on a trained neural network to make a classification or regression decision
  
Training has always been the limiting factor in NN scalability. Researchers focus on training performance, in pursuit of the highest accuracy NN.

Google has a different problem. They service many inference requests on already trained models. It is a simpler computation, but it is being done at massive scale.

> The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS).

* TOPS -- they are not FLOPS because they are 8-bit
* 92T compares with <3T for GPU or CPU at the time
* it is a 256x256 array for 2-d MM

> The TPU’s deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...)

If you have a predictable and simple program:
  * no branching
  * no stack (no function calls)
a simpler allocation of resources can lead to better price/power/performance.

GPU simplifies over CPU.
  * eliminates stack
  * eliminates most managed cache
  * no branch prediction or OOE
  
TPU simplifies over GPU
  * eliminates all cache
  * supports only 8-bit arithmetic ops
  * no threads, warps
  * no pipelining or interleaved execution
  
Essentially, the TPU is a coprocessor that does a single structured computation one at a time.

> that help average throughput more than guaranteed latency.

This connects to the latency goal. The performance target is the time to perform a single inference. The architecture is well suited to this task. Deterministic computation is predictable and will always meet latency targets. 

>The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. .........'

Not surprised. This is a remarkably limited tool. Lean and mean.

> NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters’ NN inference demand. 

This is the real secret. Google has a few classes of ML models and a very specific workload. They want to run inference on neural networks interactively, i.e. a single inference at a time, with minimum latency.  

The trend of customizing HW to SW workloads is sometimes called **co-design** and is a hot topic in AI.


### The Target (OK Google)

>  people use voice search for 3 minutes a day using speech recognition DNNs would require our datacenters to double to meet computation demands

The evolution of Google's services toward voice interaction transformed their computing needs.  They used to run ML in the corners or their big clusters. With voice, ML became the \#1 user of compute resources. It became worthwhile to build an entirely new chip for this specific application


### The Secret (reduced precision)

> A step called quantization transforms floating-point numbers into narrow integers—often just 8 bits—which are usually good enough for inference.

> Eight-bit integer multiplies can be 6X less energy and 6X less area than IEEE 754 16-bit floating-point multiplies, and the advantage for integer addition is 13X in energy and 38X in area

These observations lead to a design point.
  * train models on full-precision hardware (GPUs)
  * quantize models
  * deploy models on low-precision hardware (TPUs)
  
This is exactly what Google was doing in 2017. 



### The TPU as an Accelerator

>TPU was designed to be a coprocessor on the PCIe I/O bus, allowing it to plug into existing servers just as a GPU does. 

<img src="./images/tpublockdiagram.png" width=512 />

> Moreover, to simplify hardware design and debugging, the host server sends TPU instructions for it to execute rather than fetching them itself.

This is a very different design point from the GPU.
  * GPU -- transfer a program (kernel) that is executed remotely
  * TPU -- execute instructions on the TPU
    * instruction buffer queues instructions to avoid latency
    * but, CPU must initiate each instruction
    
Instruction throughput is a function of input precision:
  * full speed: 8-bit weights and 8-bit activations
  * half speed: 8 and 16
  * quarter speed: 16 and 16

### Silicon real estate

Layout reflects simplicity of computations. No managed caches. Little control hardware.
  * 37% of area for memory (28+ MB)
  * 30% of area for computation
  
<img src="./images/tpuareal.png" width=512 />
  
This memory is on-chip. It should not be compared with GPU memory (DRAM). Rather, it is more analagous to the combination of GPU registers, L1, and L2.

Memory is used to progammatically reuse data:
  * activations: outputs of one layer are inputs to next layer
  
Weights prefetched and streamed onto MAC:
  * TPU has an off-chip 8 GiB DRAM for weights
  * Weights are static: modeled is already trained
  * Needed data is known ahead of time

### TPUs follow the CISC tradition!!!!

This is a huge departure from modern practice. The goal of modern CPU design has been to maximize instruction level parallelism (ILP). The RISC (Reduced Instruction Set Computer) architecture pushed this agenda, starting in the 1980s. The paper summarizes it in one line

> traditional RISC pipeline with one clock cycle per stage

CISC (complex instruction set computer) has negative connotations because variable and unpredictable instruction execution times reduce ILP. The CISC/RISC debate is 40 years old and not that informative. As we learned in the midterm, X86 instructions are not "reduced" and not "predictable".

However, X86 wants to maximize instruction level parallelism through RISC-like principles:
  * pipelines
  * out-of-order execution

And, other approaches.
  * vector processing
  * fused multiple/add
  
This is not what the TPU is doing (for the most part). The TPU has 5 main instructions each that takes 10-20 clock cycles. The TPU does benefit from pipelining and overlapping I/O with computation in the MAC. It does reads and writes asychronoulsy and it runs the activation and pool steps in parallel. The whole processes is limited by the throughput of matrix multiply. 

Matrix multiply is done using a "systolic" array:

> A given 256-element multiply-accumulate operation moves through the matrix as a diagonal wavefront. The weights are preloaded, and take effect with the advancing wave alongside the first data of a new block. Control and data are pipelined to give the illusion that the 256 inputs are read at once, and that they instantly update one location of each of 256 accumulators.

<img src="./images/tpusystolic.png" width=512 />

Systolic is an arcane term that goes back to WWII codebreaking. It essentially means that you are passing data over multiple processing units. It has been questionably connected to the MISD (multiple instruction single data) taxon of Flynn's taxonomy.

The taxonomy confuses the issues.  The TPU initiates and completes a 256x256 MM every clock cycle. Think of it as a 2-d pipeline. 




### Roofline Performance

We've already looked at this. The TPU:
  * has a roofline corner that requires high operational intensity
  * results in efficient kernels that approach the roofline
  * exceeds CPU and GPU performance on neural networks

<img src="./images/tpuroofline.png" width=512 />


### What's happened since?

* 2017 TPU generation 2
  * High-bandwidth memory (600 GB/s). Push the Roofline corner left.
  * Floating point. Suitable for training as well as inference.
  * Deployed in pods: 4 TPUs x 64 modules
  
* 2018 TPU generation 3
  * 2x TOPS, 4x pods = 8x performance per pod
  
* 2018 Edge TPU
  * 8-bit, lower power, inference.
  
* 2019 Pixel 4 with Edge TPU (Pixel Neural Core)
  
* 2021 TPU v4.
    * pods of 4096 TPUs
    * 10x bandwidth
    * exaflops of floating point 16 operations per pod
