# Lecture 8: Processing Near Memory

*Notes*

**PIM**: Processing in Memory

<hr>

## 1 - Main Idea

We want to see memory as an **accelerator** (e.g. specialzied compute-capability in-memory). 

The goal is to place compute unit *near* memory. 

<u>Example:</u> 3D-stacked logic+memory with a logic layer at the bottom of a series of memory layers as part of the construction of memory (driven by the Hybrid Memory Cube consortium). 

<hr>

## 2 - 3D-stacked PIM

Using 3D memory as a coarse-grained accelerator allows to change the entire system by performing simple function offloading. This requires the minimal processing-in-memory support.

This can be used in large graph processing (e.g. rank computing requires frequent random memory accesses but little amounts of computation). Solution: interconnected set of 3d stacked memory + logic chips with simple cores (tesseract systems for graph processing).

![tess](images/tesseract.png)

### Tesseract (2015)

<u>Advantages:</u>
- Specialized graph processing accelerator using PIM
- Large system performance and energy benefits
- Takes advantage of 3D stacking for an important workload
- More general than just graph processing

<u>Disadvantages:</u>

- Changes a lot in the system (new programming model, specialized tesseract cores for graph processing)
- Cost
- Scalability limited by off-chip links or graph partitioning

<hr>

## 3 - 3D stacked PIM in mobile device

### Energy cost of data movement

<u>Observations:</u>

1. 62.7% of the total system energy is spent on data movement between the different parts of a computer (**on Tensorflow-Mobile, 57.3%**)

2. A significant fraction of the data movement often comes from simple functions

### Operation: PAcking

A reordering of the elements of matrices to minimize cache misses during matrix multiplication.

$$Matrix \Rightarrow Packing \Rightarrow Packed\,\,Matrix$$

Up to 40% of the inference energy and 31% of the inference execution time correspond to cache misses. Packing's data movement accounts for up to 35.3% of the inference energy.

> A simple data reorganization process that reuqires simple arithmetic

### Operation: Quantization

Corresponds to converting 32-bit floating points to 8-bit integers to improve inference execution time and energy consumption.

**PIM cores cand accelerator reduce energy consumption by 40-50%**.


<hr>

## 4 - PIM-Enabled Instructions (PEI)

Goal: develop mechanisms to get the most out of near-data processing with **minimal cost, minimal changes to the system, no changes to the programming model**.

1. Expose each PIM operation as a cache-coherent, virtually-addressed host processor instruction (called PEI) that operates on only a single data block.

2. Dynamically decide where to execute a PEI (host processor or PIM accelerator) based on simple locality characteristics

<u>Benefits:</u>

- Localization: each PEI is bounded to one memory module
- Interoperability: easier support for cache coherence and virtual memory
- Simplified locality monitoring: data locality of PEIs can be identified simply by the cache control logic

<hr>

## 5 - Barriers to adoption of PIM

1. Functionality of and applications of software for PIM
2. Ease of programming
3. System support: coherence & virtual memory
4. Runtime and compilation systems for adaptive scheduling, data mapping, access/sharing control
5. Infrastructures to assess benefits and feasibility

### Challenge 1: Code Mapping

Which operations should be executed in-memory vs. CPU?

### Challenge 2: Maintain memory coherence for hybrid CPU-PIM Apps

We need to provide a system to provide a memory coherence (i.e. the data is always accessible at its most updated copy). 

#### CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators

Specialized accelerators: GPU, FPGA, ASICs

The coherence is between NDAs and CPUs. It is impractical to use traditional coherence protocols. 

![conda](images/conda.png)

