# Lecture 9: Real PIM Systems: UPMEM
*Notes*

**PIM**: Processing in Memory using PIM-enabled instructions.

<hr>

## 1 - Main Idea

Data movement between memory and storage units and compute units is a major contributor to execution time and energy consumption.

Processing-in-Memory is a paradigm that can tackle the datamovement bottleneck (Though explored for 50 years, technology challegnes prevented the successful materialization).

UPMEM has designed and fabricated the first publicly-available real-world PIM architecture (DDR4 chips embedding in-order multi-hreaded DRAM Processing Units, DPU).

**Takeaway:**

UPMEM-based PIM systems outperform state-fo-the-art CPUs in terms of performance and energy efficiency on most of PrIM benchmarks. They also outperforms state-of-the-art GPU on a majority of PrIM benchmarks. PIM systems are more energy efficient.

<hr>

## 2 - UPMEM PIM Programming

![obs](images/obs.png)

UPMEM DIMMs coexist with conventional DIMMs. Integration of UPMEM DIMMs in a system follows an **accelerator model**. Il resembles GPU computing (explicit data movement between the main processor and the accelerator. Explicit kernel launch onto the UPMEM processors). 

### Vector Addition

Array programming to allow high parallelism without requesting data throughput to the CPU.

### CPU-DPU Data Transfers

Serial, parallel and broadcast transfers (single DPU, multiple DPU, multiple DPU with a single buffer). 

### Inter-DPU COmmunication

There is no direct communication channel between DPUs (they have to go through the host CPU). 

### CPU-DPU/DPU-CPU Transfer Bandwidth

Data transfer size varies between 8 bytes and 32 MB. 

<hr>

## 3 - DRAM Processing Unit

### Arithmetic Throughput

Arithmetic Throughput vs. Operational Intensity (goal is to characterize memory-bound regions and compute-bound regions). 

Microbenchmark (1. load one chunk of an MRAM array into WRAM, 2. Perform a variable number of operations on the data, 3. write back to MRAM). 

**Operational Intensity** is defined as ther number of arithmetic operations performed per byte accessed from MRAM.

> In a memory-bound region, the arithmetic throughput increases with the operational intensity. In the compute-bound region, the arithmetic throughput is flat at its maximum. The throughput saturation point is the operational intensity where the transition between the memory-bound region and the compute-bound region happens.
>
> The thorughput saturation point is as low as $\frac{1}{4}$ Operation/byte, i.e. 1 integer addition per every 32-bit element fetched.

### WRAM and MRAM bandwidth

#### WRAM Copy

$$\text{WRAM badwidth in }\frac{B}{S}=\frac{\text{Bytes} * \text{frequency}_{DPU}}{\text{# instructions}}$$

**Copy** executes 2 instructions (Load and store) with 11x16 bytes in 22 cycles -> $2800\frac{MB}{s}$ at 350Mhz.

#### MRAM Read and Write

$$\text{MRAM Bandwidth in }\frac{B}{S}=\frac{\text{size} * \text{frequency}_{DPU}}{\text{MRAM latency}}$$

We can model the MRAM latency with a linear expression: MRAM LAtency (in cycles) = $\alpha + \beta*\text{size}$.

<hr>

## 4 - PrIM Benchmarks

A common set of workloads that can be used to evaluate the UPMEM PIM architecture, compare software improvements and compilers, compare future PIM architectures and hardware.

Two key selection criteria: 1. selected workloads from different application domains, 2. memory-bound workloads on processor-centric architecture.