# Lecture 11: Low-Latency Memory

*Notes*

<hr>

## 1 - Main Idea

One of the four key directions of the industry is **low-latency** and **predictable** memory.

One of the hardest problem of modern memory management is **latency**.

### Data-Centric Architectures: Properties

- Process data where it resides (processing in and near memory structures)
- Low-latency and low-energy data access
- Low-cost data storage & processing (high capacity memroy at low cost)
- Intelligent data management


<hr>

## 2 - Low-Latency and Low-Energy data access

In general, reducing interconnects and data calls/retrieval also implies reduced energy consumption.

### Fundamental trade-offs

1. **Memory latency generally lags behind capacity and bandwidth** in terms of development/progress/scaling since 1999. Memory latency has almost remained constant for two decades.

<hr>

**Note on DRAM**: 

    - DRAM latency is critical for performance (in-memory databases, graph/Tree Processing, in-memory data analytics, and datacenter workloads).
    - New DRAM types have increased latency (manufacturers are developing many new types of DRAM but DRAM limits performance and energy improvement).
    - DRAM latency remains a critical bottle neck for many applications.
    - Bank parallelism is not fully utlized by a wide variety of applications.
    - Spatial locality continues to provide significant performance benefits if it is iexploited by the memory subsystem.
    - For some classes of applications, low-power memory can provide energy savings without sacrificing significant performance.
    
Modern DRAM is not designed for low latency and such latency is determined by worst case conditions and worst case devices.
    
<hr>

    
2. There is an increasing higher memory contention in multi-core and heterogeneous architectures (Exacerbate the bandwidth need and the QoS problem)

3. Increases in processor design complexity due to the mechanisms incorporate to tolerate memory latency

<hr>

## 3 - Runahead Execution

Cache Miss: *A cache miss occurs when the application attempts to retrieve information from cache memory, but the given item is not actually present in the cache. This is the opposite of a cache hit, in which the system does discover the requested item in the cache.*

![runahead](images/runahead.png)

**Runahead is a pre-fetching technique to compress stall times.** Runahead execution improves processor performance by accurately prefetching long-latency memory accesses. When a long-latency load causes the instruction window to fill up and halt the pipeline, the processor enters runahead mode and keeps speculatively executing code to trigger accurate prefetches.

This leads to significant performance increases (+40% in 2008 sun ROCK).

<hr>

## 4 - Long Memory Latency: an issue of design

### Tiered Latency DRAM

DRAM chips are composed of stacked subarrays where the latency depends on the distance from a channel (subarrays are acceessed through an I/O, which also incurs some latency). The subarray latency is dominant over the I/O latency.

![subarray](images/subarray.png)

There is a clear trade-off between **area (die size)** and **latency**. A shorter bitline can help mitigate this issue. 

**TL-DRAM is a substrate that can be leveraged by the hardware and/or software.**

<u>Many potential uses:</u>

1. Using neear segment as hardware-managed inclusive cache to far segment.
2. Using near segment as hardware-managed exclusive cache to far segment
3. profile-based page mapping by operating system
4. Simply replace DRAM with TL-DRAM

### LISA: Low-Cost Inter-Linked Subarrays

LISA allow fast bulk data movement between subarrays as it relies on a **versatile substrate**.

You can perform **row buffer movement** (RBM) where a row of data is moved into an activated row buffer to a precharged one.

### CROW: The Copy Row Substrate

### SALP: Reducing DRAM Bank Conflict Impact

**Problem**: Bank conflicts are costly for performance and energy.

**Goal**: Reduce bank conflicts without adding more banks (*low cost*)

**Observation 1**: A DRAM bank is divided into subarrays and each subarray has its own local row buffer.

**Observation 2**: Subarrays are mostly independent except when sharing global structures to reduce costs

**Key Idea**: Minimally reduce sharing of global structures (reduce sharing of global decoded [enables almost parallel access to subarrays], global row buffer [utilizes multiple local row buffers]).

<hr>

## 5 - 

<hr>

## 5 - 

<hr>

## 5 - 

<hr>

## 5 - 