## The Cache Hierarchy

* Memory is an abstraction
  * looks to processor like a 1-d adress space of data locations
  * uniform access from all cores/processors
*  Actually a steep, hieararchy of cache in which different levels have different:
  * Performance
  * Capacity
  * Sharing
  
<img src="https://sites.google.com/site/cachememory2011/_/rsrc/1311628836036/memory-hierarchy/hei.png" width=512 title="Cache Hierarchy" />

* Caches are a place to store a smaller amount of data the is frequently/recently used to make data access faster.
  * Processor caches (on chip) cache for memory.  Managed by hardware.
  * Memory (DRAM) is a cache for pages from disk.  Managed by a storage system (database, file system).
  * Management refers to the process of loading and evicting the contents in response to workload.
  
### The Hierarchy

<img src="http://www.imexresearch.com/newsletters/images/201009_SSDImages/20100913_SSD_0000.png" width=512 title="IMEX Data on Latency and Cost" />

<img src="https://eda360insider.files.wordpress.com/2012/05/wegener-1.gif?w=1400" width=512 title="Cache latency and granularity" />

### Latency

Delays (in clock cycles) to different levels in the cache hierarchy for an i7 (Nehalem)
 * $1$ cycle to registers (private to each core)
 * $1$ cycle to L1 (private to each core)
 * $4$ cycles to L2 (private to each core)
 * $35$ cycles to L3 (shared by cores)
 * $145$ cycles to memory (shared by processors)
 * $10^5$ cycles to NVRAM
 * $10^7$ cycles to disk

_Data Loading_: New data that has not been used yet must come from SSD or disk.  Can be very slow.

_Data Sharing_: When two threads need to share data, they incur the cost of transferring data through the fastest shared cache.
  * 2 cores on the same processor take 70 cycles (35 to write to L3 and 35 to read from L3)
  * 2 processors take 290 cycles
Sharing dyanmics result in _interference_ between processes that share data in OpenMP and threads.  This is the major source for lost parallelism in these programming models.

### Processor Caching Concepts

The memory system should be thought of a a vectorized parallel system.  Whenever you 
get data, you get many words of data.  To get good memory throughput, you must 
use all that data.  Most important to understanding cache performance are:
* __cache line__: data are moved among levels in the cache one line at a time
  * as small as 64 bytes (L1 or L2)
  * think of each load as the parallel load of an entire line
  * good parallel programs will use as 64 bytes
* __unified__: refers to whether or not the cache is shared (among cores or processors)

Other concepts that don't matter as much.
* __inclusive vs exclusive__: has implications for hardware management policies.  We don't care.
  * __inclusive__: data in higher level caches are also in lower level caches
  * __exclusive__: data in higher level caches are not in lower level caches
* __associativity__: the number of hardware locations that a cache line can go into
  * important for HW design.  We typically don't care until we get to CUDA.

### So What

* You can’t just access memory
  * Different memory access patterns result in 50x performance differences for the same computation
* Worry about:
  * Parallelism: am I using all the data in a cache line
    * To access a single byte, one must load a whole line
    * Sequential access to memory is always parallel!
  * Sharing/reuse: is my program referencing data in the cache more than once?  At what levels?

Good memory access patterns are __aligned, sequential__ and __coalesced__.
  * Aligned – access range starts/ends on cache line boundaries
  * Sequential – a continuous range of bytes
  * Coalesced – combine multiple small accesses into fewer large accesses


### A Neat Experiment

For an array of different sizes, loop over the array repeatedly acessing every 128th byte on a Power 7 processor.

<img src="https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/26579cc3-66fe-42b8-baf9-1fcc88445848/page/523bd8d0-b51c-4c19-94ef-b1674779b5d8/attachment/ff9ae87f-2a1d-43d5-9c6d-916426cb4fc2/media/lat_mem_rd-1st.png?preventCache=1466696220189" width=512 title="lat_mem_rd" />

The results show:
  * arrays are accessed from the smallest cache that contains the entire array
  * there are steep performance cliffs between each of the hardware levels
  * main memory has a more complex transition
    * it is software managed with more complex policies
    * for large arrays, performance matches latency
    
__Conclusion__: the exact same code at different sizes can have >20x performance differences.
  * you have to understand the cache hierarchy and reason about where your code runs to write fast code.
    