# Investigation: The Memory Hierarchy

**From The Nature of Fast, Chapter 1**

This notebook lets you investigate the memory hierarchy on your own hardware. You'll see how access patterns dramatically affect performance—even when the algorithm is "the same."

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ttsugriy/performance-book/blob/main/notebooks/tier2-experimental/01-memory-hierarchy.ipynb)

---

## Setup

First, let's check our environment and import dependencies.

In [None]:
import numpy as np
import time
import platform
import matplotlib.pyplot as plt

print(f"Python: {platform.python_version()}")
print(f"NumPy: {np.__version__}")
print(f"Platform: {platform.platform()}")
print(f"Processor: {platform.processor()}")

## Investigation 1: Sequential vs Random Access

Let's measure how access pattern affects performance for summing an array.

**Hypothesis (RAM model)**: Both should be the same—we're accessing the same elements.

**Hypothesis (Memory hierarchy)**: Sequential should be much faster due to caching and prefetching.

In [None]:
def benchmark_access_patterns(n=100_000_000, num_runs=5):
    """Compare sequential vs random array access."""
    
    # Create array and random indices
    print(f"Creating array of {n:,} elements ({n * 8 / 1e9:.2f} GB)...")
    arr = np.arange(n, dtype=np.int64)
    indices = np.random.permutation(n).astype(np.int64)
    
    # Warmup
    _ = arr.sum()
    _ = arr[indices[:1000]].sum()
    
    # Benchmark sequential access
    print("\nBenchmarking sequential access...")
    seq_times = []
    for i in range(num_runs):
        start = time.perf_counter()
        total_seq = arr.sum()
        seq_times.append(time.perf_counter() - start)
        print(f"  Run {i+1}: {seq_times[-1]:.4f}s")
    
    # Benchmark random access
    print("\nBenchmarking random access...")
    rand_times = []
    for i in range(num_runs):
        start = time.perf_counter()
        total_rand = arr[indices].sum()
        rand_times.append(time.perf_counter() - start)
        print(f"  Run {i+1}: {rand_times[-1]:.4f}s")
    
    # Verify correctness
    assert total_seq == total_rand, "Results don't match!"
    
    # Report results
    seq_mean = np.mean(seq_times)
    rand_mean = np.mean(rand_times)
    
    print("\n" + "="*50)
    print(f"Sequential: {seq_mean:.4f}s (±{np.std(seq_times):.4f}s)")
    print(f"Random:     {rand_mean:.4f}s (±{np.std(rand_times):.4f}s)")
    print(f"Ratio:      {rand_mean/seq_mean:.1f}× slower")
    print("="*50)
    
    return seq_times, rand_times

# Run the benchmark
seq_times, rand_times = benchmark_access_patterns()

### Discussion

You should see random access being 5-20× slower than sequential.

**Why?**

1. **Cache lines**: When you access one element, the CPU fetches a whole cache line (64 bytes = 8 int64s). Sequential access uses all 8; random access wastes 7.

2. **Prefetching**: The CPU hardware predicts sequential patterns and fetches ahead. Random access defeats prediction.

3. **TLB misses**: Virtual-to-physical address translation is cached. Random access causes more TLB misses.

## Investigation 2: Working Set Size

How does array size affect performance? If the working set fits in cache, access is fast. If it spills to DRAM, access slows down.

Let's find the cache boundaries!

In [None]:
def benchmark_working_set_sizes():
    """Measure how working set size affects access speed."""
    
    # Range of sizes from 1 KB to 1 GB
    sizes_kb = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 
                1024, 2048, 4096, 8192, 16384, 32768, 
                65536, 131072, 262144, 524288, 1048576]
    
    results = []
    
    for size_kb in sizes_kb:
        # Create array
        n_elements = (size_kb * 1024) // 8  # 8 bytes per int64
        if n_elements < 1000:
            continue
            
        arr = np.arange(n_elements, dtype=np.int64)
        
        # Measure random access (to stress the cache)
        indices = np.random.randint(0, n_elements, size=min(n_elements, 10_000_000))
        
        # Warmup
        _ = arr[indices[:1000]].sum()
        
        # Measure
        times = []
        for _ in range(3):
            start = time.perf_counter()
            _ = arr[indices].sum()
            times.append(time.perf_counter() - start)
        
        mean_time = np.mean(times)
        accesses = len(indices)
        ns_per_access = (mean_time / accesses) * 1e9
        
        results.append({
            'size_kb': size_kb,
            'ns_per_access': ns_per_access,
        })
        
        print(f"Size: {size_kb:>7} KB | {ns_per_access:>6.1f} ns/access")
    
    return results

results = benchmark_working_set_sizes()

In [None]:
# Visualize the results
sizes = [r['size_kb'] for r in results]
latencies = [r['ns_per_access'] for r in results]

plt.figure(figsize=(12, 6))
plt.loglog(sizes, latencies, 'o-', linewidth=2, markersize=8)

# Add cache size annotations (typical values)
cache_sizes = [
    (32, 'L1 (~32 KB)'),
    (256, 'L2 (~256 KB)'),
    (32768, 'L3 (~32 MB)'),
]

for size, label in cache_sizes:
    plt.axvline(x=size, color='red', linestyle='--', alpha=0.5)
    plt.text(size * 1.1, max(latencies) * 0.8, label, fontsize=10, rotation=90)

plt.xlabel('Working Set Size (KB)', fontsize=12)
plt.ylabel('Latency per Access (ns)', fontsize=12)
plt.title('Memory Latency vs Working Set Size', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nNote: The 'steps' in the curve show cache boundaries.")
print("When data spills from one cache level to the next, latency increases.")

### What to Look For

You should see a **staircase pattern**:

1. **Flat region (small sizes)**: Data fits in L1 cache. Fast access.
2. **Step up**: Data spills to L2. Slower.
3. **Another step**: Data spills to L3. Even slower.
4. **Final step**: Data goes to DRAM. Much slower.

The exact boundaries depend on your hardware.

## Investigation 3: Stride Patterns

What happens if we access every Nth element? This is called a **strided** access pattern.

- Stride 1: Sequential (best)
- Stride 8: Every 8th element (uses 1 of 8 per cache line)
- Stride 64: Every 64th byte (uses 1 per cache line)
- Stride > cache line: Maximum cache waste

In [None]:
def benchmark_strides():
    """Measure how access stride affects performance."""
    
    # Large array to ensure we're measuring memory effects
    n = 100_000_000
    arr = np.arange(n, dtype=np.int64)
    
    strides = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
    results = []
    
    for stride in strides:
        # Create indices with this stride
        indices = np.arange(0, n, stride, dtype=np.int64)
        num_accesses = len(indices)
        
        # Warmup
        _ = arr[indices[:1000]].sum()
        
        # Measure
        times = []
        for _ in range(3):
            start = time.perf_counter()
            _ = arr[indices].sum()
            times.append(time.perf_counter() - start)
        
        mean_time = np.mean(times)
        gb_per_sec = (num_accesses * 8) / mean_time / 1e9
        
        results.append({
            'stride': stride,
            'gb_per_sec': gb_per_sec,
            'time': mean_time,
        })
        
        print(f"Stride {stride:>3}: {gb_per_sec:>6.2f} GB/s effective bandwidth")
    
    return results

stride_results = benchmark_strides()

In [None]:
# Visualize stride effects
strides = [r['stride'] for r in stride_results]
bandwidths = [r['gb_per_sec'] for r in stride_results]

plt.figure(figsize=(10, 6))
plt.bar(range(len(strides)), bandwidths, tick_label=strides)
plt.xlabel('Stride (elements)', fontsize=12)
plt.ylabel('Effective Bandwidth (GB/s)', fontsize=12)
plt.title('How Stride Affects Memory Bandwidth', fontsize=14)

# Annotate cache line boundary
cache_line_elements = 64 // 8  # 8 bytes per int64
plt.axvline(x=strides.index(cache_line_elements) if cache_line_elements in strides else -1, 
            color='red', linestyle='--', alpha=0.5, label='Cache line boundary')

plt.legend()
plt.tight_layout()
plt.show()

print(f"\nStride 1 to Stride {strides[-1]} bandwidth ratio: {bandwidths[0]/bandwidths[-1]:.1f}×")

### Key Observation

Notice how bandwidth **drops dramatically** as stride increases past the cache line size (8 elements for int64, since 64 bytes / 8 bytes per element = 8).

At stride 8+, you're fetching a full cache line but only using 1 element from it. That's 87.5% waste!

## Your Turn: Experiments

Try modifying the benchmarks to answer these questions:

1. **Different data types**: How do float32 vs float64 vs int8 affect the results?

2. **Read vs write**: Is writing faster or slower than reading? (Hint: try `arr[indices] = value` vs `arr[indices].sum()`)

3. **Multiple passes**: If you access the same data multiple times, does the second pass benefit from caching?

In [None]:
# Your experiments here!

# Example: Try different data types
def compare_dtypes():
    n = 10_000_000
    
    for dtype in [np.float32, np.float64, np.int8, np.int32, np.int64]:
        arr = np.arange(n, dtype=dtype)
        
        start = time.perf_counter()
        for _ in range(10):
            _ = arr.sum()
        elapsed = time.perf_counter() - start
        
        bytes_per_elem = arr.itemsize
        gb_per_sec = (n * bytes_per_elem * 10) / elapsed / 1e9
        
        print(f"{dtype.__name__:>10} ({bytes_per_elem} bytes): {gb_per_sec:.2f} GB/s")

compare_dtypes()

## Key Takeaways

1. **The RAM model lies**: Memory access is NOT uniform. It can vary by 200× depending on where data lives.

2. **Sequential beats random**: By 5-20×, because of caching and prefetching.

3. **Working set size matters**: Keep frequently-used data small enough to fit in cache.

4. **Stride kills performance**: Strided access wastes cache lines. Use stride-1 (sequential) when possible.

5. **Measure on YOUR hardware**: Cache sizes and latencies vary. The specific numbers depend on your machine.

---

*Continue to [Chapter 2: The Tyranny of Bandwidth](https://ttsugriy.github.io/performance-book/chapters/02-bandwidth-tyranny.html)*