# üöÄ Day 4: Special Memory Types - Constant & Texture Memory

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sdodlapati3/cuda-lab/blob/main/learning-path/week-02/day-4-special-memory.ipynb)

---

## üéØ Why Special Memory Types?

> **The Problem:** Not all data access patterns are created equal. Some data is read by every thread (like a filter kernel), while other data has strong 2D spatial locality (like image pixels). Using only global memory for these patterns leaves performance on the table.

**Real-World Impact:**
- üé® **Image processing** - Convolution filters read by millions of threads simultaneously
- üî¨ **Scientific computing** - Physical constants and lookup tables accessed uniformly  
- üéÆ **Graphics** - Texture sampling with automatic interpolation

**Today's Mission:** Learn when and how to use CUDA's specialized memory types to match your data access patterns for maximum performance.

---

## üìã Learning Objectives

| # | Objective | Skill Level |
|---|-----------|-------------|
| 1 | Understand when constant memory provides broadcast benefits | üîµ Core |
| 2 | Implement convolution with constant memory filters | üîµ Core |
| 3 | Understand texture memory's 2D spatial locality optimization | üîµ Core |
| 4 | Choose the right memory type for different access patterns | üü¢ Essential |
| 5 | Combine multiple memory optimizations in a complete solution | üü° Advanced |

---

## Learning Philosophy

> **CUDA C++ First, Python/Numba as Optional Backup**

This notebook shows:
1. **CUDA C++ code** - The PRIMARY implementation you should learn
2. **Python/Numba code** - OPTIONAL for quick interactive testing in Colab

---

## Setup

**For ODU HPC (Wahab):**
```bash
module load container_env cuda-12.3.0
crun -p ~/envs/cuda python -m jupyter lab
```

In [None]:
# ‚öôÔ∏è Colab/Local Setup - Run this first!
# Python/Numba is OPTIONAL - for quick interactive testing only
import subprocess, sys
try:
    import google.colab
    print("üîß Running on Google Colab - Installing dependencies...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "numba"])
    print("‚úÖ Setup complete!")
except ImportError:
    print("üíª Running locally")

import numpy as np
from numba import cuda
import math
import time

print(f"\nCUDA available: {cuda.is_available()}")
if cuda.is_available():
    device = cuda.get_current_device()
    print(f"Device: {device.name}")
print("\n‚ö†Ô∏è  Remember: CUDA C++ code is the PRIMARY learning material!")

---

## Part 1: Complete CUDA Memory Hierarchy Review

Before diving into special memory types, let's review the complete memory hierarchy:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    CUDA Memory Hierarchy                    ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                             ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îÇ
‚îÇ  ‚îÇ              Global Memory (GB, slowest)            ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚Ä¢ All threads can read/write                       ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚Ä¢ Persists for application lifetime                ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  ‚Ä¢ ~400-900 GB/s bandwidth                          ‚îÇ   ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îÇ
‚îÇ                          ‚îÇ                                  ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê          ‚îÇ
‚îÇ  ‚îÇ                      ‚îÇ                        ‚îÇ          ‚îÇ
‚îÇ  ‚ñº                      ‚ñº                        ‚ñº          ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê      ‚îÇ
‚îÇ  ‚îÇ   Constant   ‚îÇ  ‚îÇ   Texture    ‚îÇ  ‚îÇ    Shared    ‚îÇ      ‚îÇ
‚îÇ  ‚îÇ    Memory    ‚îÇ  ‚îÇ    Memory    ‚îÇ  ‚îÇ    Memory    ‚îÇ      ‚îÇ
‚îÇ  ‚îÇ  (64KB,      ‚îÇ  ‚îÇ  (cached,    ‚îÇ  ‚îÇ  (48-164KB   ‚îÇ      ‚îÇ
‚îÇ  ‚îÇ   cached)    ‚îÇ  ‚îÇ   read-only) ‚îÇ  ‚îÇ   per SM)    ‚îÇ      ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò      ‚îÇ
‚îÇ         ‚îÇ                 ‚îÇ                  ‚îÇ              ‚îÇ
‚îÇ         ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò              ‚îÇ
‚îÇ                           ‚îÇ                                  ‚îÇ
‚îÇ                           ‚ñº                                  ‚îÇ
‚îÇ            ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                 ‚îÇ
‚îÇ            ‚îÇ  Registers (fastest, private) ‚îÇ                 ‚îÇ
‚îÇ            ‚îÇ  ‚Ä¢ Per-thread, ~256 per thread‚îÇ                 ‚îÇ
‚îÇ            ‚îÇ  ‚Ä¢ ~12 TB/s equivalent        ‚îÇ                 ‚îÇ
‚îÇ            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                 ‚îÇ
‚îÇ                                                             ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Memory Types Summary

| Memory Type | Location | Scope | Lifetime | Cache | Speed |
|-------------|----------|-------|----------|-------|-------|
| Register | On-chip | Thread | Thread | N/A | Fastest |
| Local | Off-chip | Thread | Thread | L1/L2 | Slow |
| Shared | On-chip | Block | Block | N/A | Fast |
| Global | Off-chip | Grid | Application | L1/L2 | Slow |
| Constant | Off-chip | Grid | Application | Constant cache | Fast (broadcast) |
| Texture | Off-chip | Grid | Application | Texture cache | Fast (spatial) |

---

## üÉè Concept Card: The Complete Memory Hierarchy

> **Analogy: A City's Communication Systems**
>
> Think of GPU memory like different communication systems in a city:
>
> | Memory Type | City Analogy | Best For |
> |-------------|--------------|----------|
> | **Registers** | Person's own thoughts | Private calculations |
> | **Shared Memory** | Conference room | Team collaboration |
> | **Global Memory** | Public library | Large shared data |
> | **Constant Memory** | üìª **Radio broadcast** | Same info to everyone |
> | **Texture Memory** | üó∫Ô∏è **GPS navigation** | Spatial lookups with interpolation |
>
> **Today's Focus:** The specialized "broadcast" and "spatial" systems that match specific access patterns.

---

---

## Part 2: Constant Memory

Now that we understand where constant memory fits in the hierarchy, let's explore it in depth. Remember our radio broadcast analogy‚Äîconstant memory shines when **all threads need the same data**.

### What is Constant Memory?

Constant memory is a **read-only** memory space that:
- Has 64KB total capacity per GPU
- Is cached in a dedicated **constant cache**
- Optimized for **broadcasting** same value to all threads
- Initialized by the host before kernel launch

### When Constant Memory Excels

```
BEST CASE: All threads read the SAME address
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Thread 0  Thread 1  Thread 2  ...  Thread 31    ‚îÇ
‚îÇ     ‚îÇ         ‚îÇ         ‚îÇ              ‚îÇ         ‚îÇ
‚îÇ     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò         ‚îÇ
‚îÇ                     ‚îÇ                             ‚îÇ
‚îÇ                     ‚ñº                             ‚îÇ
‚îÇ           ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                    ‚îÇ
‚îÇ           ‚îÇ Constant Cache  ‚îÇ                    ‚îÇ
‚îÇ           ‚îÇ   (1 read)      ‚îÇ                    ‚îÇ
‚îÇ           ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                    ‚îÇ
‚îÇ  ‚Üí Result broadcast to all 32 threads in 1 cycle ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

WORST CASE: All threads read DIFFERENT addresses
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Thread 0  Thread 1  Thread 2  ...  Thread 31    ‚îÇ
‚îÇ     ‚îÇ         ‚îÇ         ‚îÇ              ‚îÇ         ‚îÇ
‚îÇ     ‚ñº         ‚ñº         ‚ñº              ‚ñº         ‚îÇ
‚îÇ   addr[0]  addr[1]  addr[2]  ...  addr[31]      ‚îÇ
‚îÇ     ‚îÇ         ‚îÇ         ‚îÇ              ‚îÇ         ‚îÇ
‚îÇ     ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò         ‚îÇ
‚îÇ                     ‚îÇ                             ‚îÇ
‚îÇ                     ‚ñº                             ‚îÇ
‚îÇ           ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                    ‚îÇ
‚îÇ           ‚îÇ 32 serial reads ‚îÇ ‚Üê SLOW!            ‚îÇ
‚îÇ           ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                    ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Ideal Use Cases for Constant Memory

1. **Filter kernels/masks** (convolution, Gaussian blur)
2. **Lookup tables** accessed uniformly
3. **Configuration parameters** (dimensions, coefficients)
4. **Mathematical constants** (œÄ, e, conversion factors)

### üî∑ CUDA C++ Implementation (Primary)

Let's see constant memory in action with a classic use case: **image convolution**. Every thread applies the same filter, making this a perfect broadcast scenario.

---

## üÉè Concept Card: Constant Memory - The Broadcast Channel

> **Analogy: Radio Station Broadcasting**
>
> Imagine a radio station broadcasting the weather report:
> - üìª **One transmission** reaches **thousands of listeners simultaneously**
> - Each listener doesn't need their own phone call to the station
> - The information is the **same for everyone**
>
> **Constant memory works the same way:**
> ```
> Traditional (Global Memory):          Constant Memory (Broadcast):
> ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê          ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
> ‚îÇ Thread 0: "Give me data"‚îÇ          ‚îÇ Thread 0 ‚îÄ‚îê             ‚îÇ
> ‚îÇ Thread 1: "Give me data"‚îÇ          ‚îÇ Thread 1 ‚îÄ‚îº‚îÄ‚îÄ üìª ‚îÄ‚îÄ‚îÄ‚Üí data‚îÇ
> ‚îÇ Thread 2: "Give me data"‚îÇ          ‚îÇ Thread 2 ‚îÄ‚î§  (1 read)   ‚îÇ
> ‚îÇ    ...32 separate reads ‚îÇ          ‚îÇ   ...     ‚îÄ‚îò             ‚îÇ
> ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò          ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
>         32 reads!                           1 read, broadcast!
> ```
>
> **Perfect For:** Filter kernels, lookup tables, constants that ALL threads need

---

In [None]:
%%writefile constant_mem_demo.cu
#include <stdio.h>
#include <cuda_runtime.h>

// Constant memory declaration (64KB max)
__constant__ float d_filter[9];

// Convolution using constant memory for filter
__global__ void convolution_constant(const float* input, float* output, 
                                      int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (x > 0 && x < width - 1 && y > 0 && y < height - 1) {
        float result = 0.0f;
        
        // 3x3 convolution using constant memory filter
        // All threads read the SAME filter values -> broadcast!
        for (int i = -1; i <= 1; i++) {
            for (int j = -1; j <= 1; j++) {
                int idx = (y + i) * width + (x + j);
                int fIdx = (i + 1) * 3 + (j + 1);
                result += input[idx] * d_filter[fIdx];  // Constant memory access
            }
        }
        output[y * width + x] = result;
    }
}

// Convolution using global memory for filter (for comparison)
__global__ void convolution_global(const float* input, const float* filter,
                                   float* output, int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (x > 0 && x < width - 1 && y > 0 && y < height - 1) {
        float result = 0.0f;
        
        for (int i = -1; i <= 1; i++) {
            for (int j = -1; j <= 1; j++) {
                int idx = (y + i) * width + (x + j);
                int fIdx = (i + 1) * 3 + (j + 1);
                result += input[idx] * filter[fIdx];  // Global memory access
            }
        }
        output[y * width + x] = result;
    }
}

int main() {
    int width = 2048, height = 2048;
    int size = width * height;
    
    // Sobel edge detection filter
    float h_filter[9] = {
        -1, 0, 1,
        -2, 0, 2,
        -1, 0, 1
    };
    
    // Allocate host memory
    float* h_input = (float*)malloc(size * sizeof(float));
    float* h_output = (float*)malloc(size * sizeof(float));
    
    // Initialize with test pattern
    for (int i = 0; i < size; i++) {
        h_input[i] = (float)(rand() % 256) / 255.0f;
    }
    
    // Allocate device memory
    float *d_input, *d_output, *d_filter_global;
    cudaMalloc(&d_input, size * sizeof(float));
    cudaMalloc(&d_output, size * sizeof(float));
    cudaMalloc(&d_filter_global, 9 * sizeof(float));
    
    cudaMemcpy(d_input, h_input, size * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_filter_global, h_filter, 9 * sizeof(float), cudaMemcpyHostToDevice);
    
    // Copy filter to constant memory
    cudaMemcpyToSymbol(d_filter, h_filter, 9 * sizeof(float));
    
    printf("=== Constant Memory Demonstration ===\n");
    printf("Image size: %dx%d, Filter: 3x3 Sobel\n\n", width, height);
    
    dim3 threads(16, 16);
    dim3 blocks((width + 15) / 16, (height + 15) / 16);
    
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    // Benchmark constant memory version
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        convolution_constant<<<blocks, threads>>>(d_input, d_output, width, height);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float const_time;
    cudaEventElapsedTime(&const_time, start, stop);
    
    // Benchmark global memory version
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++) {
        convolution_global<<<blocks, threads>>>(d_input, d_filter_global, d_output, width, height);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float global_time;
    cudaEventElapsedTime(&global_time, start, stop);
    
    printf("Constant memory filter: %.3f ms\n", const_time / 100);
    printf("Global memory filter:   %.3f ms\n", global_time / 100);
    printf("Speedup:                %.2fx\n", global_time / const_time);
    printf("\nüí° Constant memory broadcasts filter values to all threads efficiently!\n");
    
    // Cleanup
    cudaFree(d_input);
    cudaFree(d_output);
    cudaFree(d_filter_global);
    free(h_input);
    free(h_output);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o constant_mem_demo constant_mem_demo.cu
!./constant_mem_demo

### üî∂ Python/Numba (Optional - Quick Testing)

Example: Convolution with Constant Memory

Image convolution is the **classic** use case for constant memory because all threads apply the **same filter kernel**.

In [None]:
# 3x3 Sobel edge detection filter
SOBEL_X = np.array([
    [-1, 0, 1],
    [-2, 0, 2],
    [-1, 0, 1]
], dtype=np.float32)

print("Sobel X filter (edge detection):")
print(SOBEL_X)

In [None]:
# Version 1: Passing filter as regular global memory
@cuda.jit
def convolution_global(image, filter_kernel, output):
    """Convolution using global memory for filter."""
    row, col = cuda.grid(2)
    height, width = image.shape
    
    if row > 0 and row < height - 1 and col > 0 and col < width - 1:
        result = 0.0
        for i in range(-1, 2):
            for j in range(-1, 2):
                # Each read of filter_kernel goes to global memory
                result += image[row + i, col + j] * filter_kernel[i + 1, j + 1]
        output[row, col] = result

In [None]:
# In Numba, we simulate constant memory behavior by:
# 1. Using a closure to capture the filter at compile time
# 2. Or placing filter in shared memory (next best thing)

# Version 2: Filter in shared memory (simulating constant memory)
@cuda.jit
def convolution_shared_filter(image, filter_kernel, output):
    """Convolution with filter loaded to shared memory."""
    # Load filter to shared memory (done once per block)
    shared_filter = cuda.shared.array((3, 3), dtype=np.float32)
    
    tx, ty = cuda.threadIdx.x, cuda.threadIdx.y
    
    # First 9 threads load the filter
    linear_tid = ty * cuda.blockDim.x + tx
    if linear_tid < 9:
        fi, fj = linear_tid // 3, linear_tid % 3
        shared_filter[fi, fj] = filter_kernel[fi, fj]
    
    cuda.syncthreads()
    
    row, col = cuda.grid(2)
    height, width = image.shape
    
    if row > 0 and row < height - 1 and col > 0 and col < width - 1:
        result = 0.0
        for i in range(-1, 2):
            for j in range(-1, 2):
                # Filter reads now from fast shared memory
                result += image[row + i, col + j] * shared_filter[i + 1, j + 1]
        output[row, col] = result

In [None]:
# Test both versions
SIZE = 2048
image = np.random.rand(SIZE, SIZE).astype(np.float32)
output = np.zeros_like(image)

d_image = cuda.to_device(image)
d_filter = cuda.to_device(SOBEL_X)
d_output = cuda.to_device(output)

threads = (16, 16)
blocks = ((SIZE + 15) // 16, (SIZE + 15) // 16)

# Warm up
convolution_global[blocks, threads](d_image, d_filter, d_output)
convolution_shared_filter[blocks, threads](d_image, d_filter, d_output)
cuda.synchronize()

# Benchmark global memory filter
start = time.perf_counter()
for _ in range(100):
    convolution_global[blocks, threads](d_image, d_filter, d_output)
cuda.synchronize()
global_time = (time.perf_counter() - start) / 100 * 1000

# Benchmark shared memory filter
start = time.perf_counter()
for _ in range(100):
    convolution_shared_filter[blocks, threads](d_image, d_filter, d_output)
cuda.synchronize()
shared_time = (time.perf_counter() - start) / 100 * 1000

print(f"Image size: {SIZE}x{SIZE}")
print(f"Global memory filter: {global_time:.3f} ms")
print(f"Shared memory filter: {shared_time:.3f} ms")
print(f"Speedup: {global_time/shared_time:.2f}x")

### Understanding Constant Memory Benefits

Even though Numba doesn't have direct constant memory syntax, the concept matters:

```
Why constant memory is fast for filters:

For a 3x3 convolution:
- Each thread reads the SAME 9 filter values
- 1 warp = 32 threads all reading filter[0,0]
- With constant cache: 1 read, broadcast to 32
- Without: 32 reads to global memory

Result: 32x reduction in filter memory traffic!
```

In native CUDA C++:
```cpp
__constant__ float filter[9];  // Declared at file scope

// Copied before kernel launch
cudaMemcpyToSymbol(filter, host_filter, 9 * sizeof(float));
```

---

**Transition:** Now that we've mastered constant memory for uniform broadcast access, let's explore texture memory‚Äîdesigned for a completely different pattern: **2D spatial locality**.

---

## Part 3: Texture Memory

While constant memory broadcasts the same value to all threads, texture memory excels at a different pattern: **2D spatial access with hardware interpolation**.

### What is Texture Memory?

Texture memory is another **read-only** memory that:
- Is optimized for **2D spatial locality**
- Has a dedicated **texture cache**
- Supports **hardware interpolation** (free bilinear/trilinear)
- Supports **automatic boundary handling** (clamp, wrap, mirror)
- Originally designed for graphics, but useful for compute

### Texture Cache vs L1 Cache

```
L1 Cache: Optimized for 1D linear access (coalescing)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ [0][1][2][3][4][5][6][7] ‚Üí cache line  ‚îÇ
‚îÇ Linear memory layout                    ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

Texture Cache: Optimized for 2D spatial access
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ [0,0][0,1]‚îÇ[0,2][0,3]                  ‚îÇ
‚îÇ [1,0][1,1]‚îÇ[1,2][1,3]  ‚Üí 2D tiles      ‚îÇ
‚îÇ ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ                  ‚îÇ
‚îÇ [2,0][2,1]‚îÇ[2,2][2,3]                  ‚îÇ
‚îÇ [3,0][3,1]‚îÇ[3,2][3,3]                  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### When Texture Memory Excels

1. **Image processing** (resizing, rotation, warping)
2. **Interpolation** operations
3. **Random 2D access patterns**
4. **Volume rendering** (3D textures)
5. **Data with spatial locality** in 2D/3D

### üî∑ CUDA C++ Implementation (Primary)

---

## üÉè Concept Card: Texture Memory - The Image Cache

> **Analogy: GPS Navigation with Smart Caching**
>
> Your GPS doesn't load the entire world map‚Äîit loads **tiles around your location**:
> - üó∫Ô∏è Moving **north**? Nearby northern tiles are likely needed next
> - üó∫Ô∏è Moving **east**? Eastern tiles are pre-cached
> - üó∫Ô∏è Need a point **between** grid points? GPS **interpolates** automatically
>
> **Texture memory works the same way:**
> ```
> Regular L1 Cache (1D optimized):     Texture Cache (2D optimized):
> ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê         ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
> ‚îÇ Cache line: [0][1][2][3] ‚îÇ         ‚îÇ 2D Tile: [0,0][0,1]      ‚îÇ
> ‚îÇ Great for linear access  ‚îÇ         ‚îÇ          [1,0][1,1]      ‚îÇ
> ‚îÇ Poor for 2D neighbors    ‚îÇ         ‚îÇ Great for 2D neighbors!  ‚îÇ
> ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò         ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
> ```
>
> **Bonus Features (Free in Hardware!):**
> - ‚ú® **Bilinear interpolation** between pixels
> - ‚ú® **Boundary handling** (clamp, wrap, mirror)
> - ‚ú® **Normalized coordinates** (0.0 to 1.0)
>
> **Perfect For:** Image resizing, rotation, texture mapping, any 2D spatial access

---

### Texture Memory in Modern CUDA

Modern GPUs use **texture objects** (introduced in CUDA 5.0). Texture memory provides automatic interpolation and boundary handling.

Unfortunately, Numba CUDA doesn't directly support texture objects. For texture-like benefits in Numba:
1. Use shared memory tiling for 2D spatial locality
2. Implement manual interpolation
3. For advanced cases, use CuPy or raw CUDA

### üî∑ CUDA C++ Implementation (Primary)

In [None]:
%%writefile texture_demo.cu
#include <stdio.h>
#include <cuda_runtime.h>

// Texture object reference
cudaTextureObject_t texObj;

// Kernel using texture memory for bilinear interpolation
__global__ void resizeWithTexture(cudaTextureObject_t tex, float* output,
                                   int outWidth, int outHeight,
                                   int inWidth, int inHeight) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (x < outWidth && y < outHeight) {
        // Map output coords to input coords (normalized)
        float u = (float)x / (outWidth - 1);
        float v = (float)y / (outHeight - 1);
        
        // tex2D with linear filtering does bilinear interpolation for free!
        float value = tex2D<float>(tex, u * inWidth, v * inHeight);
        output[y * outWidth + x] = value;
    }
}

// Kernel without texture (manual bilinear interpolation)
__global__ void resizeManual(float* input, float* output,
                              int outWidth, int outHeight,
                              int inWidth, int inHeight) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (x < outWidth && y < outHeight) {
        // Map output coords to input coords
        float srcX = (float)x * (inWidth - 1) / (outWidth - 1);
        float srcY = (float)y * (inHeight - 1) / (outHeight - 1);
        
        // Manual bilinear interpolation
        int x0 = (int)srcX;
        int y0 = (int)srcY;
        int x1 = min(x0 + 1, inWidth - 1);
        int y1 = min(y0 + 1, inHeight - 1);
        
        float fx = srcX - x0;
        float fy = srcY - y0;
        
        float v00 = input[y0 * inWidth + x0];
        float v01 = input[y0 * inWidth + x1];
        float v10 = input[y1 * inWidth + x0];
        float v11 = input[y1 * inWidth + x1];
        
        float v0 = v00 * (1 - fx) + v01 * fx;
        float v1 = v10 * (1 - fx) + v11 * fx;
        
        output[y * outWidth + x] = v0 * (1 - fy) + v1 * fy;
    }
}

int main() {
    int inWidth = 256, inHeight = 256;
    int outWidth = 512, outHeight = 512;
    
    // Allocate and initialize input
    float* h_input = (float*)malloc(inWidth * inHeight * sizeof(float));
    for (int i = 0; i < inHeight; i++) {
        for (int j = 0; j < inWidth; j++) {
            h_input[i * inWidth + j] = (float)(i + j) / (inWidth + inHeight);
        }
    }
    
    // Allocate device memory
    float *d_input, *d_output_tex, *d_output_manual;
    cudaMalloc(&d_input, inWidth * inHeight * sizeof(float));
    cudaMalloc(&d_output_tex, outWidth * outHeight * sizeof(float));
    cudaMalloc(&d_output_manual, outWidth * outHeight * sizeof(float));
    cudaMemcpy(d_input, h_input, inWidth * inHeight * sizeof(float), cudaMemcpyHostToDevice);
    
    // Create CUDA array for texture
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
    cudaArray* cuArray;
    cudaMallocArray(&cuArray, &channelDesc, inWidth, inHeight);
    cudaMemcpy2DToArray(cuArray, 0, 0, h_input, inWidth * sizeof(float),
                        inWidth * sizeof(float), inHeight, cudaMemcpyHostToDevice);
    
    // Create texture object
    cudaResourceDesc resDesc = {};
    resDesc.resType = cudaResourceTypeArray;
    resDesc.res.array.array = cuArray;
    
    cudaTextureDesc texDesc = {};
    texDesc.addressMode[0] = cudaAddressModeClamp;
    texDesc.addressMode[1] = cudaAddressModeClamp;
    texDesc.filterMode = cudaFilterModeLinear;  // Bilinear interpolation!
    texDesc.normalizedCoords = false;
    
    cudaTextureObject_t texObj;
    cudaCreateTextureObject(&texObj, &resDesc, &texDesc, NULL);
    
    printf("=== Texture Memory Demonstration ===\n");
    printf("Resizing %dx%d -> %dx%d with bilinear interpolation\n\n", 
           inWidth, inHeight, outWidth, outHeight);
    
    dim3 threads(16, 16);
    dim3 blocks((outWidth + 15) / 16, (outHeight + 15) / 16);
    
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    // Benchmark texture version
    cudaEventRecord(start);
    for (int i = 0; i < 1000; i++) {
        resizeWithTexture<<<blocks, threads>>>(texObj, d_output_tex, 
                                                outWidth, outHeight, inWidth, inHeight);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float tex_time;
    cudaEventElapsedTime(&tex_time, start, stop);
    
    // Benchmark manual version
    cudaEventRecord(start);
    for (int i = 0; i < 1000; i++) {
        resizeManual<<<blocks, threads>>>(d_input, d_output_manual,
                                          outWidth, outHeight, inWidth, inHeight);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float manual_time;
    cudaEventElapsedTime(&manual_time, start, stop);
    
    printf("Texture memory (hw interpolation): %.3f ms\n", tex_time);
    printf("Manual bilinear interpolation:     %.3f ms\n", manual_time);
    printf("Speedup from texture:              %.2fx\n", manual_time / tex_time);
    printf("\n‚úÖ Texture memory provides FREE hardware interpolation!\n");
    
    // Cleanup
    cudaDestroyTextureObject(texObj);
    cudaFreeArray(cuArray);
    cudaFree(d_input);
    cudaFree(d_output_tex);
    cudaFree(d_output_manual);
    free(h_input);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o texture_demo texture_demo.cu
!./texture_demo

### üî∂ Python/Numba (Optional - Quick Testing)

Simulating Texture Benefits: Image Interpolation

In [None]:
# Manual bilinear interpolation (what texture memory does for free)
@cuda.jit(device=True)
def bilinear_sample(image, x, y, height, width):
    """Bilinear interpolation at floating-point coordinates."""
    # Clamp to valid range
    x = max(0.0, min(x, width - 1.001))
    y = max(0.0, min(y, height - 1.001))
    
    # Get integer coordinates
    x0 = int(x)
    y0 = int(y)
    x1 = min(x0 + 1, width - 1)
    y1 = min(y0 + 1, height - 1)
    
    # Fractional parts
    fx = x - x0
    fy = y - y0
    
    # Bilinear interpolation
    v00 = image[y0, x0]
    v01 = image[y0, x1]
    v10 = image[y1, x0]
    v11 = image[y1, x1]
    
    v0 = v00 * (1 - fx) + v01 * fx
    v1 = v10 * (1 - fx) + v11 * fx
    
    return v0 * (1 - fy) + v1 * fy

In [None]:
# Image resize using bilinear interpolation
@cuda.jit
def resize_bilinear(src, dst):
    """Resize image using bilinear interpolation."""
    dst_y, dst_x = cuda.grid(2)
    dst_h, dst_w = dst.shape
    src_h, src_w = src.shape
    
    if dst_y < dst_h and dst_x < dst_w:
        # Map destination coords to source coords
        src_x = dst_x * (src_w - 1) / (dst_w - 1)
        src_y = dst_y * (src_h - 1) / (dst_h - 1)
        
        dst[dst_y, dst_x] = bilinear_sample(src, src_x, src_y, src_h, src_w)

In [None]:
# Test image resizing
src_size = 256
dst_size = 512

# Create a simple test pattern
src_image = np.zeros((src_size, src_size), dtype=np.float32)
for i in range(src_size):
    for j in range(src_size):
        src_image[i, j] = (i + j) / (2 * src_size)

dst_image = np.zeros((dst_size, dst_size), dtype=np.float32)

d_src = cuda.to_device(src_image)
d_dst = cuda.to_device(dst_image)

threads = (16, 16)
blocks = ((dst_size + 15) // 16, (dst_size + 15) // 16)

resize_bilinear[blocks, threads](d_src, d_dst)
result = d_dst.copy_to_host()

print(f"Resized from {src_size}x{src_size} to {dst_size}x{dst_size}")
print(f"Source range: [{src_image.min():.3f}, {src_image.max():.3f}]")
print(f"Result range: [{result.min():.3f}, {result.max():.3f}]")
print("\nNote: With texture memory, interpolation would be automatic!")

---

## Part 4: Memory Type Decision Guide

### Decision Flowchart

```
                 ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                 ‚îÇ Need to store data   ‚îÇ
                 ‚îÇ   for GPU kernel?    ‚îÇ
                 ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                            ‚îÇ
              ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
              ‚ñº                           ‚ñº
   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
   ‚îÇ Read-only data?    ‚îÇ    ‚îÇ Read-write data?   ‚îÇ
   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
            ‚îÇ                          ‚îÇ
   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê         ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
   ‚ñº                 ‚ñº         ‚ñº                ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê       ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Small ‚îÇ       ‚îÇ Large  ‚îÇ  ‚îÇPrivate ‚îÇ   ‚îÇ  Shared   ‚îÇ
‚îÇ <64KB ‚îÇ       ‚îÇdata or ‚îÇ  ‚îÇto each ‚îÇ   ‚îÇamong block‚îÇ
‚îÇuniform‚îÇ       ‚îÇspatial ‚îÇ  ‚îÇthread  ‚îÇ   ‚îÇthreads    ‚îÇ
‚îÇaccess ‚îÇ       ‚îÇaccess  ‚îÇ  ‚îÇ        ‚îÇ   ‚îÇ           ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îò       ‚îî‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
    ‚îÇ               ‚îÇ           ‚îÇ              ‚îÇ
    ‚ñº               ‚ñº           ‚ñº              ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇCONSTANT‚îÇ   ‚îÇ TEXTURE  ‚îÇ  ‚îÇREGISTER‚îÇ   ‚îÇ  SHARED   ‚îÇ
‚îÇ MEMORY ‚îÇ   ‚îÇ  MEMORY  ‚îÇ  ‚îÇ(auto)  ‚îÇ   ‚îÇ  MEMORY   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

Default: GLOBAL MEMORY (with coalescing optimizations)
```

---

## üÉè Concept Card: When to Use Each Memory Type

> **The Decision Tree**
>
> Ask yourself these questions in order:
>
> ```
> ‚ùì Is the data READ-ONLY during kernel execution?
>    ‚îÇ
>    ‚îú‚îÄ NO ‚Üí Use Global Memory (read-write) or Shared Memory (block-local)
>    ‚îÇ
>    ‚îî‚îÄ YES ‚Üí Continue...
>        ‚îÇ
>        ‚ùì Is the data < 64KB AND accessed UNIFORMLY by all threads?
>           ‚îÇ
>           ‚îú‚îÄ YES ‚Üí üìª CONSTANT MEMORY
>           ‚îÇ        Examples: filter kernels, LUTs, config parameters
>           ‚îÇ
>           ‚îî‚îÄ NO ‚Üí Continue...
>               ‚îÇ
>               ‚ùì Does the data have 2D/3D SPATIAL LOCALITY?
>                  ‚îÇ
>                  ‚îú‚îÄ YES ‚Üí üó∫Ô∏è TEXTURE MEMORY
>                  ‚îÇ        Examples: image processing, volume rendering
>                  ‚îÇ
>                  ‚îî‚îÄ NO ‚Üí Use GLOBAL MEMORY with coalescing
> ```
>
> **Quick Cheat Sheet:**
> | Access Pattern | Memory Choice | Why |
> |----------------|---------------|-----|
> | Same value ‚Üí all threads | Constant | Broadcast efficiency |
> | 2D neighborhood reads | Texture | 2D cache + interpolation |
> | Linear streaming | Global | Coalescing works well |
> | Block-local reuse | Shared | Fastest for collaboration |

---

### Quick Reference Table

| Scenario | Best Memory | Why |
|----------|-------------|-----|
| Convolution kernel/filter | Constant | Same values read by all threads |
| Configuration parameters | Constant | Small, uniform read access |
| Lookup table (uniform access) | Constant | Broadcast efficiency |
| Image processing (resize, rotate) | Texture | 2D spatial locality + interpolation |
| Volume rendering | Texture | 3D spatial locality |
| Random 2D reads | Texture | 2D cache optimization |
| Thread-local accumulator | Register | Fastest, private to thread |
| Block-wide reduction | Shared | Threads need to communicate |
| Tiled matrix multiply | Shared | Data reuse within block |
| Histogram (atomic updates) | Shared ‚Üí Global | Reduce atomic contention |
| Large arrays with streaming | Global | Only option for large data |

---

## Part 5: Practical Example - Optimized Gaussian Blur

Now let's bring together everything we've learned this week! We'll combine:
- **Shared memory** for image tile caching (Day 2)
- **Constant-like behavior** for the filter kernel (Day 4)
- **Coalesced access** patterns (Day 1)

This is what real-world CUDA optimization looks like‚Äîlayering multiple techniques for maximum performance.

In [None]:
def gaussian_kernel_2d(size, sigma):
    """Generate 2D Gaussian kernel."""
    ax = np.arange(-size // 2 + 1, size // 2 + 1)
    xx, yy = np.meshgrid(ax, ax)
    kernel = np.exp(-(xx**2 + yy**2) / (2 * sigma**2))
    return (kernel / kernel.sum()).astype(np.float32)

# 5x5 Gaussian kernel
KERNEL_SIZE = 5
GAUSSIAN = gaussian_kernel_2d(KERNEL_SIZE, 1.0)
print("5x5 Gaussian kernel:")
print(np.round(GAUSSIAN, 4))

In [None]:
# Naive implementation: Global memory only
@cuda.jit
def gaussian_blur_naive(image, kernel, output, ksize):
    """Naive Gaussian blur - all global memory."""
    row, col = cuda.grid(2)
    height, width = image.shape
    half_k = ksize // 2
    
    if row >= half_k and row < height - half_k and col >= half_k and col < width - half_k:
        result = 0.0
        for i in range(-half_k, half_k + 1):
            for j in range(-half_k, half_k + 1):
                result += image[row + i, col + j] * kernel[i + half_k, j + half_k]
        output[row, col] = result

In [None]:
# Optimized implementation: Shared memory tiling + kernel in shared memory
TILE_SIZE = 16
BLOCK_SIZE = TILE_SIZE + KERNEL_SIZE - 1  # Tile + halo

@cuda.jit
def gaussian_blur_optimized(image, kernel, output, ksize):
    """Optimized Gaussian blur with shared memory tiling."""
    # Shared memory for image tile (with halo) and kernel
    shared_tile = cuda.shared.array((BLOCK_SIZE, BLOCK_SIZE), dtype=np.float32)
    shared_kernel = cuda.shared.array((5, 5), dtype=np.float32)
    
    tx, ty = cuda.threadIdx.x, cuda.threadIdx.y
    bx, by = cuda.blockIdx.x, cuda.blockIdx.y
    height, width = image.shape
    half_k = ksize // 2
    
    # Load kernel to shared memory (first 25 threads)
    linear_tid = ty * cuda.blockDim.x + tx
    if linear_tid < ksize * ksize:
        ki, kj = linear_tid // ksize, linear_tid % ksize
        shared_kernel[ki, kj] = kernel[ki, kj]
    
    # Calculate tile starting position (with halo offset)
    tile_start_row = by * TILE_SIZE - half_k
    tile_start_col = bx * TILE_SIZE - half_k
    
    # Load tile with halo into shared memory
    # Each thread may need to load multiple elements
    for i in range(0, BLOCK_SIZE, TILE_SIZE):
        for j in range(0, BLOCK_SIZE, TILE_SIZE):
            si = ty + i
            sj = tx + j
            if si < BLOCK_SIZE and sj < BLOCK_SIZE:
                gi = tile_start_row + si
                gj = tile_start_col + sj
                if 0 <= gi < height and 0 <= gj < width:
                    shared_tile[si, sj] = image[gi, gj]
                else:
                    shared_tile[si, sj] = 0.0
    
    cuda.syncthreads()
    
    # Compute output
    out_row = by * TILE_SIZE + ty
    out_col = bx * TILE_SIZE + tx
    
    if out_row < height and out_col < width:
        result = 0.0
        for i in range(ksize):
            for j in range(ksize):
                result += shared_tile[ty + i, tx + j] * shared_kernel[i, j]
        output[out_row, out_col] = result

In [None]:
# Benchmark comparison
SIZE = 2048
image = np.random.rand(SIZE, SIZE).astype(np.float32)
output = np.zeros_like(image)

d_image = cuda.to_device(image)
d_kernel = cuda.to_device(GAUSSIAN)
d_output = cuda.to_device(output)

# Naive version
threads_naive = (16, 16)
blocks_naive = ((SIZE + 15) // 16, (SIZE + 15) // 16)

gaussian_blur_naive[blocks_naive, threads_naive](d_image, d_kernel, d_output, KERNEL_SIZE)
cuda.synchronize()

start = time.perf_counter()
for _ in range(50):
    gaussian_blur_naive[blocks_naive, threads_naive](d_image, d_kernel, d_output, KERNEL_SIZE)
cuda.synchronize()
naive_time = (time.perf_counter() - start) / 50 * 1000

# Optimized version
threads_opt = (TILE_SIZE, TILE_SIZE)
blocks_opt = ((SIZE + TILE_SIZE - 1) // TILE_SIZE, (SIZE + TILE_SIZE - 1) // TILE_SIZE)

gaussian_blur_optimized[blocks_opt, threads_opt](d_image, d_kernel, d_output, KERNEL_SIZE)
cuda.synchronize()

start = time.perf_counter()
for _ in range(50):
    gaussian_blur_optimized[blocks_opt, threads_opt](d_image, d_kernel, d_output, KERNEL_SIZE)
cuda.synchronize()
optimized_time = (time.perf_counter() - start) / 50 * 1000

print(f"\n{'='*50}")
print(f"Gaussian Blur Performance ({SIZE}x{SIZE} image)")
print(f"{'='*50}")
print(f"Naive (global memory):     {naive_time:.3f} ms")
print(f"Optimized (shared memory): {optimized_time:.3f} ms")
print(f"Speedup:                   {naive_time/optimized_time:.2f}x")
print(f"\nOptimizations applied:")
print("  ‚úì Kernel loaded to shared memory (constant-like behavior)")
print("  ‚úì Image tile with halo in shared memory")
print("  ‚úì Coalesced global memory loads")

---

## üéØ Exercises

Now it's your turn! These exercises will help you internalize constant and texture memory patterns through hands-on practice.

### üî∑ CUDA C++ Exercises (Primary)

Complete these exercises using constant and texture memory in CUDA C++.

In [None]:
%%writefile special_memory_exercises.cu
// special_memory_exercises.cu - Constant and texture memory exercises
#include <stdio.h>
#include <cuda_runtime.h>
#include <math.h>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA Error: %s\n", cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// =============================================================================
// Exercise 1: Separable Gaussian Blur using Constant Memory
// =============================================================================

// Constant memory for 1D Gaussian kernel (shared by all threads)
#define MAX_KERNEL_SIZE 25
__constant__ float c_gaussianKernel[MAX_KERNEL_SIZE];
__constant__ int c_kernelRadius;

// Generate 1D Gaussian kernel on host
void generateGaussianKernel(float* kernel, int size, float sigma) {
    int radius = size / 2;
    float sum = 0.0f;
    for (int i = 0; i < size; i++) {
        int x = i - radius;
        kernel[i] = expf(-(x * x) / (2.0f * sigma * sigma));
        sum += kernel[i];
    }
    // Normalize
    for (int i = 0; i < size; i++) {
        kernel[i] /= sum;
    }
}

// Horizontal 1D convolution
__global__ void gaussianBlurHorizontal(const float* input, float* output, 
                                        int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (x >= width || y >= height) return;
    
    float sum = 0.0f;
    int radius = c_kernelRadius;
    
    for (int k = -radius; k <= radius; k++) {
        int nx = min(max(x + k, 0), width - 1);  // Clamp to border
        sum += input[y * width + nx] * c_gaussianKernel[k + radius];
    }
    
    output[y * width + x] = sum;
}

// Vertical 1D convolution
__global__ void gaussianBlurVertical(const float* input, float* output, 
                                      int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (x >= width || y >= height) return;
    
    float sum = 0.0f;
    int radius = c_kernelRadius;
    
    for (int k = -radius; k <= radius; k++) {
        int ny = min(max(y + k, 0), height - 1);  // Clamp to border
        sum += input[ny * width + x] * c_gaussianKernel[k + radius];
    }
    
    output[y * width + x] = sum;
}

// =============================================================================
// Exercise 2: Lookup Table with Constant Memory
// =============================================================================

// Heatmap colormap (256 RGB entries)
__constant__ float c_heatmapLUT[256 * 3];

void generateHeatmapLUT(float* lut) {
    for (int i = 0; i < 256; i++) {
        float t = i / 255.0f;
        float r, g, b;
        
        // Blue -> Cyan -> Green -> Yellow -> Red
        if (t < 0.25f) {
            r = 0; g = t * 4; b = 1;
        } else if (t < 0.5f) {
            r = 0; g = 1; b = 1 - (t - 0.25f) * 4;
        } else if (t < 0.75f) {
            r = (t - 0.5f) * 4; g = 1; b = 0;
        } else {
            r = 1; g = 1 - (t - 0.75f) * 4; b = 0;
        }
        
        lut[i * 3 + 0] = r;
        lut[i * 3 + 1] = g;
        lut[i * 3 + 2] = b;
    }
}

__global__ void applyHeatmap(const unsigned char* grayscale, float* output_rgb,
                              int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    
    if (x >= width || y >= height) return;
    
    int idx = y * width + x;
    int gray = grayscale[idx];
    
    // Read from constant memory LUT (broadcast to all threads)
    output_rgb[idx * 3 + 0] = c_heatmapLUT[gray * 3 + 0];
    output_rgb[idx * 3 + 1] = c_heatmapLUT[gray * 3 + 1];
    output_rgb[idx * 3 + 2] = c_heatmapLUT[gray * 3 + 2];
}

// =============================================================================
// Test harness
// =============================================================================

int main() {
    printf("=== Special Memory Exercises ===\n\n");
    
    // Exercise 1: Separable Gaussian Blur
    printf("Exercise 1: Separable Gaussian Blur\n");
    printf("-" "-----------------------------------\n");
    {
        const int WIDTH = 512;
        const int HEIGHT = 512;
        const int KERNEL_SIZE = 5;
        const float SIGMA = 1.0f;
        size_t imageSize = WIDTH * HEIGHT * sizeof(float);
        
        // Generate Gaussian kernel and copy to constant memory
        float h_kernel[MAX_KERNEL_SIZE];
        generateGaussianKernel(h_kernel, KERNEL_SIZE, SIGMA);
        CUDA_CHECK(cudaMemcpyToSymbol(c_gaussianKernel, h_kernel, 
                                       KERNEL_SIZE * sizeof(float)));
        int radius = KERNEL_SIZE / 2;
        CUDA_CHECK(cudaMemcpyToSymbol(c_kernelRadius, &radius, sizeof(int)));
        
        printf("Gaussian kernel (size=%d, sigma=%.1f):\n  ", KERNEL_SIZE, SIGMA);
        for (int i = 0; i < KERNEL_SIZE; i++) printf("%.3f ", h_kernel[i]);
        printf("\n\n");
        
        // Create test image (gradient)
        float* h_image = (float*)malloc(imageSize);
        for (int y = 0; y < HEIGHT; y++) {
            for (int x = 0; x < WIDTH; x++) {
                h_image[y * WIDTH + x] = (float)((x + y) % 256) / 255.0f;
            }
        }
        
        // Allocate device memory
        float *d_input, *d_temp, *d_output;
        CUDA_CHECK(cudaMalloc(&d_input, imageSize));
        CUDA_CHECK(cudaMalloc(&d_temp, imageSize));
        CUDA_CHECK(cudaMalloc(&d_output, imageSize));
        CUDA_CHECK(cudaMemcpy(d_input, h_image, imageSize, cudaMemcpyHostToDevice));
        
        // Launch separable blur
        dim3 block(16, 16);
        dim3 grid((WIDTH + 15) / 16, (HEIGHT + 15) / 16);
        
        cudaEvent_t start, stop;
        CUDA_CHECK(cudaEventCreate(&start));
        CUDA_CHECK(cudaEventCreate(&stop));
        
        CUDA_CHECK(cudaEventRecord(start));
        gaussianBlurHorizontal<<<grid, block>>>(d_input, d_temp, WIDTH, HEIGHT);
        gaussianBlurVertical<<<grid, block>>>(d_temp, d_output, WIDTH, HEIGHT);
        CUDA_CHECK(cudaEventRecord(stop));
        CUDA_CHECK(cudaEventSynchronize(stop));
        
        float ms;
        CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));
        
        CUDA_CHECK(cudaMemcpy(h_image, d_output, imageSize, cudaMemcpyDeviceToHost));
        
        printf("Separable blur: %.3f ms\n", ms);
        printf("Operations per pixel: 2 √ó %d = %d (vs %d for 2D)\n", 
               KERNEL_SIZE, 2 * KERNEL_SIZE, KERNEL_SIZE * KERNEL_SIZE);
        printf("Sample output: [0,0]=%.3f, [255,255]=%.3f\n\n", 
               h_image[0], h_image[255 * WIDTH + 255]);
        
        cudaFree(d_input); cudaFree(d_temp); cudaFree(d_output);
        free(h_image);
    }
    
    // Exercise 2: Heatmap Colorization
    printf("Exercise 2: Heatmap Colorization\n");
    printf("-" "--------------------------------\n");
    {
        const int WIDTH = 1024;
        const int HEIGHT = 1024;
        
        // Generate and copy LUT to constant memory
        float h_lut[256 * 3];
        generateHeatmapLUT(h_lut);
        CUDA_CHECK(cudaMemcpyToSymbol(c_heatmapLUT, h_lut, 256 * 3 * sizeof(float)));
        
        // Create grayscale test image
        unsigned char* h_gray = (unsigned char*)malloc(WIDTH * HEIGHT);
        float* h_rgb = (float*)malloc(WIDTH * HEIGHT * 3 * sizeof(float));
        
        for (int y = 0; y < HEIGHT; y++) {
            for (int x = 0; x < WIDTH; x++) {
                h_gray[y * WIDTH + x] = (unsigned char)((x + y) % 256);
            }
        }
        
        // Allocate device memory
        unsigned char* d_gray;
        float* d_rgb;
        CUDA_CHECK(cudaMalloc(&d_gray, WIDTH * HEIGHT));
        CUDA_CHECK(cudaMalloc(&d_rgb, WIDTH * HEIGHT * 3 * sizeof(float)));
        CUDA_CHECK(cudaMemcpy(d_gray, h_gray, WIDTH * HEIGHT, cudaMemcpyHostToDevice));
        
        // Launch kernel
        dim3 block(16, 16);
        dim3 grid((WIDTH + 15) / 16, (HEIGHT + 15) / 16);
        
        cudaEvent_t start, stop;
        CUDA_CHECK(cudaEventCreate(&start));
        CUDA_CHECK(cudaEventCreate(&stop));
        
        CUDA_CHECK(cudaEventRecord(start));
        applyHeatmap<<<grid, block>>>(d_gray, d_rgb, WIDTH, HEIGHT);
        CUDA_CHECK(cudaEventRecord(stop));
        CUDA_CHECK(cudaEventSynchronize(stop));
        
        float ms;
        CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop));
        
        CUDA_CHECK(cudaMemcpy(h_rgb, d_rgb, WIDTH * HEIGHT * 3 * sizeof(float), 
                              cudaMemcpyDeviceToHost));
        
        printf("Colorization: %.3f ms (%.2f Mpixels/s)\n", 
               ms, (WIDTH * HEIGHT / 1e6) / (ms / 1000));
        printf("Sample: gray[0]=%d ‚Üí RGB=(%.2f, %.2f, %.2f)\n",
               h_gray[0], h_rgb[0], h_rgb[1], h_rgb[2]);
        printf("Sample: gray[128]=%d ‚Üí RGB=(%.2f, %.2f, %.2f)\n",
               h_gray[128], h_rgb[128*3], h_rgb[128*3+1], h_rgb[128*3+2]);
        
        cudaFree(d_gray); cudaFree(d_rgb);
        free(h_gray); free(h_rgb);
    }
    
    printf("\n=== All exercises complete! ===\n");
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o special_memory_exercises special_memory_exercises.cu && ./special_memory_exercises

### üî∂ Python/Numba Exercises (Optional)

### Exercise 1: Separable Gaussian Blur

A 2D Gaussian is **separable**: it can be computed as two 1D passes (horizontal then vertical). This reduces operations from O(k¬≤) to O(2k).

In [None]:
# 1D Gaussian kernel
def gaussian_kernel_1d(size, sigma):
    ax = np.arange(-size // 2 + 1, size // 2 + 1)
    kernel = np.exp(-ax**2 / (2 * sigma**2))
    return (kernel / kernel.sum()).astype(np.float32)

GAUSSIAN_1D = gaussian_kernel_1d(5, 1.0)
print("1D Gaussian kernel:", GAUSSIAN_1D)

In [None]:
# TODO: Implement horizontal and vertical 1D convolution kernels

@cuda.jit
def gaussian_blur_horizontal(image, kernel, output, ksize):
    """Apply 1D Gaussian blur horizontally."""
    # Your implementation here
    pass

@cuda.jit
def gaussian_blur_vertical(image, kernel, output, ksize):
    """Apply 1D Gaussian blur vertically."""
    # Your implementation here
    pass

# Compare:
# 1. 5x5 2D convolution: 25 multiplications per pixel
# 2. Two 1x5 1D convolutions: 10 multiplications per pixel
# Expected speedup: ~2.5x from reduced arithmetic

### Exercise 2: Lookup Table with Constant Memory Pattern

Implement a color mapping kernel where all threads read from the same lookup table.

In [None]:
# Create a colormap lookup table (256 entries)
# Maps grayscale values to "heat" colors
def create_heatmap_lut():
    lut = np.zeros((256, 3), dtype=np.float32)
    for i in range(256):
        t = i / 255.0
        # Blue -> Cyan -> Green -> Yellow -> Red
        if t < 0.25:
            lut[i] = [0, t * 4, 1]
        elif t < 0.5:
            lut[i] = [0, 1, 1 - (t - 0.25) * 4]
        elif t < 0.75:
            lut[i] = [(t - 0.5) * 4, 1, 0]
        else:
            lut[i] = [1, 1 - (t - 0.75) * 4, 0]
    return lut

HEATMAP_LUT = create_heatmap_lut()
print(f"Lookup table shape: {HEATMAP_LUT.shape}")

In [None]:
# TODO: Implement color mapping with LUT in shared memory

@cuda.jit
def apply_heatmap(grayscale, lut, output_rgb):
    """
    Apply heatmap colorization using lookup table.
    
    Args:
        grayscale: 2D input (H, W), values 0-255
        lut: Lookup table (256, 3)
        output_rgb: 3D output (H, W, 3)
    
    Hint: Load LUT to shared memory for constant-memory-like behavior
    """
    # Your implementation here
    pass

---

## üéØ Key Takeaways

### üìã Quick Reference Card: Special Memory Types

| Memory Type | Size Limit | Best Access Pattern | Hardware Feature | Use Case |
|-------------|------------|---------------------|------------------|----------|
| **Constant** | 64 KB | Uniform (all threads same address) | Broadcast to warp | Filter kernels, LUTs, config |
| **Texture** | Large | 2D/3D spatial locality | Interpolation, boundary handling | Image processing, volume rendering |
| **Shared** | 48-164 KB/SM | Block-local reuse | Low latency, high bandwidth | Tiled algorithms, reductions |
| **Global** | GB | Coalesced linear | L1/L2 cache | General purpose |

### üß† Three Things to Remember

1. **Constant Memory = Radio Broadcast**
   - One read serves an entire warp when all threads access the same address
   - Perfect for filter kernels, lookup tables, and configuration parameters
   - ‚ö†Ô∏è Different addresses per thread ‚Üí serialized reads (slow!)

2. **Texture Memory = 2D GPS Cache**
   - Optimized for 2D spatial locality (nearby pixels likely accessed together)
   - FREE hardware interpolation and boundary handling
   - Great for image resizing, rotation, and any 2D neighborhood access

3. **Match Memory to Access Pattern**
   - There's no universally "best" memory type
   - Profile and measure‚Äîthe right choice depends on YOUR data access pattern
   - Layering techniques (shared + constant-like filter) yields best results

### Week 2 Summary: Memory Mastery

| Day | Topic | Key Insight |
|-----|-------|-------------|
| 1 | Memory Coalescing | Adjacent threads ‚Üí adjacent memory = single transaction |
| 2 | Shared Memory | On-chip cache for data reuse within a block |
| 3 | Bank Conflicts | 32 banks, same-bank access ‚Üí serialization |
| 4 | Special Memory | Constant (broadcast) + Texture (2D spatial) |

### üîß Optimization Hierarchy

```
1. First: Choose right algorithm (parallelizable)
2. Then:  Ensure coalesced global memory access
3. Then:  Use shared memory for data reuse
4. Then:  Avoid bank conflicts (padding)
5. Then:  Consider special memory types
6. Then:  Fine-tune thread/block configuration
```

### Memory Selection Quick Guide

```
Small read-only + uniform access ‚Üí Constant
2D spatial access + interpolation ‚Üí Texture  
Block-local data reuse           ‚Üí Shared
Thread-private temporary         ‚Üí Register
Everything else                  ‚Üí Global (with coalescing)
```

---

## üöÄ What's Next?

### ‚úÖ Week 2 Complete: Memory Patterns & Optimization

You've mastered the memory hierarchy! You now understand:
- How to achieve coalesced memory access
- When and how to use shared memory for data reuse
- How to avoid bank conflicts
- Which special memory types to choose for different patterns

### üìã Before Moving On
- Complete the **Day 5 Review & Checkpoint Quiz**
- Try the exercises above to solidify your understanding

### üîÆ Week 3 Preview: Synchronization & Atomics

Next week, we tackle **thread coordination**‚Äîwhat happens when threads need to work together:

| Day | Topic | Why It Matters |
|-----|-------|----------------|
| 1 | Thread Synchronization | Coordinate threads within a block |
| 2 | Atomic Operations | Safe concurrent updates to shared data |
| 3 | Warp-Level Programming | Leverage warp-level primitives |
| 4 | Parallel Reduction | Efficient patterns for aggregation |

**Key Question for Week 3:** *How do we safely combine results from thousands of threads?*

---

*Great work completing Week 2! You now have the memory optimization skills that separate efficient CUDA code from naive implementations.* üéâ