# üöÄ Day 3: Atomic Operations

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sdodlapati3/cuda-lab/blob/main/learning-path/week-04/day-3-atomic-operations.ipynb)

## Learning Objectives
- Understand race conditions and why atomics are needed
- Use atomic add, max, min, CAS operations
- Apply atomics to counting and reduction problems
- Understand privatization for reducing atomic contention

> **Primary Focus:** CUDA C++ code examples first, Python/Numba backup for interactive testing

---

In [None]:
# ‚öôÔ∏è Colab/Local Setup - Run this first!
import subprocess, sys
try:
    import google.colab
    print("üîß Running on Google Colab - Installing dependencies...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "numba"])
    print("‚úÖ Setup complete!")
except ImportError:
    print("üíª Running locally - make sure you have: pip install numba numpy")

import numpy as np
from numba import cuda
import math
import time

print(f"\nCUDA available: {cuda.is_available()}")
if cuda.is_available():
    device = cuda.get_current_device()
    print(f"Device: {device.name}")

---

## Part 1: The Race Condition Problem

### What Goes Wrong Without Atomics?

```
Thread A reads counter (0)
Thread B reads counter (0)     ‚Üê Same old value!
Thread A writes counter+1 (1)
Thread B writes counter+1 (1)  ‚Üê Overwrites A's work!

Expected: 2, Actual: 1 ‚Üê DATA RACE!
```

### CUDA C++ Atomic Operations (Primary)

The following CUDA C++ implementation demonstrates atomic operations for thread-safe memory updates.

In [None]:
%%writefile atomic_ops.cu
// atomic_ops.cu - Thread-safe operations
#include <stdio.h>
#include <cuda_runtime.h>

// BAD: Race condition!
__global__ void raceCondition(int* counter) {
    // Multiple threads read-modify-write simultaneously
    counter[0] = counter[0] + 1;  // NOT SAFE!
}

// GOOD: Using atomicAdd
__global__ void safeIncrement(int* counter, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        atomicAdd(counter, 1);  // Thread-safe!
    }
}

// Common atomic operations:
__global__ void atomicExamples(int* arr, float* farr) {
    // Integer atomics
    atomicAdd(&arr[0], 1);        // arr[0] += 1
    atomicSub(&arr[1], 1);        // arr[1] -= 1
    atomicMax(&arr[2], 100);      // arr[2] = max(arr[2], 100)
    atomicMin(&arr[3], 0);        // arr[3] = min(arr[3], 0)
    atomicExch(&arr[4], 42);      // arr[4] = 42, returns old value
    atomicCAS(&arr[5], 0, 1);     // if (arr[5] == 0) arr[5] = 1
    
    // Floating-point atomics
    atomicAdd(&farr[0], 1.0f);    // farr[0] += 1.0
    // Note: atomicMax/Min for floats requires CUDA 11+
}

// Reduce with atomics (simple but slow)
__global__ void atomicSum(const float* input, float* result, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    float localSum = 0.0f;
    for (int i = tid; i < n; i += stride) {
        localSum += input[i];
    }
    
    // One atomic per thread is MUCH better than one per element
    atomicAdd(result, localSum);
}

// Privatization: reduce contention with per-block counters
__global__ void countWithPrivatization(const int* data, int* blockCounts, int n) {
    __shared__ int localCount;
    
    if (threadIdx.x == 0) {
        localCount = 0;
    }
    __syncthreads();
    
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        if (data[i] > 0) {
            atomicAdd(&localCount, 1);  // Shared memory atomic (fast)
        }
    }
    __syncthreads();
    
    if (threadIdx.x == 0) {
        atomicAdd(&blockCounts[blockIdx.x], localCount);  // Global atomic once
    }
}

int main() {
    int n = 1000;
    int *d_counter;
    cudaMalloc(&d_counter, sizeof(int));
    cudaMemset(d_counter, 0, sizeof(int));
    
    safeIncrement<<<4, 256>>>(d_counter, n);
    
    int result;
    cudaMemcpy(&result, d_counter, sizeof(int), cudaMemcpyDeviceToHost);
    printf("Counter after %d increments: %d\n", n, result);
    
    cudaFree(d_counter);
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o atomic_ops atomic_ops.cu
!./atomic_ops

In [None]:
# Python demonstration of race condition
@cuda.jit
def race_condition_demo(counter):
    """
    BAD: Multiple threads increment counter without protection.
    """
    counter[0] = counter[0] + 1  # NOT ATOMIC!

In [None]:
# Demonstrate the race condition
def test_race_condition(num_threads):
    counter = np.zeros(1, dtype=np.int32)
    d_counter = cuda.to_device(counter)
    
    threads_per_block = min(256, num_threads)
    blocks = (num_threads + threads_per_block - 1) // threads_per_block
    
    race_condition_demo[blocks, threads_per_block](d_counter)
    cuda.synchronize()
    
    result = d_counter.copy_to_host()[0]
    return result

print("Race Condition Demo")
print("="*50)
print(f"{'Threads':>10} | {'Expected':>10} | {'Actual':>10} | {'Lost':>10}")
print("-"*50)

for n in [100, 1000, 10000, 100000]:
    actual = test_race_condition(n)
    lost = n - actual
    print(f"{n:>10,} | {n:>10,} | {actual:>10,} | {lost:>10,}")

print("\n‚ö†Ô∏è  Notice: Most increments are LOST due to race conditions!")

---

## Part 2: Atomic Operations

### The Solution: Atomic Operations

```
Atomic = Indivisible

Read-Modify-Write happens as ONE operation:
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ 1. Read old value      ‚îÇ
‚îÇ 2. Compute new value   ‚îÇ  ‚Üê All protected!
‚îÇ 3. Write new value     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

Other threads must WAIT until this completes.
```

In [None]:
@cuda.jit
def atomic_increment(counter):
    """CORRECT: Use atomic operation for thread-safe increment."""
    cuda.atomic.add(counter, 0, 1)  # Atomically: counter[0] += 1

In [None]:
def test_atomic_increment(num_threads):
    counter = np.zeros(1, dtype=np.int32)
    d_counter = cuda.to_device(counter)
    
    threads_per_block = min(256, num_threads)
    blocks = (num_threads + threads_per_block - 1) // threads_per_block
    
    atomic_increment[blocks, threads_per_block](d_counter)
    cuda.synchronize()
    
    return d_counter.copy_to_host()[0]

print("Atomic Increment Demo")
print("="*50)
print(f"{'Threads':>10} | {'Expected':>10} | {'Actual':>10} | {'Match':>10}")
print("-"*50)

for n in [100, 1000, 10000, 100000]:
    actual = test_atomic_increment(n)
    match = "‚úì" if actual == n else "‚úó"
    print(f"{n:>10,} | {n:>10,} | {actual:>10,} | {match:>10}")

print("\n‚úì All increments are now counted correctly!")

---

## Part 3: Available Atomic Operations

### Numba CUDA Atomic Functions

In [None]:
# Available atomic operations in Numba CUDA:
#
# cuda.atomic.add(array, index, value)     # array[index] += value
# cuda.atomic.max(array, index, value)     # array[index] = max(array[index], value)
# cuda.atomic.min(array, index, value)     # array[index] = min(array[index], value)
# cuda.atomic.compare_and_swap(array, old, val)  # CAS operation
#
# All return the OLD value before the operation

@cuda.jit
def demo_atomic_add(arr, values, result, n):
    """Sum values using atomic add."""
    tid = cuda.grid(1)
    if tid < n:
        cuda.atomic.add(result, 0, values[tid])

@cuda.jit
def demo_atomic_max(arr, result, n):
    """Find maximum using atomic max."""
    tid = cuda.grid(1)
    if tid < n:
        cuda.atomic.max(result, 0, arr[tid])

@cuda.jit
def demo_atomic_min(arr, result, n):
    """Find minimum using atomic min."""
    tid = cuda.grid(1)
    if tid < n:
        cuda.atomic.min(result, 0, arr[tid])

In [None]:
# Test atomic operations
n = 10000
arr = np.random.randint(1, 1000, n).astype(np.int32)

d_arr = cuda.to_device(arr)

# Atomic add (sum)
d_sum = cuda.to_device(np.zeros(1, dtype=np.int32))
demo_atomic_add[40, 256](d_arr, d_arr, d_sum, n)
gpu_sum = d_sum.copy_to_host()[0]

# Atomic max
d_max = cuda.to_device(np.array([np.iinfo(np.int32).min], dtype=np.int32))
demo_atomic_max[40, 256](d_arr, d_max, n)
gpu_max = d_max.copy_to_host()[0]

# Atomic min
d_min = cuda.to_device(np.array([np.iinfo(np.int32).max], dtype=np.int32))
demo_atomic_min[40, 256](d_arr, d_min, n)
gpu_min = d_min.copy_to_host()[0]

print(f"Array: {n:,} random integers [1, 1000)")
print(f"\nAtomic Sum: {gpu_sum:,} (CPU: {np.sum(arr):,})")
print(f"Atomic Max: {gpu_max} (CPU: {np.max(arr)})")
print(f"Atomic Min: {gpu_min} (CPU: {np.min(arr)})")

---

## Part 4: Atomics Performance Problem

### Atomic Contention

In [None]:
# Atomics are SLOW when many threads compete for same location!

@cuda.jit
def high_contention_atomic(result, n):
    """All threads atomically add to ONE location."""
    tid = cuda.grid(1)
    if tid < n:
        cuda.atomic.add(result, 0, 1)  # Everyone fights for index 0!

@cuda.jit
def low_contention_atomic(result, n):
    """Threads spread across MULTIPLE locations."""
    tid = cuda.grid(1)
    if tid < n:
        # Spread across 256 locations
        bin_idx = tid % 256
        cuda.atomic.add(result, bin_idx, 1)

In [None]:
# Benchmark contention
n = 1_000_000
iterations = 100

# High contention (1 location)
d_result1 = cuda.device_array(1, dtype=np.int32)
cuda.synchronize()

start = time.perf_counter()
for _ in range(iterations):
    d_result1 = cuda.to_device(np.zeros(1, dtype=np.int32))
    high_contention_atomic[4000, 256](d_result1, n)
cuda.synchronize()
high_time = (time.perf_counter() - start) / iterations * 1000

# Low contention (256 locations)
d_result256 = cuda.device_array(256, dtype=np.int32)
cuda.synchronize()

start = time.perf_counter()
for _ in range(iterations):
    d_result256 = cuda.to_device(np.zeros(256, dtype=np.int32))
    low_contention_atomic[4000, 256](d_result256, n)
cuda.synchronize()
low_time = (time.perf_counter() - start) / iterations * 1000

print(f"Atomic Contention Benchmark (N={n:,})")
print(f"{'='*45}")
print(f"High contention (1 loc):   {high_time:.3f} ms")
print(f"Low contention (256 locs): {low_time:.3f} ms")
print(f"Speedup:                   {high_time/low_time:.1f}x")

---

## Part 5: Privatization Pattern

### Reducing Contention with Local Accumulation

In [None]:
@cuda.jit
def privatized_sum(arr, result, n):
    """
    Privatization: Each block has its own accumulator in shared memory.
    
    1. Local accumulation in shared memory (no atomic)
    2. Block-level reduction (no atomic)
    3. Single atomic per block to global
    """
    # Shared memory for block's local sum
    shared = cuda.shared.array(256, dtype=np.float32)
    
    tid = cuda.threadIdx.x
    bid = cuda.blockIdx.x
    gid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    # Phase 1: Local accumulation
    local_sum = 0.0
    for i in range(gid, n, stride):
        local_sum += arr[i]
    
    shared[tid] = local_sum
    cuda.syncthreads()
    
    # Phase 2: Block reduction
    s = cuda.blockDim.x // 2
    while s > 0:
        if tid < s:
            shared[tid] += shared[tid + s]
        cuda.syncthreads()
        s //= 2
    
    # Phase 3: ONE atomic per block
    if tid == 0:
        cuda.atomic.add(result, 0, shared[0])

In [None]:
# Compare naive atomic vs privatized
@cuda.jit
def naive_atomic_sum(arr, result, n):
    """Naive: one atomic per element."""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    for i in range(tid, n, stride):
        cuda.atomic.add(result, 0, arr[i])

n = 1_000_000
arr = np.random.rand(n).astype(np.float32)
d_arr = cuda.to_device(arr)

iterations = 50

# Naive atomic
d_result1 = cuda.to_device(np.zeros(1, dtype=np.float32))
naive_atomic_sum[256, 256](d_arr, d_result1, n)
cuda.synchronize()

start = time.perf_counter()
for _ in range(iterations):
    d_result1 = cuda.to_device(np.zeros(1, dtype=np.float32))
    naive_atomic_sum[256, 256](d_arr, d_result1, n)
cuda.synchronize()
naive_time = (time.perf_counter() - start) / iterations * 1000

# Privatized
d_result2 = cuda.to_device(np.zeros(1, dtype=np.float32))
privatized_sum[256, 256](d_arr, d_result2, n)
cuda.synchronize()

start = time.perf_counter()
for _ in range(iterations):
    d_result2 = cuda.to_device(np.zeros(1, dtype=np.float32))
    privatized_sum[256, 256](d_arr, d_result2, n)
cuda.synchronize()
priv_time = (time.perf_counter() - start) / iterations * 1000

print(f"Sum Benchmark (N={n:,})")
print(f"{'='*45}")
print(f"Naive atomic (N atomics): {naive_time:.3f} ms")
print(f"Privatized (256 atomics): {priv_time:.3f} ms")
print(f"Speedup:                  {naive_time/priv_time:.1f}x")
print(f"\nResults match: {'‚úì' if np.isclose(d_result1.copy_to_host()[0], d_result2.copy_to_host()[0], rtol=1e-4) else '‚úó'}")

---

## Part 6: Shared Memory Atomics

### Faster Atomics in Shared Memory

In [None]:
@cuda.jit
def count_values_global_atomic(arr, counts, n, num_bins):
    """Count values using global memory atomics (slow)."""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        bin_idx = arr[i] % num_bins
        cuda.atomic.add(counts, bin_idx, 1)  # Global atomic!

@cuda.jit
def count_values_shared_atomic(arr, counts, n, num_bins):
    """
    Count values using shared memory atomics (faster).
    
    1. Accumulate in shared memory (fast atomics)
    2. Merge to global memory (fewer atomics)
    """
    # Shared memory histogram
    shared_counts = cuda.shared.array(256, dtype=np.int32)
    
    tid = cuda.threadIdx.x
    gid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    # Initialize shared memory
    if tid < num_bins:
        shared_counts[tid] = 0
    cuda.syncthreads()
    
    # Phase 1: Count in shared memory
    for i in range(gid, n, stride):
        bin_idx = arr[i] % num_bins
        cuda.atomic.add(shared_counts, bin_idx, 1)  # Shared atomic (fast!)
    
    cuda.syncthreads()
    
    # Phase 2: Merge to global
    if tid < num_bins:
        cuda.atomic.add(counts, tid, shared_counts[tid])  # One global atomic per bin

In [None]:
# Benchmark global vs shared atomics
n = 10_000_000
num_bins = 256

arr = np.random.randint(0, num_bins, n).astype(np.int32)
d_arr = cuda.to_device(arr)

iterations = 50

# Global atomics
d_counts1 = cuda.to_device(np.zeros(num_bins, dtype=np.int32))
count_values_global_atomic[256, 256](d_arr, d_counts1, n, num_bins)
cuda.synchronize()

start = time.perf_counter()
for _ in range(iterations):
    d_counts1 = cuda.to_device(np.zeros(num_bins, dtype=np.int32))
    count_values_global_atomic[256, 256](d_arr, d_counts1, n, num_bins)
cuda.synchronize()
global_time = (time.perf_counter() - start) / iterations * 1000

# Shared atomics
d_counts2 = cuda.to_device(np.zeros(num_bins, dtype=np.int32))
count_values_shared_atomic[256, 256](d_arr, d_counts2, n, num_bins)
cuda.synchronize()

start = time.perf_counter()
for _ in range(iterations):
    d_counts2 = cuda.to_device(np.zeros(num_bins, dtype=np.int32))
    count_values_shared_atomic[256, 256](d_arr, d_counts2, n, num_bins)
cuda.synchronize()
shared_time = (time.perf_counter() - start) / iterations * 1000

print(f"Counting Benchmark (N={n:,}, bins={num_bins})")
print(f"{'='*50}")
print(f"Global memory atomics: {global_time:.3f} ms")
print(f"Shared memory atomics: {shared_time:.3f} ms")
print(f"Speedup:               {global_time/shared_time:.1f}x")

---

## Part 7: Compare-and-Swap (CAS)

### Building Custom Atomics

In [None]:
# compare_and_swap(array, compare_val, new_val)
# If array[0] == compare_val, set array[0] = new_val
# Returns the OLD value
#
# This is the fundamental building block for all atomics!

@cuda.jit
def atomic_max_float_cas(arr, idx, val):
    """
    Implement atomic max for float using CAS.
    (Numba doesn't have native atomic max for float)
    """
    # This is a pattern for implementing custom atomics
    old = arr[idx]
    
    # Keep trying until we succeed
    while val > old:
        # Try to swap old with val
        assumed = old
        old = cuda.atomic.compare_and_swap(arr, assumed, val)
        
        # If old == assumed, swap succeeded
        # If old != assumed, someone else updated, retry
        if old == assumed:
            break

---

## Part 8: When to Use What?

### Decision Guide

In [None]:
print("""
‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó
‚ïë           WHEN TO USE REDUCTION VS ATOMICS                    ‚ïë
‚ï†‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï£
‚ïë                                                               ‚ïë
‚ïë  Use REDUCTION when:                                          ‚ïë
‚ïë  ‚îú‚îÄ Computing single result (sum, max, min)                   ‚ïë
‚ïë  ‚îú‚îÄ Regular access pattern (element-wise)                     ‚ïë
‚ïë  ‚îú‚îÄ Maximum performance needed                                ‚ïë
‚ïë  ‚îî‚îÄ Can use shared memory + warp primitives                   ‚ïë
‚ïë                                                               ‚ïë
‚ïë  Use ATOMICS when:                                            ‚ïë
‚ïë  ‚îú‚îÄ Multiple output locations (histogram)                     ‚ïë
‚ïë  ‚îú‚îÄ Irregular/data-dependent access pattern                   ‚ïë
‚ïë  ‚îú‚îÄ Simple counting/accumulation                              ‚ïë
‚ïë  ‚îî‚îÄ Low contention (few conflicts per location)               ‚ïë
‚ïë                                                               ‚ïë
‚ïë  Use PRIVATIZATION when:                                      ‚ïë
‚ïë  ‚îú‚îÄ High atomic contention expected                           ‚ïë
‚ïë  ‚îú‚îÄ Can accumulate locally first                              ‚ïë
‚ïë  ‚îî‚îÄ Want best of both worlds                                  ‚ïë
‚ïë                                                               ‚ïë
‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù
""")

---

## Exercises

### Exercise 1: Count Specific Values

In [None]:
# TODO: Count how many times each unique value appears in array
# Use atomic add to shared memory, then merge to global

@cuda.jit
def count_occurrences(arr, counts, n):
    """Count occurrences of values 0-255 in array."""
    pass

# Test with arr = [0, 1, 1, 2, 2, 2, 3, 3, 3, 3]

### Exercise 2: Find First Occurrence

In [None]:
# TODO: Find the index of first element > threshold
# Use atomic min on the index

@cuda.jit
def find_first_above(arr, threshold, result_idx, n):
    """Find index of first element > threshold."""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    # TODO: For each element > threshold, do atomic min on result_idx
    # Initialize result_idx to n (meaning "not found")
    pass

### Exercise 3: Parallel Counter with Saturation

In [None]:
# TODO: Implement a counter that stops at a maximum value
# Use compare_and_swap to implement saturating increment

@cuda.jit
def saturating_increment(counter, max_val):
    """Increment counter, but don't exceed max_val."""
    # Hint: Loop with CAS until either:
    # 1. Successfully incremented, or
    # 2. Counter already at max_val
    pass

---

## Summary

### Atomic Operations Reference

| Operation | Syntax | Description |
|-----------|--------|-------------|
| Add | `cuda.atomic.add(arr, idx, val)` | arr[idx] += val |
| Max | `cuda.atomic.max(arr, idx, val)` | arr[idx] = max(...) |
| Min | `cuda.atomic.min(arr, idx, val)` | arr[idx] = min(...) |
| CAS | `cuda.atomic.compare_and_swap(arr, old, new)` | Conditional swap |

### Performance Tips

```
1. MINIMIZE CONTENTION
   ‚Ä¢ Spread atomics across locations
   ‚Ä¢ Use privatization pattern

2. PREFER SHARED MEMORY ATOMICS
   ‚Ä¢ ~10x faster than global
   ‚Ä¢ Merge to global at end

3. USE REDUCTION WHEN POSSIBLE
   ‚Ä¢ No atomics needed for sum/max/min
   ‚Ä¢ Faster than any atomic approach

4. BATCH UPDATES
   ‚Ä¢ Accumulate locally first
   ‚Ä¢ One atomic per thread/warp/block
```

### Key Takeaways

1. **Race conditions break correctness** - use atomics!
2. **Atomics serialize threads** - minimize contention
3. **Shared memory atomics are faster** than global
4. **Privatization reduces contention** dramatically
5. **CAS is fundamental** - all atomics built on it

### Next: Day 4 - Histogram & Counting
Apply atomics to build practical histogram kernels!