# üöÄ Day 4: Histogram - A Complete Example

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sdodlapati3/cuda-lab/blob/main/learning-path/week-04/day-4-histogram.ipynb)

## Learning Objectives
- Implement histogram counting with atomics
- Use shared memory privatization for performance
- Extend to 2D histograms and image processing
- Apply best practices for counting algorithms

> **Primary Focus:** CUDA C++ code examples first, Python/Numba backup for interactive testing

---

In [None]:
# ‚öôÔ∏è Colab/Local Setup - Run this first!
import subprocess, sys
try:
    import google.colab
    print("üîß Running on Google Colab - Installing dependencies...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "numba"])
    print("‚úÖ Setup complete!")
except ImportError:
    print("üíª Running locally - make sure you have: pip install numba numpy")

import numpy as np
from numba import cuda
import math
import time

print(f"\nCUDA available: {cuda.is_available()}")
if cuda.is_available():
    device = cuda.get_current_device()
    print(f"Device: {device.name}")

---

## Part 1: What is a Histogram?

### Counting Values into Bins

```
Data:  [3, 1, 4, 1, 5, 9, 2, 6, 5, 3]

Histogram (bins 0-9):
Bin 0: 0 occurrences  ‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë
Bin 1: 2 occurrences  ‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë
Bin 2: 1 occurrence   ‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë
Bin 3: 2 occurrences  ‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë
Bin 4: 1 occurrence   ‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë
Bin 5: 2 occurrences  ‚ñà‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë
Bin 6: 1 occurrence   ‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë
Bin 7: 0 occurrences  ‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë
Bin 8: 0 occurrences  ‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë
Bin 9: 1 occurrence   ‚ñà‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë‚ñë
```

### üî∑ CUDA C++ Implementation (Primary)

The following CUDA C++ implementation demonstrates histogram computation with shared memory privatization for reduced atomic contention.

In [None]:
%%writefile histogram.cu
// histogram.cu - GPU histogram with privatization
#include <stdio.h>
#include <cuda_runtime.h>

#define NUM_BINS 256
#define BLOCK_SIZE 256

// Naive: Global atomics only (slow due to contention)
__global__ void histogramNaive(const unsigned char* data, int* hist, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        atomicAdd(&hist[data[i]], 1);  // High contention!
    }
}

// Optimized: Shared memory privatization
__global__ void histogramPrivatized(const unsigned char* data, int* hist, int n) {
    // Private histogram in shared memory
    __shared__ int localHist[NUM_BINS];
    
    // Initialize shared memory
    for (int i = threadIdx.x; i < NUM_BINS; i += blockDim.x) {
        localHist[i] = 0;
    }
    __syncthreads();
    
    // Count into shared memory (low contention within block)
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        atomicAdd(&localHist[data[i]], 1);
    }
    __syncthreads();
    
    // Merge local histograms to global (once per bin per block)
    for (int i = threadIdx.x; i < NUM_BINS; i += blockDim.x) {
        if (localHist[i] > 0) {
            atomicAdd(&hist[i], localHist[i]);
        }
    }
}

int main() {
    int n = 10000000;  // 10M data points
    
    unsigned char *h_data = (unsigned char*)malloc(n);
    for (int i = 0; i < n; i++) {
        h_data[i] = rand() % 256;
    }
    
    unsigned char *d_data;
    int *d_hist;
    cudaMalloc(&d_data, n);
    cudaMalloc(&d_hist, NUM_BINS * sizeof(int));
    cudaMemset(d_hist, 0, NUM_BINS * sizeof(int));
    
    cudaMemcpy(d_data, h_data, n, cudaMemcpyHostToDevice);
    
    histogramPrivatized<<<256, 256>>>(d_data, d_hist, n);
    
    int h_hist[NUM_BINS];
    cudaMemcpy(h_hist, d_hist, NUM_BINS * sizeof(int), cudaMemcpyDeviceToHost);
    
    printf("Sample histogram values:\n");
    for (int i = 0; i < 5; i++) {
        printf("  Bin %d: %d\n", i, h_hist[i]);
    }
    
    cudaFree(d_data); cudaFree(d_hist);
    free(h_data);
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o histogram histogram.cu
!./histogram

### üî∂ Python/Numba (Optional - Quick Testing)

CPU baseline for comparison with GPU histogram implementations.

In [None]:
# CPU baseline for comparison
def cpu_histogram(data, num_bins):
    """Simple CPU histogram."""
    hist = np.zeros(num_bins, dtype=np.int32)
    for val in data:
        if 0 <= val < num_bins:
            hist[val] += 1
    return hist

# NumPy optimized
def numpy_histogram(data, num_bins):
    return np.bincount(data.astype(np.int32), minlength=num_bins)[:num_bins]

# Test
data = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3], dtype=np.int32)
hist = cpu_histogram(data, 10)
print(f"Data: {data}")
print(f"Histogram: {hist}")

---

## Part 2: Naive GPU Histogram

### Using Global Memory Atomics

In [None]:
@cuda.jit
def histogram_global_atomic(data, hist, n, num_bins):
    """
    Naive histogram: each thread does global atomic add.
    Simple but slow due to contention.
    """
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        val = data[i]
        if 0 <= val < num_bins:
            cuda.atomic.add(hist, val, 1)

In [None]:
# Test naive histogram
n = 10_000_000
num_bins = 256

# Random data with values 0-255
data = np.random.randint(0, num_bins, n).astype(np.int32)
d_data = cuda.to_device(data)
d_hist = cuda.device_array(num_bins, dtype=np.int32)

blocks, threads = 256, 256

# Reset and compute
d_hist = cuda.to_device(np.zeros(num_bins, dtype=np.int32))
histogram_global_atomic[blocks, threads](d_data, d_hist, n, num_bins)

gpu_hist = d_hist.copy_to_host()
cpu_hist = numpy_histogram(data, num_bins)

print(f"GPU histogram matches CPU: {'‚úì' if np.array_equal(gpu_hist, cpu_hist) else '‚úó'}")
print(f"\nFirst 10 bins: {gpu_hist[:10]}")
print(f"Total count: {np.sum(gpu_hist):,} (expected: {n:,})")

---

## Part 3: Optimized Histogram with Shared Memory

### Privatization Pattern

In [None]:
@cuda.jit
def histogram_shared_atomic(data, hist, n, num_bins):
    """
    Optimized histogram using shared memory privatization.
    
    1. Each block has private histogram in shared memory
    2. Threads atomically update shared (fast!)
    3. Merge to global at the end (fewer global atomics)
    """
    # Shared memory for block's private histogram
    # Assuming max 256 bins
    shared_hist = cuda.shared.array(256, dtype=np.int32)
    
    tid = cuda.threadIdx.x
    gid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    # Phase 1: Initialize shared histogram to zeros
    if tid < num_bins:
        shared_hist[tid] = 0
    cuda.syncthreads()
    
    # Phase 2: Count into shared memory (fast atomics!)
    for i in range(gid, n, stride):
        val = data[i]
        if 0 <= val < num_bins:
            cuda.atomic.add(shared_hist, val, 1)
    
    cuda.syncthreads()
    
    # Phase 3: Merge to global (one atomic per bin per block)
    if tid < num_bins:
        if shared_hist[tid] > 0:
            cuda.atomic.add(hist, tid, shared_hist[tid])

In [None]:
# Test optimized histogram
d_hist = cuda.to_device(np.zeros(num_bins, dtype=np.int32))
histogram_shared_atomic[blocks, threads](d_data, d_hist, n, num_bins)

gpu_hist_opt = d_hist.copy_to_host()

print(f"Optimized histogram matches CPU: {'‚úì' if np.array_equal(gpu_hist_opt, cpu_hist) else '‚úó'}")

---

## Part 4: Performance Comparison

In [None]:
def benchmark_histograms(n, num_bins=256, iterations=50):
    """Compare histogram implementations."""
    data = np.random.randint(0, num_bins, n).astype(np.int32)
    d_data = cuda.to_device(data)
    
    blocks, threads = 256, 256
    
    # CPU (NumPy)
    start = time.perf_counter()
    for _ in range(iterations):
        _ = numpy_histogram(data, num_bins)
    cpu_time = (time.perf_counter() - start) / iterations * 1000
    
    # GPU Global Atomic
    d_hist1 = cuda.to_device(np.zeros(num_bins, dtype=np.int32))
    histogram_global_atomic[blocks, threads](d_data, d_hist1, n, num_bins)
    cuda.synchronize()
    
    start = time.perf_counter()
    for _ in range(iterations):
        d_hist1 = cuda.to_device(np.zeros(num_bins, dtype=np.int32))
        histogram_global_atomic[blocks, threads](d_data, d_hist1, n, num_bins)
    cuda.synchronize()
    global_time = (time.perf_counter() - start) / iterations * 1000
    
    # GPU Shared Atomic
    d_hist2 = cuda.to_device(np.zeros(num_bins, dtype=np.int32))
    histogram_shared_atomic[blocks, threads](d_data, d_hist2, n, num_bins)
    cuda.synchronize()
    
    start = time.perf_counter()
    for _ in range(iterations):
        d_hist2 = cuda.to_device(np.zeros(num_bins, dtype=np.int32))
        histogram_shared_atomic[blocks, threads](d_data, d_hist2, n, num_bins)
    cuda.synchronize()
    shared_time = (time.perf_counter() - start) / iterations * 1000
    
    return cpu_time, global_time, shared_time

# Benchmark
sizes = [1_000_000, 5_000_000, 10_000_000, 50_000_000]

print(f"Histogram Benchmark (256 bins)")
print(f"{'='*70}")
print(f"{'Size':>12} | {'CPU (ms)':>10} | {'Global (ms)':>12} | {'Shared (ms)':>12} | {'Speedup':>8}")
print(f"{'-'*70}")

for n in sizes:
    cpu_t, global_t, shared_t = benchmark_histograms(n)
    speedup = cpu_t / shared_t
    print(f"{n:>12,} | {cpu_t:>10.2f} | {global_t:>12.2f} | {shared_t:>12.2f} | {speedup:>7.1f}x")

---

## Part 5: Real-Value Histograms

### Handling Continuous Data

In [None]:
@cuda.jit
def histogram_float(data, hist, n, num_bins, min_val, max_val):
    """
    Histogram for floating-point data with specified range.
    
    Maps [min_val, max_val) to bins [0, num_bins)
    """
    shared_hist = cuda.shared.array(256, dtype=np.int32)
    
    tid = cuda.threadIdx.x
    gid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    # Initialize shared memory
    if tid < num_bins:
        shared_hist[tid] = 0
    cuda.syncthreads()
    
    # Calculate bin width
    bin_width = (max_val - min_val) / num_bins
    
    # Count
    for i in range(gid, n, stride):
        val = data[i]
        
        # Calculate bin index
        if val >= min_val and val < max_val:
            bin_idx = int((val - min_val) / bin_width)
            bin_idx = min(bin_idx, num_bins - 1)  # Handle edge case
            cuda.atomic.add(shared_hist, bin_idx, 1)
    
    cuda.syncthreads()
    
    # Merge to global
    if tid < num_bins:
        if shared_hist[tid] > 0:
            cuda.atomic.add(hist, tid, shared_hist[tid])

In [None]:
# Test float histogram
n = 1_000_000
num_bins = 50

# Generate normal distribution
data = np.random.randn(n).astype(np.float32)
min_val, max_val = -4.0, 4.0

d_data = cuda.to_device(data)
d_hist = cuda.to_device(np.zeros(num_bins, dtype=np.int32))

histogram_float[256, 256](d_data, d_hist, n, num_bins, min_val, max_val)

gpu_hist = d_hist.copy_to_host()

# Visualize
print(f"Histogram of Normal Distribution (N={n:,})")
print(f"Range: [{min_val}, {max_val})")
print()

max_count = max(gpu_hist)
bin_width = (max_val - min_val) / num_bins

for i in range(0, num_bins, 5):  # Show every 5th bin
    bin_start = min_val + i * bin_width
    bar_len = int(gpu_hist[i] / max_count * 30)
    print(f"{bin_start:>6.1f}: {'‚ñà' * bar_len} {gpu_hist[i]:,}")

---

## Part 6: 2D Histogram (Joint Distribution)

In [None]:
@cuda.jit
def histogram_2d(x_data, y_data, hist, n, 
                 x_bins, y_bins, 
                 x_min, x_max, y_min, y_max):
    """
    2D histogram for joint distribution.
    
    hist has shape (x_bins, y_bins) flattened to 1D
    """
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    x_width = (x_max - x_min) / x_bins
    y_width = (y_max - y_min) / y_bins
    
    for i in range(tid, n, stride):
        x = x_data[i]
        y = y_data[i]
        
        if x >= x_min and x < x_max and y >= y_min and y < y_max:
            x_bin = int((x - x_min) / x_width)
            y_bin = int((y - y_min) / y_width)
            
            x_bin = min(x_bin, x_bins - 1)
            y_bin = min(y_bin, y_bins - 1)
            
            # Flatten 2D index
            flat_idx = x_bin * y_bins + y_bin
            cuda.atomic.add(hist, flat_idx, 1)

In [None]:
# Test 2D histogram
n = 1_000_000
x_bins, y_bins = 20, 20

# Correlated normal distributions
mean = [0, 0]
cov = [[1, 0.8], [0.8, 1]]  # Correlation = 0.8
xy = np.random.multivariate_normal(mean, cov, n).astype(np.float32)
x_data, y_data = xy[:, 0], xy[:, 1]

d_x = cuda.to_device(x_data)
d_y = cuda.to_device(y_data)
d_hist = cuda.to_device(np.zeros(x_bins * y_bins, dtype=np.int32))

histogram_2d[256, 256](d_x, d_y, d_hist, n,
                       x_bins, y_bins,
                       -4.0, 4.0, -4.0, 4.0)

hist_2d = d_hist.copy_to_host().reshape(x_bins, y_bins)

# Simple ASCII visualization
print("2D Histogram (Correlated Normal, œÅ=0.8)")
print("="*50)

chars = " ‚ñë‚ñí‚ñì‚ñà"
max_val = hist_2d.max()

for i in range(x_bins-1, -1, -1):  # Reverse for proper orientation
    row = ""
    for j in range(y_bins):
        level = int(hist_2d[i, j] / max_val * (len(chars) - 1))
        row += chars[level]
    print(row)

print(f"\nPeak count: {max_val:,}")

---

## Part 7: Sparse Histogram (Large Bin Count)

In [None]:
# When num_bins > shared memory size, we need a different approach

@cuda.jit
def histogram_large_bins(data, hist, n, num_bins):
    """
    Histogram for large number of bins (no shared memory).
    Uses sorted-segment approach for reduced contention.
    """
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    # Process elements, grouping consecutive same-bin values
    for i in range(tid, n, stride):
        val = data[i]
        if 0 <= val < num_bins:
            cuda.atomic.add(hist, val, 1)

# For very sparse histograms, consider hash-based approaches
print("Large Bin Strategies:")
print("="*50)
print("1. If bins < shared memory: Use privatization")
print("2. If bins > shared memory: Direct global atomics")
print("3. If very sparse: Hash table or sorting")

---

## Part 8: Practical Applications

### Image Histogram

In [None]:
@cuda.jit
def image_histogram_grayscale(image, hist, height, width):
    """
    Compute histogram of grayscale image (0-255).
    """
    shared_hist = cuda.shared.array(256, dtype=np.int32)
    
    tid = cuda.threadIdx.x
    gid = cuda.grid(1)
    stride = cuda.gridsize(1)
    n = height * width
    
    # Initialize
    if tid < 256:
        shared_hist[tid] = 0
    cuda.syncthreads()
    
    # Count pixels
    for i in range(gid, n, stride):
        row = i // width
        col = i % width
        pixel = image[row, col]
        cuda.atomic.add(shared_hist, pixel, 1)
    
    cuda.syncthreads()
    
    # Merge
    if tid < 256:
        cuda.atomic.add(hist, tid, shared_hist[tid])

In [None]:
# Simulate a grayscale image
height, width = 1080, 1920

# Create gradient with some noise
image = np.zeros((height, width), dtype=np.uint8)
for i in range(height):
    image[i, :] = np.clip(i * 256 // height + np.random.randint(-20, 20, width), 0, 255)

d_image = cuda.to_device(image)
d_hist = cuda.to_device(np.zeros(256, dtype=np.int32))

image_histogram_grayscale[256, 256](d_image, d_hist, height, width)

hist = d_hist.copy_to_host()

print(f"Image Histogram ({width}x{height} image)")
print("="*60)

# Show distribution
max_count = max(hist)
for i in range(0, 256, 32):
    segment_sum = sum(hist[i:i+32])
    bar_len = int(segment_sum / (max_count * 10) * 40)
    print(f"{i:3d}-{i+31:3d}: {'‚ñà' * bar_len} {segment_sum:,}")

---

## üéØ Exercises

### üî∑ CUDA C++ Exercises (Primary)

In [None]:
%%writefile histogram_exercises.cu
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// Error checking macro
#define CHECK_CUDA(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            printf("CUDA Error: %s at line %d\n", cudaGetErrorString(err), __LINE__); \
            exit(1); \
        } \
    } while(0)

// ============================================================
// Exercise 1: Weighted Histogram
// ============================================================
// Instead of counting +1 per element, add weight[i]
// Uses shared memory for local accumulation

__global__ void histogramWeighted(const int* data, const float* weights, 
                                   float* hist, int n, int numBins) {
    extern __shared__ float sharedHist[];
    
    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    // Initialize shared histogram
    for (int i = tid; i < numBins; i += blockDim.x) {
        sharedHist[i] = 0.0f;
    }
    __syncthreads();
    
    // Accumulate weights in shared memory
    for (int i = gid; i < n; i += stride) {
        int bin = data[i];
        if (bin >= 0 && bin < numBins) {
            atomicAdd(&sharedHist[bin], weights[i]);
        }
    }
    __syncthreads();
    
    // Merge to global histogram
    for (int i = tid; i < numBins; i += blockDim.x) {
        if (sharedHist[i] > 0.0f) {
            atomicAdd(&hist[i], sharedHist[i]);
        }
    }
}

// ============================================================
// Exercise 2: Histogram with Overflow Bins
// ============================================================
// hist[0] = underflow (values < minVal)
// hist[1..numBins] = normal bins
// hist[numBins+1] = overflow (values >= maxVal)

__global__ void histogramWithOverflow(const float* data, int* hist, int n,
                                       int numBins, float minVal, float maxVal) {
    __shared__ int sharedHist[258];  // Max bins + 2 for under/overflow
    
    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    int totalBins = numBins + 2;
    
    // Initialize shared histogram
    for (int i = tid; i < totalBins; i += blockDim.x) {
        sharedHist[i] = 0;
    }
    __syncthreads();
    
    float binWidth = (maxVal - minVal) / numBins;
    
    // Bin data with overflow handling
    for (int i = gid; i < n; i += stride) {
        float val = data[i];
        int bin;
        
        if (val < minVal) {
            bin = 0;  // Underflow bin
        } else if (val >= maxVal) {
            bin = numBins + 1;  // Overflow bin
        } else {
            bin = 1 + (int)((val - minVal) / binWidth);
            bin = min(bin, numBins);  // Handle edge case
        }
        
        atomicAdd(&sharedHist[bin], 1);
    }
    __syncthreads();
    
    // Merge to global
    for (int i = tid; i < totalBins; i += blockDim.x) {
        if (sharedHist[i] > 0) {
            atomicAdd(&hist[i], sharedHist[i]);
        }
    }
}

// ============================================================
// Exercise 3: RGB Color Histogram
// ============================================================
// Compute separate histograms for R, G, B channels
// Image stored as interleaved RGB (R0,G0,B0,R1,G1,B1,...)

__global__ void rgbHistogram(const unsigned char* image, 
                              int* histR, int* histG, int* histB,
                              int numPixels) {
    __shared__ int sharedR[256], sharedG[256], sharedB[256];
    
    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    // Initialize shared histograms
    for (int i = tid; i < 256; i += blockDim.x) {
        sharedR[i] = 0;
        sharedG[i] = 0;
        sharedB[i] = 0;
    }
    __syncthreads();
    
    // Accumulate in shared memory
    for (int i = gid; i < numPixels; i += stride) {
        unsigned char r = image[i * 3 + 0];
        unsigned char g = image[i * 3 + 1];
        unsigned char b = image[i * 3 + 2];
        
        atomicAdd(&sharedR[r], 1);
        atomicAdd(&sharedG[g], 1);
        atomicAdd(&sharedB[b], 1);
    }
    __syncthreads();
    
    // Merge to global
    for (int i = tid; i < 256; i += blockDim.x) {
        if (sharedR[i] > 0) atomicAdd(&histR[i], sharedR[i]);
        if (sharedG[i] > 0) atomicAdd(&histG[i], sharedG[i]);
        if (sharedB[i] > 0) atomicAdd(&histB[i], sharedB[i]);
    }
}

// ============================================================
// Test Functions
// ============================================================
void testWeightedHistogram() {
    printf("=== Exercise 1: Weighted Histogram ===\n");
    
    // Simple test: data = [0, 1, 1, 2], weights = [1.0, 2.0, 3.0, 4.0]
    const int N = 4;
    const int numBins = 8;
    int h_data[] = {0, 1, 1, 2};
    float h_weights[] = {1.0f, 2.0f, 3.0f, 4.0f};
    float h_hist[8] = {0};
    
    int* d_data;
    float* d_weights, *d_hist;
    
    CHECK_CUDA(cudaMalloc(&d_data, N * sizeof(int)));
    CHECK_CUDA(cudaMalloc(&d_weights, N * sizeof(float)));
    CHECK_CUDA(cudaMalloc(&d_hist, numBins * sizeof(float)));
    
    CHECK_CUDA(cudaMemcpy(d_data, h_data, N * sizeof(int), cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemcpy(d_weights, h_weights, N * sizeof(float), cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemset(d_hist, 0, numBins * sizeof(float)));
    
    histogramWeighted<<<1, 32, numBins * sizeof(float)>>>(d_data, d_weights, d_hist, N, numBins);
    CHECK_CUDA(cudaDeviceSynchronize());
    
    CHECK_CUDA(cudaMemcpy(h_hist, d_hist, numBins * sizeof(float), cudaMemcpyDeviceToHost));
    
    printf("Weighted histogram: ");
    for (int i = 0; i < numBins; i++) {
        if (h_hist[i] > 0) printf("bin[%d]=%.1f ", i, h_hist[i]);
    }
    printf("\nExpected: bin[0]=1.0 bin[1]=5.0 bin[2]=4.0\n");
    
    bool correct = (h_hist[0] == 1.0f && h_hist[1] == 5.0f && h_hist[2] == 4.0f);
    printf("Test %s\n\n", correct ? "PASSED ‚úì" : "FAILED ‚úó");
    
    cudaFree(d_data);
    cudaFree(d_weights);
    cudaFree(d_hist);
}

void testOverflowHistogram() {
    printf("=== Exercise 2: Histogram with Overflow Bins ===\n");
    
    const int N = 100;
    float h_data[N];
    
    // Generate data: some under 0, some in [0,10), some >= 10
    for (int i = 0; i < N; i++) {
        h_data[i] = (float)(i - 20) / 5.0f;  // Range: -4 to 15.8
    }
    
    int numBins = 10;
    int totalBins = numBins + 2;
    int h_hist[12] = {0};
    
    float* d_data;
    int* d_hist;
    
    CHECK_CUDA(cudaMalloc(&d_data, N * sizeof(float)));
    CHECK_CUDA(cudaMalloc(&d_hist, totalBins * sizeof(int)));
    
    CHECK_CUDA(cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemset(d_hist, 0, totalBins * sizeof(int)));
    
    histogramWithOverflow<<<4, 64>>>(d_data, d_hist, N, numBins, 0.0f, 10.0f);
    CHECK_CUDA(cudaDeviceSynchronize());
    
    CHECK_CUDA(cudaMemcpy(h_hist, d_hist, totalBins * sizeof(int), cudaMemcpyDeviceToHost));
    
    printf("Histogram with overflow (range [0, 10), %d bins):\n", numBins);
    printf("  Underflow (<0): %d\n", h_hist[0]);
    for (int i = 1; i <= numBins; i++) {
        printf("  Bin %d [%.1f-%.1f): %d\n", i-1, (i-1)*1.0f, i*1.0f, h_hist[i]);
    }
    printf("  Overflow (>=10): %d\n", h_hist[numBins + 1]);
    
    int total = 0;
    for (int i = 0; i < totalBins; i++) total += h_hist[i];
    printf("Total count: %d (expected %d)\n", total, N);
    printf("Test %s\n\n", (total == N) ? "PASSED ‚úì" : "FAILED ‚úó");
    
    cudaFree(d_data);
    cudaFree(d_hist);
}

void testRGBHistogram() {
    printf("=== Exercise 3: RGB Color Histogram ===\n");
    
    const int numPixels = 10000;
    unsigned char* h_image = (unsigned char*)malloc(numPixels * 3);
    
    // Create test image with known color distribution
    srand(42);
    for (int i = 0; i < numPixels * 3; i++) {
        h_image[i] = rand() % 256;
    }
    
    int h_histR[256] = {0}, h_histG[256] = {0}, h_histB[256] = {0};
    
    // CPU reference
    int cpuR[256] = {0}, cpuG[256] = {0}, cpuB[256] = {0};
    for (int i = 0; i < numPixels; i++) {
        cpuR[h_image[i*3 + 0]]++;
        cpuG[h_image[i*3 + 1]]++;
        cpuB[h_image[i*3 + 2]]++;
    }
    
    unsigned char* d_image;
    int *d_histR, *d_histG, *d_histB;
    
    CHECK_CUDA(cudaMalloc(&d_image, numPixels * 3));
    CHECK_CUDA(cudaMalloc(&d_histR, 256 * sizeof(int)));
    CHECK_CUDA(cudaMalloc(&d_histG, 256 * sizeof(int)));
    CHECK_CUDA(cudaMalloc(&d_histB, 256 * sizeof(int)));
    
    CHECK_CUDA(cudaMemcpy(d_image, h_image, numPixels * 3, cudaMemcpyHostToDevice));
    CHECK_CUDA(cudaMemset(d_histR, 0, 256 * sizeof(int)));
    CHECK_CUDA(cudaMemset(d_histG, 0, 256 * sizeof(int)));
    CHECK_CUDA(cudaMemset(d_histB, 0, 256 * sizeof(int)));
    
    rgbHistogram<<<64, 256>>>(d_image, d_histR, d_histG, d_histB, numPixels);
    CHECK_CUDA(cudaDeviceSynchronize());
    
    CHECK_CUDA(cudaMemcpy(h_histR, d_histR, 256 * sizeof(int), cudaMemcpyDeviceToHost));
    CHECK_CUDA(cudaMemcpy(h_histG, d_histG, 256 * sizeof(int), cudaMemcpyDeviceToHost));
    CHECK_CUDA(cudaMemcpy(h_histB, d_histB, 256 * sizeof(int), cudaMemcpyDeviceToHost));
    
    // Verify
    bool correct = true;
    for (int i = 0; i < 256; i++) {
        if (h_histR[i] != cpuR[i] || h_histG[i] != cpuG[i] || h_histB[i] != cpuB[i]) {
            correct = false;
            break;
        }
    }
    
    printf("Sample RGB histogram values:\n");
    printf("  Value 128: R=%d, G=%d, B=%d\n", h_histR[128], h_histG[128], h_histB[128]);
    printf("  Value 0:   R=%d, G=%d, B=%d\n", h_histR[0], h_histG[0], h_histB[0]);
    printf("  Value 255: R=%d, G=%d, B=%d\n", h_histR[255], h_histG[255], h_histB[255]);
    printf("Test %s\n\n", correct ? "PASSED ‚úì" : "FAILED ‚úó");
    
    cudaFree(d_image);
    cudaFree(d_histR);
    cudaFree(d_histG);
    cudaFree(d_histB);
    free(h_image);
}

int main() {
    printf("‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó\n");
    printf("‚ïë              CUDA Histogram Exercises                        ‚ïë\n");
    printf("‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù\n\n");
    
    // Print device info
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    printf("Device: %s\n", prop.name);
    printf("Compute Capability: %d.%d\n\n", prop.major, prop.minor);
    
    testWeightedHistogram();
    testOverflowHistogram();
    testRGBHistogram();
    
    printf("‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê\n");
    printf("                    All exercises completed!\n");
    printf("‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê\n");
    
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o histogram_exercises histogram_exercises.cu && ./histogram_exercises

### üî∂ Python/Numba Exercises (Optional)

### Exercise 1: Weighted Histogram

In [None]:
# TODO: Implement weighted histogram
# Instead of counting +1 per element, add weight[i]

@cuda.jit
def histogram_weighted(data, weights, hist, n, num_bins):
    """Weighted histogram: sum weights instead of counting."""
    shared_hist = cuda.shared.array(256, dtype=np.float32)
    
    tid = cuda.threadIdx.x
    gid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    # TODO: Initialize shared memory
    # TODO: Accumulate weights in shared memory
    # TODO: Merge to global
    pass

# Test: data = [0, 1, 1, 2], weights = [1.0, 2.0, 3.0, 4.0]
# Result: hist[0]=1.0, hist[1]=5.0, hist[2]=4.0

### Exercise 2: Histogram with Overflow Bins

In [None]:
# TODO: Add underflow and overflow bins
# hist[0] = count of values < min_val
# hist[1..num_bins] = normal bins
# hist[num_bins+1] = count of values >= max_val

@cuda.jit
def histogram_with_overflow(data, hist, n, num_bins, min_val, max_val):
    """Histogram with underflow/overflow bins."""
    pass

# Total output size = num_bins + 2

### Exercise 3: RGB Color Histogram

In [None]:
# TODO: Compute separate histograms for R, G, B channels
# image is shape (height, width, 3)

@cuda.jit
def rgb_histogram(image, hist_r, hist_g, hist_b, height, width):
    """Compute histograms for each RGB channel."""
    pass

# Each histogram should have 256 bins

---

## Summary

### Histogram Implementation Strategies

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    HISTOGRAM STRATEGIES                     ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                             ‚îÇ
‚îÇ  Small bins (< 256):                                        ‚îÇ
‚îÇ  ‚îî‚îÄ Shared memory privatization (fastest)                   ‚îÇ
‚îÇ                                                             ‚îÇ
‚îÇ  Medium bins (< 4096):                                      ‚îÇ
‚îÇ  ‚îî‚îÄ Multiple shared histograms per block                    ‚îÇ
‚îÇ                                                             ‚îÇ
‚îÇ  Large bins:                                                ‚îÇ
‚îÇ  ‚îî‚îÄ Direct global atomics (fallback)                        ‚îÇ
‚îÇ                                                             ‚îÇ
‚îÇ  Very sparse:                                               ‚îÇ
‚îÇ  ‚îî‚îÄ Sort + unique count                                     ‚îÇ
‚îÇ                                                             ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Performance Tips

1. **Always use shared memory** for bins that fit
2. **Initialize shared memory in parallel**
3. **Skip merge for zero counts** (minor optimization)
4. **Consider data distribution** - uniform is worst case

### Key Takeaways

1. **Histograms are atomic-heavy** - need optimization
2. **Privatization is essential** for performance
3. **Shared memory atomics** are ~10x faster than global
4. **2D histograms** work the same way with flattened indices

---

## Week 4 Complete! üéâ

### What You Learned

| Day | Topic | Key Skills |
|-----|-------|------------|
| 1 | Parallel Reduction | Tree reduction, multi-pass |
| 2 | Warp Primitives | Shuffle, no-sync reduction |
| 3 | Atomic Operations | Thread-safe updates, CAS |
| 4 | Histogram | Privatization, shared atomics |

### Next Steps

üìã **Day 5:** Complete the checkpoint quiz

üìã **Week 5 Preview:** Matrix Operations
- Matrix-vector multiplication
- Matrix-matrix multiplication (tiled)
- Memory access optimization
- Cache blocking