<img src="Images/nvidia_header.png" style="margin-left: -30px; width: 300px; float: left;">

# Exercise: Optimize Histogram

You can allocate shared memory with `__shared__` memory space specifier.

<img src="Images/shared.png" alt= "Shared" width=900>

Use shared memory to optimize the performance of the histogram.  You will do this algorithm in two stages:
1. Compute a privatized histogram for each thread block.
2. Contribute the privatized histogram to the global histogram.

<details>
<summary>Original code in case you need it.</summary>

```c++
%%writefile Sources/shmem.cu
#include "dli.cuh"

constexpr int num_bins = 10;
constexpr float bin_width = 10;

// 1. Remove `block_histograms` from kernel parameters
__global__ void histogram_kernel(cuda::std::span<float> temperatures,
                                 cuda::std::span<int> block_histograms,
                                 cuda::std::span<int> histogram) 
{
  // 2. Allocate `block_histogram` in shared memory and initialize it to 0
  cuda::std::span<int> block_histogram =
      block_histograms.subspan(blockIdx.x * histogram.size(), histogram.size());

  int cell = blockIdx.x * blockDim.x + threadIdx.x;
  int bin = static_cast<int>(temperatures[cell] / bin_width);

  cuda::atomic_ref<int, cuda::thread_scope_block> 
    block_ref(block_histogram[bin]);
  block_ref.fetch_add(1);
  __syncthreads();

  if (threadIdx.x < num_bins) 
  {
    cuda::atomic_ref<int, cuda::thread_scope_device> ref(histogram[threadIdx.x]);
    ref.fetch_add(block_histogram[threadIdx.x]);
  }
}

void histogram(cuda::std::span<float> temperatures,
               cuda::std::span<int> block_histograms,
               cuda::std::span<int> histogram, cudaStream_t stream) {
  int block_size = 256;
  int grid_size = cuda::ceil_div(temperatures.size(), block_size);
  histogram_kernel<<<grid_size, block_size, 0, stream>>>(
      temperatures, block_histograms, histogram);
}
```
    
</details>

In [None]:
%%writefile Sources/shmem.cu
#include "dli.cuh"

constexpr int num_bins = 10;
constexpr float bin_width = 10;

// 1. Remove `block_histograms` from kernel parameters
__global__ void histogram_kernel(cuda::std::span<float> temperatures,
                                 cuda::std::span<int> block_histograms,
                                 cuda::std::span<int> histogram) 
{
  // 2. Allocate `block_histogram` in shared memory and initialize it to 0
  cuda::std::span<int> block_histogram =
      block_histograms.subspan(blockIdx.x * histogram.size(), histogram.size());

  int cell = blockIdx.x * blockDim.x + threadIdx.x;
  int bin = static_cast<int>(temperatures[cell] / bin_width);

  cuda::atomic_ref<int, cuda::thread_scope_block> 
    block_ref(block_histogram[bin]);
  block_ref.fetch_add(1);
  __syncthreads();

  if (threadIdx.x < num_bins) 
  {
    cuda::atomic_ref<int, cuda::thread_scope_device> ref(histogram[threadIdx.x]);
    ref.fetch_add(block_histogram[threadIdx.x]);
  }
}

void histogram(cuda::std::span<float> temperatures,
               cuda::std::span<int> block_histograms,
               cuda::std::span<int> histogram, cudaStream_t stream) {
  int block_size = 256;
  int grid_size = cuda::ceil_div(temperatures.size(), block_size);
  histogram_kernel<<<grid_size, block_size, 0, stream>>>(
      temperatures, block_histograms, histogram);
}

In [None]:
import Sources.dli
Sources.dli.run("Sources/shmem.cu")

If you’re unsure how to proceed, consider expanding this section for guidance. Use the hint only after giving the problem a genuine attempt.

<details>
  <summary>Hints</summary>
  
  - You can allocate shared memory using the `__shared__` keyword
</details>

Open this section only after you’ve made a serious attempt at solving the problem. Once you’ve completed your solution, compare it with the reference provided here to evaluate your approach and identify any potential improvements.

<details>
  <summary>Solution</summary>

  Key points:

  - Allocate a shared memory array

  Solution:
  ```c++
  __shared__ int block_histogram[num_bins];

  if (threadIdx.x < num_bins) 
  {
    block_histogram[threadIdx.x] = 0;
  }
  __syncthreads();

  int cell = blockIdx.x * blockDim.x + threadIdx.x;
  int bin = static_cast<int>(temperatures[cell] / bin_width);

  cuda::atomic_ref<int, cuda::thread_scope_block> 
    block_ref(block_histogram[bin]);
  block_ref.fetch_add(1, cuda::memory_order_relaxed);
  __syncthreads();

  if (threadIdx.x < num_bins) 
  {
    cuda::atomic_ref<int, cuda::thread_scope_device> ref(histogram[threadIdx.x]);
    ref.fetch_add(block_histogram[threadIdx.x], cuda::memory_order_relaxed);
  }
  ```

  You can find full solution [here](Solutions/shmem.cu).
</details>

---
Congratulations!  Move on to the [next section](../03.06-Cooperative-Algorithms/03.06.01-Cooperative.ipynb).

<img src="Images/nvidia_header.png" style="margin-left: -30px; width: 300px; float: left;">