# Chapter 9 : Parallel Histogram
This chapter discusses techniques on computing parallel histogram. Following is the profiling for basic implementation of parallel histogram. 

In [2]:
!nvcc -arch sm_86 Chapter-9/basic.cu -o basic
!nsys profile --stats=true -o basic ./basic

         This may increase runtime overhead and the likelihood of false
         dependencies across CUDA Streams. If you wish to avoid this, please
         disable the feature with --cuda-event-trace=false.
Try the 'nsys status --environment' command to learn more.

Try the 'nsys status --environment' command to learn more.

Collecting data...
GPU value matches CPU values.
Generating '/tmp/nsys-report-8a3d.qdstrm'
Failed to create '/home/spire-zk/PMPP_notebooks/basic.nsys-rep': File exists.
Use `--force-overwrite true` to overwrite existing files.
Failed to create '/home/spire-zk/PMPP_notebooks/basic.sqlite': File exists.
Use `--force-overwrite true` to overwrite existing files.
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /tmp/nsys-report-6c44.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)   Max (ns)    StdDev (ns)            Name         
 --------  -----

This can be improved using shared memory. This approach is called `privatization`. 

In [3]:
!nvcc -arch sm_86 Chapter-9/shared_memory.cu -o shared
!nsys profile --stats=true -o shared ./shared

         This may increase runtime overhead and the likelihood of false
         dependencies across CUDA Streams. If you wish to avoid this, please
         disable the feature with --cuda-event-trace=false.
Try the 'nsys status --environment' command to learn more.

Try the 'nsys status --environment' command to learn more.

Collecting data...
GPU value matches CPU values.
Generating '/tmp/nsys-report-17b6.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/spire-zk/PMPP_notebooks/shared.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)   Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  -----------  --------  -----------  ------------  ----------------------
     63.3      226,172,849         10  22,617,284.9  4,119,102.0   260,258  168,978,225  52,213,296.8  poll                  
     36.1      128,

The use of shared memory can be further improved using thread coarsening. The following approach utilizes the thread coarsenting approach. 

In [4]:
!nvcc -arch sm_86 Chapter-9/coarsening.cu -o coarsening
!nsys profile --stats=true -o coarsening ./coarsening

         This may increase runtime overhead and the likelihood of false
         dependencies across CUDA Streams. If you wish to avoid this, please
         disable the feature with --cuda-event-trace=false.
Try the 'nsys status --environment' command to learn more.

Try the 'nsys status --environment' command to learn more.

Collecting data...
GPU value matches CPU values.
Generating '/tmp/nsys-report-e162.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/spire-zk/PMPP_notebooks/coarsening.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)   Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  -----------  --------  -----------  ------------  ----------------------
     63.0      222,937,969         10  22,293,796.9  4,074,168.5   260,860  165,933,624  51,273,680.3  poll                  
     36.3      

But using thread coarsening in above approach is not coalesced. To achieve memory coalescing, the iteration can be done using `blockDim.x * gridDim.x`. 

In [5]:
!nvcc -arch sm_86 Chapter-9/coalescing.cu -o coalescing
!nsys profile --stats=true -o coalescing ./coalescing

         This may increase runtime overhead and the likelihood of false
         dependencies across CUDA Streams. If you wish to avoid this, please
         disable the feature with --cuda-event-trace=false.
Try the 'nsys status --environment' command to learn more.

Try the 'nsys status --environment' command to learn more.

Collecting data...
GPU value matches CPU values.
Generating '/tmp/nsys-report-a299.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/spire-zk/PMPP_notebooks/coalescing.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)   Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  -----------  --------  -----------  ------------  ----------------------
     63.1      223,302,542         10  22,330,254.2  4,179,081.5   255,361  166,172,299  51,336,832.2  poll                  
     36.3      

If the histogram frequency are repeatative, then it can utilize benefit from aggregation. The following approach utilizes aggregation. 

In [6]:
!nvcc -arch sm_86 Chapter-9/aggregation.cu -o aggregation
!nsys profile --stats=true -o aggregation ./aggregation

         This may increase runtime overhead and the likelihood of false
         dependencies across CUDA Streams. If you wish to avoid this, please
         disable the feature with --cuda-event-trace=false.
Try the 'nsys status --environment' command to learn more.

Try the 'nsys status --environment' command to learn more.

Collecting data...
GPU value matches CPU values.
Generating '/tmp/nsys-report-7f86.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/spire-zk/PMPP_notebooks/aggregation.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)   Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  -----------  --------  -----------  ------------  ----------------------
     62.3      227,306,043         10  22,730,604.3  4,304,633.0   276,248  169,704,312  52,402,980.2  poll                  
     37.0     