# Chapter 10: Reduction

The following demonstrates the basic parallel reduction. It uses 512 threads to sum 1024 floating point values. The grid size is 1, that is, the reduction is done on a single kernel. 

In [1]:
!nvcc -arch sm_86 Chapter-10/basic.cu -o basic
!nsys profile --stats=true -o basic ./basic

         This may increase runtime overhead and the likelihood of false
         dependencies across CUDA Streams. If you wish to avoid this, please
         disable the feature with --cuda-event-trace=false.
Try the 'nsys status --environment' command to learn more.

Try the 'nsys status --environment' command to learn more.

Collecting data...
GPU value matches CPU value
Generating '/tmp/nsys-report-412f.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/spire-zk/PMPP_notebooks/basic.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)   Max (ns)    StdDev (ns)        Name     
 --------  ---------------  ---------  ------------  -----------  --------  -----------  ------------  --------------
     63.1      241,697,423         11  21,972,493.0  1,462,281.0    13,112  184,952,263  54,738,190.1  poll          
     36.4      139,529,740        564     247,

The above approach has thread divergence in different reduction level. The next approach uses convergent kernel to sum the same 1024 floating points using 512 threads. This reduction approach also uses a single grid to perform the parallel reduction. 

In [2]:
!nvcc -arch sm_86 Chapter-10/convergent_reduction.cu -o convergent
!nsys profile --stats=true -o convergent ./convergent

         This may increase runtime overhead and the likelihood of false
         dependencies across CUDA Streams. If you wish to avoid this, please
         disable the feature with --cuda-event-trace=false.
Try the 'nsys status --environment' command to learn more.

Try the 'nsys status --environment' command to learn more.

Collecting data...
GPU value matches CPU value
Generating '/tmp/nsys-report-aca8.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/spire-zk/PMPP_notebooks/convergent.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)   Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  -----------  --------  -----------  ------------  ----------------------
     62.8      223,646,562         10  22,364,656.2  4,210,397.0   252,661  166,497,182  51,431,797.5  poll                  
     36.5      13

Following approach uses shared memory to compute the parallel reduction. The threads are convergent as in this approach. The grid size is still 1, that is, the reduction is done in a single grid. 

In [3]:
!nvcc -arch sm_86 Chapter-10/shared_memory.cu -o shared
!nsys profile --stats=true -o shared ./shared

         This may increase runtime overhead and the likelihood of false
         dependencies across CUDA Streams. If you wish to avoid this, please
         disable the feature with --cuda-event-trace=false.
Try the 'nsys status --environment' command to learn more.

Try the 'nsys status --environment' command to learn more.

Collecting data...
CPU and GPU value do not match. CPU = 2560.000000, GPU = 2400.000000.
Generating '/tmp/nsys-report-a0ab.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/spire-zk/PMPP_notebooks/shared.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)   Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  -----------  --------  -----------  ------------  ----------------------
     63.0      225,025,156         10  22,502,515.6  4,217,977.0   201,767  168,351,016  51,995,328.9  po

The next approach uses shared memory and multiple grid to reduce the sum. This uses atomic add to merge the results from multiple grids. 

In [4]:
!nvcc -arch sm_86 Chapter-10/segmented_reduction.cu -o segmented
!nsys profile --stats=true -o segmented ./segmented

         This may increase runtime overhead and the likelihood of false
         dependencies across CUDA Streams. If you wish to avoid this, please
         disable the feature with --cuda-event-trace=false.
Try the 'nsys status --environment' command to learn more.

Try the 'nsys status --environment' command to learn more.

Collecting data...
GPU value matches CPU value
Generating '/tmp/nsys-report-ffb5.qdstrm'
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/spire-zk/PMPP_notebooks/segmented.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)   Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  -----------  --------  -----------  ------------  ----------------------
     62.9      221,397,538         10  22,139,753.8  4,188,332.0   229,106  164,052,304  50,666,725.1  poll                  
     36.5      128