# Compilation

In [None]:
!make

nvcc -arch=sm_75 -o histogram histogram.cu


# Testing Optimizations Against CPU

## For Uniform Distribution

In [None]:
!echo "Kernel 0"
!./histogram 102400000 0 0 1 0 # Kernel 0
!echo "Kernel 1"
!./histogram 102400000 0 0 1 1 # Kernel 1
!echo "Kernel 2"
!./histogram 102400000 0 0 1 2 # Kernel 2

Kernel 0
The input length is 102400000
All results are same between GPU and CPU! :) 
CPU Time: 204.821110 ms
GPU h-to-d time: 87.765217 ms
GPU kernel time: 12.290955 ms
GPU t-to-h time: 0.030041 ms
GPU transfer and kernel time: 100.086212 ms
Kernel 1
The input length is 102400000
All results are same between GPU and CPU! :) 
CPU Time: 378.595829 ms
GPU h-to-d time: 94.969988 ms
GPU kernel time: 153.832912 ms
GPU t-to-h time: 0.042915 ms
GPU transfer and kernel time: 248.845816 ms
Kernel 2
The input length is 102400000
All results are same between GPU and CPU! :) 
CPU Time: 204.872131 ms
GPU h-to-d time: 86.741924 ms
GPU kernel time: 8.389950 ms
GPU t-to-h time: 0.036001 ms
GPU transfer and kernel time: 95.167875 ms


## For Normal Distribution

In [None]:
!echo "Kernel 0"
!./histogram 102400000 1 0 1 0 # Kernel 0
!echo "Kernel 1"
!./histogram 102400000 1 0 1 1 # Kernel 1
!echo "Kernel 2"
!./histogram 102400000 1 0 1 2 # Kernel 2

Kernel 0
The input length is 102400000
All results are same between GPU and CPU! :) 
CPU Time: 206.622839 ms
GPU h-to-d time: 104.830980 ms
GPU kernel time: 13.186216 ms
GPU t-to-h time: 0.039816 ms
GPU transfer and kernel time: 118.057013 ms
Kernel 1
The input length is 102400000
All results are same between GPU and CPU! :) 
CPU Time: 376.600981 ms
GPU h-to-d time: 92.118025 ms
GPU kernel time: 138.606071 ms
GPU t-to-h time: 0.038862 ms
GPU transfer and kernel time: 230.762959 ms
Kernel 2
The input length is 102400000
All results are same between GPU and CPU! :) 
CPU Time: 383.775949 ms
GPU h-to-d time: 93.061924 ms
GPU kernel time: 8.436918 ms
GPU t-to-h time: 0.036001 ms
GPU transfer and kernel time: 101.534843 ms


# More Metrics to Determine Which Kernel to Use
From the looks of it, Kernel 2 seems to be the way to go. We now use `ncu` to further ascertain that this is the right choice from our current implementations.

## For Uniform Distribution

### Kernel 0: No Shared Memory, Only Atomics

In [None]:
!ncu ./histogram 102400000 0 0 0 0

The input length is 102400000
==PROF== Connected to process 870 (/content/histogram)
==PROF== Profiling "histogram_kernel_v0" - 0: 0%....50%....100% - 9 passes
==PROF== Profiling "convert_kernel" - 1: 0%....50%....100% - 9 passes
GPU h-to-d time: 114.427090 ms
GPU kernel time: 807.105064 ms
GPU t-to-h time: 0.039101 ms
GPU transfer and kernel time: 921.571255 ms
==PROF== Disconnected from process 870
[870] histogram@127.0.0.1
  histogram_kernel_v0(unsigned int *, unsigned int *, unsigned int, unsigned int) (100000, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         5.00
    SM Frequency                    Mhz       585.00
    Elapsed Cycles                cycle    7,063,191
    Memory Throughput                 %        46.49
    DRA

### Kernel 1: Shared Memory + Atomics, 1 Thread Does More Work

In [None]:
!ncu ./histogram 102400000 0 0 0 1

The input length is 102400000
==PROF== Connected to process 968 (/content/histogram)
==PROF== Profiling "histogram_kernel_v1" - 0: 0%....50%....100% - 9 passes
==PROF== Profiling "convert_kernel" - 1: 0%....50%....100% - 9 passes
GPU h-to-d time: 108.314037 ms
GPU kernel time: 1926.940918 ms
GPU t-to-h time: 0.058889 ms
GPU transfer and kernel time: 2035.313845 ms
==PROF== Disconnected from process 968
[968] histogram@127.0.0.1
  histogram_kernel_v1(unsigned int *, unsigned int *, unsigned int, unsigned int) (100000, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- -------------
    Metric Name             Metric Unit  Metric Value
    ----------------------- ----------- -------------
    DRAM Frequency                  Ghz          5.00
    SM Frequency                    Mhz        585.00
    Elapsed Cycles                cycle    89,799,647
    Memory Throughput                 %         35.4

### Kernel 2: Shared Memory + Atomics, Threads Do Similar Amount of Work

In [None]:
!ncu ./histogram 102400000 0 0 0 2

The input length is 102400000
==PROF== Connected to process 1068 (/content/histogram)
==PROF== Profiling "histogram_kernel_v2" - 0: 0%....50%....100% - 9 passes
==PROF== Profiling "convert_kernel" - 1: 0%....50%....100% - 9 passes
GPU h-to-d time: 112.285852 ms
GPU kernel time: 640.719891 ms
GPU t-to-h time: 0.040054 ms
GPU transfer and kernel time: 753.045797 ms
==PROF== Disconnected from process 1068
[1068] histogram@127.0.0.1
  histogram_kernel_v2(unsigned int *, unsigned int *, unsigned int, unsigned int) (100000, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         5.00
    SM Frequency                    Mhz       585.00
    Elapsed Cycles                cycle    4,778,803
    Memory Throughput                 %        50.20
    

## For Normal Distribution

### Kernel 0: No Shared Memory, Only Atomics

In [None]:
!ncu ./histogram 102400000 1 0 0 0

The input length is 102400000
==PROF== Connected to process 1162 (/content/histogram)
==PROF== Profiling "histogram_kernel_v0" - 0: 0%....50%....100% - 9 passes
==PROF== Profiling "convert_kernel" - 1: 0%....50%....100% - 9 passes
GPU h-to-d time: 108.165979 ms
GPU kernel time: 559.966087 ms
GPU t-to-h time: 0.038862 ms
GPU transfer and kernel time: 668.170929 ms
==PROF== Disconnected from process 1162
[1162] histogram@127.0.0.1
  histogram_kernel_v0(unsigned int *, unsigned int *, unsigned int, unsigned int) (100000, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         4.99
    SM Frequency                    Mhz       585.00
    Elapsed Cycles                cycle    7,575,156
    Memory Throughput                 %        46.08
    

### Kernel 1: Shared Memory + Atomics, 1 Thread Does More Work

In [None]:
!ncu ./histogram 102400000 1 0 0 1

The input length is 102400000
==PROF== Connected to process 1292 (/content/histogram)
==PROF== Profiling "histogram_kernel_v1" - 0: 0%....50%....100% - 9 passes
==PROF== Profiling "convert_kernel" - 1: 0%....50%....100% - 9 passes
GPU h-to-d time: 114.621878 ms
GPU kernel time: 1886.146069 ms
GPU t-to-h time: 0.036955 ms
GPU transfer and kernel time: 2000.804901 ms
==PROF== Disconnected from process 1292
[1292] histogram@127.0.0.1
  histogram_kernel_v1(unsigned int *, unsigned int *, unsigned int, unsigned int) (100000, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- -------------
    Metric Name             Metric Unit  Metric Value
    ----------------------- ----------- -------------
    DRAM Frequency                  Ghz          5.00
    SM Frequency                    Mhz        585.00
    Elapsed Cycles                cycle    89,796,792
    Memory Throughput                 %         3

### Kernel 2: Shared Memory + Atomics, Threads Do Similar Amount of Work

In [None]:
!ncu ./histogram 102400000 1 0 0 2

The input length is 102400000
==PROF== Connected to process 1412 (/content/histogram)
==PROF== Profiling "histogram_kernel_v2" - 0: 0%....50%....100% - 9 passes
==PROF== Profiling "convert_kernel" - 1: 0%....50%....100% - 9 passes
GPU h-to-d time: 97.759008 ms
GPU kernel time: 532.022953 ms
GPU t-to-h time: 0.044107 ms
GPU transfer and kernel time: 629.826069 ms
==PROF== Disconnected from process 1412
[1412] histogram@127.0.0.1
  histogram_kernel_v2(unsigned int *, unsigned int *, unsigned int, unsigned int) (100000, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         5.00
    SM Frequency                    Mhz       585.00
    Elapsed Cycles                cycle    4,777,769
    Memory Throughput                 %        50.26
    D

# Histogram Graphs
We compare the GPU and the CPU histograms for the specified sizes and then generate the results in the text file, which are then graphed. `generate_results.py` first generates the results for us. The `*.txt` files generated (included in the submitted code) can then be plotted using `python3 plot_histogram.py`

## Generating the Results

In [None]:
!python3 generate_results.py

Generating Graphs for UNIFORM distribution
Current N=1024
Running: ./histogram 1024 0 1 1 2
The input length is 1024
All results are same between GPU and CPU! :) 
Saved Histogram!
CPU Time: 0.004053 ms
GPU h-to-d time: 1.283884 ms
GPU kernel time: 0.213861 ms
GPU t-to-h time: 0.038147 ms
GPU transfer and kernel time: 1.535892 ms

-----------------------------------
Current N=10240
Running: ./histogram 10240 0 1 1 2
The input length is 10240
All results are same between GPU and CPU! :) 
Saved Histogram!
CPU Time: 0.041008 ms
GPU h-to-d time: 0.054121 ms
GPU kernel time: 0.197887 ms
GPU t-to-h time: 0.028133 ms
GPU transfer and kernel time: 0.280142 ms

-----------------------------------
Current N=102400
Running: ./histogram 102400 0 1 1 2
The input length is 102400
All results are same between GPU and CPU! :) 
Saved Histogram!
CPU Time: 0.425100 ms
GPU h-to-d time: 0.141144 ms
GPU kernel time: 0.197887 ms
GPU t-to-h time: 0.029087 ms
GPU transfer and kernel time: 0.368118 ms

---------

# Shared Memory and Occupancy for N=1024
We profile for both Uniform and Normal distribution.

In [None]:
!ncu ./histogram 1024 0 0 0 2

The input length is 1024
==PROF== Connected to process 1847 (/content/histogram)
==PROF== Profiling "histogram_kernel_v2" - 0: 0%....50%....100% - 9 passes
==PROF== Profiling "convert_kernel" - 1: 0%....50%....100% - 9 passes
GPU h-to-d time: 1.667023 ms
GPU kernel time: 560.741901 ms
GPU t-to-h time: 0.031233 ms
GPU transfer and kernel time: 562.440157 ms
==PROF== Disconnected from process 1847
[1847] histogram@127.0.0.1
  histogram_kernel_v2(unsigned int *, unsigned int *, unsigned int, unsigned int) (1, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         4.84
    SM Frequency                    Mhz       575.76
    Elapsed Cycles                cycle        3,354
    Memory Throughput                 %         1.79
    DRAM Through

In [None]:
!ncu ./histogram 1024 1 0 0 2

The input length is 1024
==PROF== Connected to process 1911 (/content/histogram)
==PROF== Profiling "histogram_kernel_v2" - 0: 0%....50%....100% - 9 passes
==PROF== Profiling "convert_kernel" - 1: 0%....50%....100% - 9 passes
GPU h-to-d time: 0.061989 ms
GPU kernel time: 406.075954 ms
GPU t-to-h time: 0.036001 ms
GPU transfer and kernel time: 406.173944 ms
==PROF== Disconnected from process 1911
[1911] histogram@127.0.0.1
  histogram_kernel_v2(unsigned int *, unsigned int *, unsigned int, unsigned int) (1, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
    Section: GPU Speed Of Light Throughput
    ----------------------- ----------- ------------
    Metric Name             Metric Unit Metric Value
    ----------------------- ----------- ------------
    DRAM Frequency                  Ghz         4.88
    SM Frequency                    Mhz       580.94
    Elapsed Cycles                cycle        3,384
    Memory Throughput                 %         1.79
    DRAM Through