# **Analysis Report**

| Duration                | 10.565 ms (10,565,388 ns) |
|-------------------------|---------------------------|
| Grid Size               | [ 14,20,1 ]               |
| Block Size              | [ 32,16,1 ]               |
| Registers/Thread        | 63                        |
| Shared Memory/Block     | 11.953 KiB                |
| Shared Memory Requested | 16 KiB                    |
| Shared Memory Executed  | 16 KiB                    |
| Shared Memory Bank Size | 4 B                       |

## [0] GeForce GT 730

| Compute Capability             | 2.1                     |   |
|--------------------------------|-------------------------|---|
| Max. Threads per Block         | 1024                    |   |
| Max. Shared Memory per Block   | 48 KiB                  |   |
| Max. Registers per Block       | 32768                   |   |
| Max. Grid Dimensions           | [ 65535, 65535, 65535 ] |   |
| Max. Block Dimensions          | [ 1024, 1024, 64 ]      |   |
| Max. Warps per Multiprocessor  | 48                      |   |
| Max. Blocks per Multiprocessor | 8                       |   |
| Number of Multiprocessors      | 2                       |   |
| Multiprocessor Clock Rate      | 1.4 GHz                 |   |
| Concurrent Kernel              | true                    |   |
| Max IPC                        | 4                       |   |
| Threads per Warp               | 32                      |   |
| Global Memory Bandwidth        | 22.4 GB/s               |   |
| Global Memory Size             | 2 GiB                   |   |
| Constant Memory Size           | 64 KiB                  |   |
| L2 Cache Size                  | 128 KiB                 |   |
| Memcpy Engines                 | 1                       |   |
| PCIe Generation                | 2                       |   |
| PCIe Link Rate                 | 5 Gbit/s                |   |
| PCIe Link Width                | 4                       | · |

# 1. Compute, Bandwidth, or Latency Bound

The first step in analyzing an individual kernel is to determine if the performance of the kernel is bounded by computation, memory bandwidth, or instruction/memory latency. The results below indicate that the performance of kernel "horizontal\_diffusion\_gpu" is most likely limited by both compute and memory bandwidth. You should first examine the information in the "Compute Resources" section to determine how it is limiting performance.

## 1.1. Kernel Performance Is Bound By Compute And Memory Bandwidth

For device "GeForce GT 730" compute and memory utilization are balanced. These utilization levels indicate that kernel performance is good, but that additional performance improvement may be possible if either of both of compute and memory utilization levels are increased.



# 2. Compute Resources

GPU compute resources limit the performance of a kernel when those resources are insufficient or poorly utilized. Compute resources are used most efficiently when instructions do not overuse a function unit. The results below indicate that compute performance may be limited by overuse of a function unit.

### 2.1. GPU Utilization Is Limited By Function Unit Usage

Different types of instructions are executed on different function units within each SM. Performance can be limited if a function unit is over-used by the instructions executed by the kernel. The following results show that the kernel's performance is potentially limited by overuse of the following function units: Arithmetic.

Load/Store - Load and store instructions for local, shared, global, constant, etc. memory.

Arithmetic - All arithmetic instructions including integer and floating-point add and multiply, logical and binary operations, etc. Control-Flow - Direct and indirect branches, jumps, and calls.

Texture - Texture operations.



#### 2.2. Instruction Execution Counts

The following chart shows the mix of instructions executed by the kernel. The instructions are grouped into classes and for each class the chart shows the percentage of thread execution cycles that were devoted to executing instructions in that class. The "Inactive" result shows the thread executions that did not execute any instruction because the thread was predicated or inactive due to divergence.



## 2.3. Floating-Point Operation Counts

The following chart shows the mix of floating-point operations executed by the kernel. The operations are grouped into classes and for each class the chart shows the percentage of thread execution cycles that were devoted to executing operations in that class. The results do not sum to 100% because non-floating-point operations executed by the kernel are not shown in this chart.



## 3. Memory Bandwidth

Memory bandwidth limits the performance of a kernel when one or more memories in the GPU cannot provide data at the rate requested by the kernel. The results below indicate that the kernel is limited by the bandwidth available to the device memory.

## 3.1. GPU Utilization Is Limited By Memory Bandwidth

The following table shows the memory bandwidth used by this kernel for the various types of memory on the device. The table also shows the utilization of each memory type relative to the maximum throughput supported by the memory. The results show that the kernel's performance is potentially limited by the bandwidth available from one or more of the memories on the device.

Optimization: Try the following optimizations for the memory with high bandwidth utilization.

L1/Shared Memory - If possible use 64-bit accesses to shared memory and 8-byte bank mode to achieved 2x throughput. Resolve alignment and access pattern issues for global loads and stores.

L2 Cache - Align and block kernel data to maximize L2 cache efficiency.

Texture Cache - Reallocate texture cache data to shared or global memory.

Device Memory - Resolve alignment and access pattern issues for global loads and stores.

System Memory (via PCIe) - Make sure performance critical data is placed in device or shared memory.

| ,                       | Transactions      | Bandwidth    |      |     | Utilization | Utilization |     |  |  |  |
|-------------------------|-------------------|--------------|------|-----|-------------|-------------|-----|--|--|--|
| L1/Shared Memory        | Transactions      | Danawiatii   |      |     | Othization  |             |     |  |  |  |
| Local Loads             | 0                 | 0 B/s        |      |     |             |             |     |  |  |  |
| Local Stores            | 0                 | 0 B/s        |      |     |             |             |     |  |  |  |
| Shared Loads            | 66528             | 809.584 MB/s |      |     |             |             |     |  |  |  |
| Shared Stores           | 83160             | 1.012 GB/s   |      |     |             |             |     |  |  |  |
| Global Loads            | 3401372           | 41.392 GB/s  |      |     |             |             |     |  |  |  |
| Global Stores           | 270204            | 2.001 GB/s   |      |     |             |             |     |  |  |  |
| Atomic                  | 0                 | 0 B/s        |      |     |             |             |     |  |  |  |
| L1/Shared Total         | 3821264           | 45.214 GB/s  | Idle | Low | Medium      | High        | Max |  |  |  |
| L2 Cache                | '                 |              |      |     |             |             |     |  |  |  |
| L1 Reads                | 3803312           | 11.571 GB/s  |      |     |             |             |     |  |  |  |
| L1 Writes               | 657636            | 2.001 GB/s   |      |     |             |             |     |  |  |  |
| Texture Reads           | 0                 | 0 B/s        |      |     |             |             |     |  |  |  |
| Atomic                  | 0                 | 0 B/s        |      |     |             |             |     |  |  |  |
| Total                   | 4460948           | 13.571 GB/s  | Idle | Low | Medium      | High        | Max |  |  |  |
| Texture Cache           | '                 |              |      |     |             |             |     |  |  |  |
| Reads                   | 0                 | 0 B/s        | Idle | Low | Medium      | High        | Max |  |  |  |
| Device Memory           |                   |              |      |     |             |             |     |  |  |  |
| Reads                   | 3665691           | 11.152 GB/s  |      |     |             |             |     |  |  |  |
| Writes                  | 655878            | 1.995 GB/s   |      |     |             |             |     |  |  |  |
| Total                   | 4321569           | 13.147 GB/s  | Idle | Low | Medium      | High        | Max |  |  |  |
| System Memory           |                   |              |      |     |             |             |     |  |  |  |
| PCIe configuration: Ger | n2 x4, 5 Gbit/s ] |              |      |     |             |             |     |  |  |  |
| Reads                   | 0                 | 0 B/s        | Idle | Low | Medium      | High        | Max |  |  |  |
| Writes                  | 0                 | 0 B/s        | Idle | Low | Medium      | High        | Max |  |  |  |

## 4. Instruction and Memory Latency

Instruction and memory latency limit the performance of a kernel when the GPU does not have enough work to keep busy. The performance of latency-limited kernels can often be improved by increasing occupancy. Occupancy is a measure of how many warps the kernel has active on the GPU, relative to the maximum number of warps supported by the GPU. Theoretical occupancy provides an upper bound while achieved occupancy indicates the kernel's actual occupancy. The results below indicate that occupancy can be improved by reducing the number of registers used by the kernel.

## 4.1. GPU Utilization Is Limited By Register Usage

The kernel uses 63 registers for each thread (32256 registers for each block). This register usage is likely preventing the kernel from fully utilizing the GPU. Device "GeForce GT 730" provides up to 32768 registers for each block. Because the kernel uses 32256 registers for each block each SM is limited to simultaneously executing 1 block (16 warps). Chart "Varying Register Count" below shows how changing register usage will change the number of blocks that can execute on each SM.

Optimization: Use the -maxrregcount flag or the \_\_launch\_bounds\_\_ qualifier to decrease the number of registers used by each thread. This will increase the number of blocks that can execute on each SM. On devices with Compute Capability 5.2 turning global cache off can increase the occupancy limited by register usage.

| Variable            | Achieved | Theoretical | Device Limit | Grid Si | ze: [ 1 | 4,20,1] | (280 blo | ocks) Blo | ock Size | e: [ 32, | 16,1]( | 512 thre |
|---------------------|----------|-------------|--------------|---------|---------|---------|----------|-----------|----------|----------|--------|----------|
| Occupancy Per SM    |          |             |              |         |         |         |          |           |          |          |        |          |
| Active Blocks       |          | 1           | 8            | 0       | 1       | 2       | 3        | 4         | 5        | 6        | 7      | 8        |
| Active Warps        | 15.47    | 16          | 48           | 0       | 5       | 10      | 15 20    | 25        | 30       | 35       | 40     | 45 48    |
| Active Threads      |          | 512         | 1536         | 0       | 2!      | 56      | 512      | 768       | 102      | 24       | 1280   | 1536     |
| Occupancy           | 32.2%    | 33.3%       | 100%         | 0%      |         | 25%     | -        | 50%       |          | 75%      | ,<br>) | 100%     |
| Warps               |          |             |              |         |         |         |          |           |          |          |        |          |
| Threads/Block       |          | 512         | 1024         | 0       | 128     | 256     | 384      | 512       | 640      | 768      | 896    | 1024     |
| Warps/Block         |          | 16          | 32           | 0       | 3       | 6 9     | 12       | 15 18     | 8 21     | 24       | 27     | 30 32    |
| Block Limit         |          | 3           | 8            | 0       | 1       | 2       | 3        | 4         | 5        | 6        | 7      | 8        |
| Registers           | ì        |             |              |         |         |         |          |           |          |          |        |          |
| Registers/Thread    |          | 63          | 63           | 0       | 8       | 16      | 24       | 32        | 40       | 48       | 56     | 63       |
| Registers/Block     |          | 32768       | 32768        | 0       |         | 8k      |          | 16k       |          | 24k      |        | 32k      |
| Block Limit         |          | 1           | 8            | 0       | 1       | 2       | 3        | 4         | 5        | 6        | 7      | 8        |
| Shared Memory       | I        |             |              |         |         |         |          |           |          |          |        |          |
| Shared Memory/Block |          | 12240       | 16384        | 0       |         | 4k      |          | *<br>8k   |          | 12k      |        | 16k      |
| Block Limit         |          | 1           | 8            | 0       | 1       | 2       | 3        | 4         | 5        | 6        | 7      | 8        |

#### 4.2. Occupancy Charts

The following charts show how varying different components of the kernel will impact theoretical occupancy.





## Varying Shared Memory Usage

