Skip to content

Use conditional branching carefully

philippemilletresearch edited this page Jan 29, 2019 · 5 revisions

Guideline Information

Item Value
Guideline Number 26
Guideline Responsible (Name, Affiliation) Boitumelo Ruf, Fraunhofer
Guideline Reviewer (Name, Affiliation) Magnus Jahre, NTNU
Guideline Audience (Category) Application developers
Guideline Expertise (Category) Hardware designers, System architects
Guideline Keywords (Category) Code optimization, GPU, FPGA

Guideline advice

Conditional branching such as if-then-else is vital to most image processing applications, e.g. in finding maximum similarity between pixels or handling image boarders when filters exceed the dimensions of the image.

In terms of processing speed and performance overhead the use of conditional branching on CPUs and FPGAs is cheap. However, if the HLS tool cannot group branches, i.e. if branches are likely to diverge, the use of conditional branching can result in a resource overhead when optimizing code for FPGAs.

When leveraging the processing power of GPUs by GPGPU (general computation on a GPU), condition branching is to be used with caution. If branches diverge within a warp, i.e. if some evaluate to true and others to false, the instructions are executed twice, resulting in a processing overhead. [1][2]

Insights that led to the guideline

CPUs are designed for general purpose processing and are equipped with optimization strategies such as branch prediction, which allow a fast response to conditional input. In order to achieve parallel processing on CPUs the programmer instantiates different threads and processes which can run concurrently on the different processing cores. The scheduler of the CPU is free to pause the processing of certain threads in order to react to important interrupts and inputs. Hence, it is not guaranteed that all threads will run synchronously. Furthermore, due to its flexibility, the CPU, unlike GPUs, is able to only process the branch for which the conditional directive resolved to true.

FPGAs can also cope well with conditional branching in terms of processing speed, as HLS will create different paths for each conditional branch. However if the branches cannot be grouped efficiently the use of many conditional branches leads to a resource overhead on FPGAs

In order to achieve great parallelism and high data throughput, GPUs run numerous (>100) kernels on a large number of processing units. The key aspect of this processing is that each instantiation of the kernel is performing the same processing but on different subsets of data. The GPGPU programming model calls this paradigm Single-Instruction-Multiple-Threads (SIMT) which is similar to Single-Instruction-Multiple-Data (SIMD). This SIMT processing requires that all threads within in one warp (a group of threads running on one processor, sharing resources) run synchronously. This means that when kernels have conditional branching, all branches are evaluated and processed in order to keep the threads from diverging. At the end the result of the particular branch is chosen for which the conditional expression resulted in true. Hence, conditional branching with large bodies to save processing time is to be avoided, as all branches will be processed anyway. Furthermore, a divergence of the branches, which occurs when the conditional branch evaluates to true for some threads of the warp and for others to false, will result in processing overhead as the instructions are executed twice. See [1] and [2] for more information.

Recommended implementation method of the guideline along with a solid motivation for the recommendation

Avoid conditional branching with possibly divergent branches. Use multiple loops to perform different operations in different areas of the image.

When accelerating code with GPGPUs instantiate different kernels instead of using if-then-else statements for image areas which need specific processing. `

Instantiation of the recommended implementation method in the reference platform

This method is actually true for almost all accelerators and particularly with GPGPUs and FPGAs. Accelerators are often based on long pipeline chains and can manage big chunks of data with less hardware involved than standard CPUs. This must particularity be taken into account during the development of the algorithm as branches will cut the execution pipeline and will also have effects on the data to be served to the application and therefore their distribution in the system.

Evaluation of the guideline in reference applications

There was no evaluation done as part of the Tulipp project as the guideline is common practice when employing GPGPU. However, the authors of [3] did a thorough evaluation on the effect of divergent threads. However, the Tulipp use case followed this method for the development of their application on GPGPU and FPGA.

References

[1] CUDA C Programming Guide
[2] Efficient control flow restructuring for GPUs
[3] Benchmarking the Cost of Thread Divergence in CUDA
Guidline Review

Related guidelines

none

Clone this wiki locally