Here we demonstrate how to use Thrust's "backend systems" which control how Thrust algorithms get mapped to and executed on the parallel processors available to the application. There are two basic ways to access Thrust's systems: by specifying the global "device" system associated with types like
thrust::device_vector, or by selecting a specific container associated with a particular system, such as
thrust::cuda::vector. These two approaches are complementary and may be used together within the same program.
Here, we demonstrate how to switch between the CUDA (default), OpenMP, TBB, and standard C++ "device" backend systems. This is a global setting which applies to all types associated with the device system. In the following we'll consider the
monte_carlo sample program, but any of the example programs would also do. Note that absolutely no changes to the source code are required to switch the device system.
First, download the source code for the
$ wget http://thrust.googlecode.com/hg/examples/monte_carlo.cu
Now let's time the program, which estimates pi by random sampling:
$ time ./monte_carlo pi is around 3.14164 real 0m0.222s user 0m0.120s sys 0m0.100s
We can switch to the OpenMP device system with the following compiler options (no changes to the source code!)
$ nvcc -O2 -o monte_carlo monte_carlo.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP -lgomp
By default OpenMP runs one thread for each of the available cores, which is 4 on this particular system. Notice that the 'real' or wall-clock time is almost exactly one 1/4th the 'user' or CPU time, suggesting that
monte_carlo is completely compute-bound and scales well .
$ time ./monte_carlo pi is around 3.14163 real 0m2.090s user 0m8.333s sys 0m0.000s
We can override OpenMP's default behavior and instruct it to only use two threads using the
OMP_NUM_THREADS environment variable. Notice that the real time has doubled while the user time remains the same.
$ export OMP_NUM_THREADS=2 $ time ./monte_carlo pi is around 3.14163 real 0m4.168s user 0m8.333s sys 0m0.000s
When only a single thread is used the real and user times agree.
$ export OMP_NUM_THREADS=1 $ time ./monte_carlo pi is around 3.14164 real 0m8.333s user 0m8.333s sys 0m0.000s
We can switch to the TBB device system with the following compiler options
$ nvcc -O2 -o monte_carlo monte_carlo.cu -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_TBB -ltbb $ time ./monte_carlo pi is around 3.14 real 0m1.216s user 0m9.425s sys 0m0.040s
Because both the OpenMP and TBB systems use similar algorithm implementations to utilize the CPU, their timings are similar.
When using either the OpenMP or TBB systems,
nvcc isn't required. In general,
nvcc is only required when targeting Thrust at CUDA. For example, we could compile the previous code directly with
g++ with this command line:
$ g++ -O2 -o monte_carlo monte_carlo.cpp -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP -lgomp -I<path-to-thrust-headers>
Note that we've copied
monte_carlo.cpp so that
g++ recognizes that it's a c++ source file. The
-fopenmp command line argument instructs
g++ to enable OpenMP directives. Without this option, the compilation will fail. The
-lgomp command line argument instructs
g++ to link against the OpenMP library. Without this option, linking will fail.
If necessary, we can explicitly select the CUDA backend like so:
$ nvcc -O2 -o monte_carlo monte_carlo.cu -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA