Skip to content

The single-core kernel of GUPS (aka RandomAccess) from the HPC Challenge benchmark

License

Notifications You must be signed in to change notification settings

technion-csl/gups

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TL;DR

The single-core kernel of GUPS (aka RandomAccess) from the HPC Challenge benchmark suite.

How to build and run

Simply invoke make to build the gups executable, which runs the multi-threaded version (implemented with OpenMP).

To run GUPS on a 1GB array (== 2^27 cells of unsigned 64-bit integers):

$ OMP_NUM_THREADS=1 ./gups --log2_length 27

To run GUPS with four threads on a 2GB array:

$ OMP_NUM_THREADS=3 ./gups --log2_length 28

Motivation

According to its website, the HPC Challenge (HPCC) suite measures a range memory access patterns and consists of seven benchmarks: HPL, DGEMM, STREAM, PTRANS, RandomAccess, FFT, and Communication bandwidth and latency. Strangely, the suite links all these benchmarks into a single, monolithic executable that runs them one by one. This repo re-implements one of the seven benchmarks -- RandomAccess -- so it can be run alone. Additionally, this repo cleans the original code and remove MPI dependencies. The original code can be cloned from the official github repo: https://github.com/icl-utk-edu/hpcc . My code mostly borrows from the file RandomAccess/core_single_cpu.c in this repo. I therefore included the original copyright notice.

Summary of changes

The changes with respect to the original code are:

  1. Specify the array size through a command-line argument.
  2. Specify the number of iterations through a command-line argument. A single iteration goes over the array 4x (like in the original HPCC implementation) without running the initialization phase (array[i] = i). The default number of iterations is set to one, compatible to the HPCC implementation.
  3. Scrape off the MPI related functions.
  4. Migrate the code to C++ and build it with g++.
  5. Introduce a new-command line flag, "--verify", that enables the final verification step. Verification is disabled by default because the comments in core_single_cpu.c say it is optional.

Validation against the original HPCC benchmark

make compare validates the performance of this new implementation against the reference HPCC implementation. My tests on my Intel i7-6600U CPU (Skylake) produced the following performance numbers (giga updates per second):

ours reference
repeat1 0.0295 0.0297
repeat2 0.0305 0.0321
repeat3 0.0304 0.0299
repeat4 0.0311 0.0301
repeat5 0.0308 0.0302
----------- ----------- -----------
average 0.0305 0.0304
std_dev 0.0006 0.0010

In conclusion: the new implementation is similar to the original HPCC implementation because the difference between the average numbers is well below the standard deviation. (This statistical procedure is similar to the Z-test or the t-test).

Scalability tests

GUPS seems like an embarrassingly parallel workload because it can issue many threads that randomly accesses the big array. However, it turned out that the parellel implementation provided in the HPCC suite does not scale well with the number of threads; The table below shows that GUPS throughput is ~50% higher with 4 threads. Additionally, Intel VTune reported that the application achieves 33% of the available memory bandwidth.

My first guess was that the problem is false sharing of the random number array, which is accessed in the main loop like this:

#pragma omp parallel for
    for (j=0; j<128; j++) {
      ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0);
      Table[ran[j] & (TableSize-1)] ^= ran[j];
    }
  }
}

To fix the problem, I padded the ran array such that each cell lies in a separate 64-byte cache line. The performance results on Intel Xeon E5-2660 v4 CPU (Broadwell) were similar, as shown below. This trend was true for other machines and array sizes. It seems like the problem lies somewhere else...

threads original no false sharing
1 0.0283 0.0287
2 0.0393 0.0382
3 0.0406 0.0402
4 0.0419 0.0425

About

The single-core kernel of GUPS (aka RandomAccess) from the HPC Challenge benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published