-
Notifications
You must be signed in to change notification settings - Fork 0
uswick/presta_HMPI
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
PRESTA MPI Bandwidth and Latency Benchmark v1.2 README File *********************************************************** Table of contents ================= o Code Description o Benchmark Test Summary o Benchmark Test Parameters o Benchmark Test Descriptions o Building Instructions o Files in this Distribution Code Description ================= This benchmark consists of 5 individual tests intended to evaluate the performance of inter-process communication with MPI. The provided applications test inter-process communication latency and bandwidth for standard MPI message passing operations as well as the MPI-2 RMA/one-sided operations. Additionally, collective operation performance is measured for MPI_Allreduce in one test and several collective operations in the final application. These benchmarks are written in C with MPI message-passing primitives. Benchmark Test Summary ====================== This section identifies each included test and defines the measurements provided by each test. com : This test measures total unidirectional and bidirectional inter-process bandwidth for pairs of MPI processes, varying both message size and number of concurrently communicating process pairs. Intra-SMP or inter-SMP communication, as well as bandwidth out of node and bisectional bandwidth, can be measured with proper allocation of the processes. laten : This test measures latency between MPI processes by measuring the time to execute MPI_Send/MPI_Recv ping-pong communication between MPI process pairs, varying the number of concurrently communicating process pairs. rma : This test measures MPI-2 RMA bandwidth and epoch length for both unidirectional and bidirectional access patterns. The number of operations per epoch can be specified by the user. allred : This test measures the MPI_Allreduce operation in a simulated application environment with alternating computation and communication phases. globalop : This test, also referred to as the Global-Op benchmark, times MPI operation loops with and without a simulated compute phase for MPI_Barrier, MPI_Reduce, MPI_Bcast, MPI_Allreduce, and MPI_Reduce/MPI_Bcast. Benchmark Test Parameters ========================= Operations Per Time Sample : Timing results are obtained in each of these tests through the use of the MPI_Wtime function. To address variation in MPI_Wtime resolution, all of the tests, with the exception of globalop, allow the specification of the number of operations to be performed between MPI_Wtime calls as a command-line argument. The number of MPI_Wtime clock ticks per measurement is provided in the output of each test. This mechanism is provided as a direct means of tuning the test behavior to the granularity of the MPI_Wtime function. Note that for the com, laten, and rma tests, each time sample will include the overhead of performing a Barrier operation by the included butterfly barrier. This default behavior is provided as a means of ensuring concurrent communication during the test. Use of the '-n' command-line argument will suppress the use of the barrier. When using the default behavior, the effect of the barrier can be minimized by using larger numbers of operations per sample. Participating Processes and Allocation : For the com, laten, and rma tests, the allocation of concurrently communicating process pairs and the number of these pairs can be specified in several ways. The intent of the test is to measure communication for the following configurations: 1) Communication within a SMP, increasing the number of concurrently communicating process pairs. 2) Communication between two SMPs, increasing the number of concurrently communicating process pairs. 3) Communication between multiple SMPs, increasing the number of SMPs with concurrently communicating process pairs. 4) Communication between multiple SMPs, with active processes on each SMP, increasing the number of communicating processes on each SMP. The following command-line flags provide the same functionality across the com, laten, and rma tests: -f [process count source file] -p [allocate processes: c(yclic) or b(lock)] default=b -t [processes per SMP] -i : print process pair information default=false -r : partner processes with nearby rank default=false Process pair partnering is done by default by selecting 1 process from each half of the rank ids in MPI_COMM_WORLD. Pair partners are identified by: (MPI_COMM_WORLD Rank ID + Size of MPI_COMM_WORLD/2) % Size of MPI_COMM_WORLD However, if the number of processes per SMP is provided, the processes can be paired by nearest off-SMP rank through the use of the '-r' flag. The default behavior of the tests is to allocate process pairs incrementally based on rank, beginning with 1 process pair. For example, an iteration for 4 processes of a 16 process job with 8 processes per SMP would include the rank pairs (0,8) and (1,9). Test configuration #3 mentioned above uses this allocation pattern and begins with 2*processes-per-SMP active processes. Since test configuration #4 increases the number of active processes per SMP, the participating processes must be identified in a cyclic manner. Specifying the number of processes per SMP and the '-p c' option results in cyclic allocation of the active processes. For example, an iteration for 8 processes of a 32 process job with 8 processes per SMP would include rank pairs (0,16), (8,24), (1,17), and (9,25). Test configuration #4 starts with the number of communicating processes equal to the number of SMPs and increases the process count by the number of SMPs each iteration. The process counts can also be specified in a source file with the '-f' option. Each count must be on a separate line. Message Sizes : The com and rma tests vary message size based on the command-line arguments message size start, message size stop and message size factor. Start and stop are specified in bytes. The message size factor is the value by which the current message size is multiplied to derive the message size for the next iteration. Benchmark Test Descriptions =========================== com : The "com" test is intended to illustrate at what point the interconnect between communicating MPI processes becomes saturated for both unidirectional and bidirectional communication. As indicated above, com iterates over a range of message sizes for each set of communicating process pairs. The entire size/pair iteration is performed initially for unidirectional and then bidirectional communication. The bandwidth and average operation time are calculated based on the longest time sample of any of the participating processes. Intra- and inter-SMP performance can be evaluated based on process allocation. The com test requires that the 'number of operations between measurements' argument (-o) be specified. com : MPI interprocess communication bandwidth benchmark syntax: com [OPTION]... -b [message start size] default=32 -e [message stop size] default=8388608 -f [process count source file] -o [number of operations between measurements] -p [allocate processes: c(yclic) or b(lock)] default=b -s [message size increase factor] default=2 -t [processes per SMP] -h : print use information -i : print process pair information default=false -n : do not use barrier within measurement default=barrier used -r : partner processes with nearby rank default=false laten : The application "laten" measures inter-process latency by performing ping-pong communication using MPI_Send and MPI_Recv. As indicated above, the laten test varies the number of concurrently communicating process pairs for each measurement. Communication latency is determined based on the longest process pair execution of the specified number of 0 byte MPI_Send/MPI_Recv operations performed. The average time of a single MPI_Send/MPI_Recv iteration is calculated and divided by two. Intra- and inter-SMP performance can be evaluated based on process allocation. The laten test requires that the 'number of operations between measurements' argument (-o) be specified. laten : MPI interprocess communication latency benchmark syntax: laten [OPTION]... -f [process count source file] -o [number of operations between measurements] -p [allocate processes: c(yclic) or b(lock)] default=b -t [processes per SMP] -h : print use information -i : print process pair information default=false -n : do not use barrier within measurement default=barrier used -r : partner processes with nearby rank default=false rma : The application "rma" measures the performance of unidirectional and bidirectional communication using the MPI-2 RMA operations MPI_Put and MPI_Get. This test varies message size and concurrent communicating process pairs in the same manner as the com test. Note that each process pair performs operations on a window specific to the two processes. Note that this test has large memory requirements: for 128 operations per epoch with operation sizes of 8MB for a bidirectional test, the memory requirement for 1 process would be 128 * 8 * 2 = 2048 MB. For multiple concurrently communicating process pairs, this can quickly lead to excessive paging. The rma test requires that the 'number of operations per epoch' argument (-o) and the 'number of epochs between measurements' (-c) argument values be specified. rma : MPI interprocess communication benchmark for one-sided operations syntax: rma [OPTION]... -b [message start size] default=32 -c [number of epochs between measurements] -e [message stop size] default=8388608 -f [process count source file] -o [number of operations per epoch] -p [allocate processes: c(yclic) or b(lock)] default=b -s [message size increase factor] default=2 -t [processes per SMP] -h : print use information -i : print process pair information default=false -n : do not use barrier within measurement default=barrier used -r : partner processes with nearby rank default=false allred : The application "allred" measures the MPI_Allreduce operation in a simulated application-like context. A series of computation operations are timed followed by a series of timed compute/Allreduce iterations. The computation-only time is subtracted from the compute/Allreduce time, resulting in the MPI_Allreduce operation time. The Allreduce operations use a fixed operation size of 4 MPI_DOUBLEs. This application takes 3 arguments: the number of compute iterations, the number of loop iterations between time samples, and the total number of samples taken. To avoid overlap of Allreduce operations, the "Time between Allreduce" field should be at least as great as the "Op mean" value. This can be tuned with the number of compute iterations argument. If the "Op mean" results of this test appear inconsistent, the sample and stddev/mean values can be examined to determine whether inconsistent data is being gathered. globalop : The application "globalop" measures the time to complete loops of several MPI collective operations, including MPI_Barrier, MPI_Reduce, MPI_Bcast, MPI_Allreduce, and MPI_Reduce/MPI_Bcast. The operation loops are timed with and without a call to a simulated compute-phase function. Building Instructions ===================== Configure the Makefile to use the appropriate commands and flags as required to compile MPI applications on the target system. File in this Distribution ======================== Makefile README README.html README.rfp README.rfp.html allred.c com.c globalop.c laten.c rma.c util.c util.h Last modified on April 8, 2002 by Chris Chambreau For information contact: Chris Chambreau -- chcham@llnl.gov UCRL-CODE-2001-028
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published