s370_perf: IBM System/370 Instruction Timing Benchmark

Table of content

Overview
Description
Tests
Parameters
Configuration file
Output
Usage
See also
source code: codes/s370_perf.asm

Overview

s370_perf determines the time per instruction of IBM System/370 instructions in 24-bit mode. It

covers almost all non-privileged instructions
tests load and store instructions for aligned and unaligned data
tests instructions with length fields (e.g. MVC) for several length
tests decimal instructions (e.g. AP) for two digit counts
tests conditional branches for taken and fall-through case
tests branches (e.g. BC,BAL) for short (same 4k page) and far (different page) targets
tests instructions with memory interlock (CS,CDS,TS) in the typical lock taken and the lock missed data configurations

The code was developed and tested using the Hercules emulator and MVS 3.8J as packaged with tk4-. It should work for CPU speeds from below 1 to well above 1000 MIPS, thus run without modifications on a wide range of platforms

older hardware implementations, like P/390 boards
Hercules on slow systems, like Raspberry Pi
Hercules on contemporary PC processors
and even contemporary z Systems

Description

s370_perf contains about 220 test routines, each targeting one S/370 instruction, plus about 80 additional tests to verify the consistency of the measured instruction times. The core of an instruction test looks like

T100L   LR  R2,R1
        LR  R2,R1
        LR  R2,R1
        ... total of 100 repeats ...
        BCTR  R15,R11

The instruction under test is repeated, usually 50 times for fast instructions, and that sequence is wrapped in BCTR loop. The instruction repeat count is called group count and listed in the ig column of the output.

The repeat count of the BCTR loop is called local repeat count and listed in the lr column of the output. The default values are chosen such that all test routines have roughly the same execution time of about 5 msec on a reference system. They can be changed via a configuration file.

The s370_perf main program executes each test GMUL times. This global multiplier is common for all tests and typically chosen such that a test runs about one second for a benchmark run, either explicitly via /Gnnn or automatically via /GAUT option.

In some cases a register used in the inner loop must be re-initialized for each loop iteration to avoid arithmetic overflows, like

T220L   LA    R2,1
        SLA   R2,1
        SLA   R2,1
        SLA   R2,1
        ... total of 30 repeats ...
        BCTR  R15,R11

The loop overhead will be subtracted later by the analysis tool s370_perf_ana, based on the loop type listed in the lt column of the output.

Some instructions modify registers or memory such that a setup is needed for each invocation of this instruction, e.g. ED requires a MVC to setup the edit pattern which is overwritten by the edit result. In those cases the test loop looks like

T410L   MVC   0(10,R3),T410V3
        ED    0(10,R3),T410V1+3
        MVC   0(10,R3),T410V3
        ED    0(10,R3),T410V1+3
        ... total of 10 repeats sequence ...
        BCTR  R15,R11

In those cases the test gives the time for the instruction sequence. The time for the targeted instruction, ED in the example, is again determined by the analysis tool s370_perf_ana by subtracting the independently measured instruction time(s) of the additional instructions, MVC in the example.

Last but not least allows the s370_perf main program to enable or disable the execution of tests via the /Ennn, /Dnnn and /Tnnn options. Almost all of the instruction tests are enabled by default, most of the auxiliary tests T9xx are disabled by default. The configuration of all available tests can be listed with the /OPTT option.

Tests

Each test has a unique identifier, usually called tag, of the form Tddd. The tests are grouped into classes

Test 1xx -- load/store/move
Test 2xx -- binary/logical
Test 3xx -- flow control
Test 4xx -- packed/decimal
Test 5xx -- floating point
Test 6xx -- miscellaneous instructions
Test 7xx -- mix sequence
Test 9xx -- auxiliary tests

Most tests are self-explanatory and target a single instruction, but some deserve some commentary

unaligned memory access
T15x - MVC
T17x - MVCL
T25x - TRT
T27x - CLC
T28x - CLCL
T29x - CD+CDS
T301+T302 - BC branch taken / not taken
far page branches
T311,T312,T315 - BCT,BCTR,BXLE
T320+T321 - BALR close and far
T330 - BALR;SAVE;RETURN
T42x - AP+SP+MP+DP
T61x - EX
T62x - TS
T7xx - mix sequences
T9xx - auxiliary tests

unaligned memory access

s370_perf has several tests addressing unaligned memory access

Test	Description	hword	word	dword	Comment
T103	L R,m (unal)	-	yes	-	cross word border
T105	LH R,m (unal3)	yes	yes	-	cross word border
T111	ST R,m (unal)	-	yes	-	cross word border
T113	STH R,m (unal1)	yes	no	-	cross half-word border
T114	STH R,m (unal3)	yes	yes	-	cross word border
T502	LE R,m (unal)	-	yes	-	cross word border
T509	STE R,m (unal)	-	yes	-	cross word border
T532	LD R,m (unal)	-	-	yes	cross double word border
T539	STD R,m (unal)	-	-	yes	cross double word border

In test T113 STH does a write across a halfword border, while in test T114 STH does a write across a word border. In T114 the access can even cross a page border, so the two cases might exhibit quite different performance characteristics.

T15x - MVC

The MVC instruction is tested for a wide range of transfer sizes between 5 and 250 characters, and also for two scenarios with overlapping source and destination areas:

in T156 the destination buffer is offset by + 1 byte to the source buffer. This is sometimes used to fill buffer with a character.
in T157 the destination buffer is offset by -24 bytes to the source buffer, effectively shifting the buffer 24 bytes to the left.

T17x - MVCL

The MVCL instruction is tested, like MVC in T15x, for a wide range of copy transfer sizes between 10 and 4096 bytes, for three zero-fill padding cases, and also for two scenarios with overlapping source and destination areas:

in T178 the destination buffer is offset by + 1 byte to the source buffer. Like T156 for MVC, can be used to fill an area, but padding is likely more efficient.
in T179 the destination buffer is offset by -100 bytes to the source buffer, effectively shifting the buffer 100 bytes to the left.

T25x - TRT

The TRT instruction is tested for different operand sizes and function tables

Test	Description	size	zeros	reads
T255	TRT m,m (10c,zero)	10	10	10
T256	TRT m,m (100c,zero)	100	100	100
T257	TRT m,m (250c,zero)	250	250	250
T258	TRT m,m (250c,10b)	250	10	11
T259	TRT m,m (250c,100b)	250	100	101

In tests T255-T257 the function table lookup is always zero, so all input bytes are checked. In tests T258 and T259 the function table is setup such that the first 10 or 100 lookups are zero, respectively, and the 11th or 101th is non-zero. The number of operand byte and function table reads is indicated in the table above.

T27x - CLC

The CLC instruction is tested for a range of buffer sizes (10 to 250) and also for fully matching eq and completely different ne buffers. Because the ne case can be detected at the very first byte comparison it's natural to expect that the ne tests have the same instruction time for all sizes, while the eq tests show a time which increases with buffer size.

T28x - CLCL

The CLCL instruction is tested for two buffer sizes (10 and 4096) and different locations of the first non-matching byte (10, 100, 250, 1024 and 4096). Like for CLC in T27x it is natural to assume that the instruction time mainly depends on the number of bytes to test before a mismatch is detected. The tests T284 and T285 are disabled by default because they are very slow on Hercules.

T29x - CD+CDS

The CD and CDS instructions implement the compare-and-swap paradigm, in short

    opcode:    CS  R1,R3,D2(B2)      or    CS  OP1,OP3,OP2
    action:    if (OP1==OP2) then OP2 := OP3 else OP1 := OP2

In a multi-CPU configuration this involves interlocked memory updates and access serialization. For different implementations the overhead for interlock and serialization can vary strongly with the data pattern, therefore three cases are tested

Test	Description	Comment
T290	CS R,R,m (eq,eq)	OP1==OP2 && OP3==OP1
T291	CS R,R,m (eq,ne)	OP1==OP2 && OP3!=OP1
T292	CS R,R,m (ne)	OP1!=OP2

Likewise T295-T297 for CDS.

The OP1!=OP2 is considered the lock missed case.

T301+T302 - BC branch taken / not taken

The BC instruction is tested in both the

branch not taken (T301)
branch taken (T302) case. The later is implemented as branch maze. In most implementations the branch taken case will have a significantly larger instruction time the the not taken (or fall through) case.

far page branches

The basic branch instruction tests, like T301, use a branch target located in the same page as the branch instruction. Several tests address the case where branch instruction and branch target are located in different pages, where the branch crosses a page border

T303 is similar to T302, tests BC (far)
T305 is similar to T304, tests BR (far)
T321 is similar to T320, tests BALR (far)
T323 is similar to T322, tests BAL (far)

T311,T312,T315 - BCT,BCTR,BXLE

The loop instructions BCT, BCTR and BXLE are tested with empty loop bodies, like

T315L    LA    R3,0               index begin
         LA    R4,1               index increment
         LA    R5,99              index end
T315LL   EQU   *                  no inner loop body
         BXLE  R3,R4,T315LL       will be executed 100 times

As in real applications is the branch taken case much more frequent than the fall through case at end of loop.

T320+T321 - BALR close and far

The BALR instruction can only be tested together with a BR. T321 is setup such that each branch crosses a page border.

T330 - BALR;SAVE;RETURN

This test covers the standard MVS calling sequence, starting with a L and BALR on the caller side and standard save area handling with a full (14,12) save and restore and save area linkage update at the callee side. The test returns the time for the full sequence of 11 instructions.

T42x - AP+SP+MP+DP

The decimal packed arithmetic instructions are tested with two number sizes, 10 digits and 30 digits. The tests with 30 digit numbers not only involve larger operands, but also values with a higher number of significant digits. The available tests are

Instruction	10d test	30d test
AP	T420	T421
SP	T422	T423
MP	T424	T425
DP	T426	T427

T61x - EX

The EX instruction can only be tested together with another instruction being modified and executed by EX. Two tests are provided

T610 - EX with TM m,i
T611 - EX with XI m,i

T62x - TS

The TS instruction implements the test-and-set paradigm. In a multi-CPU configuration this involves interlocked memory updates and access serialization. For different implementations the overhead for interlock and serialization can vary strongly with the data pattern, therefore both cases are tested

Test	Description	Comment
T620	TS m (zero)	lock taken case
T621	TS m (ones)	lock missed case

T7xx - mix sequences

The goal of all previous tests is to determine the time of single instruction, usually done by repeating the instruction under test many times. Real workloads of course have instruction sequences with different instructions. Four simple tests are provided to test non-trivial instruction sequences

T700 - sequence of RR type instructions
T701 - sequence of RX type instructions
T702 - like T701, but code+data in different pages
T703 - similar to T700, but non-optimizable

See auxiliary tests for a more detailed tests on the additivity of instruction times.

T700 - mix int RR

The test T700 contains a sequence of 38 integer RR type instructions plus two BC where the branch isn't taken. The test returns the average execution time of the involved instructions. This test allows to check whether the instruction times are additive on a given system, simply compare the T700 time with the average of the involved instructions. See T95x tests for an in-depth study of this instruction sequence.

T701+T702 - mix int RX

Similar goal as T700, using a sequence of 21 integer RX type instructions. In T701 the accessed operands are in the same page as the code, while in T702 the accessed operands are in a different page than the code.

T703 - mix int RR noopt

Similar goal as T700, now with an instruction sequence where each calculated values is used. This prevents that emulators using an optimizing binary translator will remove part of the code.

T9xx - auxiliary tests

The tests T90x, T92x and T95x allow to test whether the instruction times are additive, or in other words, whether the time for a sequence of instructions is the sum of the measured instruction times. The tests report the time for a whole instruction sequence and are best analysed with the raw view generated by s370_perf_ana with the -raw option.

T90x - LR R,R count tests

The tests T900 to T915 are similar to the T100 test, but use different repeat counts of the LR R,R instruction, with ig ranging from 1 to 72 (T100 uses 100). These tests report the time for the bundle ig LR instructions and not time for a single instruction as T100. The measured time should increase in proportion to the ig count if instruction times are additive, so this can be used to check whether the loop overhead is subtracted correctly.

T92x - L R,m count tests

Similar goal as T90x, using L R,m. To be compared with T102.

T95x - T700 partial sequence tests

The tests T952 to T990 contain the first 2,3,...,40 instructions of the T700 test, they are therefore truncated versions of T700. The tests report the time for whole sequence and not the average as T700. The measured time should continuously increase if instruction times are additive, so this sequence of tests can be used to check whether this is actually true for a given system.

Parameters

The run time behavior of s370_perf is controlled by options passed with the JCL EXEC card PARM mechanism. The PARM string is a list of 4 letter options, each starting with a /. Valid options are:

Option	Description
/OWTO	enable step by step MVS console messages
/ODBG	enable debug trace output for test steps
/OTGA	enable debug trace output for /GAUT processing
/OPCF	print configuration file entries
/OPTT	print test descriptor table
/ORIP	run tests in place (default is relocate)
/GAUT	automatic determination of `GMUL`, aim is 1 sec per test
/Gnnn	set `GMUL` to nnn
/GnnK	set `GMUL` to nn * 1000
/Cnnn	select test used for /GAUT
/Ennn	enable test Tnnn
/Dnnn	disable test Tnnn
/Tnnn	select test Tnnn
/TCOR	select tests required for corrections

/OWTO

Enables step by step MVS console messages, send with WTO as 'job status' message to the operator console. This might be useful to see the s370_perf run steps in the context of all other system activities in the console log, but on systems with a real operator this might not be too welcome. The messages are send at the end of each test step and look like

13.49.45 JOB 5226  +s370_perf: done T100
13.49.46 JOB 5226  +s370_perf: done T101
13.49.47 JOB 5226  +s370_perf: done T102

/ODBG

Enable debug trace output for test steps. Gives the start and stop time of the test step as retrieved with STCK and all other information to double check the calculation of the instruction timing.

/OTGA

Enable debug trace output for /GAUT processing in the format

--  GAUT:         1 :   D3BCC375  081DF080 :   D3BCC375  098220C0 :       5699
--  GAUT:         3 :   D3BCC375  09832880 :   D3BCC375  0E14E0C1 :      18716
--  GAUT:         9 :   D3BCC375  0E166081 :   D3BCC375  1934C0C1 :      45542
--  GAUT:        27 :   D3BCC375  193620C1 :   D3BCC375  3A9848C1 :     136738
--  GAUT:        81 :   D3BCC375  3A9A2841 :   D3BCC375  9EDCF041 :     410669
PERF002I run with GMUL=        197

The 1st number gives the current GMUL, the next two the time retrieved with STCK before and after an execution of the T102 test, the last difference divided by 4096, which is the elapsed time in units usec.

/OPCF

Prints configuration file entries (comments are skipped) in the format

  PERF010I config: T151    1      2000
  PERF010I config: T152    1      4000
  PERF010I config: T154    1      4000
  ...

/OPTT

Prints the test descriptor table in the format

 ind   tag        lr  ig  lt      addr    length
   0  T100     22000 100   1  000A8D48       252
   1  T101     17000 100   1  000A8E48       452
   2  T102     13000  50   1  000A9010       256
 ...

with the columns containing

column	description
ind	table entry index
tag	name of the test. A disabled test is prefixed with `-`
lr	local repeat count, the loop count for the `BCTR` loop of the test
ig	group count, the number time the instruction under test is replicated in the body of the test loop
lt	loop type, indicates the additional instructions used to close the loop around the instruction under test
addr	absolute address of the beginning of the test code
length	length of the test code

/ORIP

Run tests in place. The default is to relocate each test before execution into a page aligned 8 kByte buffer. This ensures that test don't have branches across page boundaries, unless explicitly wanted. /ORIP disables this relocation and executes the code in the place where the assembler generated it. Useful for debugging tests with break after relocation.

/GAUT

Enables automatic determination of GMUL. This will setup the global repeat count such that the T102 test or the one selected with a /Cnnn option runs about 1 sec. Because the local repeat count of each test have been tuned to get about equal CPU time for all tests on the reference system this will result in about 1 sec CPU time for all tests for systems with similar characteristics. Typical GMUL values resulting from /GAUT are

Host CPU	System	GMUL	Comment
ARMv5te	Herc tk3	4	Pogoplug v2
	P/390	8	CPU board
ARMv7 - BCM2835	Herc tk4- 08	11	Raspberry Pi 2 Model B
AMD Opteron 6238	Herc tk4- 08	100	older 2*12 core server
Intel Core2 Duo E8400	Herc tk4- 09 rc2	118	older desktop CPU
Intel Xeon E5-1620	Herc tk4- 08	195	typical 4 core workstation
Intel Core i7-3520M	Herc tk4- 08	202	mid-end notebook

/Gnnn and /GnnK

With /Gnnn or GnnK, where n are decimal digits, the global multiplier GMUL will be set to nnn or nn*1000, respectively. Helpful for debugging, normal production runs usually use /GAUT.

/Cnnn

Selects the test used by /GAUT. The three digit code must match one of the test numbers. By default T102 is used.

/Ennn and /Dnnn

Allow to enable or disable the test Tnnn. The three characters after the leading /T or /D can be either a number 0 - 9 or a wildcard character *, which will match any number in that position. This allows to handle groups of tests, e.g. /E1** will enable all tests in the 100 to 199 range, /D5** will disable all test in the range 500 to 599 (the floating point group). By default all tests are enabled with the exception of T284 and T285 (very slow CLCL tests). Can be used to disable tests which cause problems. To setup a run with only a few tests use the /Tnnn option.

/Tnnn

When the first /Tnnn option is detected in the PARM list all tests will be disabled. Each /Tnnn re-enables than the test Tnnn. This allows to setup a run with only a few tests enabled. Wildcards are supported as described for /Ennn.

/TCOR

Inspects all enabled tests and enables all tests required by s370_perf_ana for corrections and normalizations:

all tests required by loop overhead corrections
T100 and T102 used in normalized instruction times

/TCOR is handled as last step of test enable/disable processing, after the configuration file and all /Ennn,Dnnn and /Tnnn options.

Configuration file

The local repeat counts for each test have been adjusted such that all tests consume roughly the same CPU time on a reference system, a Hercules emulator running on an up-to-date Intel CPU. For very different environments, e.g. a z/PDT emulator or real hardware like a P/390 system, the relative CPU consumption can be very different. In those cases the local repeat counts can be redefined with a configuration file read from SYSIN in the format

#nnn    e     lrcnt
T151    1      2000
T152    1      4000
T154    1      4000

Lines starting with # are considered comments and are ignored. Each line holds

a four character task name, like T154. No wildcards supported here.
an enable flag, 0 or 1, which overrides the test enable status
a new local repeat count for this test. If 0 is specified the old one is kept.

Note that the fields are strictly positional, the enable must be in column 9, the local repeat count right justified in columns 12 to 18. It is thus advisable to have a comment line as shown in the example above. The processing of the configuration file can be monitored with the /OPCF option, the final settings can be inspected with the /OPTT option.

Output

The output of s370_perf is a table of test step results in the form

PERF001I PARM: /GAUT
PERF002I run with GMUL=        118
PERF003I start with tests
 tag  description              :      test(s)         lr  ig  lt :    inst(usec)
T100  LR R,R                   :      0.818643     22000 100   1 :      0.003153
T101  LA R,n                   :      0.800819     17000 100   1 :      0.003992
T102  L R,m                    :      0.991196     13000  50   1 :      0.012923
T103  L R,m (unal)             :      1.078041     12000  50   1 :      0.015227
...
PERF004I done with tests

with the columns containing

column	description
tag	name of the test
description	instruction under test and conditions
test(s)	execution time of this test in sec
lr	local repeat count, the loop count for the `BCTR` loop of the test
ig	group count, the number time the instruction under test is replicated in the body of the test loop
lt	loop type, indicates the additional instructions used to close the loop around the instruction under test, see section Loop Types.
inst(usec)	time per instruction in usec

Notes on the given instruction time:

the loop overhead is not subtracted, that will be done in post-processing with s370_perf_ana. However, the loop overhead is typically a few % only, so the numbers are a good quick estimate.
for instructions which can only be tested in context, like BALR or MVCL, the time is for the whole bundle, which is described in the description field as a ; separated list like BALR R,R; BR R.

Loop Types

Most tests contain only the replicated instruction under test and a closing BCTR. In some cases additional initialization is needed for each inner loop iteration. The current code uses the following loop types

lt	Loop instructions	Comment
0		used for testing `BCTR` and `BCR`
1	BCTR	used for most tests
2	BCT	tests with 8k code
3	LR, BCTR
4	LA, BCTR
5	LA, XR, BCTR
6	LA, LA, LA, BCTR
7	MVC (5c), BCTR
8	MVC (15c), BCTR
9	LE, BCTR
10	LD, BCTR
11	LD, LD, BCTR

Usage

Job templates to be used with hercjis are provided in the codes directory and described in the README. A typical benchmark run with 30 jobs is started like

  cd <codes-directory>
  hercjis -r 30 s370_perf_ff.JES

If s370_perf is run without using the packed job, keep in mind that s370_perf is a fairly large assembler module, currently 6800+ lines of code with a lot of macro generated code. Both assembler nor linkage editor fail under MVS 3.8J and the defaults of the ASMFCLG procedure as provided in tk4-. The assembler needs increased allocations of the work files, and runs substantially faster with BUFSIZE(MAX). The linkage editor needs an increased work area like SIZE=(512000,122880). A well working JCL example is

//CLG EXEC ASMFCLG,
//      MAC1='SYS2.MACLIB',
//      PARM.ASM='NOLIST,NOXREF,NORLD,NODECK,LOAD,BUFSIZE(MAX)',
//      PARM.LKED='MAP,LIST,LET,NCAL,SIZE=(512000,122880)',
//      COND.LKED=(8,LE,ASM),
//      PARM.GO='/GAUT/E9**',
//      COND.GO=((8,LE,ASM),(4,LT,LKED))
//ASM.SYSUT1 DD DSN=&&SYSUT1,UNIT=SYSDA,SPACE=(1700,(600,100))
//ASM.SYSUT2 DD DSN=&&SYSUT2,UNIT=SYSDA,SPACE=(1700,(900,200))
//ASM.SYSUT3 DD DSN=&&SYSUT3,UNIT=SYSDA,SPACE=(1700,(900,200))
//ASM.SYSGO  DD DSN=&&OBJSET,UNIT=SYSDA,SPACE=(80,(2000,500))
//ASM.SYSIN  DD *
...

Files

s370_perf.md

Latest commit

History

s370_perf.md

File metadata and controls

s370_perf: IBM System/370 Instruction Timing Benchmark

Table of content