- Overview
- Description
- Tests
- Parameters
- Configuration file
- Output
- Usage
- See also
- source code: codes/s370_perf.asm
s370_perf determines the time per instruction of IBM System/370 instructions in 24-bit mode. It
- covers almost all non-privileged instructions
- tests load and store instructions for aligned and unaligned data
- tests instructions with length fields (e.g.
MVC
) for several length - tests decimal instructions (e.g.
AP
) for two digit counts - tests conditional branches for taken and fall-through case
- tests branches (e.g.
BC
,BAL
) for short (same 4k page) and far (different page) targets - tests instructions with memory interlock (
CS
,CDS
,TS
) in the typical lock taken and the lock missed data configurations
The code was developed and tested using the Hercules emulator and MVS 3.8J as packaged with tk4-. It should work for CPU speeds from below 1 to well above 1000 MIPS, thus run without modifications on a wide range of platforms
- older hardware implementations, like P/390 boards
- Hercules on slow systems, like Raspberry Pi
- Hercules on contemporary PC processors
- and even contemporary z Systems
s370_perf contains about 220 test routines, each targeting one S/370 instruction, plus about 80 additional tests to verify the consistency of the measured instruction times. The core of an instruction test looks like
T100L LR R2,R1
LR R2,R1
LR R2,R1
... total of 100 repeats ...
BCTR R15,R11
The instruction under test is repeated, usually 50 times for fast instructions,
and that sequence is wrapped in BCTR
loop. The instruction repeat count is
called group count and listed in the ig
column of the
output.
The repeat count of the BCTR
loop is called local repeat count and listed
in the lr
column of the output.
The default values are chosen such that all
test routines have roughly the same execution time of about 5 msec on a
reference system. They can be changed via a
configuration file.
The s370_perf main program executes each test GMUL
times. This
global multiplier is common for all tests and typically chosen such
that a test runs about one second for a benchmark run, either explicitly via
/Gnnn or automatically via
/GAUT option.
In some cases a register used in the inner loop must be re-initialized for each loop iteration to avoid arithmetic overflows, like
T220L LA R2,1
SLA R2,1
SLA R2,1
SLA R2,1
... total of 30 repeats ...
BCTR R15,R11
The loop overhead will be subtracted later by the analysis tool
s370_perf_ana, based on the
loop type
listed in the lt
column of the output.
Some instructions modify registers or memory such that a setup is needed for
each invocation of this instruction, e.g. ED
requires a MVC
to setup the
edit pattern which is overwritten by the edit result. In those cases the test
loop looks like
T410L MVC 0(10,R3),T410V3
ED 0(10,R3),T410V1+3
MVC 0(10,R3),T410V3
ED 0(10,R3),T410V1+3
... total of 10 repeats sequence ...
BCTR R15,R11
In those cases the test gives the time for the instruction sequence. The
time for the targeted instruction, ED
in the example, is again determined
by the analysis tool s370_perf_ana by subtracting the
independently measured instruction time(s) of the additional instructions,
MVC
in the example.
Last but not least allows the s370_perf main program to enable or disable the execution of tests via the /Ennn, /Dnnn and /Tnnn options. Almost all of the instruction tests are enabled by default, most of the auxiliary tests T9xx are disabled by default. The configuration of all available tests can be listed with the /OPTT option.
Each test has a unique identifier, usually called tag, of the form Tddd
.
The tests are grouped into classes
- Test 1xx -- load/store/move
- Test 2xx -- binary/logical
- Test 3xx -- flow control
- Test 4xx -- packed/decimal
- Test 5xx -- floating point
- Test 6xx -- miscellaneous instructions
- Test 7xx -- mix sequence
- Test 9xx -- auxiliary tests
Most tests are self-explanatory and target a single instruction, but some deserve some commentary
- unaligned memory access
- T15x - MVC
- T17x - MVCL
- T25x - TRT
- T27x - CLC
- T28x - CLCL
- T29x - CD+CDS
- T301+T302 - BC branch taken / not taken
- far page branches
- T311,T312,T315 - BCT,BCTR,BXLE
- T320+T321 - BALR close and far
- T330 - BALR;SAVE;RETURN
- T42x - AP+SP+MP+DP
- T61x - EX
- T62x - TS
- T7xx - mix sequences
- T9xx - auxiliary tests
s370_perf has several tests addressing unaligned memory access
Test | Description | hword | word | dword | Comment |
---|---|---|---|---|---|
T103 | L R,m (unal) | - | yes | - | cross word border |
T105 | LH R,m (unal3) | yes | yes | - | cross word border |
T111 | ST R,m (unal) | - | yes | - | cross word border |
T113 | STH R,m (unal1) | yes | no | - | cross half-word border |
T114 | STH R,m (unal3) | yes | yes | - | cross word border |
T502 | LE R,m (unal) | - | yes | - | cross word border |
T509 | STE R,m (unal) | - | yes | - | cross word border |
T532 | LD R,m (unal) | - | - | yes | cross double word border |
T539 | STD R,m (unal) | - | - | yes | cross double word border |
In test T113 STH
does a write across a halfword border, while in test
T114 STH
does a write across a word border. In T114 the access can even
cross a page border, so the two cases might exhibit quite different
performance characteristics.
The MVC
instruction is tested for a wide range of transfer sizes between
5 and 250 characters, and also for two scenarios with overlapping
source and destination areas:
- in T156 the destination buffer is offset by + 1 byte to the source buffer. This is sometimes used to fill buffer with a character.
- in T157 the destination buffer is offset by -24 bytes to the source buffer, effectively shifting the buffer 24 bytes to the left.
The MVCL
instruction is tested, like MVC
in T15x,
for a wide range of copy transfer sizes between 10 and 4096 bytes,
for three zero-fill padding cases, and also for two scenarios with overlapping
source and destination areas:
- in T178 the destination buffer is offset by + 1 byte to the source buffer.
Like T156 for
MVC
, can be used to fill an area, but padding is likely more efficient. - in T179 the destination buffer is offset by -100 bytes to the source buffer, effectively shifting the buffer 100 bytes to the left.
The TRT
instruction is tested for different operand sizes and
function tables
Test | Description | size | zeros | reads |
---|---|---|---|---|
T255 | TRT m,m (10c,zero) | 10 | 10 | 10 |
T256 | TRT m,m (100c,zero) | 100 | 100 | 100 |
T257 | TRT m,m (250c,zero) | 250 | 250 | 250 |
T258 | TRT m,m (250c,10b) | 250 | 10 | 11 |
T259 | TRT m,m (250c,100b) | 250 | 100 | 101 |
In tests T255-T257 the function table lookup is always zero, so all input bytes are checked. In tests T258 and T259 the function table is setup such that the first 10 or 100 lookups are zero, respectively, and the 11th or 101th is non-zero. The number of operand byte and function table reads is indicated in the table above.
The CLC
instruction is tested for a range of buffer sizes (10 to 250)
and also for fully matching eq
and completely different ne
buffers.
Because the ne
case can be detected at the very first byte comparison
it's natural to expect that the ne
tests have the same instruction time
for all sizes, while the eq
tests show a time which increases with
buffer size.
The CLCL
instruction is tested for two buffer sizes (10 and 4096)
and different locations of the first non-matching byte (10, 100, 250,
1024 and 4096). Like for CLC
in T27x it
is natural to assume that the instruction time mainly depends on the
number of bytes to test before a mismatch is detected. The tests
T284 and T285 are disabled by default because they are very slow on
Hercules.
The CD
and CDS
instructions implement the
compare-and-swap paradigm,
in short
opcode: CS R1,R3,D2(B2) or CS OP1,OP3,OP2
action: if (OP1==OP2) then OP2 := OP3 else OP1 := OP2
In a multi-CPU configuration this involves interlocked memory updates and access serialization. For different implementations the overhead for interlock and serialization can vary strongly with the data pattern, therefore three cases are tested
Test | Description | Comment |
---|---|---|
T290 | CS R,R,m (eq,eq) | OP1==OP2 && OP3==OP1 |
T291 | CS R,R,m (eq,ne) | OP1==OP2 && OP3!=OP1 |
T292 | CS R,R,m (ne) | OP1!=OP2 |
Likewise T295-T297 for CDS
.
The OP1!=OP2
is considered the lock missed case.
The BC
instruction is tested in both the
- branch not taken (T301)
- branch taken (T302) case. The later is implemented as branch maze. In most implementations the branch taken case will have a significantly larger instruction time the the not taken (or fall through) case.
The basic branch instruction tests, like T301, use a branch target located in the same page as the branch instruction. Several tests address the case where branch instruction and branch target are located in different pages, where the branch crosses a page border
- T303 is similar to T302, tests
BC (far)
- T305 is similar to T304, tests
BR (far)
- T321 is similar to T320, tests
BALR (far)
- T323 is similar to T322, tests
BAL (far)
The loop instructions BCT
, BCTR
and BXLE
are tested with empty
loop bodies, like
T315L LA R3,0 index begin
LA R4,1 index increment
LA R5,99 index end
T315LL EQU * no inner loop body
BXLE R3,R4,T315LL will be executed 100 times
As in real applications is the branch taken case much more frequent than the fall through case at end of loop.
The BALR
instruction can only be tested together with a BR
. T321 is
setup such that each branch crosses a page border.
This test covers the standard MVS calling sequence, starting with a L
and
BALR
on the caller side and standard save area handling with a full
(14,12)
save and restore and save area linkage update at the callee
side. The test returns the time for the full sequence of 11 instructions.
The decimal packed arithmetic instructions are tested with two number sizes, 10 digits and 30 digits. The tests with 30 digit numbers not only involve larger operands, but also values with a higher number of significant digits. The available tests are
Instruction | 10d test | 30d test |
---|---|---|
AP | T420 | T421 |
SP | T422 | T423 |
MP | T424 | T425 |
DP | T426 | T427 |
The EX
instruction can only be tested together with another instruction
being modified and executed by EX
. Two tests are provided
- T610 -
EX
withTM m,i
- T611 -
EX
withXI m,i
The TS
instruction implements the
test-and-set paradigm.
In a multi-CPU configuration this involves interlocked memory updates and
access serialization. For different implementations the overhead for interlock
and serialization can vary strongly with the data pattern, therefore both
cases are tested
Test | Description | Comment |
---|---|---|
T620 | TS m (zero) | lock taken case |
T621 | TS m (ones) | lock missed case |
The goal of all previous tests is to determine the time of single instruction, usually done by repeating the instruction under test many times. Real workloads of course have instruction sequences with different instructions. Four simple tests are provided to test non-trivial instruction sequences
- T700 - sequence of RR type instructions
- T701 - sequence of RX type instructions
- T702 - like T701, but code+data in different pages
- T703 - similar to T700, but non-optimizable
See auxiliary tests for a more detailed tests on the additivity of instruction times.
The test T700 contains a sequence of 38 integer RR type instructions plus
two BC
where the branch isn't taken. The test returns the average
execution time of the involved instructions. This test allows to check whether
the instruction times are additive on a given system, simply compare the
T700 time with the average of the involved instructions.
See T95x tests for an in-depth study of this
instruction sequence.
Similar goal as T700, using a sequence of 21 integer RX type instructions. In T701 the accessed operands are in the same page as the code, while in T702 the accessed operands are in a different page than the code.
Similar goal as T700, now with an instruction sequence where each calculated values is used. This prevents that emulators using an optimizing binary translator will remove part of the code.
The tests T90x, T92x and T95x allow to test whether the instruction times are additive, or in other words, whether the time for a sequence of instructions is the sum of the measured instruction times. The tests report the time for a whole instruction sequence and are best analysed with the raw view generated by s370_perf_ana with the -raw option.
The tests T900 to T915 are similar to the T100 test, but use different repeat
counts of the LR R,R
instruction, with ig
ranging from 1 to 72 (T100
uses 100). These tests report the time for the bundle ig
LR
instructions
and not time for a single instruction as T100. The measured time should
increase in proportion to the ig
count if instruction times are additive,
so this can be used to check whether the loop overhead is subtracted correctly.
Similar goal as T90x, using L R,m
. To be compared
with T102.
The tests T952 to T990 contain the first 2,3,...,40 instructions of the T700 test, they are therefore truncated versions of T700. The tests report the time for whole sequence and not the average as T700. The measured time should continuously increase if instruction times are additive, so this sequence of tests can be used to check whether this is actually true for a given system.
The run time behavior of s370_perf
is controlled by options passed with
the JCL EXEC
card PARM
mechanism. The PARM
string is a list of 4 letter
options, each starting with a /
. Valid options are:
Option | Description |
---|---|
/OWTO | enable step by step MVS console messages |
/ODBG | enable debug trace output for test steps |
/OTGA | enable debug trace output for /GAUT processing |
/OPCF | print configuration file entries |
/OPTT | print test descriptor table |
/ORIP | run tests in place (default is relocate) |
/GAUT | automatic determination of GMUL , aim is 1 sec per test |
/Gnnn | set GMUL to nnn |
/GnnK | set GMUL to nn * 1000 |
/Cnnn | select test used for /GAUT |
/Ennn | enable test Tnnn |
/Dnnn | disable test Tnnn |
/Tnnn | select test Tnnn |
/TCOR | select tests required for corrections |
Enables step by step MVS console messages, send with WTO
as 'job status'
message to the operator console. This might be useful to see the s370_perf
run steps in the context of all other system activities in the console log,
but on systems with a real operator this might not be too welcome. The
messages are send at the end of each test step and look like
13.49.45 JOB 5226 +s370_perf: done T100
13.49.46 JOB 5226 +s370_perf: done T101
13.49.47 JOB 5226 +s370_perf: done T102
Enable debug trace output for test steps. Gives the start and stop time
of the test step as retrieved with STCK
and all other information to
double check the calculation of the instruction timing.
Enable debug trace output for /GAUT processing in the format
-- GAUT: 1 : D3BCC375 081DF080 : D3BCC375 098220C0 : 5699
-- GAUT: 3 : D3BCC375 09832880 : D3BCC375 0E14E0C1 : 18716
-- GAUT: 9 : D3BCC375 0E166081 : D3BCC375 1934C0C1 : 45542
-- GAUT: 27 : D3BCC375 193620C1 : D3BCC375 3A9848C1 : 136738
-- GAUT: 81 : D3BCC375 3A9A2841 : D3BCC375 9EDCF041 : 410669
PERF002I run with GMUL= 197
The 1st number gives the current GMUL
, the next two the time retrieved with
STCK
before and after an execution of the T102 test, the last difference
divided by 4096, which is the elapsed time in units usec.
Prints configuration file entries (comments are skipped) in the format
PERF010I config: T151 1 2000
PERF010I config: T152 1 4000
PERF010I config: T154 1 4000
...
Prints the test descriptor table in the format
ind tag lr ig lt addr length
0 T100 22000 100 1 000A8D48 252
1 T101 17000 100 1 000A8E48 452
2 T102 13000 50 1 000A9010 256
...
with the columns containing
column | description |
---|---|
ind | table entry index |
tag | name of the test. A disabled test is prefixed with - |
lr | local repeat count, the loop count for the BCTR loop of the test |
ig | group count, the number time the instruction under test is replicated in the body of the test loop |
lt | loop type, indicates the additional instructions used to close the loop around the instruction under test |
addr | absolute address of the beginning of the test code |
length | length of the test code |
Run tests in place. The default is to relocate each test before execution into
a page aligned 8 kByte buffer. This ensures that test don't have branches
across page boundaries, unless explicitly wanted. /ORIP
disables this
relocation and executes the code in the place where the assembler generated
it. Useful for debugging tests with break after relocation.
Enables automatic determination of GMUL
. This will setup the global repeat
count such that the T102
test or the one selected with a
/Cnnn option
runs about 1 sec. Because the local repeat count of
each test have been tuned to get about equal CPU time for all tests on the
reference system this will result in about 1 sec CPU time for all tests
for systems with similar characteristics.
Typical GMUL
values resulting from /GAUT
are
Host CPU | System | GMUL | Comment |
---|---|---|---|
ARMv5te | Herc tk3 | 4 | Pogoplug v2 |
P/390 | 8 | CPU board | |
ARMv7 - BCM2835 | Herc tk4- 08 | 11 | Raspberry Pi 2 Model B |
AMD Opteron 6238 | Herc tk4- 08 | 100 | older 2*12 core server |
Intel Core2 Duo E8400 | Herc tk4- 09 rc2 | 118 | older desktop CPU |
Intel Xeon E5-1620 | Herc tk4- 08 | 195 | typical 4 core workstation |
Intel Core i7-3520M | Herc tk4- 08 | 202 | mid-end notebook |
With /Gnnn
or GnnK
, where n are decimal digits, the global multiplier
GMUL
will be set to nnn or nn*1000, respectively. Helpful for debugging,
normal production runs usually use /GAUT.
Selects the test used by /GAUT. The three digit
code must match one of the test numbers. By default T102
is used.
Allow to enable or disable the test Tnnn.
The three characters after the leading /T
or /D
can be either a number
0
- 9
or a wildcard character *
, which will match any number in that
position. This allows to handle groups of tests, e.g. /E1**
will enable all
tests in the 100 to 199 range, /D5**
will disable all test in the range
500 to 599 (the floating point group).
By default all tests are enabled with the exception of T284 and T285
(very slow CLCL
tests).
Can be used to disable tests which cause problems.
To setup a run with only a few tests use the
/Tnnn option.
When the first /Tnnn
option is detected in the PARM list all tests will
be disabled.
Each /Tnnn
re-enables than the test Tnnn. This allows to setup a run with
only a few tests enabled. Wildcards are supported as described for
/Ennn.
Inspects all enabled tests and enables all tests required by s370_perf_ana for corrections and normalizations:
- all tests required by loop overhead corrections
- T100 and T102 used in normalized instruction times
/TCOR
is handled as last step of test enable/disable processing, after
the configuration file and all
/Ennn,Dnnn and
/Tnnn options.
The local repeat counts for each test have been adjusted such that all
tests consume roughly the same CPU time on a reference system, a Hercules
emulator running on an up-to-date Intel CPU. For very different
environments, e.g. a z/PDT emulator or real hardware like a P/390 system,
the relative CPU consumption can be very different. In those cases the
local repeat counts can be redefined with a configuration file read from
SYSIN
in the format
#nnn e lrcnt
T151 1 2000
T152 1 4000
T154 1 4000
Lines starting with #
are considered comments and are ignored.
Each line holds
- a four character task name, like
T154
. No wildcards supported here. - an enable flag,
0
or1
, which overrides the test enable status - a new local repeat count for this test. If 0 is specified the old one is kept.
Note that the fields are strictly positional, the enable must be in column 9, the local repeat count right justified in columns 12 to 18. It is thus advisable to have a comment line as shown in the example above. The processing of the configuration file can be monitored with the /OPCF option, the final settings can be inspected with the /OPTT option.
The output of s370_perf
is a table of test step results in the form
PERF001I PARM: /GAUT
PERF002I run with GMUL= 118
PERF003I start with tests
tag description : test(s) lr ig lt : inst(usec)
T100 LR R,R : 0.818643 22000 100 1 : 0.003153
T101 LA R,n : 0.800819 17000 100 1 : 0.003992
T102 L R,m : 0.991196 13000 50 1 : 0.012923
T103 L R,m (unal) : 1.078041 12000 50 1 : 0.015227
...
PERF004I done with tests
with the columns containing
column | description |
---|---|
tag | name of the test |
description | instruction under test and conditions |
test(s) | execution time of this test in sec |
lr | local repeat count, the loop count for the BCTR loop of the test |
ig | group count, the number time the instruction under test is replicated in the body of the test loop |
lt | loop type, indicates the additional instructions used to close the loop around the instruction under test, see section Loop Types. |
inst(usec) | time per instruction in usec |
Notes on the given instruction time:
- the loop overhead is not subtracted, that will be done in post-processing with s370_perf_ana. However, the loop overhead is typically a few % only, so the numbers are a good quick estimate.
- for instructions which can only be tested in context, like
BALR
orMVCL
, the time is for the whole bundle, which is described in the description field as a;
separated list likeBALR R,R; BR R
.
Most tests contain only the replicated instruction under test and a closing
BCTR
. In some cases additional initialization is needed for each inner
loop iteration. The current code uses the following loop types
lt | Loop instructions | Comment |
---|---|---|
0 | used for testing BCTR and BCR |
|
1 | BCTR | used for most tests |
2 | BCT | tests with 8k code |
3 | LR, BCTR | |
4 | LA, BCTR | |
5 | LA, XR, BCTR | |
6 | LA, LA, LA, BCTR | |
7 | MVC (5c), BCTR | |
8 | MVC (15c), BCTR | |
9 | LE, BCTR | |
10 | LD, BCTR | |
11 | LD, LD, BCTR |
Job templates to be used with hercjis are provided in the codes directory and described in the README. A typical benchmark run with 30 jobs is started like
cd <codes-directory>
hercjis -r 30 s370_perf_ff.JES
If s370_perf is run without using the packed job, keep in mind that
s370_perf is a fairly large assembler module, currently 6800+ lines of code
with a lot of macro generated code. Both assembler nor linkage editor
fail under MVS 3.8J and the defaults of the ASMFCLG
procedure as
provided in tk4-. The assembler needs
increased allocations of the work files, and runs substantially faster
with BUFSIZE(MAX)
. The linkage editor needs an increased work area
like SIZE=(512000,122880)
. A well working JCL example is
//CLG EXEC ASMFCLG,
// MAC1='SYS2.MACLIB',
// PARM.ASM='NOLIST,NOXREF,NORLD,NODECK,LOAD,BUFSIZE(MAX)',
// PARM.LKED='MAP,LIST,LET,NCAL,SIZE=(512000,122880)',
// COND.LKED=(8,LE,ASM),
// PARM.GO='/GAUT/E9**',
// COND.GO=((8,LE,ASM),(4,LT,LKED))
//ASM.SYSUT1 DD DSN=&&SYSUT1,UNIT=SYSDA,SPACE=(1700,(600,100))
//ASM.SYSUT2 DD DSN=&&SYSUT2,UNIT=SYSDA,SPACE=(1700,(900,200))
//ASM.SYSUT3 DD DSN=&&SYSUT3,UNIT=SYSDA,SPACE=(1700,(900,200))
//ASM.SYSGO DD DSN=&&OBJSET,UNIT=SYSDA,SPACE=(80,(2000,500))
//ASM.SYSIN DD *
...
- s370_perf_ana - analyze s370_perf data
- s370_perf_sum - summarize s370_perf data
- s370_perf_sort - generate a sorted s370_perf data listing
- s370_perf_mark - derive MIPS ratings from s370_perf data