There might be an undesirable path to some optimization given the current integer instruction rules.
Let's focus on the integer instructions reading from memory (IADD_M, ISUB_M, IMUL_M, IMULH_M, ISMULH_M, IXOR_M).
For all these instruction, there is a rule stating that if the src register is the same as the destination register, the src operand will be the immediate value zero.
This means that when a program is generated, when src==dst, the instruction will read always from the same memory location in scratchpad L3, e.g.:
./randomx-codegen --genNative --nonce $RANDOM|grep "_M.*\L3[[0-9]"
IADD_M r5, L3[363232]
ISUB_M r4, L3[1756640]
IMUL_M r2, L3[2014304]
ISUB_M r5, L3[1946240]
The content of these memory cells changes very seldomly during the program execution loops. It could be affected when integer registers are dumped to the scratchpad at address spAddr1 or by a ISTORE write (limited to L1 & L2 14/16 of the time).
To simplify the handling of memory writes in the proposed optimization model, let's limit the optimization to the 7/8 of cases when the imm32 > RANDOMX_SCRATCHPAD_L2, where less clashes will occur.
Let's be clear, we're talking about possible optimization made by people on their own implementation, by making compatible changes to the VM and using extra CPU registers, it's not an optimization desired in the spec and we'll discuss how we can prevent such optimizations.
To avoid these L3 memory lookups, which happen on average three times per random program, one could pre-fetch these 8-byte values when the program is generated (when the random buffer is interpreted as a program and "opcodes" are decoded into VM instructions), store them in some unused CPU registers such as the MMX or xmm16+ AVR-512 registers (let's call them X0-X3, new VM registers), as well as the original (0+)imm32 used as addresses (let's call them Y0-Y3).
Then one replaces the *_M instruction by a *_R8 equivalent (a new VM instruction), using the 8-byte pre-fetched value stored in one of these extra X0-X3 registers.
In this example we limit ourselves to the optimization of the first 4 such instructions as we're using only 4 X and 4 Y registers.
Then, e.g. for the first one of the example given above we would do:
IADD_M r5, L3[363232]
=>
IADD_R8 r5, X0 (with X0 initialized with L3[363232] during program generation)
The L3 lookup will then be avoided in the 2048 executions of the program.
To handle the rare cases where it's required to refresh the pre-fetched values, it's possible to modify the handling of ISTORE when on L3 in the VM to add a comparison of the written address with Y0-Y3, as well as the code in charge of dumping the integer registers to the Scratchpad after each loop.
E.g. the ISTORE could be seen as this pseudo-code:
ISTORE: [mem] = src
=>
[mem] = src // mem[dst + imm32]=src
if dst + imm32 == Y0:
X0 = src
if dst + imm32 == Y1:
X1 = src
...
Or e.g. skip the ISTORE checks and live with the probability that a few mining attempts will be corrupted.
What would be the best strategy and the global gain is to be investigated.
Our recommendation:
Change the way src==dst cases are handled in the *_M instructions.
Instead of setting src=0, have some rule to deterministically choose another register. E.g. just pick the next one (modulo 8) or use more complex rule to keep a more uniform distribution.
Actually, we don't see any security problem to have src==dst in the *_M instructions, was it set because of i86 instructions limitations?
There might be an undesirable path to some optimization given the current integer instruction rules.
Let's focus on the integer instructions reading from memory (IADD_M, ISUB_M, IMUL_M, IMULH_M, ISMULH_M, IXOR_M).
For all these instruction, there is a rule stating that if the src register is the same as the destination register, the src operand will be the immediate value zero.
This means that when a program is generated, when src==dst, the instruction will read always from the same memory location in scratchpad L3, e.g.:
The content of these memory cells changes very seldomly during the program execution loops. It could be affected when integer registers are dumped to the scratchpad at address spAddr1 or by a ISTORE write (limited to L1 & L2 14/16 of the time).
To simplify the handling of memory writes in the proposed optimization model, let's limit the optimization to the 7/8 of cases when the imm32 > RANDOMX_SCRATCHPAD_L2, where less clashes will occur.
Let's be clear, we're talking about possible optimization made by people on their own implementation, by making compatible changes to the VM and using extra CPU registers, it's not an optimization desired in the spec and we'll discuss how we can prevent such optimizations.
To avoid these L3 memory lookups, which happen on average three times per random program, one could pre-fetch these 8-byte values when the program is generated (when the random buffer is interpreted as a program and "opcodes" are decoded into VM instructions), store them in some unused CPU registers such as the MMX or xmm16+ AVR-512 registers (let's call them X0-X3, new VM registers), as well as the original (0+)imm32 used as addresses (let's call them Y0-Y3).
Then one replaces the *_M instruction by a *_R8 equivalent (a new VM instruction), using the 8-byte pre-fetched value stored in one of these extra X0-X3 registers.
In this example we limit ourselves to the optimization of the first 4 such instructions as we're using only 4 X and 4 Y registers.
Then, e.g. for the first one of the example given above we would do:
The L3 lookup will then be avoided in the 2048 executions of the program.
To handle the rare cases where it's required to refresh the pre-fetched values, it's possible to modify the handling of ISTORE when on L3 in the VM to add a comparison of the written address with Y0-Y3, as well as the code in charge of dumping the integer registers to the Scratchpad after each loop.
E.g. the ISTORE could be seen as this pseudo-code:
Or e.g. skip the ISTORE checks and live with the probability that a few mining attempts will be corrupted.
What would be the best strategy and the global gain is to be investigated.
Our recommendation:
Change the way src==dst cases are handled in the *_M instructions.
Instead of setting src=0, have some rule to deterministically choose another register. E.g. just pick the next one (modulo 8) or use more complex rule to keep a more uniform distribution.
Actually, we don't see any security problem to have src==dst in the *_M instructions, was it set because of i86 instructions limitations?