This directory contains code for encoding and decoding A32 instructions.
We only support the regular 32 bit ISA, roughly what is generated by
gcc/clang using -marm -march=armv7ve
.
Neither Thumb nor Thumb-2 are supported. Nor are there plans to do so.
opcode_tab.py contains the necessary encoding tables which are also use to generate C++ code.
The implementation is far more complete than what is needed by the CodeGenA32 component and might be useful by itself.
It also contains an assembler that can directly generate Elf executables, which depends on the Elf component. Otherwise, there are no dependencies.
Note: we deviate from the official assembler notation in several places.
If you are working on this on a non-A32 platform installing Qemu will be invaluable see ../TestQemu.
An instruction (Ins
) consists of two parts:
- a template describing its format, aka
Opcode
, which breaks down the instruction into a list ofOperands
. - a list of integers, one for each
Operand
in theOpcode
.
The integers are usually just the bits used by the operands shifted all the way to the right but for some immediate operands we process the bits further for ergonomic reasons.
A list of all opcodes and their operands can be obtained by running
./opcode_tab.py
To look up the concrete decomposition of an instruction word run
./disassembler_tool.py eefdbbc0
which will print
eefdbbc0 vcvt.s32.f64 s23, d0 # assembler notation (implicit `al` predicate not shown)
OPCODE vcvt.s32.f64 # opcode name
PRED_28_31 al (14) # 1. operand name:PRED_28_31 symbolized_val:al val:14
SREG_12_15_22 s23 (23) # 2. operand name:SREG_12_15_22 symbolized_val:s23 val:23
DREG_0_3_5 d0 (0) # 3. operand name:DREG_0_3_5 symbolized_val:d0 val:0
Note, the opcode name may have two components a basename and possibly a variant suffix separated by an underscore:
add_regreg
(src2 is reg shifted by another reg)add_regimm
(src2 is reg shifted by immediate)add_imm
(src2 is immediate),
We re-use the "official" instruction names as base names as much as possible. The variant component is necessary for disambiguation since the official instruction names are heavily overloaded. The different addressing modes are reflected as variants as well.
An Operand
represents:
- the predicate
- a register
- an immediate value
- a register mask
- a shift direction
- etc.
The order of the Operands
roughly corresponds to the order in the
assembler notation with the following exceptions:
- the predicate is always explicit and the first operand. "al" represents the always predicate.
- written registers precede read registers. This affects (v)str instructions where the "storee" is moved to the end, and (v)ldm instructions where the register masks is moved to the front.
- the register-list for ldm and stm is expressed as a 16bit integer immediate
- register ranges for vldm and vstm are expressed as a start-register followed by an immediate count
- the lr register in the bl instruction is made explicit
- store and load instruction do not use square brackets, exclamation marks, minus signs to indicate the various addressing modes. Instead this information is encoded in the opcode variant
opcode_tab.Assemble()
converts and int to an opcode_tab.Ins
.
opcode_tab.Disasemble()
does the inverse.
symbolic.InsSymbolize()
converts an opcode_tab.Ins
into a more
human friendly form. (Reminder: we deviate from the official notation.).
symbolic.InsFromSybolized()
does the inverse.
Standard Notation
```
add r0 r1 r2 asr #3
ldrb lr, [ip, r3, lsl #1]
ldrsheq r6, [r4,#-26]
bl exit
```
Our Notation
```
add_regimm al r0 r1 asr asr 3
ldrb_reg_add al lr ip r3 lsl 1
ldrsheq eq, r6, r4, 26
bl al lr exit
```
Note:
- the predicate is the first operand
- some
Operands
are implicit and do not correspond to any bits in the instruction, e.g. thebl
instruction implicitly writes registerlr
. - the instruction variant is separated with a "_"
- operands are separated by white space NOT commas.
TODO: mention "opcode classes" which are flags attached to opcode to quickly answer queries like: is this a store?
The authoritative version of the encoder/decoder is written in Python using a table driven approach.
There is a C++ 17 version of the code which is mostly generated from the Python code. It is mostly a proof of concept and not suitable for adversarial environments. When the Python code was changed the C++ code can be updated like so:
make opcode_gen.cc opcode_gen.h
Other language ports are encouraged to mimic the approach of generating code from the Python tables.
Run
make test
before any commit.
To materialize an Instruction for JITing you need to first identify the proper Opcode and then populate the Operands for each Field.
arm_table.py
has a query feature which may be helpful here, e.g.
./arm_table.py mov
will show information for all opcodes with name mov
including operand fields and bit ranges.
jit_example.c
gives an example for generating one specific instruction.
The Opcodes covered by the encoder/decoder are what one would expect to
see a C compiler emit i.e. basic integer and floating point instructions.
The file arm_test.dis
which was generated by objdump -d
has examples
for most currently supported Opcodes.
Missing instructions will be support as needed.
ISA
- arm.com https://static.docs.arm.com/ddi0406/c/DDI0406C_C_arm_architecture_reference_manual.pdf
- asmdb https://github.com/asmjit/asmdb/blob/master/armdata.js
- asmgrid https://asmjit.com/asmgrid/
- PeachPy https://github.com/Maratyszcza/PeachPy
- https://godbolt.org/ check target code generated by various compiler* s
- online emulator https://cpulator.01xz.net/
- online instruction decoder http://armconverter.com/
- crc32 instruction CRC32C https://developer.arm.com/docs/ddi0597/f/base-instructions-alphabetic-order/crc32c-crc32c#a1
Linking
OS