Skip to content

Latest commit

 

History

History

CpuA32

A32 (aka ARM32) Encoder/Decoder

This directory contains code for encoding and decoding A32 instructions.

We only support the regular 32 bit ISA, roughly what is generated by gcc/clang using -marm -march=armv7ve. Neither Thumb nor Thumb-2 are supported. Nor are there plans to do so.

opcode_tab.py contains the necessary encoding tables which are also use to generate C++ code.

The implementation is far more complete than what is needed by the CodeGenA32 component and might be useful by itself.

It also contains an assembler that can directly generate Elf executables, which depends on the Elf component. Otherwise, there are no dependencies.

Note: we deviate from the official assembler notation in several places.

If you are working on this on a non-A32 platform installing Qemu will be invaluable see ../TestQemu.

Concepts

An instruction (Ins) consists of two parts:

  1. a template describing its format, aka Opcode, which breaks down the instruction into a list of Operands.
  2. a list of integers, one for each Operand in the Opcode.

The integers are usually just the bits used by the operands shifted all the way to the right but for some immediate operands we process the bits further for ergonomic reasons.

A list of all opcodes and their operands can be obtained by running

./opcode_tab.py 

To look up the concrete decomposition of an instruction word run

 ./disassembler_tool.py eefdbbc0

which will print

eefdbbc0 vcvt.s32.f64 s23, d0      # assembler notation (implicit `al` predicate not shown)
OPCODE vcvt.s32.f64                # opcode name 
    PRED_28_31          al (14)    # 1. operand name:PRED_28_31     symbolized_val:al  val:14 
    SREG_12_15_22       s23 (23)   # 2. operand name:SREG_12_15_22  symbolized_val:s23 val:23 
    DREG_0_3_5          d0 (0)     # 3. operand name:DREG_0_3_5     symbolized_val:d0  val:0

Opcode Names

Note, the opcode name may have two components a basename and possibly a variant suffix separated by an underscore:

  • add_regreg (src2 is reg shifted by another reg)
  • add_regimm (src2 is reg shifted by immediate)
  • add_imm (src2 is immediate),

We re-use the "official" instruction names as base names as much as possible. The variant component is necessary for disambiguation since the official instruction names are heavily overloaded. The different addressing modes are reflected as variants as well.

Operands

An Operand represents:

  • the predicate
  • a register
  • an immediate value
  • a register mask
  • a shift direction
  • etc.

The order of the Operands roughly corresponds to the order in the assembler notation with the following exceptions:

  • the predicate is always explicit and the first operand. "al" represents the always predicate.
  • written registers precede read registers. This affects (v)str instructions where the "storee" is moved to the end, and (v)ldm instructions where the register masks is moved to the front.
  • the register-list for ldm and stm is expressed as a 16bit integer immediate
  • register ranges for vldm and vstm are expressed as a start-register followed by an immediate count
  • the lr register in the bl instruction is made explicit
  • store and load instruction do not use square brackets, exclamation marks, minus signs to indicate the various addressing modes. Instead this information is encoded in the opcode variant

Python API

opcode_tab.Assemble() converts and int to an opcode_tab.Ins. opcode_tab.Disasemble() does the inverse.

symbolic.InsSymbolize() converts an opcode_tab.Ins into a more human friendly form. (Reminder: we deviate from the official notation.). symbolic.InsFromSybolized() does the inverse.

Examples

Standard Notation

```
add r0 r1 r2 asr #3
ldrb lr, [ip, r3, lsl #1]
ldrsheq r6, [r4,#-26]
bl exit
```

Our Notation

```
add_regimm al r0 r1 asr asr 3
ldrb_reg_add al lr ip r3 lsl 1
ldrsheq eq, r6, r4, 26
bl al lr exit
```  

Note:

  • the predicate is the first operand
  • some Operands are implicit and do not correspond to any bits in the instruction, e.g. the bl instruction implicitly writes register lr.
  • the instruction variant is separated with a "_"
  • operands are separated by white space NOT commas.

TODO: mention "opcode classes" which are flags attached to opcode to quickly answer queries like: is this a store?

Implementation

The authoritative version of the encoder/decoder is written in Python using a table driven approach.

There is a C++ 17 version of the code which is mostly generated from the Python code. It is mostly a proof of concept and not suitable for adversarial environments. When the Python code was changed the C++ code can be updated like so:

make opcode_gen.cc opcode_gen.h

Other language ports are encouraged to mimic the approach of generating code from the Python tables.

Testing

Run

make test

before any commit.

Use as a JIT

To materialize an Instruction for JITing you need to first identify the proper Opcode and then populate the Operands for each Field.

arm_table.py has a query feature which may be helpful here, e.g. ./arm_table.py mov will show information for all opcodes with name mov including operand fields and bit ranges.

jit_example.c gives an example for generating one specific instruction.

Opcode coverage

The Opcodes covered by the encoder/decoder are what one would expect to see a C compiler emit i.e. basic integer and floating point instructions. The file arm_test.dis which was generated by objdump -d has examples for most currently supported Opcodes. Missing instructions will be support as needed.

Cross Testing with QEMU

see ../TestQemu/README.md

References

ISA

Linking

OS