August is an assembler written from scratch in Ink for me to learn about assemblers, linkers, executable file formats, and compiler backends. It currently supports assembling and linking (in a single step) x86_64 ELF binaries for Linux, and might in the future support ELF executables for ARM, RISC-V, and x86 architectures. In the far long term, August might also become a code generation backend for a compiler written in Ink for some small subset of C if I feel adventurous. But for now, August is an educational project that assembles a subset of x86_64 to a Linux ELF binary.
August currently supports the following features:
- A good portable subset of the integer x86_64 instruction set
- Support for arguments as immediates, registers, and labels
- Embedded read-only data segments
- Symbol tables for debugging and disassembly
You can see some example assembly code that August can assemble and link under
August provides a CLI,
./src/cli.ink, that currently takes a single assembly program and emits a single statically-linked x86_64 ELF executable. Under the hood, August reads the assembly program, parses it into a simple representation of symbols and sections in the source, assembles it into machine code, and links it all together with a minimal ELF linker.
At the moment, the assembler and linker are pretty tightly integrated. The ELF linker assumes that only two sections are used,
.rodata, and the assembler generates code with that assumption. The virtual address table for the generated executable is also currently hard-coded into the linker and relied on by the assembler when resolving symbols.
Here's a transcript of a shell session that demonstrates what August can do today. We take a bare-bones Hello World program for Linux on x86_64, assemble it with August, run it, and dump the generated assembly with
$ cat test/asm/004-sym.asm ; Hello World section .text ; implicit _start: mov eax 0x1 ; write syscall mov edi 0x1 ; stdout mov esi msg ; string to print mov edx len ; length syscall exit: mov eax 60 ; exit syscall mov edi 0 ; exit code syscall section .rodata msg: db "Hello, World!" 0xa len: eq 14
Run the emitted program, which prints, "Hello, World!" and exits cleanly.
$ august test/asm/004.asm ./hello-world executable written. $ ./hello-world Hello, World! $ echo $? 0
If we disassemble the generated executable, we find the assembly we began with.
$ objdump -d ./hello-world ./hello-world: file format elf64-x86-64 Disassembly of section .text: 0000000000401000 <_start>: 401000: b8 01 00 00 00 mov eax,0x1 401005: bf 01 00 00 00 mov edi,0x1 40100a: be 00 50 6b 00 mov esi,0x6b5000 40100f: ba 0e 00 00 00 mov edx,0xe 401014: 0f 05 syscall 0000000000401016 <exit>: 401016: b8 3c 00 00 00 mov eax,0x3c 40101b: bf 00 00 00 00 mov edi,0x0 401020: 0f 05 syscall ...
The instruction encoding is handled by the
./src/asm.ink library within the project. Currently, August can assemble simple programs that work with 32-bit registers and the ALU, handle branches and jumps, make system calls and function calls per the x86 calling convention, and read or write to memory. Even with these basic building blocks, we can write programs that do interesting things like loop, manipulate memory, and make recursive calls. You can check out some examples in
August uses a library for constructing ELF executable files located at
./src/elf.ink. The ELF generated by the ELF library in August currently makes use of three sections:
.textcontaining the program text, i.e. translated x64 assembly.
.rodatacontaining read-only data loaded into process memory as read-only
.shstrtabcontaining section headers
The content of
.rodata sections can be provided to the ELF library, which will return a fully linked ELF binary as the result. All labels found in the assembly code are treated as local function symbols and placed into the generated symbol table.
References and further reading
The ELF file format is quite well documented, especially in source bases of various linkers, assemblers, and kernels, but the available reference material for implementing an ELF linker is not...what you would call super accessible. In the process of building August, I've found the following references particularly helpful.
man elfon Linux and the
elfheader file in the kernel sources, which provide the canonical reference for implementations of ELF files
- A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux, which breaks down the ELF format for executable files at a high level
- LWN's write-up of the Linux kernel's view of ELF executables, with another breakdown of ELF executables
- Solaris's documentation on ELF object files, a good in-depth reference
- Notes on the ELF specification, which is long but very, very comprehensive, occasionally useful for studying edge cases
In writing an x86/x64 assembler, the following were especially helpful to get me up to speed.
- The x86asm.net ISA reference, which is comprehensive enough for a toy assembler and easy to navigate once you get used to the compact notation
- Encoding x86 Instructions, which was a helpful guide to understanding how x86 and x64 instructions are encoded
- The x64 cheat sheet for a handy list of the core x86/x64 instruction set
- The Calling Conventions article on OSDev Wiki
To work on August, you obviously need Ink installed. Inkfmt is also useful for auto-formatting code, which you can run with
make format or
When I work on August (especially the instruction encoder), I usually have two other panes open, running:
ls test/asm/*.asm lib/*.ink src/*.ink | entr -cr makeso every file change assembles and runs a program to test
ls ./b.out | entr -cr objdump -d -Mintel ./b.outso that every time the executable is re-compiled, I can see the disassembly of the executable and check it against the intended assembly code.
There is a growing test suite for the assembler / x86 instruction encoder, which you can run with
make check or