A quick and dirty self-hosting C compiler.
main, an executable that takes in an assembly filename on the command line, reads in a C file from
stdout and prints compiled NASM x86-64 assembly to the filename.
my_cc, taking a C filename and an assembly filename on the command line (much better).
To self-compile again, for fun:
my_my_cc, which should be identical to
my_cc, thus proving that this compiler is actually self-hosting.
To compile, say, test.c:
mkdir tmp cc -E test.c -o tmp/test.c my_cc tmp/test.c tmp/test.asm nasm -o tmp/test.o -f your_object_format tmp/test.asm ld -lc tmp/test.o -o test /path/to/libc rm -rf tmp
To rid yourself of garbage:
Implemented behavior includes:
- Typed local and global variables, with declaration permitted anywhere and lexical scoping
- The data types
void, structs, typedefs, enums, and pointers
- Function calls
- All operators with correct precedence except for
-, prefix increment/decrement,
%, bitwise operators, comma, and combined assignment
Not implemented behavior includes, but is not limited to:
- Inline assembly
- Correct pointer arithmetic
- Statically allocated structs or arrays
- Static variables
- Gracefully recovering from errors
Hey, I wrote a compiler in this, you'll be fine.
Basically, I was bored, so I decided to write a self-hosting compiler. Since I was in 15-122 at Carnegie Mellon (which is taught in C), and since C fits in that sweet spot in the intersection of possible to write code in and possible to compile, I decided to write it in C. (Alternatively, some might say that C sits in the sweet spot of unwritable and unreadable, in which case I doomed myself to failure.)
I was inspired by Noam Nisan and Shimon Schocken's fantastic The Elements of Computing Systems, which takes you all the way from a NAND gate to programming Tetris and includes four chapters on simple compilers, and Jack Crenshaw's Let's Build a Compiler series, partially building a Pascal self-compiler using the KISS principle.
I basically chose the smallest subset of C that I wouldn't go crazy writing in. I mean, could you imagine writing a compiler without structs or typing?
The compiler comes in three units, as is typical: a tokenizer, a parser, and a code generator. The tokenizer produces a stream of tokens. The parser parses this stream into an internal AST representation, which the code generator then walks down to directly produce x86-64 assembler, leading to correct but inefficient code.
Everything works on 64-bit Mac OS X, but I'm not sure how the C calling conventions change across operating systems and machines. At least on my computer, I can call out to libc, etc.
Improvements that would be easy to implement include:
- Implementing the rest of C
- Basic optimizations like constant folding
Improvements that would be hard to implement and would likely entail rewriting a significant body of code include:
- More involved optimizations like propagation analysis, SSA, loop unrolling and exchanging
- More graceful error-checking