If you need a cheap and cheerful assembler for a tiny project on an architecture
that doesn't get much support from "real" assemblers, you might be able to hack
your opcodes into tsasm
and get on with life. That's what I wrote it for.
- No macro facilties right now.
- All labels must be unique everywhere.
- Only one source code file; no includes. Write shorter code.
- No
.equ
,.asg
or.set
directives or similar. Use numbers.
./main.py --listing=prog.lst --arch=ibm5100 prog.asm a.out
For now, tsasm
just dumps bytes to an output file. There are no notions of
sections, headers, framing, metadata, symbol tables, linking, executable
formats, or anything else like that. Your assembly language statements are
turned into bytes, in the locations that your code says that they belong. As
such, if the first line of your program is something like
ORG $8000
then your output is going to have 32KiB of $00 bytes before you start to see any bytes for the code you wrote.
If you're not going to un-dump the generated output straight into RAM and jump right into it, why are you coding in assembly?
Python 3.6 or greater.
tsasm
was written by someone who knows how to write a lexer and a parser the
right way but couldn't be bothered to reach for the dragon book this time. You
get regexp-based hacks instead, so the code won't be suitable for an undergrad
comp sci course, regrettably. Here, have a "grammar":
- COMMENTS start with a bare
;
and go to the end of the line. They are preceeded by CODE STUFF. - CODE STUFF can start with a LABEL, or not, and carry on with zero or more CODEY BITS.
- LABELS start with
_
or a letter and continue with_
or alphanumeric characters. They end with:
. - CODEY BITS are separated by bare whitespace or bare commas. They can be composed of characters that aren't any of those, or of STRINGS.
- STRINGS are
"
or'
-delimited strings---you know.\
is your escape character.
Additionally, data structures inside the assembler call the first CODEY BIT an "opcode" and the remaining ones "args".
That's about all the structure that tsasm
imposes out of the box. From there,
architecture-specific code generation modules impose additional structure as
they see fit. tsasm
does provide these modules with some built-in ways to
interpret certain CODEY BITS, including:
- ADDRESS literals, with popular formats like
1234
and1234d
(decimal numbers),$12AB
and12ABh
(hexadecimal),1234o
and1234q
(octal),1010b
(binary),"!"
and'!'
(single characters), plus whatever else Python 3 will accept as the first arguments
inint(s, base=0)
. - NUMBERS, which are just integers for now, and which are the same format as
ADDRESS literals but preceded by the
#
character. - REGISTERS, which are the same format as ADDRESS literals but preceded by
some architecture-specific case-insensitive prefix, like
R
if the arch has an orthogonal register file orA
orD
if it doesn't. It's possible for the prefix not to be followed by the number, as might happen if you use a cursed architecture with registers with strange names likeeax
. - DEREFERENCES, which warmly embrace an ADDRESS or a REGISTER in brackets,
like
(R2)
or($E842)
. Additionally, any number of+
or-
characters can be arranged on the outside of the brackets to denote (pre/post)-(in/de)crementation. Some architectures may support some notion of 'crementation by -0.5 or +0.5, and for these you can use~
or'
respectively.
When parsing any numeric value in an ADDRESS, a NUMBER, or even a REGISTER for
some reason, tsasm
will detect if the raw text is a LABEL and substitute in
the label's value.
Anyway, code generation modules don't have to use the parsing facilities that support these idioms, or may only use them on a per-opcode basis.
Here is some example parity-calculating code I cranked out for an IBM 5100 computer with its 16-bit PALM processor.
; Compute even parity for each byte in the string "HELLO WORLD!".
; Count in R4 the number of bytes with a parity bit of 1, then halt.
; Meditate on living with an 8-bit ALU that can't multiply.
ARCH ibm5100 ; Use the ibm5100.py code generation module.
ORG $800 ; This program starts at $800.
; Main program.
LBI R4, #0 ; Load 0 into LSByte of R4 (parity bit count).
LWI R3, #Data ; Load address of the string into R3.
mloop: MOVB R5, (R3)+ ; Copy next string character into R5.
SNS R5 ; Skip next line unless character is $FF.
HALT ; All done!
CALL Parity,R2 ; Compute even parity of R5 byte.
CLR R5, #$FE ; Isolate just the parity bit.
ADD R4, R5 ; Add the bit to the parity bit count.
BRA mloop ; Deal with the next byte.
; Compute even parity of the lower byte in R5.
; Trashes R5-R7. Return address must be in R2.
; The computed parity bit is the least-significant bit of R5.
; Based on the algorithm described at
; http://graphics.stanford.edu/~seander/bithacks.html#ParityParallel
Parity:
MOVE R6, R5 ; Copy byte into R6 and do a logical right...
SWAP R6 ; ...shift of four bytes: SWAP is a 4-byte...
CLR R6, #$F0 ; ...rotate, CLR clears indicated bits.
XOR R5, R6 ; XOR byte with the result.
CLR R5, #$F0 ; The lower nybble that obtains counts the...
LBI R7, #$69 ; ...number of bits to right-shift a magic...
LBI R6, #$96 ; ...value $6996 to get even parity in bit 1.
ploop: SZ R5 ; We can only shift by 1 bit. Are we done?
RET R2 ; Yes, back to caller!
MLH R6, R7 ; Move magic word upper-byte to Hi(R6).
SHR R6 ; Shift Lo(R6) 1-right, carrying in a Hi bit.
SHR R7 ; Shift the magic word upper byte too.
DEC R5, R5 ; Decrement shifts remaining.
BRA ploop ; Back to top of loop.
; $00 is ' ' for the 5100, so we use $FF as a string terminator.
Data: .db "HELLO WORLD!",$FF
The codegen
subdirectory contains architecture-specific code generation
modules. An assembly language program specifies which architecture to use
with an ARCH
(or .ARCH
, or CPU
, or .CPU
, or variants of same with
lower-case letters) statement. For example,
.cpu ibm5100
causes the assembler to load the codegen/ibm5100.py
module. Any such module
defines two functions: encode_str
and get_codegen
. The former has this
signature:
encode_str(data: Text) -> bytes
and is used to turn data fron string constants into strings of bytes as
appropriate for the architecture. Note that tsasm
allows the specification of
single-character strings as integer literals in addresses or numbers, which may
have important consequences for your encoding. On the other hand, if your
strings must all be ASCII bytes and your numbers are all 8-bit, no problem.
The get_codegen
function returns a dict from "casefolded" (see str.casefold
)
opcodes (i.e. the first CODEY BIT in a statement) to functions that generate hex
data for those opcodes:
get_codegen() -> Dict[Text, Callable[[Context, Op], Tuple[Context, Op]]]
where Context
and Op
are classes defined in the data.py
module. It might
build out this dict with entries like this one:
'xor': _codegen_xor,
where _codegen_xor
is another function in the module that specialises in
generating binary code for the XOR
opcode. A heavily-annotated implementation
of _codegen_xor
for the old IBM PALM processor could look like this:
def _codegen_xor(context: Context, op: Op) -> Tuple[Context, Op]:
# The first thing we'll want to do is parse the arguments in this statement,
# the details of which are stored in `op`. The PALM ALU can only operate on
# registers, so both arguments have to be registers. We can use
# `parse_args_if_able` from the `data` module to do most of the hard work.
# Note how we're replacing `op` here: both `context` and `op` are namedtuples
# and cannot be mutated.
op = op._replace(args=parse_args_if_able(
_PARSE_OPTIONS, context, op, Type.REGISTER, Type.REGISTER))
# If all arguments were parsed successfully, we can try generating some code.
# Reasons for args not being parsed successfully include using labels that
# haven't been bound to a value yet because they are defined later in the
# file. In those cases, we'll just have to wait for the next code generation
# pass to try again.
if all_args_parsed(op.args): # This is another `data` module helper.
# Check to make sure the registers are in bounds. ValueError is the
# exception to use to object to a user's code.
if not 0 <= op.args[0].integer <= 15: raise ValueError('Bad 1st register.')
if not 0 <= op.args[1].integer <= 15: raise ValueError('Bad 2nd register.')
# With good arguments we are ready to make code. This means supplying
# hexadecimal strings as the `hex` field of `op`. For the IBM 5100, xor
# instructions take the form 0ab7, where a and b are registers.
op = op._replace(hex='0{:X}{:X}7'.format(
op.args[0].integer, op.args[1].integer))
# Since we've generated all the code that we need to generate for this
# statement, we set the `todo` field of `op` to None. If we didn't do this,
# `_codegen_xor` would be called again for this statement in the next pass.
# The assembler will continue making passes through the code until all
# statements have None todos.
op = op._replace(todo=None)
# Even if we weren't able to generate code, we know that the code we would
# generate would occupy two bytes. We advance our current location in the code
# to reflect this:
context = context.advance_by_bytes(2)
return context, op
Some additional details for understanding the example:
_PARSE_OPTIONS
is adata.ParseOptions
namedtuple, and for_codegen_xor
it mainly serves to tellparse_args_if_able
that register names start withR
.- All
op.args
entries are of typedata.Arg
and include type information in the form ofdata.Type
values. These values areenum.Flag
s, allowing the type to includeType.UNPARSED
, a qualifier indicating that parsing the argument could not be completed (e.g. due to a label being undefined). data.Type
values can also tellparse_args_if_able
what type of args might be acceptable at a certain position. Multiple single types can be combined with|
, e.g.Type.REGISTER | Type.DEREF_REGISTER
.
For further information that will help you understand how to write code
generation modules, study the data structures and documentation in data.py
and
have a look through any of the existing modules.
This is not an official Google product.