The project goal was to have some fun by implementing an interpreted programming language without a single keyWORD. All language-specific instructions are symbols from the nowadays keyboard.
To simplify the syntax scanner, the program is considered to be a sequence of "tokens" separated with space, tab of a newline, when met while scanning the program, each token is "evaluated". Some tokens are evaluated immediately into some value, some need one or more further tokens (evaluated too). In simple words -- it's a prefix notation syntax.
If a token starts with a digit it's considered to be an integer.
If a token starts with a letter (any case) it's considered to be a symbol. Symbols may include letters, digits, underbars and dots inside. No other characters are allowed.
If a token is (
it's a list, if a token is [
it's a map and if
a token is {
it's a commands block. They start the respective set
(or collection) and terminate at the next paired closing brace.
Otherwise a token is considered to be a command.
Two exceptions are made in the space-separation rule above -- if
a token starts with a "
or '
then it's considered to be a string
constant and the token ends with the next "
or '
respectively. All
spaces, tabs and even newlines are included in the string as is.
No backslashes are supported (yet), they are included in the string
as is. The 2nd exception is token started with #
. It's a comment
and it terninates at the next #
(not at the end of line).
Some tokens may evaluate a boolean value or a special NOVALUE
thing.
The latter is the default evaluation result for any token unless
explicitly documented.
Tokens may also result in a stream value, which represents an open file, pipe, socket, etc.
To sum up -- a token can be an integer, a string, a symbol, define list, map or a command block, be a stream, and NOVALUE or to be a command (or a comment, but these are ignored).
These are evaluated immediately into the respective value. Quotes for strings are not included into it.
List is evaluated by evaluating all the subsequent tokens untill the )
one, which denotes the end of the list. "Subsequent evaluation" literally
means evaluation, e.g. the ( + 1 2 )
means a list of one element being
the sum of 1 and 2.
Maps always use strings as keys and are evaluated by evaluating all the subsequent tokens in pairs, considering the first one to be the string and the second one to be any other token. The first one is thus a key and the second one is the vlaue.
For simplicity in describing and using objects the key may not be a string
token (i.e. it may not start with "
or '
), but can be a symbol without
dots. In either case the key token is treated as just a string. IOW two next
tokens mean the same
[ "a" 1 ]
[ a 1 ]
Mind the space before the final ]
, as the latter must be a separate token.
Like lists, command block token grabs all the subsequent tokens until the }
one. Note, that it only grabs the tokens as is, without evaluating them. To
evaluate the tokens in a command block you should pass control to it with
one of the commands described below.
Symbols are like variable names. They are always evaluated into the value they refer to (except in a single case when a new symbol is declared). Symbols can be considered to be pointers to other objects.
Characters _
and .
are special in a symbol name. If a symbol starts with a
_
it's one the service symbols. These are
-
_
-- a cursor. This is a symbol that refers to the current element in the loop, map/filter and swap commands (see below) -
_-
and_+
-- false and true constants respectively -
_?
-- a random number (unlimited, use the mod operation to trim it) -
__
-- current namespace (see below) -
_<
,_>
and_>!
-- stdin, stdout and stderr streams respectively
An underbar anywhere inside symbol name is considered to be just the part of the name.
A dot (.
) inside symbol is used to split a symbol name into pieces each
of which is considered to be a key in the collection (list or map) that has
been referenced by the previous part of the symbol. For example the symbol
a.b
means find the object named "a", then find an object named "b" in it,
provided a
referred to a map.
Note, that the value "b" is treated as the key value. If you want to treat
"b" as another symbol that cotains the key value, you should use two dots as
separator. For example a..b
means -- find the object "a", then find the
symbol "b", get what it refers to (it should be a string), then find in "a"
the value by the obtained string key.
Dot can follow a cursor, for example the symbol _.a
means take the cursor
object, treat it as a map and find object "a" in it.
For lists the "key" is expected to be a number, for example a.0
means
find list named "a" and get the 0th element from it. Respectively a..b
means find list "a", then get what b refers to (should be a number) then
get the b-th element.
There's also a separate ..
token that evaluates two next tokens and
effectively results in referencing the 1st..2nd
value. IOW the
a..b
is equivalent to .. a b
with the exception that b
in the
2nd case can be any evaluatable token.
To declare a new symbol there's a =
command, find its description below.
As was said above, commands may need more subsequent tokens to be evaluated. Here's how.
Token !
makes a new symbol. It takes a symbol token, and dereferences it all
excluding the traling component. The result should be a map. Then it evaluates
the next token and creates a new element in the map referenced by the 1st token
with the key being the trailing component of it and the value being the 2nd token.
The question is -- if all the symbols are created in some map, where would a
symbol without dots be created. The ansewer is -- there's an implicit root map,
so if the symbol name doesn't have any dots in it the new entry is created in
this root map thus resulting in somehwat that looks like a global variable.
For example the ! x 1
makes a variable named "x" being a number 1. Respectively
! x [ ]
! x.a "string"
makes a new empty map named "x", then makes an entry with the key "a" in it being a string with the value "string".
Since all symbols start from the root map, it's possible to change this map into some
other map. This is done with the !!
command. It evaluates the next map token and
sets it as the root one. All the service variables mentioned above remain accessible
in any map, the __
can be used to return back to the original namespace.
Tokens: + - / * %
. Evaluate two more tokens. If the next tokens are numbers they
are added, substracted, etc. The result is the number as well. Otherwise some magic
comes up.
+ on two strings concatenates those and results in a new string
+ on two lists splices them and results in a new list
- on a list and a number removes the n-th element from the list
- on a map and a string removes the respective key from the map
Tokens: & | ^ !
. All but !
need two more boolean tokens, !
needs one.
Do what they are expected to and result in a boolean value. Note, that for
&
and |
both argument tokens ARE evaluated before doing the operation,
it's not like && and || in C.
Tokens: ( ) (< (> (<> +( +) -( -)
. Tokens (
and )
mark the list
start and end respectively. Other tokens typically need at least one more
list token.
(>
, (<
and (<>
cut the list and result in a new one. Evaluate one list token
and one (or two for the <>
one) number(s). The resulting list is cut from head,
tail or both respectively. Right index is included, left index is not. Rule of a
thumb is >
results in the right (tail) part of the list and <
results in the
left (head) part of the list.
+(
and +)
are push to head and push to tail respectively. Evaluate list token
and one more one of any type, then add the latter value to the former list. The
result is NOVALUE, the 1st list token is modified in place.
Similarly -(
and -)
are pops from head or tail. Evaluate one list token and
modify it in place. Both result in whatever is poped from the list or in a NOVALUE
if the list was empty.
Tokens: %% %~ %/ %^
. All evaluate one more token string to work on.
%%
results in a new formatted string. Formatting means scanning the argument
string and getting the \(...)
pieces from it. Each ...
piece is then
considered to be a symbol which is dereferenced and instead of the whole \(...)
block the string representation of it is inserted. E.g.
%% "Hello, \(name)!"
will dereference the "name" symbol and, if it's a string "world", will result in a new string "Hello, world!"
%~
is the atio (or strtol) command. Results in a number value corresponding to
the argument string.
%/
splits the argument string using spaces and tabs as separator. The result is
a list of strings.
%^
and %$
evaluates one more string and evaluate a boolean value meaning whether
the first argument respectively starts of ends with the second one.
Check tokens: = != > >= < <=
. Evaluate two more tokens of the same type, compare
them and retuls in a boolean value.
Ifs and loops tokens: ? ~
. All evaluate one or more command blocks and
may pass control to them. All typically result in NOVALUE, but if one of the
return commands is met in the command block, the if/loop token is evaluated into
its argument (see more details further).
The ?
evaluates next token. If it's a boolean value, then it's an if case,
if it's a command block, then it's a select case.
In if case two more commabd block tokens are evaluated for then and else branches respectively. One if them is then called depending on the 1st token value.
In the select case the evaluated argument is considered to consist of bool:block
pairs. The first boolean token evaluated into true value passes control to the
respective block token, then the whole ?
evaluation stops.
Loop is ~
. It evaluates the next token. For a list or map it grabs one more token
and calls this command block for each list element or for each map key. For command
block it calls one until any result (:= :- :+
) even if novalue.
Tokens: ;
and ;;
. Evaluate next token and print it on the screen. The
former one accepts only strings and prints them as is, the latter one
accepts any other token and prints it some string representation, e.g. strings
are printed with quotes aound and new lines at the end.
Tokens: { } . : := :+ :-
. The {
and }
denote start and end of the command
block, token .
is a shortcut for the { }
pair and means a no-op block (useful
as ?
and ??
sub-blocks). Tokens :
, :=
, :+
and :-
switch between
blocks.
The :
one evaluates a command block token and a map token, sets the map as the
namespace, passes control to the command block, then restores the namespace back.
It's thus the way to do a function call with arguments. Using __
namespace makes
smth like goto.
The :=
is like return or break in C. It evaluates the next token, then completes
execution of the current command block and (!) makes the evaluated argument be the
result of evaluation of the token that caused the execution of this command block.
E.g. the : { := 1 }
makes the first :
be evaluated into 1. The ? _+ { := 1 } .
makes ?
be evaluated into 1, since it's ?
that caused execution of the block
with :=
.
The :+
token is used to propagate the "return" one more code block up. It evaluates
the next token, if it's a NOVALUE then the :+
does nothing, otherwise it acts
as :=
stopping execution of current block and evaluating it's caller into the value.
The :-
works the other way -- it stops execution of the current command block
is the argument is NOVALUE, otherwise does nothing. The result of this token is
always NOVALUE.
Tokens: <~ <~> ~> <-> ->> <->>
to open a stream and << >>
to read from and
write to a stream.
Opening tokens evaluate the string token and open the file by the string name. Mode is r, r+, w, w+, a and a+ from fopen man page respectively. The result is stream value.
Reading from stream evaluates a stream token and a string token that shows what to read. Currently the following formats are supported:
-
"ln" results in string with a single line read from file (excluding the trailing newline character).
-
"b", "s", "i" and "l" read 8, 16, 32 and 64-bit numbers respectively.
Tokens: @ $ .| -. =. =^
.
The @
evaluates a list token and one more token and checks whether the latter
one is present in the list. Results in a boolean value.
The $
evaluates the next token, checks it to be list, map or string and results
in a number value equal to the size of the argument.
The .|
evaluates two next tokens. If the first one is NOVALUE, the it results
in the 2nd value, otherwise results in the 1st value. It's token meaning is
the "default value".
The -.
evaluates next token and results in NOVALUE if it's empty, i.e. a NOVALUE
itself or false bool, zero number, empty string, list or map. Otherwise results
in the mentioned token value. The token's meaning is "novalue if empty".
The =.
token evaluates the next value and results in true if it's NOVALUE or
false otherwise.
The =^
token is re-declare one. It works like the =
one, but unlike it, the
=^
results in the old value of the symbol being declared. If th symbol didn't
exist before, the result is NOVALUE. When evaluating the velue for the new
symbol the cursor is available and points to the symbol's old value. E.g. the
= x =^ y + _ 1
is equivalent to C x = y++
.
The `
token is "eval" analogue. It gets the next string token
and resolves and evaluates it's value one more time. E.g. the
` "+" 1 2
is the same as just + 1 2
.
The cyvm
binary syntax is
cyvm -c
prints the commands and their short meaningcyvm -r file.cy
runs the program in a filecyvm -t file.cy
prints the found tokens without evaluating them
When running a program execution starts by evaluating the very first token. A special
name Args
is available in the root map and is the list of CLI arguments (including
the .cy file name).
Each token has a location in the file in the @line.number
format. The -t
option
prints tokens and their location. When an error in evaluation occurs the cyvm prints
the fauling token location and exits.
Memory management is absent. There's no explicit malloc/free commands, objects are allocated transparently and are not freed :) garbage collection, well, may come some day.
Error handling is not there yet. In some situations (e.g. out-of-bounds list access or getting a missing key from the map) the result is NOVALUE, in all other cyvm just exits.
Doing recursion looks clumsy, look at examples/qsort-rec.cy file's line
-> split [ split split words b ]
it does the recursive call, but to make the split function call itself it pushes its name into the namespace it will live in.
Current set of commands is subject to change. The plan is to make some symbol reflect the command type. E.g.
?
for conditional ops (if, select)~
for loops of all kinds(
and)
for lists[
and]
for maps{
and}
for command blocks%
for strings!
for errors (not there yet) (now used for declare)<
and>
for streams- math and bool are very common and shouldn't change