🚧 Under Construction 👷
A state machine compiler with no runtime dependency. Define a grammar using a subset of regular expression notation, then compile it into a blazing-fast state machine. Acorn
supports lexers or custom string-based state machines.
Add this to your application's shard.yml
:
dependencies:
acorn:
github: "rmosolgo/acorn"
-
Define the grammar in a build file:
# ./build/my_lexer.cr require "acorn" class MyLexer < Acorn::Lexer # optional, rename the generated class: # name("Namespace::MyLexer") token :letter "a-z" token :number "0-9" generate "./src/my_lexer.cr" end
-
Generate the lexer:
crystal run ./build/my_lexer.cr
-
Use the compiled lexer:
require "./src/my_lexer.cr" MyLexer.scan(input) # => Array(Tuple(Symbol, String))
Tokens are defined with a small regular expression language:
Feature | Example |
---|---|
Character | a , 1 , ❤️ |
Sequence | ab , 123 |
Alternation | a|b |
(ab)|c |
|
. |
|
One of | [abc] |
[^abc] |
|
Escape | \[ , \. |
Unicode character range | a-z , 0-9 |
Zero-or-more | a* |
One-or-more | a+ |
Zero-or-one | a? |
Specific number | a{3} |
Between numbers | a{3,4} |
At least | a{3,} |
An Acorn
module is a Crystal program that generates code. To get a lexer, you have to run the Acorn
module. Then, your main program should use the generated code.
For example, if you define a lexer:
# build/my_lexer.cr
class MyLexer < Acorn::Lexer
# ...
generate("./app/my_lexer.cr")
end
You should run the file with Crystal to generate the specified file:
crystal run build/my_lexer.cr
Then, your main program should require
the generated file:
# my_app.cr
require "app/my_lexer"
MyLexer.scan(input) # => Array(Tuple(Symbol, String))
The generated code has no dependency on Acorn
, so you only need this library during development.
Acorn
returns an of array tokens. Each token is a tuple with:
Symbol
: the name of this token in the lexer definitionString
: the segment of input which matched the pattern: line number and column number where the token began{Int32, Int32}
: line number and column number where the token ended{Int32, Int32}
Line numbers and column numbers are 1-indexed, so the first character in the input is 1:1
.
Acorn
lexers are actually a special case of state machine. You can specify a custom machine, too.
class MyMachine < Acorn::Machine
to bring in the macrosalias Accumulator = ...
to specify the data that will be modified during the process- It will be initialized with
.new
- It will be returned from
.scan
- It will be initialized with
- use
action :name, "pattern" { |acc, str, ts, te| ... }
to define patternsacc
is an instance of yourAccumulator
str
is the original inputts
is the index instr
where this token begante
is the index instr
where this token ended (if the match is one character long,ts == te
)
- optionally,
name("MyNamespace::MachineName")
to rename the generated Crystal class
- rebuild fixtures with
crystal run spec/prepare.cr
crystal spec
Goals:
- Great runtime performance
- Linear time complexity
- Reasonable code size (
.cr
file and memory usage) - No runtime dependency
- Plain-Crystal API (source files are
.cr
, not a special format) - Easy to write a lexer, possible to write something else
Non-goals:
- Great compile-time performance (choose simplicity over performance)
- Fancy regexp features
- Finish regexp language (
(...)
,.
,[^...]
)- A move on
a
is also a move on:any
, how is that handled?
- A move on
- Better line/col awareness:
- Add line/col to tokens
- Add line/col to errors
LGPLv3