-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Considering supporting OCaml #449
Comments
Do you think re2c will be able to provide some advantage over these? The main strengths of re2c is generating fast code, providing a flexible user interface and supporting submatch extraction based on TDFA. Do you see any of these features lacking in existing OCaml generators? Porting re2c to a functional languages looks like an interesting challenge, but I'm trying to understand if it will be useful in practice.
First thing would be to understand what you are going to generate and construct the desired output by hand. Take a look at the standard re2c examples --- they represent various use cases and different modes of using re2c. Examples will be among the first programs ported to the new language, so it's good to have an idea of how to express each of them in OCaml. Once you have understanding how this can be done, |
I've found the flexible UI to re2c to be the killer feature. Being able to
My gut feeling is to take advantage of OCaml's
I can mention this in that issue, but my worry is that the syntax file approach can no longer special-case behavior per-target. And while I understand you may not want to commit to a stable AST, there could be a third option. re2c can have its own internal IR for language-agnostic optimizations, and lower to a separate, stable IR that users can codegen from. |
FWIW, I think supporting OCaml would be extremely cool. sedlex is pretty clunky, and re2c is much nicer. maybe if #450 were implemented it would be fairly straightforward to do this. Btw, I disagree that OCaml lacks |
@smasher164 Any interest in working on this yourself (or with assistance)? |
FYI I'm slowly working on #450, which (if my experiment works out) should allow adding new backends with one config file. Aside from that, the first thing to start for any new backend is to manually construct examples of the code that re2c should generate (look at the examples to get the idea of what should be covered). |
I could probably manually create OCaml code examples if that would be of help for you in making #450 a reality. It's a very different language than the ones re2c currently supports so it might be a good test. |
This is approximately what re2c needs to generate for the simplest example 01_basic.re (the open Printf
let rec lex s yystate cursor =
match yystate with
| 0 ->
let yych = s.[cursor] in
let c = cursor + 1 in
(match yych with
| '1'..'9' -> lex s 2 c
| _ -> lex s 1 c)
| 1 -> false
| 2 ->
let yych = s.[cursor] in
(match yych with
| '0'..'9' ->
let c = cursor + 1 in
lex s 2 c
| _ -> lex s 3 cursor)
| 3 -> true
| _ -> raise (Failure "internal lexer error")
let main () =
if not (lex "1234\x00" 0 0) then raise (Failure "error")
let _ = main () It is based on the loop/switch mode with recursive function call instead of the loop. We could probably use goto/label mode as well with every state as a separate function and It should be possible to use this new codegen mode with other languages, as they all have recursive functions. |
The most natural way to represent a state transition in OCaml is with a tail recursive function call; it will be much faster than most of the other options. |
To echo @pmetzger, the way I would approach codegen here would be to just use tail recursive calls, and treat them like you treat let lex(s: string): bool =
let cursor : int ref = ref 0 in
let yych : char ref = ref s.[!cursor] in
let rec yy1(): bool =
cursor := !cursor + 1;
false
and yy2(): bool =
cursor := !cursor + 1;
yych := s.[!cursor];
(match !yych with
| '0'..'9' -> (yy2[@tailcall])()
| _ -> (yy3[@tailcall])())
and yy3(): bool = true in
match !yych with
| '1'..'9' -> (yy2[@tailcall])()
| _ -> (yy1[@tailcall])() We have an outer So basically,
|
Note that the |
If your approach will require less work during codegen, then I think it makes sense to go down that route. Tbh I just assumed that the goto/label model was more convenient for you. |
The advantage of the "tail recursive function per state" approach is that it's going to have the highest possible performance. The code generator will turn it all into gotos (and then jump instructions in machine language.) This would also be the case for other functional languages like Haskell. Using the tailcall annotation in OCaml is not required, btw, but it will assure that the compiler will get angry if the call is not, in fact, tail recursive, thus avoiding mistakes. |
Right, I also thought about performance. Trying to compare the generated code: Program 1.ml uses recursive closures: open Printf
type state = {
str: string;
mutable cur: int;
}
let lex st =
let rec yy0() =
let yych = st.str.[st.cur] in
st.cur <- st.cur + 1;
(match yych with
| '1'..'9' -> (yy2 [@tailcall]) ()
| _ -> (yy1 [@tailcall]) ())
and yy1() = false
and yy2() =
let yych = st.str.[st.cur] in
(match yych with
| '0'..'9' ->
st.cur <- st.cur + 1;
(yy2 [@tailcall]) ()
| _ -> (yy3 [@tailcall]) ())
and yy3() = true
in yy0 ()
let main () =
let st = { str = "1234\x00"; cur = 0; } in
if not (lex st) then raise (Failure "error")
let _ = main () Program 2.ml uses one recursive function with a match on open Printf
type state = {
str: string;
mutable cur: int;
mutable state: int;
}
let rec lex st =
let yystate = st.state in
match yystate with
| 0 ->
let yych = st.str.[st.cur] in
st.cur <- st.cur + 1;
(match yych with
| '1'..'9' -> st.state <- 2; (lex [@tailcall]) st
| _ -> st.state <- 1; (lex [@tailcall]) st)
| 1 -> false
| 2 ->
let yych = st.str.[st.cur] in
(match yych with
| '0'..'9' ->
st.cur <- st.cur + 1;
st.state <- 2; (lex [@tailcall]) st
| _ -> st.state <- 3; (lex [@tailcall]) st)
| 3 -> true
| _ -> raise (Failure "internal lexer error")
let main () =
let st = { str = "1234\x00"; cur = 0; state = 0; } in
if not (lex st) then raise (Failure "error")
let _ = main ()
Let's see what instructions are generated by ocamlopt:
Here's the second one:
re2c can generate both examples. Which one is faster? I haven't measured it yet. |
Will be curious about your measurements. Also, which compiler are you using? flambda will be more aggressive here. |
Some measurements with ocamlopt: a.ml: type state = {
str: string;
mutable cur: int;
}
let lex st =
let rec yy0() =
let yych = st.str.[st.cur] in
st.cur <- st.cur + 1;
(match yych with
| '1'..'9' -> (yy2 [@tailcall]) ()
| _ -> (yy1 [@tailcall]) ())
and yy1() = false
and yy2() =
let yych = st.str.[st.cur] in
(match yych with
| '0'..'9' ->
st.cur <- st.cur + 1;
(yy2 [@tailcall]) ()
| _ -> (yy3 [@tailcall]) ())
and yy3() = true
in yy0 ()
let main () =
let s = "1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890\x00" in
for i = 1 to 100000000 do
let st = { str = s; cur = 0; } in
if not (lex st) then raise (Failure "error")
done
let _ = main ()
b.ml: type state = {
str: string;
mutable cur: int;
mutable state: int;
}
let rec lex st =
let yystate = st.state in
match yystate with
| 0 ->
let yych = st.str.[st.cur] in
st.cur <- st.cur + 1;
(match yych with
| '1'..'9' -> st.state <- 2; (lex [@tailcall]) st
| _ -> st.state <- 1; (lex [@tailcall]) st)
| 1 -> false
| 2 ->
let yych = st.str.[st.cur] in
(match yych with
| '0'..'9' ->
st.cur <- st.cur + 1;
st.state <- 2; (lex [@tailcall]) st
| _ -> st.state <- 3; (lex [@tailcall]) st)
| 3 -> true
| _ -> raise (Failure "internal lexer error")
let main () =
let s = "1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890\x00" in
for i = 1 to 100000000 do
let st = { str = s; cur = 0; state = 0 } in
if not (lex st) then raise (Failure "error")
done
let _ = main () Compiled as:
Run time:
So, many recursive closures are considerably faster in this example. |
perf stat on the same binaries shows more instructions and in particular more branches for b (which should be explained by the one extra conditional jump on
|
ocamlopt built with flambda support and -O3 gives approximately the same results, only both binaries are slightly faster (15.400s vs 21.624s). |
I would have expected both to be a bit faster; somewhat surprised that it isn't more than that, but perhaps this isn't something flambda is particularly good at optimizing. Anyway, I suppose this means that as suspected, the "goto" version is faster than the "big switch" version, and I suspect that the difference would be even bigger for very large state machines. |
("goto" in this case being a reference to "Lambda: The Ultimate Goto". Tail recursion is by far my favorite goto.) |
OCaml support has been added in experimental branch E.g. for the basic example: (* re2ocaml $INPUT -o $OUTPUT -i *)
type state = {
str: string;
mutable cur: int;
}
/*!re2c
re2c:define:YYFN = ["lex;bool", "st;state"];
re2c:define:YYCTYPE = int;
re2c:define:YYPEEK = "Char.code st.str.[st.cur]";
re2c:define:YYSKIP = "st.cur <- st.cur + 1;";
re2c:yyfill:enable = 0;
number = [1-9][0-9]*;
number { true }
* { false }
*/
let main () =
let st = {str = "1234\x00"; cur = 0}
in if not (lex st) then raise (Failure "error")
let _ = main () re2ocaml generates the following code: (* Generated by re2c *)
(* re2ocaml $INPUT -o $OUTPUT -i *)
type state = {
str: string;
mutable cur: int;
}
let rec yy0 (st : state) : bool =
let yych = Char.code st.str.[st.cur] in
st.cur <- st.cur + 1;
match yych with
| 0x31|0x32|0x33|0x34|0x35|0x36|0x37|0x38|0x39 -> (yy2 [@tailcall]) st
| _ -> (yy1 [@tailcall]) st
and yy1 (st : state) : bool =
false
and yy2 (st : state) : bool =
let yych = Char.code st.str.[st.cur] in
match yych with
| 0x30|0x31|0x32|0x33|0x34|0x35|0x36|0x37|0x38|0x39 ->
st.cur <- st.cur + 1;
(yy2 [@tailcall]) st
| _ -> (yy3 [@tailcall]) st
and yy3 (st : state) : bool =
true
and lex (st : state) : bool =
(yy0 [@tailcall]) st
let main () =
let st = {str = "1234\x00"; cur = 0}
in if not (lex st) then raise (Failure "error")
let _ = main () |
That's amazing @skvadrik, I'm excited to play with this and get back to you! |
@skvadrik Should I publicize this among the OCaml community? I think some folks would find it interesting. |
Sounds great, but let's wait until the API is finalized. I think I'll implement a few more language backends via syntax configs, and I also plan to tweak a few things in OCaml backend. I don't expect major changes though. |
Python might be interesting, FWIW. |
OCaml is quite popular in language development, and has a dearth of lexer generators with good unicode support. It's a bit different from the imperative languages that re2c currently supports, i.e. it doesn't support labelled
break
,continue
,goto
, or the ability to earlyreturn
. However, it does have good support for tail-call optimization, exceptions, algebraic effects, and of courseif-else
(for the nested-ifs code generation).Is there interest in supporting such a backend? If one were interested in contributing support, where should one look?
src/codegen
?Thanks
The text was updated successfully, but these errors were encountered: