A tiny regular expression evaluator based on Abstract Syntax Tree (AST) in Python.
This project provides a simple regex engine that first parses a regular expression into an AST and then evaluates it against a given text.
.: Any single character*: Zero or more repetitions of the preceding character+: One or more repetitions of the preceding character?: Zero or one repetition of the preceding character|: Alternation (OR)(): Grouping^: Start of the string anchor$: End of the string anchor
You can execute the regex match from the command line using the matchbox module. It supports both full matching and searching.
# Full match (the entire text must match the pattern)
python -m matchbox <pattern> <text>
# Search (find a substring that matches the pattern)
python -m matchbox <pattern> <text> --searchExamples:
# Full match
python -m matchbox "a(b|c)*d" "acccbd"
# ✅ Matched
# Search
python -m matchbox "a*b" "xxaaabyy" --search
# ✅ MatchedUse the --trace option to visualize the backtracking process of the AST evaluator. This is useful for debugging your patterns. The output is colorized using the rich library.
python -m matchbox "a(b|c)d" "acd" --traceYou can also run an interactive prompt.
python main.pyExample:
Mini Regex Engine (matchbox)
Enter regex: a*b
Enter text: aaab
✅ MATCH!
This regex engine is composed of three main components:
-
Lexer (
lexer.py): The lexer takes a regular expression string and breaks it down into a sequence of tokens (e.g.,CHAR,*,|). -
Parser (
parser.py): The parser consumes the token stream from the lexer and builds an Abstract Syntax Tree (AST). The nodes of this tree (ast_nodes.py) represent the structure of the regular expression (e.g.,ConcatNode,OrNode,RepeatNode). -
Evaluator (
evaluator.py): The evaluator traverses the AST to determine if the pattern matches the input text. Thefullmatchandsearchfunctions are the entry points, which recursively call thematchmethod on each node. This AST-based approach allows for handling complex nested structures and provides clear evaluation paths, which can be traced for debugging.
mini-regex/
├── main.py # Entry point for interactive mode
├── requirements.txt # Dependencies
├── matchbox/
│ ├── __main__.py # Entry point for CLI mode
│ ├── lexer.py # Lexical analysis
│ ├── parser.py # AST construction
│ ├── ast_nodes.py # AST node definitions
│ └── evaluator.py # AST-based regex evaluation
└── tests/
├── test_evaluator.py # Tests for the AST-based evaluator
├── test_parser.py # Tests for the parser
└── test_matchbox.py # Legacy tests (can be merged or removed)
To run the test suite, install the dependencies and run pytest.
pip install -r requirements.txt
pytestThis project uses black for code formatting and ruff for linting to ensure code quality and consistency.
# Format code with black
.venv/bin/black .
# Lint and auto-fix with ruff
.venv/bin/ruff check --fix .Currently, both tools run with their default settings as there is no specific configuration file.
To customize their behavior, you can create a pyproject.toml file in the project root. For example:
# pyproject.toml
[tool.black]
line-length = 88
[tool.ruff]
line-length = 88
select = ["E", "F", "I"] # Enable specific linting rules