-
Notifications
You must be signed in to change notification settings - Fork 9
feat(tdl): Add support to generate an ANTLR parser from the TaskDefLang's grammar. #201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds a minimal ANTLR4 grammar (TaskDefLang.g4), its generated C++ lexer/parser/visitor sources, build/task integrations and lint exclusions, a GitHub Actions workflow to verify generated code is committed, and links spider_tdl against the antlr4 runtime. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as Developer/CI
participant GA as GitHub Actions
participant Task as go-task
participant ANTLR as ANTLR Tool
participant Git as Git
Dev->>GA: Trigger workflow (PR / push / schedule / manual)
GA->>Task: run build:tdl-generate-parsers
Task->>ANTLR: Generate C++ lexer/parser/visitor
ANTLR-->>Task: antlr_generated files
Task->>Task: Prune non-source files & compute checksum
GA->>Git: git diff --exit-code src/spider/tdl/parser/antlr_generated
alt changes detected
Git-->>GA: Diff non-empty
GA-->>Dev: Fail job (regeneration not committed)
else no changes
Git-->>GA: Clean
GA-->>Dev: Success
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (11)
src/spider/tdl/parser/TaskDefLang.g4 (2)
3-3: Confirm intent: parser only accepts empty/whitespace input (start: EOF).As written, any non-whitespace input will fail to parse. If this is a bootstrap placeholder to wire up generation/CI, all good. If the expectation is to parse any TDL constructs now, this rule will block that.
5-5: Whitespace skipping strategy looks fine; consider whether comments will be needed soon.Using skip drops whitespace completely, which is typical. If you’ll want to ignore comments as well (e.g., // or /* */), plan lexer rules now to avoid churn in generated files later.
src/spider/tdl/parser/antlr_generated/TaskDefLangLexer.cpp (2)
8-14: Duplicate using directive is harmless but redundant.There are two consecutive “using namespace antlr4;” directives (Lines 8 and 13). It’s benign, but if you ever post-process or lint generated code locally, this could trigger duplicate-using warnings.
If you control the ANTLR invocation/template that emits this block, consider deduping the using directives in the generator settings/templates instead of hand-editing generated files.
104-106: Grammar path in header comment differs from repo path.Header states “Generated from tdl/parser/TaskDefLang.g4” while the file is under src/spider/tdl/parser. This is only cosmetic but can cause confusion in diffs/discussions. If feasible, generate from a consistent CWD so the path in comments matches the repo layout.
taskfiles/lint.yaml (2)
65-66: Broaden the exclude glob to be future-proof.Use a recursive glob so any nested structure under the generated dir stays excluded.
- EXCLUDE_PATTERNS: &cpp_exclude_patterns - - "tdl/parser/antlr_generated/*" + EXCLUDE_PATTERNS: &cpp_exclude_patterns + - "**/tdl/parser/antlr_generated/**"
107-116: Apply the same exclude patterns to the examples clang-tidy pass (consistency).Functionally this pass only scans examples and won’t hit the TDL path, but applying the same excludes keeps behaviour consistent if paths change later.
- task: ":utils:cpp-lint:clang-tidy-find" vars: FLAGS: - "--config-file '{{.ROOT_DIR}}/.clang-tidy'" - "-p '{{.G_EXAMPLES_COMPILE_COMMANDS_DB}}'" + EXCLUDE_PATTERNS: *cpp_exclude_patterns INCLUDE_FILENAME_PATTERNS: ["*.cpp", "*.h", "*.hpp"] OUTPUT_DIR: "{{.G_LINT_CLANG_TIDY_DIR}}" ROOT_PATHS: - "{{.G_EXAMPLES_DIR}}" VENV_DIR: "{{.G_LINT_VENV_DIR}}".github/workflows/tdl-generated-code-checks.yaml (2)
31-34: Pin go-task CLI installation for determinism.Installing latest via npm can break the workflow unexpectedly. Prefer pinning a specific @go-task/cli version or using the official setup action and pinning it to a commit.
Would you like me to propose a concrete change (either pinning a specific @go-task/cli version or switching to the official setup action) once you confirm which approach you prefer?
39-47: Use git diff --exit-code for a simpler, clearer check.This avoids grep and returns a non-zero exit code on changes automatically.
- name: "Check if the generated parsers are the latest" shell: "bash" - run: - | - git status --porcelain \ - src/spider/tdl/parser/antlr_generated \ - | grep . > /dev/null \ - && exit 1 \ - || exit 0 + run: | + set -euo pipefail + git diff --exit-code -- src/spider/tdl/parser/antlr_generatedtaskfiles/build.yaml (3)
38-41: Add ANTLR tool jar tosourcesto avoid stale caches when generator version changesRight now the task’s up-to-date check only depends on the grammar file. If the ANTLR jar changes (e.g., tool upgrade), Taskfile may skip this task and miss a necessary regen. Include the jar path in
sourcesto bust the cache when the generator changes.Apply this diff:
sources: - - "{{.G_SRC_SPIDER_DIR}}/tdl/parser/TaskDefLang.g4" + - "{{.G_SRC_SPIDER_DIR}}/tdl/parser/TaskDefLang.g4" + - "{{.G_ANTLR_JAR_FILE}}"
31-37: Confirm the intent of running checksum validation before generationRunning
:utils:checksum:validateas a dep makes the task fail early if committed artifacts don’t match the recorded checksum, which is perfect for CI “enforce-committed-artifacts” behaviour. For local “regen-and-update” workflows, however, this will block regeneration until the checksum is fixed manually.If that dual-purpose is intended, all good. If not, consider splitting into:
- tdl-generate-parsers: regenerates and computes checksum
- tdl-check-generated: only validates checksum (no generation)
Then have CI call the check task, and devs call the generate task.
42-55: Make cleanup idempotent and robust when the output directory is absent
rm -rf "{{.OUTPUT_DIR}}/"*can fail if the directory doesn’t exist or if the glob doesn’t expand. Also, prefer-deleteover-exec rmfor simplicity and safety.Refactor as below:
- Ensure the output dir exists (
mkdir -p)- Remove the directory entirely, then recreate it (avoids brittle globs)
- Use
find ... -deleteto prune non-source files- - |- - rm -rf "{{.OUTPUT_DIR}}/"* \ - && java -jar {{.G_ANTLR_JAR_FILE}} \ + - |- + rm -rf "{{.OUTPUT_DIR}}" + mkdir -p "{{.OUTPUT_DIR}}" + java -jar {{.G_ANTLR_JAR_FILE}} \ -Dlanguage=Cpp \ -no-listener \ -visitor \ -package spider::tdl::parser::antlr_generated \ -o tdl/parser/antlr_generated \ -Xexact-output-dir \ tdl/parser/TaskDefLang.g4 \ - && find "{{.OUTPUT_DIR}}" \ - -type f -and -not \( -name "*.cpp" -o -name "*.h" \) \ - -exec rm {} \; + && find "{{.OUTPUT_DIR}}" -type f ! -name "*.cpp" ! -name "*.h" -delete
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (13)
.github/workflows/tdl-generated-code-checks.yaml(1 hunks)src/spider/tdl/parser/TaskDefLang.g4(1 hunks)src/spider/tdl/parser/antlr_generated/TaskDefLangBaseVisitor.cpp(1 hunks)src/spider/tdl/parser/antlr_generated/TaskDefLangBaseVisitor.h(1 hunks)src/spider/tdl/parser/antlr_generated/TaskDefLangLexer.cpp(1 hunks)src/spider/tdl/parser/antlr_generated/TaskDefLangLexer.h(1 hunks)src/spider/tdl/parser/antlr_generated/TaskDefLangParser.cpp(1 hunks)src/spider/tdl/parser/antlr_generated/TaskDefLangParser.h(1 hunks)src/spider/tdl/parser/antlr_generated/TaskDefLangVisitor.cpp(1 hunks)src/spider/tdl/parser/antlr_generated/TaskDefLangVisitor.h(1 hunks)taskfiles/build.yaml(1 hunks)taskfiles/deps.yaml(0 hunks)taskfiles/lint.yaml(3 hunks)
💤 Files with no reviewable changes (1)
- taskfiles/deps.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-17T19:44:06.132Z
Learnt from: sitaowang1998
PR: y-scope/spider#168
File: lint-tasks.yaml:62-65
Timestamp: 2025-07-17T19:44:06.132Z
Learning: ANTLR-generated C++ code in the G_SRC_DSL_DIR (src/stdl) should not be included in linting tasks because it's auto-generated code that doesn't follow manual coding standards.
Applied to files:
taskfiles/lint.yaml
🧬 Code Graph Analysis (6)
src/spider/tdl/parser/antlr_generated/TaskDefLangVisitor.h (2)
src/spider/tdl/parser/antlr_generated/TaskDefLangBaseVisitor.h (1)
spider(11-25)src/spider/tdl/parser/antlr_generated/TaskDefLangParser.h (1)
spider(10-64)
src/spider/tdl/parser/antlr_generated/TaskDefLangLexer.h (1)
src/spider/tdl/parser/antlr_generated/TaskDefLangLexer.cpp (18)
TaskDefLangLexer(95-98)TaskDefLangLexer(100-102)getGrammarFileName(104-106)getGrammarFileName(104-104)getRuleNames(108-110)getRuleNames(108-108)getChannelNames(112-114)getChannelNames(112-112)getModeNames(116-118)getModeNames(116-116)getVocabulary(120-122)getVocabulary(120-120)getSerializedATN(124-126)getSerializedATN(124-124)getATN(128-130)getATN(128-128)initialize(135-141)initialize(135-135)
src/spider/tdl/parser/antlr_generated/TaskDefLangParser.cpp (2)
src/spider/tdl/parser/antlr_generated/TaskDefLangParser.h (1)
TaskDefLangParser(13-64)src/spider/tdl/parser/antlr_generated/TaskDefLangLexer.cpp (12)
initialize(135-141)initialize(135-135)getATN(128-130)getATN(128-128)getGrammarFileName(104-106)getGrammarFileName(104-104)getRuleNames(108-110)getRuleNames(108-108)getVocabulary(120-122)getVocabulary(120-120)getSerializedATN(124-126)getSerializedATN(124-124)
src/spider/tdl/parser/antlr_generated/TaskDefLangBaseVisitor.h (3)
src/spider/tdl/parser/antlr_generated/TaskDefLangVisitor.h (2)
spider(11-28)TaskDefLangVisitor(17-26)src/spider/tdl/parser/antlr_generated/TaskDefLangParser.h (1)
spider(10-64)src/spider/tdl/parser/antlr_generated/TaskDefLangParser.cpp (1)
StartContext(117-119)
src/spider/tdl/parser/antlr_generated/TaskDefLangLexer.cpp (2)
src/spider/tdl/parser/antlr_generated/TaskDefLangLexer.h (1)
TaskDefLangLexer(13-49)src/spider/tdl/parser/antlr_generated/TaskDefLangParser.cpp (10)
getGrammarFileName(98-100)getGrammarFileName(98-98)getRuleNames(102-104)getRuleNames(102-102)getVocabulary(106-108)getVocabulary(106-106)getSerializedATN(110-112)getSerializedATN(110-110)getATN(94-96)getATN(94-94)
src/spider/tdl/parser/antlr_generated/TaskDefLangParser.h (4)
src/spider/tdl/parser/antlr_generated/TaskDefLangBaseVisitor.h (1)
spider(11-25)src/spider/tdl/parser/antlr_generated/TaskDefLangLexer.h (1)
spider(10-51)src/spider/tdl/parser/antlr_generated/TaskDefLangVisitor.h (1)
spider(11-28)src/spider/tdl/parser/antlr_generated/TaskDefLangParser.cpp (21)
TaskDefLangParser(83-83)TaskDefLangParser(85-88)TaskDefLangParser(90-92)getGrammarFileName(98-100)getGrammarFileName(98-98)getATN(94-96)getATN(94-94)getRuleNames(102-104)getRuleNames(102-102)getVocabulary(106-108)getVocabulary(106-106)getSerializedATN(110-112)getSerializedATN(110-110)StartContext(117-119)getRuleIndex(126-128)getRuleIndex(126-126)EOF(121-123)EOF(121-121)accept(131-136)accept(131-131)start(138-138)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: lint
🔇 Additional comments (11)
src/spider/tdl/parser/antlr_generated/TaskDefLangLexer.h (1)
13-21: Generated header looks standard and consistent with the grammar.Token enum aligns with the single lexer rule (SPACE), overrides match the ANTLR C++ API, and the static initialize hook is present. Nothing to flag here.
src/spider/tdl/parser/antlr_generated/TaskDefLangLexer.cpp (1)
95-102: Interpreter lifecycle is correctly managed.Constructor initializes the LexerATNSimulator and the destructor deletes it. No leaks or double-deletes evident.
src/spider/tdl/parser/antlr_generated/TaskDefLangParser.h (2)
13-21: Parser interface wiring looks correct for a minimal grammar.Token and rule enums, StartContext, and start() entrypoint are all consistent with start: EOF. No issues.
56-60: Static init hook is useful; verify it’s called consistently across all entrypoints.You’re exposing initialize(), which is good. Ensure upstream code constructs the lexer/parser in the same way as here so the static state is consistently initialised before use.
src/spider/tdl/parser/antlr_generated/TaskDefLangVisitor.cpp (1)
5-9: Visitor skeleton is fine for now.No implementation needed yet given the trivial grammar. Ready to extend once parser rules are added.
src/spider/tdl/parser/antlr_generated/TaskDefLangBaseVisitor.cpp (1)
5-8: Generated stub looks correct.Includes and namespace usage are appropriate for a generated .cpp; nothing to fix here.
src/spider/tdl/parser/antlr_generated/TaskDefLangVisitor.h (1)
7-28: Visitor interface looks good.Header guards, includes, namespace and the pure virtual visitStart signature align with the generated parser types.
src/spider/tdl/parser/antlr_generated/TaskDefLangBaseVisitor.h (1)
17-25: Base visitor default implementation is correct.Delegating to visitChildren(ctx) is the expected default for generated visitors.
taskfiles/build.yaml (2)
48-49: Namespace configuration matches generated codeThe
-package spider::tdl::parser::antlr_generatedoption is reflected in the generated sources (e.g.,using namespace spider::tdl::parser::antlr_generated;). Looks consistent.
56-60: Checksum computed last is correctKeeping
:utils:checksum:computeas the last command guarantees it fingerprints the final pruned set of generated sources. Good call.src/spider/tdl/parser/antlr_generated/TaskDefLangParser.cpp (1)
1-171: Generated artefact looks standard and consistent with ANTLR 4.13.2 outputThis matches typical ANTLR C++ parser scaffolding: static data holder, thread-safe initialization, Start rule implementation, and Visitor integration. Nothing stands out as customized or risky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/spider/CMakeLists.txt (1)
258-263: Wiring generated and shared sources into spider_tdl looks correctTarget composition is clear. Consider marking the generated directory as a SYSTEM include to suppress warnings from generated headers, if they get noisy.
You can add this (outside the shown hunk):
# Treat generated headers as system to reduce warning noise from third-party generated code target_include_directories(spider_tdl SYSTEM PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/tdl/parser/antlr_generated)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
CMakeLists.txt(1 hunks)src/spider/CMakeLists.txt(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: non-storage-unit-tests (ubuntu-22.04)
- GitHub Check: non-storage-unit-tests (ubuntu-24.04)
- GitHub Check: lint
🔇 Additional comments (1)
src/spider/CMakeLists.txt (1)
200-212: Explicitly listing generated sources looks goodKeeping generated files explicit (and cached) avoids non-determinism from globbing. This is friendly to build reproducibility and the “generated-code-checks” workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/spider/CMakeLists.txt (1)
200-212: Optionally mute warnings for committed ANTLR-generated artefacts to keep -Werror builds greenNice to see the generated files committed and scoped via CACHE INTERNAL. If your CI uses strict warning levels, consider downgrading warnings on these third‑party generated sources only.
Apply this diff right after the variable definition:
set(SPIDER_TDL_ANTLR_GENERATED_SOURCES tdl/parser/antlr_generated/TaskDefLangBaseVisitor.cpp tdl/parser/antlr_generated/TaskDefLangBaseVisitor.h tdl/parser/antlr_generated/TaskDefLangLexer.cpp tdl/parser/antlr_generated/TaskDefLangLexer.h tdl/parser/antlr_generated/TaskDefLangParser.cpp tdl/parser/antlr_generated/TaskDefLangParser.h tdl/parser/antlr_generated/TaskDefLangVisitor.cpp tdl/parser/antlr_generated/TaskDefLangVisitor.h CACHE INTERNAL "spider task definition language ANTLR generated source files" ) +set_source_files_properties( + ${SPIDER_TDL_ANTLR_GENERATED_SOURCES} + PROPERTIES + COMPILE_OPTIONS + "$<$<OR:$<CXX_COMPILER_ID:GNU>,$<CXX_COMPILER_ID:Clang>>:-w>;$<$<CXX_COMPILER_ID:MSVC>:/W0>" +)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
src/spider/CMakeLists.txt(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-20T19:39:33.009Z
Learnt from: LinZhihao-723
PR: y-scope/spider#201
File: src/spider/CMakeLists.txt:268-274
Timestamp: 2025-08-20T19:39:33.009Z
Learning: The standard practice for linking ANTLR4 C++ runtime in CMake is to use `antlr4_static` directly, as shown in official ANTLR4 demos and used consistently in projects like CLP core and spider.
Applied to files:
src/spider/CMakeLists.txt
📚 Learning: 2025-08-20T19:39:33.009Z
Learnt from: LinZhihao-723
PR: y-scope/spider#201
File: src/spider/CMakeLists.txt:268-274
Timestamp: 2025-08-20T19:39:33.009Z
Learning: Using `antlr4_static` directly is the official, documented way to link the ANTLR4 C++ runtime statically in CMake, as shown in official ANTLR4 examples and documentation. There is no need to use variables like `ANTLR4_RUNTIME_TARGET` - the target names `antlr4_static` and `antlr4_shared` are the official exported targets from the ANTLR4 CMake package.
Applied to files:
src/spider/CMakeLists.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: lint
- GitHub Check: non-storage-unit-tests (ubuntu-22.04)
- GitHub Check: non-storage-unit-tests (ubuntu-24.04)
🔇 Additional comments (3)
src/spider/CMakeLists.txt (3)
258-264: Good separation of private sources; ANTLR bits kept implementation‑privateUsing a single PRIVATE block for both generated and shared sources is clean and keeps ANTLR headers out of the public interface. No issues here.
268-275: Linking against antlr4_static is the right call hereThis aligns with upstream ANTLR CMake usage and your project’s established pattern (per your prior note and our retrieved learnings). Looks good.
268-275: Verify ANTLR4CPP_STATIC is propagated when linking staticallyMost ANTLR CMake packages export ANTLR4CPP_STATIC via the antlr4_static target’s interface, but some environments don’t. Quick check recommended; if missing, add the definition on spider_tdl.
Drop this CMake snippet temporarily to confirm and auto‑patch if needed:
# Diagnostic: ensure antlr4_static propagates ANTLR4CPP_STATIC get_target_property(_antlr_defs antlr4_static INTERFACE_COMPILE_DEFINITIONS) message(STATUS "antlr4_static INTERFACE_COMPILE_DEFINITIONS='${_antlr_defs}'") if(NOT _antlr_defs) message(WARNING "antlr4_static exports no INTERFACE_COMPILE_DEFINITIONS; setting ANTLR4CPP_STATIC on spider_tdl") target_compile_definitions(spider_tdl PRIVATE ANTLR4CPP_STATIC) else() list(FIND _antlr_defs "ANTLR4CPP_STATIC" _idx) if(_idx EQUAL -1) message(WARNING "ANTLR4CPP_STATIC not present; setting it on spider_tdl") target_compile_definitions(spider_tdl PRIVATE ANTLR4CPP_STATIC) endif() endif()Note: I used the retrieved learnings to avoid re‑suggesting the previously debated change about using a variable instead of antlr4_static.
Description
This PR adds a task to generate the parser for TaskDefLang, from a g4 grammar file. Similar to CLP core, we don't generate the parser at build time. Instead, we commit the generated parser code to the codebase and create a dedicated workflow to check that the generated parser matches the committed code for every commit.
Compared to CLP core, which needs to generate both kql and sql parsers, the current TDL only needs one. Therefore, the task file is modified accordingly so that it only applies to a single grammar file.
Since the generated files are also C++ source files, we update the linting tasks to skip the generated files for both formatting and static checks.
Even though the parser is not used, we still add it into the TDL CMake target and build it with the antlr4_runtime library. This ensures the generated code can be compiled.
Checklist
breaking change.
Validation performed
Summary by CodeRabbit
New Features
Chores