Skip to content

feat(clp-package): Add clp-config.yaml JSON schema generation and package the config schema.#1654

Merged
junhaoliao merged 1 commit intoy-scope:mainfrom
junhaoliao:json-schema
Nov 26, 2025
Merged

feat(clp-package): Add clp-config.yaml JSON schema generation and package the config schema.#1654
junhaoliao merged 1 commit intoy-scope:mainfrom
junhaoliao:json-schema

Conversation

@junhaoliao
Copy link
Member

@junhaoliao junhaoliao commented Nov 24, 2025

Description

This change introduces automated generation and packaging of the CLP configuration
JSON schema.

Key updates:

  • Adds a new Python script generate-config-schemas.py which:
    • Uses pydantic and ClpConfig to generate JSON schema.
    • Injects a $schema field via a custom schema generator.
  • Adds a yaml-language-server: $schema directive to clp-config.yaml for editor / LS support.
  • Introduces a new generate-config-schemas Taskfile task that:
    • Clears and recreates the schema build directory.
    • Runs tools/scripts/generate-config-schemas.py to produce
      clp-config.schema.json.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Ensured the generated schema is included in the packaged output under build/clp-package/usr/share/config-schemas/.
  • Tested with PyCharm that auto completion & schema checking were working for package and logs_input properties.

Summary by CodeRabbit

Release Notes

  • New Features

    • Configuration schemas are now automatically generated and included in package distributions
  • Chores

    • Enhanced build workflow to generate and integrate configuration schemas during the packaging process

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 24, 2025

Walkthrough

Changes introduce automated JSON Schema generation for CLP configuration files. A new Python script generates schemas from the Pydantic CLP config model, integrated into the build system as a new task. The generated schema is referenced in the configuration file via a language server directive, and packaged as an artifact.

Changes

Cohort / File(s) Summary
Configuration schema generation setup
components/package-template/src/etc/clp-config.yaml, tools/scripts/generate-config-schemas.py
Added YAML language server directive referencing schema file. Introduced new Python script that generates JSON Schema for CLP config using custom Pydantic schema generator, writing output to specified directory.
Build system integration
taskfile.yaml
Added G_CONFIG_SCHEMA_BUILD_DIR variable for schema artifact storage. Created new generate-config-schemas task to produce schemas. Extended package task to depend on schema generation and copy generated schemas to usr/share/config-schemas/ in package output.

Sequence Diagram(s)

sequenceDiagram
    participant Build as Build Task
    participant Generate as generate-config-schemas
    participant Script as generate-config-schemas.py
    participant Schema as JSON Schema
    participant Package as package Task

    Build->>Generate: execute task
    Generate->>Generate: create OUTPUT_DIR
    Generate->>Script: invoke with OUTPUT_DIR
    Script->>Script: load ClpConfig model
    Script->>Script: generate schema via GenerateJsonSchemaWithSchemaKey
    Script->>Schema: inject $schema field
    Script->>Schema: write clp-config.schema.json
    Build->>Package: dependency ready
    Package->>Package: copy schema to usr/share/config-schemas/
    Package->>Package: include in artifact
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • taskfile.yaml requires attention to verify correct task sequencing, variable usage, and output directory handling during packaging
  • generate-config-schemas.py needs review of the custom Pydantic schema generator implementation and schema field injection logic
  • clp-config.yaml schema path reference should be validated against the actual generated output location

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and specifically summarizes the main changes: adding JSON schema generation for clp-config.yaml and packaging the schema.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@junhaoliao junhaoliao changed the title feat(clp-package): Generate and package JSON schema file for clp-config.yaml. feat(clp-package): Add clp-config.yaml JSON schema generation and package the config schema. Nov 24, 2025
@junhaoliao junhaoliao marked this pull request as ready for review November 24, 2025 17:04
@junhaoliao junhaoliao requested a review from a team as a code owner November 24, 2025 17:04
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f79349 and 51c7772.

📒 Files selected for processing (3)
  • components/package-template/src/etc/clp-config.yaml (1 hunks)
  • taskfile.yaml (5 hunks)
  • tools/scripts/generate-config-schemas.py (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).
Learnt from: gibber9809
Repo: y-scope/clp PR: 955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.
📚 Learning: 2025-05-28T18:33:30.155Z
Learnt from: anlowee
Repo: y-scope/clp PR: 925
File: taskfiles/deps/main.yaml:97-106
Timestamp: 2025-05-28T18:33:30.155Z
Learning: In the taskfiles dependency system (taskfiles/deps/main.yaml), echo commands are used to generate .cmake settings files that are consumed by the main CMake build process. These files set variables like ANTLR_RUNTIME_HEADER to point to dependency locations for use during compilation.

Applied to files:

  • taskfile.yaml
📚 Learning: 2025-09-25T19:26:32.436Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1335
File: taskfiles/docker-images.yaml:15-15
Timestamp: 2025-09-25T19:26:32.436Z
Learning: In the CLP project's Taskfile, the `:package` task creates the G_PACKAGE_BUILD_DIR directory structure, so any task that depends on `:package` (like `docker-images:package`) can safely assume this directory exists without needing additional mkdir commands.

Applied to files:

  • taskfile.yaml
📚 Learning: 2025-07-25T21:29:48.947Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1126
File: .gitignore:5-5
Timestamp: 2025-07-25T21:29:48.947Z
Learning: In the CLP project, the .clang-format file is maintained in the yscope-dev-utils submodule and copied over to the main CLP repository, so it should be ignored in .gitignore to prevent accidental commits of the copied file and maintain the single source of truth in the submodule.

Applied to files:

  • components/package-template/src/etc/clp-config.yaml
📚 Learning: 2025-11-03T16:17:40.223Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1535
File: components/clp-rust-utils/src/clp_config/package/config.rs:47-61
Timestamp: 2025-11-03T16:17:40.223Z
Learning: In the y-scope/clp repository, the `ApiServer` struct in `components/clp-rust-utils/src/clp_config/package/config.rs` is a Rust-native configuration type and does not mirror any Python code, unlike other structs in the same file (Config, Database, ResultsCache, Package) which are mirrors of Python definitions.

Applied to files:

  • components/package-template/src/etc/clp-config.yaml
🧬 Code graph analysis (1)
tools/scripts/generate-config-schemas.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • ClpConfig (623-861)
🪛 Ruff (0.14.5)
tools/scripts/generate-config-schemas.py

26-26: Dynamically typed expressions (typing.Any) are disallowed in *args

(ANN401)


26-26: Dynamically typed expressions (typing.Any) are disallowed in **kwargs

(ANN401)


26-26: Dynamically typed expressions (typing.Any) are disallowed in generate

(ANN401)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: rust-checks
🔇 Additional comments (6)
tools/scripts/generate-config-schemas.py (3)

1-10: LGTM! Inline script metadata is well-structured.

The uv inline script metadata correctly specifies dependencies and uses an editable install for the local clp_py_utils package. This approach ensures the script can be executed directly with proper dependency management.


23-34: LGTM! Clean extension of Pydantic's schema generator.

The custom generator correctly injects the $schema field by accessing self.schema_dialect from the parent class. The static analysis warnings about Any types are false positives—they're necessary when overriding parent methods with dynamic signatures.


37-47: Consider adding basic error handling.

The script logic is correct but lacks error handling for potential failures during schema generation or file writing. While the script will naturally exit with an error if something goes wrong, explicit error handling with informative messages would improve debuggability.

Consider wrapping the schema generation and file writing in a try-except block:

     output_dir = args.output_dir
     output_dir.mkdir(exist_ok=True)
 
-    schema = ClpConfig.model_json_schema(schema_generator=GenerateJsonSchemaWithSchemaKey)
-    with Path.open(output_dir / "clp-config.schema.json", "w") as f:
-        json.dump(schema, f, indent=2)
+    try:
+        schema = ClpConfig.model_json_schema(schema_generator=GenerateJsonSchemaWithSchemaKey)
+        with Path.open(output_dir / "clp-config.schema.json", "w") as f:
+            json.dump(schema, f, indent=2)
+    except Exception as e:
+        print(f"Error generating schema: {e}", file=sys.stderr)
+        sys.exit(1)
⛔ Skipped due to learnings
Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.901Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.
components/package-template/src/etc/clp-config.yaml (1)

1-1: LGTM! Schema directive is correctly configured.

The yaml-language-server directive properly references the generated schema file using the correct relative path. This will enable IDE/LSP features like autocompletion and validation for developers editing the configuration file.

taskfile.yaml (2)

25-25: LGTM! Build directory variable follows project conventions.

The G_CONFIG_SCHEMA_BUILD_DIR variable is properly defined and follows the project's naming conventions for build directory paths.


113-164: LGTM! Package task correctly integrates schema generation.

The modifications properly integrate the config schema into the packaging workflow:

  • The generated schema file is listed as a source dependency
  • The generate-config-schemas task is a dependency, ensuring schemas are generated before packaging
  • The rsync command correctly copies the schema to the package output directory

The trailing slash on the rsync source path correctly copies the directory contents rather than the directory itself.

Comment on lines +231 to +238
generate-config-schemas:
vars:
OUTPUT_DIR: "{{.G_CONFIG_SCHEMA_BUILD_DIR}}"
cmds:
- |-
rm -rf "{{.OUTPUT_DIR}}"
mkdir "{{.OUTPUT_DIR}}"
./tools/scripts/generate-config-schemas.py "{{.OUTPUT_DIR}}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding sources/generates for task caching.

The task implementation is correct, but it currently runs unconditionally. Adding sources and generates fields would allow Task to skip schema generation when inputs haven't changed, improving build performance.

Consider adding sources and generates:

   generate-config-schemas:
     vars:
       OUTPUT_DIR: "{{.G_CONFIG_SCHEMA_BUILD_DIR}}"
+    sources:
+      - "{{.TASKFILE}}"
+      - "components/clp-py-utils/clp_py_utils/clp_config.py"
+      - "tools/scripts/generate-config-schemas.py"
+    generates:
+      - "{{.OUTPUT_DIR}}/clp-config.schema.json"
     cmds:
       - |-
         rm -rf "{{.OUTPUT_DIR}}"
         mkdir "{{.OUTPUT_DIR}}"
         ./tools/scripts/generate-config-schemas.py "{{.OUTPUT_DIR}}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
generate-config-schemas:
vars:
OUTPUT_DIR: "{{.G_CONFIG_SCHEMA_BUILD_DIR}}"
cmds:
- |-
rm -rf "{{.OUTPUT_DIR}}"
mkdir "{{.OUTPUT_DIR}}"
./tools/scripts/generate-config-schemas.py "{{.OUTPUT_DIR}}"
generate-config-schemas:
vars:
OUTPUT_DIR: "{{.G_CONFIG_SCHEMA_BUILD_DIR}}"
sources:
- "{{.TASKFILE}}"
- "components/clp-py-utils/clp_py_utils/clp_config.py"
- "tools/scripts/generate-config-schemas.py"
generates:
- "{{.OUTPUT_DIR}}/clp-config.schema.json"
cmds:
- |-
rm -rf "{{.OUTPUT_DIR}}"
mkdir "{{.OUTPUT_DIR}}"
./tools/scripts/generate-config-schemas.py "{{.OUTPUT_DIR}}"
🤖 Prompt for AI Agents
In taskfile.yaml around lines 231 to 238, the generate-config-schemas task
currently always runs; add Task caching by declaring a sources list (all inputs
that affect schema generation, e.g., the generator script path
tools/scripts/generate-config-schemas.py, any config schema source directories
or templates, and repo files that influence output) and a generates list
pointing to the produced artifacts (the OUTPUT_DIR and a representative file
pattern under it, e.g., {{.G_CONFIG_SCHEMA_BUILD_DIR}}/** or a specific
generated schema file) so the task runtime can skip execution when sources are
unchanged and the generates are present.

@junhaoliao junhaoliao merged commit 65ac5d7 into y-scope:main Nov 26, 2025
30 checks passed
cmds:
- |-
rm -rf "{{.OUTPUT_DIR}}"
mkdir "{{.OUTPUT_DIR}}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use mkdir -p to create any missing parent paths because the build directory may not be available when the task is run

otherwise, mkdir may complain that the path does not exist

@coderabbitai create an issue to track

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create a GitHub issue to track the mkdir -p usage in generate-config-schemas task
gh issue create \
  --title "Use 'mkdir -p' in generate-config-schemas task to create parent directories" \
  --body "## Description

The \`generate-config-schemas\` task in \`taskfile.yaml\` currently uses \`mkdir\` without the \`-p\` flag, which may fail if the parent \`build\` directory does not exist when the task is run.

Other tasks in the taskfile consistently use \`mkdir -p\` to ensure parent directories are created (e.g., \`init\`, \`package\`, \`webui\`, \`python-libs\`, \`download-and-extract-tar\`).

## Context

Raised by @junhaoliao in PR #1654:
https://github.com/y-scope/clp/pull/1654#discussion_r2562824787

## Location

File: \`taskfile.yaml\`
Task: \`generate-config-schemas\`" \
  --assignee junhaoliao

Length of output: 42


@junhaoliao, I've created issue #1669 to track this improvement: #1669

You're absolutely right that mkdir -p should be used to ensure parent directories are created if they don't exist. This aligns with the pattern used consistently throughout the rest of the taskfile in tasks like init, package, webui, python-libs, and download-and-extract-tar.


🧠 Learnings used
Learnt from: anlowee
Repo: y-scope/clp PR: 925
File: taskfiles/deps/main.yaml:97-106
Timestamp: 2025-05-28T18:33:30.155Z
Learning: In the taskfiles dependency system (taskfiles/deps/main.yaml), echo commands are used to generate .cmake settings files that are consumed by the main CMake build process. These files set variables like ANTLR_RUNTIME_HEADER to point to dependency locations for use during compilation.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1335
File: taskfiles/docker-images.yaml:15-15
Timestamp: 2025-09-25T19:26:32.436Z
Learning: In the CLP project's Taskfile, the `:package` task creates the G_PACKAGE_BUILD_DIR directory structure, so any task that depends on `:package` (like `docker-images:package`) can safely assume this directory exists without needing additional mkdir commands.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.901Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.

Learnt from: hoophalab
Repo: y-scope/clp PR: 1293
File: taskfiles/codegen.yaml:26-33
Timestamp: 2025-09-10T22:13:51.907Z
Learning: ANTLR automatically creates the output directory specified by the -o flag, so explicit directory creation with mkdir -p is not needed before running ANTLR commands.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants