Skip to content

feat(tools): C/C++ stdlib registry generator + Linux overlay#679

Open
shivasurya wants to merge 3 commits intomainfrom
shiva/cpp-phase2-pr01-generator
Open

feat(tools): C/C++ stdlib registry generator + Linux overlay#679
shivasurya wants to merge 3 commits intomainfrom
shiva/cpp-phase2-pr01-generator

Conversation

@shivasurya
Copy link
Copy Markdown
Owner

PR-01 of the C/C++ Phase 2 stdlib-resolution stack. Closes the 88% unresolved-call gap from Phase 1 by giving us a generator that walks installed system headers and emits per-header JSON manifests the loader (PR-02) will consume.

What's in this PR

  • graph/callgraph/core/clike_stdlib_types.go — public schema (CStdlibRegistry, CStdlibManifest, CStdlibHeader, function/class/typedef/constant/param types). Stable contract between this PR's generator and PR-02's loader.
  • tools/internal/clikeextract/ — new internal package, one concern per file:
    • walker.goDiscoverHeaderSources(linux/c, linux/cpp), glibc/libstdc++ probing, deterministic header walk with bits//internal/ skip rules.
    • c_extractor.go, cpp_extractor.go — tree-sitter walkers reusing graph/clike helpers; C++ adds template-parameter capture, namespace-stack tracking, and a local findFunctionDeclarator that handles C++ reference_declarator (Phase 1's clike helper doesn't, blocking T&-returning methods like vector::operator[] / vector::at).
    • overlay.go — yaml.v3 loader + MergeOverlay with strict validation (language match, exactly-one-of function/method/typedef/constant, skip-rule shape).
    • emitter.go — per-header JSON write + sha256 checksums + deterministic manifest.json.
    • normalize.go — strip __attribute__, _GLIBCXX_*, _LIBCPP_*, canonicalize std::__cxx11::std::, SanitizeHeaderName.
    • extractor.goRun() orchestrator (continue-on-parse-failure, mirrors goextract).
  • tools/c_stdlib_overlay.yaml (28 entries) — security-critical sinks: format-string (printf family with __attribute__((format))), command-injection (system, popen, exec*), buffer-overflow (strcpy, gets, sprintf), allocation/tainted-source markers.
  • tools/cpp_stdlib_overlay.yaml (55 entries) — STL methods whose template return types tree-sitter cannot substitute: vector / basic_string / unique_ptr / shared_ptr / optional / map / unordered_map; std::move, std::forward; throws annotations on at() / value().
  • tools/generate_clike_stdlib_registry.go — thin //go:build cpf_generate_stdlib_registry entry-point wiring CLI flags into clikeextract.Extractor.

Why a separate clikeextract package (not flat in tools/)

Mirrors the existing tools/internal/goextract/ precedent — thin entry-point + heavy logic in an internal package, one file per concern, fully testable under regular go test ./.... The tech spec sketched everything flat in tools/; this layout is cleaner and consistent with the rest of the repo.

End-to-end smoke (this host's /usr/include + /usr/include/c++/13)

Headers Functions Classes Typedefs Constants Overlay
linux/c 1875 8467 1598 38270 27
linux/cpp 121 564 497 112 21 55

Both manifests parse round-trip. C output exceeds the spec's "~80 headers / ~1800 functions" budget — the walker captures full POSIX/sys surface in addition to libc.

Out of scope (per PR-01 plan)

  • Loader, file:// or HTTP — PR-02
  • Engine resolver integration in c_builder.go/cpp_builder.go — PR-02
  • Windows/Darwin paths in walker — PR-03 (entry-point returns explicit PR-03-deferred error)
  • GitHub Actions CI workflow — PR-03
  • --diagnose-stdlib, resolution-report enhancements — PR-04

Verification

  • go test ./... — all tests pass, no regressions on Phase 1
  • golangci-lint run ./... — zero issues across entire repo
  • gradle buildGo — clean build
  • ✅ 91.5% coverage on new clikeextract package, 100% on new clike_stdlib_types.go
  • ✅ Generator runs end-to-end: go run -tags cpf_generate_stdlib_registry tools/generate_clike_stdlib_registry.go --target=linux --language={c,cpp} --output-dir=... produces valid manifests on a real Ubuntu host.

Test plan

  • Reviewer pulls the branch locally
  • go test ./tools/internal/clikeextract/... ./graph/callgraph/core/ — passes
  • go run -tags cpf_generate_stdlib_registry ./tools/generate_clike_stdlib_registry.go --target=linux --language=c --output-dir=/tmp/cpf-c — produces manifest.json + per-header JSONs
  • Spot-check 5 entries against cppreference.com
  • Confirm gradle buildGo and gradle lintGo succeed (Python lint failure on unrelated files is pre-existing)

🤖 Generated with Claude Code

shivasurya and others added 3 commits May 3, 2026 17:45
Adds the schema contract consumed by both the PR-01 generator (this
stack) and the loader landing in PR-02:

- CStdlibRegistry / NewCStdlibRegistry — root in-memory container per
  (platform, language) axis, with accessors HasHeader, GetHeader,
  GetFunction, GetClass, GetMethod that mirror the existing
  GoStdlibRegistry surface.
- CStdlibManifest + CStdlibHeaderEntry + CStdlibStatistics — the
  top-level manifest.json shape; HasHeader / GetHeaderEntry helpers
  for the loader's lazy-fetch path.
- CStdlibHeader — per-header content; one type works for both C and
  C++ (C++-only fields are tagged omitempty so C output stays clean).
- CStdlibFunction / CStdlibParam / CStdlibTypedef / CStdlibConstant
  / CppStdlibClass / CppStdlibConstructor — leaf entries.
- Source / language / platform string constants so consumers don't
  hard-code "header" / "overlay" / "merged" / "linux" / "c" / "cpp".

JSON tags are snake_case to match the Python and Go stdlib registry
files already on disk; nolint:tagliatelle directives match the pattern
in go_stdlib_types.go.

100% test coverage on the new file via 12 round-trip + accessor tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds tools/internal/clikeextract — the Go package that walks installed
C/C++ system headers and emits per-header JSON registry files. Mirrors
the existing tools/internal/goextract layout (thin entry-point binary
plus a fully-tested internal package), one concern per file:

- doc.go — package docs (pipeline overview, reuse rules)
- config.go — Config + GeneratorVersion / SchemaVersion / RegistryVersion
  / DefaultBaseURL constants. Validate() rejects unsupported targets and
  languages early.
- normalize.go — strip __attribute__((...)), _GLIBCXX_*, _LIBCPP_*,
  __THROW, _Nonnull etc.; canonicalize std::__cxx11:: -> std::; private-
  symbol detection (single-underscore lowercase, double-underscore);
  SanitizeHeaderName for output filenames.
- walker.go — DiscoverHeaderSources for linux/c (glibc) and linux/cpp
  (libstdc++), system-tag detection, deterministic header walking with
  bits/ / internal/ skip rules. Windows/Darwin paths return an explicit
  PR-03-deferred error so the surface is forward-compatible.
- overlay.go — yaml.v3-based loader for c_stdlib_overlay.yaml /
  cpp_stdlib_overlay.yaml. Validates language match, exactly-one-of
  function/method/typedef/constant, and skip-rule shape. MergeOverlay
  applies overrides in place and returns the count for statistics.
- c_extractor.go — C function / typedef / preproc-def extraction over
  the tree-sitter AST. Reuses graph/clike helpers (ExtractFunctionInfo,
  ExtractTypeString, ExtractParameters). Conservative #define handling:
  emit constants only when the body parses as a literal.
- cpp_extractor.go — C++ classes (with template_parameter_list capture),
  methods, namespace-qualified free functions, constructors. Adds a
  local findFunctionDeclarator that handles C++ reference_declarator
  wrappers (Phase 1's clike helper does not — this is reachable for
  T&-returning methods like vector::operator[] / vector::at).
- emitter.go — per-header JSON write with sha256 checksums, statistics
  tally, top-level manifest.json. Output is deterministic across runs
  (sorted by header name) and idempotent.
- extractor.go — Run() orchestrator stitching discover -> walk -> extract
  -> merge -> emit. Continue-on-parse-failure pattern matches goextract;
  fatal errors only on missing search dirs, invalid overlay, unwritable
  output dir.

testdata/c/{stdio.h,string.h,unistd.h,inline.h} and
testdata/cpp/{vector,string,utility} provide synthetic fixture headers
for unit + integration tests. End-to-end TestRunFixtureLinux{C,Cpp}
exercise the full pipeline.

Coverage: 91.5% on the new package across 99 test cases. Remaining
gaps are defensive nil paths and tree-sitter shapes the synthetic
fixtures don't reach (operator_name, destructor_name fallbacks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the user-facing generator binary plus the two hand-curated YAML
overlay files that augment tree-sitter extraction with security tags,
template return types, and skip rules:

- tools/generate_clike_stdlib_registry.go — //go:build cpf_generate_stdlib_registry
  entry-point binary. Mirrors the layout of generate_go_stdlib_registry.go:
  flag-parse target / language / output-dir / overlay / base-url, hand
  off to clikeextract.NewExtractor(cfg).Run().
- tools/c_stdlib_overlay.yaml — 28 hand-curated entries covering
  format-string sinks (printf family with __attribute__((format))),
  command-injection sinks (system, popen, exec*), buffer-overflow sinks
  (strcpy, gets, sprintf), allocation sources (malloc, calloc), tainted
  sources (getenv, read), plus skip rules for compiler-internal symbols.
- tools/cpp_stdlib_overlay.yaml — 55 entries covering STL methods whose
  template return types tree-sitter cannot substitute: vector at /
  operator[] / data, basic_string c_str / data, unique_ptr/shared_ptr
  get/reset/operator*, optional value/value_or, map/unordered_map find/
  insert/operator[]/at, std::move / std::forward, ostream/istream stream
  operators. Throws annotations on at() (std::out_of_range), value()
  (std::bad_optional_access).

End-to-end smoke against this host's /usr/include + /usr/include/c++/13
produced 1875 C headers (8467 functions) and 121 C++ headers (497
classes, 564 functions) — both manifests parsed and statistics check
out.

Run with:

  go run -tags cpf_generate_stdlib_registry tools/generate_clike_stdlib_registry.go \
      --target=linux --language=c --output-dir=/tmp/cpf-c
  go run -tags cpf_generate_stdlib_registry tools/generate_clike_stdlib_registry.go \
      --target=linux --language=cpp --output-dir=/tmp/cpf-cpp

Output is local-only in this PR; remote deployment + CDN URL come in
PR-03. Loader + engine integration come in PR-02.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shivasurya shivasurya added enhancement New feature or request go Pull requests that update go code labels May 3, 2026
@shivasurya shivasurya self-assigned this May 3, 2026
@shivasurya shivasurya added enhancement New feature or request go Pull requests that update go code labels May 3, 2026
@safedep
Copy link
Copy Markdown

safedep Bot commented May 3, 2026

SafeDep Report Summary

Green Malicious Packages Badge Green Vulnerable Packages Badge Green Risky License Badge

No dependency changes detected. Nothing to scan.

View complete scan results →

This report is generated by SafeDep Github App

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 3, 2026

Code Pathfinder Security Scan

Pass Critical High Medium Low Info

No security issues detected.

Metric Value
Files Scanned 29
Rules 205

Powered by Code Pathfinder

@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

❌ Patch coverage is 88.61712% with 121 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.55%. Comparing base (90305d9) to head (682fa02).

Files with missing lines Patch % Lines
...ngine/tools/internal/clikeextract/cpp_extractor.go 78.87% 41 Missing and 23 partials ⚠️
...-engine/tools/internal/clikeextract/c_extractor.go 86.00% 17 Missing and 11 partials ⚠️
sast-engine/tools/internal/clikeextract/overlay.go 93.33% 8 Missing and 6 partials ⚠️
sast-engine/tools/internal/clikeextract/emitter.go 86.48% 5 Missing and 5 partials ⚠️
...st-engine/tools/internal/clikeextract/extractor.go 92.50% 2 Missing and 1 partial ⚠️
sast-engine/tools/internal/clikeextract/walker.go 98.23% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #679      +/-   ##
==========================================
+ Coverage   85.43%   85.55%   +0.11%     
==========================================
  Files         187      196       +9     
  Lines       27278    28341    +1063     
==========================================
+ Hits        23305    24247     +942     
- Misses       3082     3156      +74     
- Partials      891      938      +47     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request go Pull requests that update go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant