Skip to content

Refactor issue reconciliation logic and update prompt instructions#160

Merged
rostilos merged 7 commits intomainfrom
1.5.4-rc
Mar 5, 2026
Merged

Refactor issue reconciliation logic and update prompt instructions#160
rostilos merged 7 commits intomainfrom
1.5.4-rc

Conversation

@rostilos
Copy link
Owner

@rostilos rostilos commented Mar 4, 2026

  • Enhanced the reconciliation process in reconciliation.py to prioritize updated positional data from LLM while preserving original identity and description fields.
  • Improved logging for reconciled issues to provide clearer insights into changes made during reconciliation.
  • Updated line number and code snippet instructions in constants_shared.py to clarify requirements for issue reporting, emphasizing the importance of accurate code snippets and scope definitions.
  • Replaced deprecated line number instructions with new comprehensive guidelines in prompt_builder.py and prompt_constants.py.
  • Removed obsolete backup file prompt_constants.py.bak to streamline the codebase.

Summary by CodeRabbit

  • New Features

    • AST-powered scope detection (FILE/FUNCTION/BLOCK/LINE) and richer symbol/context enrichment for more accurate issue anchoring.
    • Server-side snippet anchoring with automated line corrections and confidence scoring.
    • Improved deterministic reconciliation to better track/restore issues across edits.
    • RAG/context retrieval now uses AST-derived identifiers, tiered selection, and safer file-content fallbacks.
  • Chores

    • New AST parsing module and DB schema extensions to store scope and end-line metadata.
  • Tests

    • Expanded tests for parsing, anchoring, reconciliation, and RAG flows.

- Enhanced the reconciliation process in `reconciliation.py` to prioritize updated positional data from LLM while preserving original identity and description fields.
- Improved logging for reconciled issues to provide clearer insights into changes made during reconciliation.
- Updated line number and code snippet instructions in `constants_shared.py` to clarify requirements for issue reporting, emphasizing the importance of accurate code snippets and scope definitions.
- Replaced deprecated line number instructions with new comprehensive guidelines in `prompt_builder.py` and `prompt_constants.py`.
- Removed obsolete backup file `prompt_constants.py.bak` to streamline the codebase.
@coderabbitai
Copy link

coderabbitai bot commented Mar 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new Tree-sitter based AST parser module and integrates AST-aware enrichment, anchoring, and deterministic reconciliation across the Java analysis engine and Python RAG/orchestration pipelines; updates models, DB migrations, services, tests, and build configuration to consume AST-derived scope, scope-aware hashing, and snippet anchoring.

Changes

Cohort / File(s) Summary
AST parser module & resources
java-ecosystem/libs/ast-parser/pom.xml, java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/..., java-ecosystem/libs/ast-parser/src/main/resources/queries/.../scopes.scm
New module providing AstParserFacade, ParsedTree, ParserPool, ScopeResolver, SymbolExtractor, ScopeAwareHasher APIs and implementations, ScopeInfo/ScopeKind/SupportedLanguage types, per-language Tree-sitter queries, tests and internal hashing/resolver/extractor logic.
Analysis-engine wiring & processors
java-ecosystem/libs/analysis-engine/pom.xml, .../module-info.java, .../processor/analysis/BranchAnalysisProcessor.java, .../PullRequestAnalysisProcessor.java, .../service/AstScopeEnricher.java, .../service/IssueReconciliationEngine.java
Adds codecrow-ast-parser dependency and module require; introduces AstScopeEnricher service and IssueReconciliationEngine; processors and services now accept and wire these new dependencies, perform non-fatal AST enrichment, and rework cache-hit handling (CacheHitType) and flow integration.
Branch mapping & reconciliation
java-ecosystem/libs/analysis-engine/.../branch/BranchIssueMappingService.java, BranchIssueReconciliationService.java
Dedup key now includes normalized title; reconciliation logic delegated to IssueReconciliationEngine and made AST-aware (line remapping, snippet verification, deterministic sweep/classification).
Anchoring & tracking utilities
java-ecosystem/libs/core/.../anchoring/SnippetAnchoringService.java, SnippetLocator.java, tracking/LineHashSequence.java, tracking/IssueFingerprint.java, tracking/ReconcilableIssue.java, tracking/TrackingConfidence.java
New snippet anchoring/locator strategies, public LocateResult/AnchorResult types; LineHashSequence.getRangeHash added; IssueFingerprint.normalizeTitle made public; new ReconcilableIssue interface; TrackingConfidence extended with UNANCHORED_FP_MATCH.
Models, DTOs & DB migrations
java-ecosystem/libs/core/.../CodeAnalysisIssue.java, branch/BranchIssue.java, core/dto/analysis/issue/IssueDTO.java, core/model/codeanalysis/IssueScope.java, .../db/migration/V2.*.sql
Adds issueScope, scopeStartLine, endLineNumber and drift fields to models and DTOs; CodeAnalysisIssue/BranchIssue implement ReconcilableIssue; DB migrations add/backfill new columns and widen tracking_confidence.
PR/file enrichment & pipeline-agent clients
java-ecosystem/libs/analysis-engine/.../pr/PrFileEnrichmentService.java, services/pipeline-agent/.../*AiClientService.java
PrFileEnrichmentService adds fetchFileContentsOnly; pipeline-agent services switch enrichment requests to use commit hashes and add non-fatal fallback to fetch raw file contents when AST/enrichment is empty or fails.
PR tracking & issue tracking flows
java-ecosystem/libs/analysis-engine/.../PrIssueTrackingService.java, PrIssueTrackingService tests
PrIssueTrackingService wired to reconciliationEngine and astScopeEnricher; adds post-tracking snippet verification and deterministic sweep phases; trackingConfidence now uses enum values.
Python RAG / splitter / orchestrator changes
python-ecosystem/.../rag/*, stage_1_file_review.py, context_helpers.py, splitter/*, models/scoring_config.py
RAG accepts additional_identifiers from AST enrichment; adds transitive parent lookups and tiered deterministic/context budgeting; metadata extraction made AST-aware; adds information-density scoring and ecosystem-aware scoring; many prompt/prompt-constants updates for snippet+scope.
Core service anchoring & context hashing
java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/CodeAnalysisService.java
Adds server-side snippet anchoring (SnippetAnchoringService) and new computeContextHash helper using range/snippet/AST-aware logic; adjusts fingerprint/context hashing strategy.
Tests, small fixes & frontend pointer
java-ecosystem/libs/ast-parser/src/test/..., analysis-engine tests, core DTO tests, web-server/FileViewService.java, frontend submodule update
New/updated tests for AST facade and SupportedLanguage; updated tests to wire AstScopeEnricher; FileViewService null-safes trackingConfidence display; frontend submodule pointer bumped in repo.
Build & root POM
java-ecosystem/pom.xml, java-ecosystem/libs/ast-parser/.gitignore
Root POM adds ast-parser module, tree-sitter JNI language dependencies and version properties; new module .gitignore added.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant PRProcessor as PullRequestAnalysisProcessor
    participant Enricher as PrFileEnrichmentService
    participant Parser as AstParserFacade
    participant ScopeEnricher as AstScopeEnricher
    participant Reconciler as IssueReconciliationEngine
    participant DB as Database

    Client->>PRProcessor: request PR analysis (commit hash)
    PRProcessor->>Enricher: request enrichment (commit)
    Enricher->>Parser: parse files
    Parser-->>Enricher: ParsedTree(s)
    Enricher-->>PRProcessor: enrichment data (files + parsed trees)

    PRProcessor->>PRProcessor: run AI analysis -> issues
    PRProcessor->>ScopeEnricher: enrichWithAstScopes(issues, fileContents)
    ScopeEnricher->>Parser: resolveInnermostScope(...)
    Parser-->>ScopeEnricher: ScopeInfo
    ScopeEnricher-->>PRProcessor: enriched issues

    PRProcessor->>Reconciler: verifySnippetAnchors / classifyByContent
    Reconciler-->>PRProcessor: classifications
    PRProcessor->>DB: persist results
    DB-->>PRProcessor: ack
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

🐰 I hopped through parse trees, whiskers twitching bright,

I found each scope and anchored snippets tight.
I chased drifting lines and stitched their rightful place,
then left a trail of hashes — tidy, neat, and chaste.
A carrot for the build: the AST got grace!

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 1.5.4-rc

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/dto/analysis/issue/IssueDTO.java (1)

53-55: ⚠️ Potential issue | 🟡 Minor

fromBranchIssue contract no longer matches implementation.

Line 54 says this factory “never dereferences to CodeAnalysisIssue”, but Line 111 now does bi.getOriginIssue().getId(). Please either source originIssueId from a scalar field on BranchIssue or update the method contract to reflect the dereference behavior.

Also applies to: 111-111

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/dto/analysis/issue/IssueDTO.java`
around lines 53 - 55, The Javadoc for IssueDTO.fromBranchIssue claims it "never
dereferences to CodeAnalysisIssue" but the implementation calls
bi.getOriginIssue().getId(); fix by either (A) sourcing originIssueId from a
scalar on BranchIssue (e.g., replace bi.getOriginIssue().getId() with
bi.getOriginIssueId()) if such accessor exists, or (B) update the
fromBranchIssue Javadoc to remove the "never dereferences..." guarantee and
document that it reads origin via bi.getOriginIssue(), and make the code
null-safe (check bi.getOriginIssue() for null before calling getId()) so no NPEs
occur. Ensure references to BranchIssue, fromBranchIssue, and
originIssueId/originIssue are adjusted consistently.
🟡 Minor comments (4)
java-ecosystem/libs/ast-parser/src/main/resources/queries/rust/scopes.scm-11-24 (1)

11-24: ⚠️ Potential issue | 🟡 Minor

Add union_item to complete Rust type-scope coverage.

Line 11–24 captures most Rust type-like declarations, but union items are currently absent. Since union_item is a valid tree-sitter Rust node, unions will not receive @class.def scope attribution, causing missed scope tracking inside union bodies.

Proposed query update
 (trait_item
   name: (type_identifier) `@name`) `@class.def`
 
+(union_item
+  name: (type_identifier) `@name`) `@class.def`
+
 (impl_item) `@class.def`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java-ecosystem/libs/ast-parser/src/main/resources/queries/rust/scopes.scm`
around lines 11 - 24, The query misses Rust union declarations—add a pattern for
union_item so unions get the `@class.def` scope: include a line matching
(union_item name: (type_identifier) `@name`) `@class.def` alongside the existing
struct_item/enum_item/trait_item/type_item/impl_item entries (i.e., mirror the
style of those patterns to ensure union bodies receive `@class.def` and their
names are captured as `@name`).
java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/pr/PrFileEnrichmentService.java-205-219 (1)

205-219: ⚠️ Potential issue | 🟡 Minor

Recompute total size after truncation for accurate stats.

Line 219 currently uses totalSize computed before applying truncateToSizeLimit(...), so reported bytes can be higher than what was actually returned.

Proposed fix
             long totalSize = contentDtos.stream()
                     .filter(f -> !f.skipped())
                     .mapToLong(FileContentDto::sizeBytes)
                     .sum();
             if (totalSize > maxTotalSizeBytes) {
                 contentDtos = truncateToSizeLimit(contentDtos, maxTotalSizeBytes, skipReasons);
+                totalSize = contentDtos.stream()
+                        .filter(f -> !f.skipped())
+                        .mapToLong(FileContentDto::sizeBytes)
+                        .sum();
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/pr/PrFileEnrichmentService.java`
around lines 205 - 219, Recompute totalSize after calling truncateToSizeLimit so
stats reflect the returned files: after possibly reassigning contentDtos in
PrFileEnrichmentService (where truncateToSizeLimit is used), recalculate
totalSize from contentDtos (filtering out skipped via FileContentDto::skipped
and summing FileContentDto::sizeBytes) and use that recomputed value when
constructing PrEnrichmentDataDto.EnrichmentStats; ensure filesEnriched is also
based on the post-truncation contentDtos.
java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/ScopeQueryRegistry.java-42-80 (1)

42-80: ⚠️ Potential issue | 🟡 Minor

Potential race condition and repeated I/O for missing queries.

Two issues with the current implementation:

  1. Thread-safety: The fast-path read at lines 44-47 accesses the EnumMap without synchronization, while writes happen inside the synchronized block. EnumMap is not thread-safe—concurrent reads during a write can cause undefined behavior.

  2. Missing queries cause repeated I/O: When a query file doesn't exist (line 59-62) or is empty (line 67-70), the method returns Optional.empty() without caching the negative result. Subsequent calls will repeatedly attempt to load the file.

🔒 Proposed fix for thread-safety and negative caching
 public final class ScopeQueryRegistry {
 
     private static final Logger log = LoggerFactory.getLogger(ScopeQueryRegistry.class);
 
-    private final Map<SupportedLanguage, String> cache = new EnumMap<>(SupportedLanguage.class);
+    // Use ConcurrentHashMap for thread-safe access without full synchronization
+    private final Map<SupportedLanguage, Optional<String>> cache = new ConcurrentHashMap<>();
 
     public Optional<String> getQuery(SupportedLanguage language) {
-        // Fast path: already cached
-        String cached = cache.get(language);
-        if (cached != null) {
-            return Optional.of(cached);
+        // Thread-safe check-and-load with computeIfAbsent
+        return cache.computeIfAbsent(language, this::loadQuery);
+    }
+
+    private Optional<String> loadQuery(SupportedLanguage language) {
+        String path = language.getScopeQueryPath();
+        try (InputStream is = getClass().getClassLoader().getResourceAsStream(path)) {
+            if (is == null) {
+                log.debug("No scope query found for {} at classpath:{}", language, path);
+                return Optional.empty();
+            }
+            String query = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8))
+                    .lines()
+                    .collect(Collectors.joining("\n"));
+
+            if (query.isBlank()) {
+                log.warn("Empty scope query file for {} at classpath:{}", language, path);
+                return Optional.empty();
+            }
+
+            log.debug("Loaded scope query for {} ({} chars)", language, query.length());
+            return Optional.of(query);
+        } catch (IOException e) {
+            log.error("Failed to load scope query for {}: {}", language, e.getMessage());
+            return Optional.empty();
         }
-        // ... rest of synchronized block
     }

You'll also need to add the import:

import java.util.concurrent.ConcurrentHashMap;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/ScopeQueryRegistry.java`
around lines 42 - 80, Replace the non-thread-safe EnumMap and ad-hoc sync in
getQuery with a concurrent cache and negative-result sentinel: change the cache
to a ConcurrentHashMap<SupportedLanguage,String> (add import
java.util.concurrent.ConcurrentHashMap), define a private static final String
NEGATIVE_MARKER = "<NO_QUERY>"; in getQuery first read cache.get(language) and
if non-null return Optional.empty() when it equals NEGATIVE_MARKER or
Optional.of(cached) otherwise; when the resource is missing or the file is
blank, put( language, NEGATIVE_MARKER ) before returning Optional.empty(); on
successful load put the real query string as before.
java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/TreeSitterScopeResolver.java-249-273 (1)

249-273: ⚠️ Potential issue | 🟡 Minor

Robust UTF-8 handling, but edge case with column offsets.

The extractNodeText method handles the UTF-8 byte offset conversion well for multi-line nodes. However, for single-line nodes, the column offsets from tree-sitter are byte-based (UTF-8), not character indices. If a line contains multi-byte UTF-8 characters before the node, line.substring(startCol, endCol) could fail or produce incorrect results since Java strings use UTF-16.

🐛 Proposed fix for column offset handling
     private static String extractNodeText(TSNode node, String source) {
         int startByte = node.getStartByte();
         int endByte = node.getEndByte();
-        // tree-sitter byte offsets are UTF-8 byte positions.
-        // For short identifier names (ASCII), byte offset == char offset.
-        // For safety, use the line/column approach for single-line nodes.
         if (node.getStartPoint().getRow() == node.getEndPoint().getRow()) {
             String[] lines = source.split("\\r?\\n", -1);
             int row = node.getStartPoint().getRow();
             if (row < lines.length) {
                 String line = lines[row];
-                int startCol = node.getStartPoint().getColumn();
-                int endCol = node.getEndPoint().getColumn();
-                if (startCol <= line.length() && endCol <= line.length()) {
-                    return line.substring(startCol, endCol);
-                }
+                // tree-sitter columns are byte offsets; convert to char indices
+                byte[] lineBytes = line.getBytes(java.nio.charset.StandardCharsets.UTF_8);
+                int startCol = node.getStartPoint().getColumn();
+                int endCol = node.getEndPoint().getColumn();
+                if (startCol <= lineBytes.length && endCol <= lineBytes.length && startCol < endCol) {
+                    return new String(lineBytes, startCol, endCol - startCol,
+                            java.nio.charset.StandardCharsets.UTF_8);
+                }
             }
         }
         // Multi-line node text: use byte offsets with UTF-8 conversion
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/TreeSitterScopeResolver.java`
around lines 249 - 273, extractNodeText currently treats single-line tree-sitter
column offsets as Java char indices, but tree-sitter columns are UTF-8 byte
offsets; update extractNodeText to convert the startCol/endCol byte offsets into
Java char indices before calling line.substring(...) by computing the UTF-8 byte
array for the line and decoding the prefix lengths (startCol and endCol) to
determine the corresponding char counts (e.g., decode new String(lineBytes, 0,
startCol, UTF_8).length() and same for endCol), then use those char indices for
substring; keep the existing multi-line byte-based branch as-is as a fallback.
🧹 Nitpick comments (18)
java-ecosystem/libs/ast-parser/.gitignore (1)

38-40: Consider anchoring local file ignores to module root.

If these are intended to ignore only root-local files, prefer /index.ts, /.env, and /server.log to avoid unintentionally ignoring same-named files in nested directories.

Suggested tweak
-index.ts
-.env
-server.log
+/index.ts
+/.env
+/server.log
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java-ecosystem/libs/ast-parser/.gitignore` around lines 38 - 40, The
.gitignore currently lists unanchored patterns "index.ts", ".env", and
"server.log" which can match files in nested directories; update those entries
to anchored forms "/index.ts", "/.env", and "/server.log" so they only ignore
files at the module root (replace the unanchored patterns in the .gitignore with
the anchored ones to avoid unintentionally ignoring same-named files in
submodules).
java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/util/tracking/IssueFingerprint.java (1)

38-41: Redundant case-insensitivity flags.

The pattern uses both the inline (?i) flag and Pattern.CASE_INSENSITIVE in the compile call. Either one alone is sufficient.

♻️ Suggested fix
     private static final Pattern LINE_REF_PATTERN = Pattern.compile(
-            "(?i)(?:(?:on|at)\\s+)?lines?\\s*\\d+(?:\\s*-\\s*\\d+)?|L\\d+",
-            Pattern.CASE_INSENSITIVE
+            "(?:(?:on|at)\\s+)?lines?\\s*\\d+(?:\\s*-\\s*\\d+)?|L\\d+",
+            Pattern.CASE_INSENSITIVE
     );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/util/tracking/IssueFingerprint.java`
around lines 38 - 41, The LINE_REF_PATTERN constant in IssueFingerprint uses
redundant case-insensitivity markers; remove one of them (prefer removing the
inline (?i) from the regex) so only Pattern.CASE_INSENSITIVE remains in the
Pattern.compile call; update the Pattern.compile invocation for LINE_REF_PATTERN
accordingly to avoid duplicate flags while preserving the same regex behavior.
java-ecosystem/libs/ast-parser/src/test/java/org/rostilos/codecrow/astparser/model/SupportedLanguageTest.java (2)

15-75: Add normalization edge-case tests (uppercase + dot-prefixed extensions).

Current cases don’t verify common user input forms like ".py" or "JAVA". Adding these will harden behavior expectations and prevent regressions in normalization logic.

Suggested test additions
 `@ParameterizedTest`
 `@CsvSource`({
         "java, JAVA",
+        "JAVA, JAVA",
+        ".java, JAVA",
         "py, PYTHON",
+        "PY, PYTHON",
+        ".py, PYTHON",
         "pyi, PYTHON",
         "js, JAVASCRIPT",
         "mjs, JAVASCRIPT",
@@
 })
 void fromExtension_resolves_known_extensions(String ext, String expected) {
     Optional<SupportedLanguage> result = SupportedLanguage.fromExtension(ext);
     assertThat(result).isPresent().hasValue(SupportedLanguage.valueOf(expected));
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/ast-parser/src/test/java/org/rostilos/codecrow/astparser/model/SupportedLanguageTest.java`
around lines 15 - 75, Add parameterized test cases to cover normalization
edge-cases: extend fromExtension_resolves_known_extensions to include
dot-prefixed extensions (e.g., ".py", ".JAVA" with leading dot and mixed case)
and uppercase-only inputs (e.g., "JAVA") to assert
SupportedLanguage.fromExtension normalizes input; likewise add cases to
fromFilePath_resolves_from_file_path that include paths with uppercase
extensions and dot-prefixed filenames (e.g., "src/Example.PY", ".env.py"
variants) to assert SupportedLanguage.fromFilePath handles these formats. Use
the existing test methods (fromExtension_resolves_known_extensions and
fromFilePath_resolves_from_file_path) and the
SupportedLanguage.fromExtension/fromFilePath APIs and
SupportedLanguage.valueOf(expected) to validate outcomes. Ensure you include
both success and unsupported variations (e.g., ".txt" uppercase) where
appropriate to keep consistency with the NullAndEmptySource/ValueSource negative
tests.

89-95: Assert scope query resources actually exist, not just path format.

The current check validates string shape only. It should also verify the resource is present on the classpath to catch missing query files early.

Suggested strengthening for the assertion
 `@Test`
 void each_language_has_scope_query_path() {
     for (SupportedLanguage lang : SupportedLanguage.values()) {
         String path = lang.getScopeQueryPath();
         assertThat(path).startsWith("queries/").endsWith("/scopes.scm");
+        assertThat(SupportedLanguageTest.class.getClassLoader().getResource(path))
+                .as("Missing scope query resource for %s", lang)
+                .isNotNull();
     }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/ast-parser/src/test/java/org/rostilos/codecrow/astparser/model/SupportedLanguageTest.java`
around lines 89 - 95, The test each_language_has_scope_query_path currently only
validates the string format; update it to also load the resource returned by
SupportedLanguage.getScopeQueryPath() from the classpath and assert it exists
(non-null/available) so missing query files fail the test. Use
SupportedLanguageTest.class.getResource or getResourceAsStream with the path
value (from getScopeQueryPath()) and assert the returned URL/InputStream is not
null (and close streams if opened); keep the existing format assertions and add
the resource existence assertion for each SupportedLanguage.
java-ecosystem/libs/ast-parser/src/main/resources/queries/go/scopes.scm (1)

14-17: Add type_alias capture to avoid missing Go type scopes.

Lines 14–17 currently capture only type_spec. In tree-sitter-go, type aliases (e.g., type A = B) are represented as type_alias nodes within type_declaration. Without capturing this node kind, type alias declarations will not be mapped to @class.def.

Proposed patch
 (type_declaration
   (type_spec
     name: (type_identifier) `@name`)) `@class.def`
+
+(type_declaration
+  (type_alias
+    name: (type_identifier) `@name`)) `@class.def`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java-ecosystem/libs/ast-parser/src/main/resources/queries/go/scopes.scm`
around lines 14 - 17, The current query only matches type_spec inside
type_declaration, so Go type alias nodes (type_alias) are missed; update the
pattern that binds `@class.def` to also accept a type_alias child by adding an
alternative for (type_alias name: (type_identifier) `@name`) or by matching either
(type_spec ...) or (type_alias ...) within the (type_declaration ...) pattern so
both regular type declarations and aliases capture `@name` and are labeled
`@class.def`.
java-ecosystem/libs/core/src/test/java/org/rostilos/codecrow/core/dto/analysis/issue/IssuesSummaryDTOTest.java (1)

129-163: Reduce test fixture brittleness from the long positional constructor.

createIssueDTO(...) now relies on a very long IssueDTO constructor call with many trailing nulls. Consider a small test fixture builder/factory so future IssueDTO field additions don’t silently misalign arguments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/core/src/test/java/org/rostilos/codecrow/core/dto/analysis/issue/IssuesSummaryDTOTest.java`
around lines 129 - 163, The test helper createIssueDTO currently constructs
IssueDTO with a long positional constructor and many trailing nulls, which is
brittle; replace it with a small test fixture builder or factory (e.g.,
TestIssueBuilder or IssueDTOTestBuilder) that exposes fluent setters for only
the fields used in tests and builds an IssueDTO, or if IssueDTO has a builder
use IssueDTO.builder() inside createIssueDTO to set id, severity, category,
title/description and defaults for others; update createIssueDTO to delegate to
that builder and update tests to use the new helper so adding fields to IssueDTO
won't break argument ordering.
java-ecosystem/libs/ast-parser/src/main/resources/queries/python/scopes.scm (1)

2-2: Align declared captures with actual patterns.

Line 2 lists @namespace.def, but this file currently has no namespace capture pattern. Either add one (e.g., module-level) or remove it from the header to avoid confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java-ecosystem/libs/ast-parser/src/main/resources/queries/python/scopes.scm`
at line 2, The header capture list declares `@namespace.def` but there is no
corresponding capture pattern in scopes.scm; either add a module-level namespace
capture pattern (e.g., a tree-sitter pattern that matches module or top-level
file nodes and tags them as `@namespace.def`) or remove `@namespace.def` from the
capture list in the header so declared captures match the actual patterns;
update the capture list and/or add the new pattern within the same file
(scopes.scm) to keep captures consistent.
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/service/BitbucketAiClientService.java (1)

273-286: Identical fallback pattern duplicated across VCS services.

The fallback logic in BitbucketAiClientService is identical to GitHubAiClientService. Consider extracting the fallback logic into a shared utility method in a base class or helper service to reduce duplication and ensure consistent behavior across all VCS providers.

The same VcsClient reuse opportunity noted for the GitHub service applies here as well.

Also applies to: 493-505

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/service/BitbucketAiClientService.java`
around lines 273 - 286, The fallback file-content logic duplicated in
BitbucketAiClientService (the block that checks enrichmentData.hasData() &&
!changedFiles.isEmpty(), calls vcsClientProvider.getClient(vcsConnection), and
invokes enrichmentService.fetchFileContentsOnly(...)) should be extracted into a
shared helper method (e.g., in a common base class or
VcsEnrichmentFallbackService) to remove duplication and ensure consistent
behavior across VCS providers; refactor BitbucketAiClientService and
GitHubAiClientService to call that new method, and in the helper ensure the
VcsClient is obtained once (reuse the result of
vcsClientProvider.getClient(vcsConnection)) and that exceptions are logged with
the same non-critical warning behavior.
java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/service/GitHubAiClientService.java (2)

257-270: Consider reusing the VcsClient from the previous enrichment attempt.

The fallback block creates a new VcsClient at line 261, but a client was already created at line 242 during the initial enrichment attempt. If enrichmentData.hasData() is false, it means the initial enrichment was attempted (lines 240-254), so the client could be reused.

Additionally, the null check for enrichmentService on line 262 is redundant here—if the code reaches this fallback block, either the enrichment service was used (line 243) and returned empty data, or the enrichment was skipped entirely (in which case changedFiles would need to be non-empty for the fallback to trigger, but enrichmentData would still be PrEnrichmentDataDto.empty()).

♻️ Suggested refactor to reuse VcsClient

Consider extracting the VcsClient creation outside both blocks and reusing it:

+        VcsClient vcsClient = null;
         // Enrich PR with full file contents and dependency graph
         PrEnrichmentDataDto enrichmentData = PrEnrichmentDataDto.empty();
         if (enrichmentService != null && enrichmentService.isEnrichmentEnabled() && !changedFiles.isEmpty()) {
             try {
-                VcsClient vcsClient = vcsClientProvider.getClient(vcsConnection);
+                vcsClient = vcsClientProvider.getClient(vcsConnection);
                 enrichmentData = enrichmentService.enrichPrFiles(
                         vcsClient,
                         // ...

Then in the fallback:

         if (!enrichmentData.hasData() && !changedFiles.isEmpty()) {
             try {
-                VcsClient vcsClient = vcsClientProvider.getClient(vcsConnection);
-                enrichmentData = (enrichmentService != null)
-                        ? enrichmentService.fetchFileContentsOnly(
+                if (vcsClient == null) {
+                    vcsClient = vcsClientProvider.getClient(vcsConnection);
+                }
+                enrichmentData = enrichmentService.fetchFileContentsOnly(
                                 vcsClient, vcsInfo.owner(), vcsInfo.repoSlug(),
-                                currentCommitHash, changedFiles)
-                        : PrEnrichmentDataDto.empty();
+                                currentCommitHash, changedFiles);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/service/GitHubAiClientService.java`
around lines 257 - 270, The fallback creates a new VcsClient even though one was
already created for the initial enrichment attempt; change the code to create
the VcsClient once (call vcsClientProvider.getClient(vcsConnection) before the
initial enrichment attempt and store it in a local variable) and reuse that same
VcsClient in the fallback branch when calling
enrichmentService.fetchFileContentsOnly; also remove the redundant null-check of
enrichmentService in the fallback (since reaching the fallback implies either
enrichment was attempted or enrichmentData is empty), and keep using
PrEnrichmentDataDto.empty() only as the ultimate default if enrichmentService is
unavailable.

475-487: Same VcsClient reuse opportunity applies here.

This fallback block mirrors the PR analysis fallback. The same refactoring suggestions apply—consider reusing the VcsClient created at line 460 and removing the redundant enrichmentService != null check.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/service/GitHubAiClientService.java`
around lines 475 - 487, The fallback block creates a new VcsClient and
redundantly checks enrichmentService; instead reuse the VcsClient instance
created earlier (avoid calling vcsClientProvider.getClient(vcsConnection) again)
by moving/using that existing variable so both PR-analysis and fallback paths
share the same VcsClient, and remove the unnecessary "enrichmentService != null"
ternary in the fallback (call enrichmentService.fetchFileContentsOnly(...)
directly, or use PrEnrichmentDataDto.empty() only if enrichmentService truly can
be null and you've handled that at a single shared spot). Target symbols:
enrichmentData, changedFiles, VcsClient,
vcsClientProvider.getClient(vcsConnection),
enrichmentService.fetchFileContentsOnly(...), PrEnrichmentDataDto.empty().
java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/model/ScopeInfo.java (1)

20-41: Clean record design with good documentation.

The record is well-documented and provides useful utility methods.

One minor consideration: there's no validation that startLine <= endLine. If an invalid scope is constructed (e.g., startLine=10, endLine=5), lineSpan() would return a negative value and containsLine() would always return false. This is likely fine if callers are trusted, but adding a compact constructor with validation could prevent subtle bugs:

🛡️ Optional: Add validation in compact constructor
 public record ScopeInfo(
         ScopeKind kind,
         String name,
         int startLine,
         int endLine,
         int parentIndex
 ) {
+    public ScopeInfo {
+        if (startLine < 1) throw new IllegalArgumentException("startLine must be >= 1");
+        if (endLine < startLine) throw new IllegalArgumentException("endLine must be >= startLine");
+    }
+
     /** Convenience constructor for scopes without a known parent. */
     public ScopeInfo(ScopeKind kind, String name, int startLine, int endLine) {
         this(kind, name, startLine, endLine, -1);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/model/ScopeInfo.java`
around lines 20 - 41, Add a compact constructor to the ScopeInfo record that
validates startLine <= endLine and throws an IllegalArgumentException (or
similar) when violated; update the primary constructor logic in ScopeInfo (the
record header) by adding a compact constructor which checks the two ints and
ensures valid ranges so that lineSpan() and containsLine() behave correctly for
all instances (refer to the record ScopeInfo and its methods lineSpan and
containsLine when locating where to add the validation).
java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/model/SupportedLanguage.java (1)

54-61: Fail fast on duplicate extension registration.

Line 58 silently overwrites previous language mappings if an extension is duplicated. That makes enum maintenance error-prone. Prefer throwing on duplicates at class init.

Proposed change
     static {
         Map<String, SupportedLanguage> idx = new HashMap<>();
         for (SupportedLanguage lang : values()) {
             for (String ext : lang.extensions) {
-                idx.put(ext, lang);
+                SupportedLanguage previous = idx.putIfAbsent(ext, lang);
+                if (previous != null) {
+                    throw new IllegalStateException(
+                        "Duplicate extension mapping: ." + ext + " -> " + previous + ", " + lang
+                    );
+                }
             }
         }
         EXTENSION_INDEX = Collections.unmodifiableMap(idx);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/model/SupportedLanguage.java`
around lines 54 - 61, The static initializer for SupportedLanguage silently
overwrites duplicate extensions when building EXTENSION_INDEX; change the loop
in the static block to detect duplicates by checking if idx.containsKey(ext) and
throw an IllegalStateException (or similar runtime exception) that includes the
extension and both the previously-registered SupportedLanguage and the current
SupportedLanguage; ensure you still build EXTENSION_INDEX as an unmodifiable map
after validating all entries.
java-ecosystem/libs/core/src/test/java/org/rostilos/codecrow/core/dto/analysis/issue/IssueDTOTest.java (1)

46-49: Add assertions for the newly added DTO fields.

The tests pass values for detectionSource, issueScope, endLineNumber, and originIssueId, but these fields are never asserted. Please assert them explicitly so constructor/mapping regressions are caught.

Also applies to: 101-104, 148-151

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/core/src/test/java/org/rostilos/codecrow/core/dto/analysis/issue/IssueDTOTest.java`
around lines 46 - 49, Add explicit assertions in IssueDTOTest to verify the
newly added DTO fields: detectionSource, issueScope, endLineNumber, and
originIssueId. Locate the test cases that construct IssueDTO (references in
IssueDTOTest) where parameters include "PR_ANALYSIS", "LINE", null, null and add
assertions that dto.getDetectionSource() equals "PR_ANALYSIS",
dto.getIssueScope() equals "LINE", dto.getEndLineNumber() equals the expected
null or numeric value used in the test, and dto.getOriginIssueId() equals the
expected null or value. Repeat the same explicit assertions for the other test
blocks around the same file (the other occurrences noted at the ranges matching
the second and third construct sites) so constructor/mapping regressions are
covered.
java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/util/tracking/ReconcilableIssue.java (1)

99-99: Parenthesize anchoring logic for maintainability.

Line 99 currently relies on operator precedence. Explicit grouping will make future changes safer.

Proposed readability tweak
-        return line != null && line > 1 || (snippet != null && !snippet.isBlank());
+        return (line != null && line > 1) || (snippet != null && !snippet.isBlank());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/util/tracking/ReconcilableIssue.java`
at line 99, The return expression in ReconcilableIssue that currently reads
"return line != null && line > 1 || (snippet != null && !snippet.isBlank());"
relies on operator precedence; update it to make the anchoring logic explicit by
grouping the line check and snippet check with parentheses (e.g., wrap the line
conditions in (line != null && line > 1) and keep the snippet check grouped) so
the intent is clear and future edits are safer.
java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/CodeAnalysisService.java (1)

770-774: Duplicate logging of snippet anchoring corrections.

The SnippetAnchoringService.anchor() method at line 88-89 already logs "Snippet anchoring corrected..." when the anchored line differs from the hint line. This code duplicates that log message, resulting in two identical log entries for the same correction.

🧹 Proposed fix to remove duplicate logging
                         if (anchor.shouldOverrideLine()) {
                             int oldLine = issue.getLineNumber() != null ? issue.getLineNumber() : 0;
                             issue.setLineNumber(anchor.startLine());
-
-                            if (oldLine != anchor.startLine()) {
-                                log.info("Snippet anchoring corrected {}:{} → {} (strategy={}, confidence={})",
-                                        filePath, oldLine, anchor.startLine(),
-                                        anchor.matchStrategy(), anchor.confidence());
-                            }
                         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/CodeAnalysisService.java`
around lines 770 - 774, The log entry in CodeAnalysisService that prints
"Snippet anchoring corrected ..." when oldLine != anchor.startLine() is
duplicating a message already emitted by SnippetAnchoringService.anchor() (the
log at lines ~88-89 in SnippetAnchoringService); remove the redundant log block
in CodeAnalysisService (the if (oldLine != anchor.startLine()) {
log.info("Snippet anchoring corrected ...") } section) so only the anchoring
service emits that correction message, leaving other surrounding logic intact.
java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/pr/PrIssueTrackingService.java (1)

319-375: Variable name sweepRevived is misleading.

The variable sweepRevived (line 323) tracks issues where the anchor is still present but AI omitted them — these are flagged as potential false resolutions, not actually "revived" into the new analysis. The log output correctly uses sweepWarnings (line 378), but the variable name is confusing.

🧹 Proposed rename for clarity
         // ── Post-tracking: deterministic sweep for unmatched prev issues ──
         // Verify that "resolved" previous issues actually have their anchors gone.
         // If an anchor is still present, the issue was likely just missed by the AI
         // in the new review, not actually fixed. Log this for audit.
-        int sweepRevived = 0;
+        int sweepWarnings = 0;
         if (newFileContents != null && !newFileContents.isEmpty()) {
             // ...
                     for (SweepResult sr : sweepResults) {
                         if (!sr.resolved()) {
                             // ...
-                            sweepRevived++;
+                            sweepWarnings++;
                         }
                     }
             // ...
         }

         log.info("PR tracking for analysis {}: {} matched, {} new, {} resolved " +
                         "(unanchored: {} resolved, {} persisting, snippetCorrected: {}, sweepWarnings: {}) " +
                         "(previous analysis={})",
                 newAnalysis.getId(), matched, newOnly, resolved,
-                unanchoredResolved, unanchoredPersisting, snippetCorrected, sweepRevived,
+                unanchoredResolved, unanchoredPersisting, snippetCorrected, sweepWarnings,
                 previousAnalysis.getId());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/pr/PrIssueTrackingService.java`
around lines 319 - 375, The local counter sweepRevived misnames what it tracks;
rename it to something clearer like sweepOmittedAnchorCount (or
sweepWarnings/sweepFalseResolutionsCount) in PrIssueTrackingService so the
declaration, the increment (sweepRevived++), and any later uses (e.g., the log
that expects sweepWarnings) all refer to the new name; update the variable
identifier wherever referenced in this method (declaration, the ++ inside the
SweepResult loop, and the final summary/log usage) to keep semantics but fix the
misleading name.
java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/AstScopeEnricher.java (1)

142-165: Potential resource leak in parseAll — add usage guidance.

The method returns ParsedTree objects that must be closed by the caller, but there's no try-finally or explicit close pattern demonstrated. If any tree fails to be closed, native tree-sitter resources will leak.

Consider adding a helper method that auto-closes all trees or documenting a recommended pattern:

📚 Add a resource-safe batch API or usage example

Consider providing a companion method that manages cleanup:

/**
 * Execute a function with parsed trees, automatically closing all trees afterward.
 * Avoids resource leaks when working with multiple files.
 */
public <T> T withParsedTrees(Map<String, String> fileContents,
                             java.util.function.Function<Map<String, ParsedTree>, T> action) {
    Map<String, ParsedTree> trees = parseAll(fileContents);
    try {
        return action.apply(trees);
    } finally {
        trees.values().forEach(ParsedTree::close);
    }
}

Or update the Javadoc with an explicit usage pattern:

/**
 * ...
 * <pre>{`@code`
 * Map<String, ParsedTree> trees = enricher.parseAll(fileContents);
 * try {
 *     enricher.enrichWithParsedTrees(issues, fileContents, trees);
 * } finally {
 *     trees.values().forEach(ParsedTree::close);
 * }
 * }</pre>
 */
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/AstScopeEnricher.java`
around lines 142 - 165, parseAll returns ParsedTree instances that require
explicit closing; add a resource-safe API and/or usage guidance: implement a
companion method (e.g., withParsedTrees) that calls parseAll(fileContents),
invokes a user-supplied java.util.function.Function or Consumer, and always
closes all ParsedTree instances in a finally block
(trees.values().forEach(ParsedTree::close)); alternatively update parseAll
Javadoc to show the try/finally pattern when calling enrichWithParsedTrees so
callers know to close ParsedTree::close to avoid native resource leaks.
java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/IssueReconciliationEngine.java (1)

504-624: Extract shared anchor-matching/classification logic to prevent drift.

The AST and non-AST classifiers duplicate most of the flow; this already diverged on the empty-content guard. A shared core method with a pluggable context-hash strategy would reduce future regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/IssueReconciliationEngine.java`
around lines 504 - 624, The classifyByContent AST and non-AST paths duplicate
anchor-matching and classification logic (including the empty-content guard)
causing drift; extract that shared flow into a helper like
matchAnchorAndClassify(issue, currentHashes, scopeResolverOptional,
ContextHashStrategy) that returns a ContentClassification, keep the current
classifyByContent overloads thin, and implement two small ContextHashStrategy
implementations: one that calls computeContextHashFromAst(currentHashes,
updatedLine, parsedTree, scopeResolver, issue.getEndLineNumber(),
issue.getCodeSnippet()) for the AST path and one that no-ops/returns null for
the non-AST path. Move the empty-content guard into the shared helper and reuse
hasAnyReliableAnchor, ContentClassification, Classification, and the
updatedLine/updatedLineHash/contentFound logic so both callers share identical
behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3825d9b8-81c2-4d55-bff4-e25d1c59600d

📥 Commits

Reviewing files that changed from the base of the PR and between 3c465a5 and 2099d2f.

📒 Files selected for processing (77)
  • frontend
  • java-ecosystem/libs/analysis-engine/pom.xml
  • java-ecosystem/libs/analysis-engine/src/main/java/module-info.java
  • java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/processor/analysis/BranchAnalysisProcessor.java
  • java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/processor/analysis/PullRequestAnalysisProcessor.java
  • java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/AstScopeEnricher.java
  • java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/IssueReconciliationEngine.java
  • java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/branch/BranchIssueMappingService.java
  • java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/branch/BranchIssueReconciliationService.java
  • java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/pr/PrFileEnrichmentService.java
  • java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/pr/PrIssueTrackingService.java
  • java-ecosystem/libs/analysis-engine/src/test/java/org/rostilos/codecrow/analysisengine/processor/analysis/BranchAnalysisProcessorTest.java
  • java-ecosystem/libs/analysis-engine/src/test/java/org/rostilos/codecrow/analysisengine/processor/analysis/PullRequestAnalysisProcessorTest.java
  • java-ecosystem/libs/ast-parser/.gitignore
  • java-ecosystem/libs/ast-parser/pom.xml
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/AstParserFacade.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/api/AstParseException.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/api/AstParser.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/api/ScopeAwareHasher.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/api/ScopeResolver.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/api/SymbolExtractor.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/DefaultScopeAwareHasher.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/ParserPool.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/ScopeQueryRegistry.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/TreeSitterAstParser.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/TreeSitterScopeResolver.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/TreeSitterSymbolExtractor.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/model/ParsedTree.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/model/ScopeInfo.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/model/ScopeKind.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/model/SupportedLanguage.java
  • java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/model/SymbolInfo.java
  • java-ecosystem/libs/ast-parser/src/main/resources/queries/bash/scopes.scm
  • java-ecosystem/libs/ast-parser/src/main/resources/queries/c/scopes.scm
  • java-ecosystem/libs/ast-parser/src/main/resources/queries/c_sharp/scopes.scm
  • java-ecosystem/libs/ast-parser/src/main/resources/queries/cpp/scopes.scm
  • java-ecosystem/libs/ast-parser/src/main/resources/queries/go/scopes.scm
  • java-ecosystem/libs/ast-parser/src/main/resources/queries/haskell/scopes.scm
  • java-ecosystem/libs/ast-parser/src/main/resources/queries/java/scopes.scm
  • java-ecosystem/libs/ast-parser/src/main/resources/queries/javascript/scopes.scm
  • java-ecosystem/libs/ast-parser/src/main/resources/queries/php/scopes.scm
  • java-ecosystem/libs/ast-parser/src/main/resources/queries/python/scopes.scm
  • java-ecosystem/libs/ast-parser/src/main/resources/queries/ruby/scopes.scm
  • java-ecosystem/libs/ast-parser/src/main/resources/queries/rust/scopes.scm
  • java-ecosystem/libs/ast-parser/src/main/resources/queries/scala/scopes.scm
  • java-ecosystem/libs/ast-parser/src/main/resources/queries/tsx/scopes.scm
  • java-ecosystem/libs/ast-parser/src/main/resources/queries/typescript/scopes.scm
  • java-ecosystem/libs/ast-parser/src/test/java/org/rostilos/codecrow/astparser/AstParserFacadeTest.java
  • java-ecosystem/libs/ast-parser/src/test/java/org/rostilos/codecrow/astparser/model/SupportedLanguageTest.java
  • java-ecosystem/libs/core/src/main/java/module-info.java
  • java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/dto/analysis/issue/IssueDTO.java
  • java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/model/branch/BranchIssue.java
  • java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/model/codeanalysis/CodeAnalysisIssue.java
  • java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/model/codeanalysis/IssueScope.java
  • java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/service/CodeAnalysisService.java
  • java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/util/anchoring/SnippetAnchoringService.java
  • java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/util/anchoring/SnippetLocator.java
  • java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/util/tracking/IssueFingerprint.java
  • java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/util/tracking/LineHashSequence.java
  • java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/util/tracking/ReconcilableIssue.java
  • java-ecosystem/libs/core/src/main/java/org/rostilos/codecrow/core/util/tracking/TrackingConfidence.java
  • java-ecosystem/libs/core/src/main/resources/db/migration/managed/V2.0.0__add_issue_scope_and_end_line.sql
  • java-ecosystem/libs/core/src/main/resources/db/migration/managed/V2.1.0__add_scope_start_line.sql
  • java-ecosystem/libs/core/src/main/resources/db/migration/managed/V2.2.0__widen_tracking_confidence_columns.sql
  • java-ecosystem/libs/core/src/test/java/org/rostilos/codecrow/core/dto/analysis/issue/IssueDTOTest.java
  • java-ecosystem/libs/core/src/test/java/org/rostilos/codecrow/core/dto/analysis/issue/IssuesSummaryDTOTest.java
  • java-ecosystem/pom.xml
  • java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/bitbucket/service/BitbucketAiClientService.java
  • java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/github/service/GitHubAiClientService.java
  • java-ecosystem/services/pipeline-agent/src/main/java/org/rostilos/codecrow/pipelineagent/gitlab/service/GitLabAiClientService.java
  • java-ecosystem/services/web-server/src/main/java/org/rostilos/codecrow/webserver/analysis/service/FileViewService.java
  • python-ecosystem/inference-orchestrator/model/output_schemas.py
  • python-ecosystem/inference-orchestrator/service/review/orchestrator/reconciliation.py
  • python-ecosystem/inference-orchestrator/utils/prompts/constants_shared.py
  • python-ecosystem/inference-orchestrator/utils/prompts/prompt_builder.py
  • python-ecosystem/inference-orchestrator/utils/prompts/prompt_constants.py
  • python-ecosystem/inference-orchestrator/utils/prompts/prompt_constants.py.bak
💤 Files with no reviewable changes (1)
  • python-ecosystem/inference-orchestrator/utils/prompts/prompt_constants.py.bak

Comment on lines +167 to 174
CacheHitType cacheHit = postAnalysisCacheIfExist(project, pullRequest, request.getCommitHash(), request.getPullRequestId(),
reportingService, request.getPlaceholderCommentId(), request.getTargetBranchName(), request.getSourceBranchName());
if (cacheHit != CacheHitType.NONE) {
publishAnalysisCompletedEvent(project, request, correlationId, startTime,
AnalysisCompletedEvent.CompletionStatus.SUCCESS, 0, 0, null);
return Map.of("status", "cached_by_commit", "cached", true);
String cacheStatus = cacheHit == CacheHitType.COMMIT_HASH ? "cached_by_commit" : "cached";
return Map.of("status", cacheStatus, "cached", true);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Record analyzed commits before returning on cache hits.

Both cache-hit branches return before commit bookkeeping, so cached executions can skip analyzed_commit recording entirely.

🔧 Proposed fix
             if (cacheHit != CacheHitType.NONE) {
+                markPrCommitsAnalyzed(project, request.getSourceBranchName(), request.getCommitHash(), null);
                 publishAnalysisCompletedEvent(project, request, correlationId, startTime,
                         AnalysisCompletedEvent.CompletionStatus.SUCCESS, 0, 0, null);
                 String cacheStatus = cacheHit == CacheHitType.COMMIT_HASH ? "cached_by_commit" : "cached";
                 return Map.of("status", cacheStatus, "cached", true);
             }
@@
             if (postDiffFingerprintCacheIfExist(request, diffFingerprint, project, pullRequest, aiRequest, reportingService)) {
+                markPrCommitsAnalyzed(project, request.getSourceBranchName(), request.getCommitHash(), null);
                 publishAnalysisCompletedEvent(project, request, correlationId, startTime,
                         AnalysisCompletedEvent.CompletionStatus.SUCCESS, 0, 0, null);
                 return Map.of("status", "cached_by_fingerprint", "cached", true);
             }

Also applies to: 197-201

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/processor/analysis/PullRequestAnalysisProcessor.java`
around lines 167 - 174, The cache-hit branches in PullRequestAnalysisProcessor
(after calling postAnalysisCacheIfExist) return early and skip recording
analyzed commits; modify the flow so that before returning on a cache hit you
invoke the same analyzed-commit bookkeeping used in the normal path (i.e., call
the method(s) that record analyzed commits for the project/pull request/commit
hash and related metadata) and only then call publishAnalysisCompletedEvent and
return the cache status; apply the same change to the other cache-hit branch
around the logic currently at or near the code handling the second cache check
(the block corresponding to lines ~197-201).

Comment on lines +467 to +474
try {
reportingService.postAnalysisResults(cloned, project,
request.getPullRequestId(), pullRequest.getId(),
request.getPlaceholderCommentId());
} catch (IOException e) {
log.error("Failed to post fingerprint-cached results to VCS: {}", e.getMessage(), e);
}
return true;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid reporting cache-hit success when posting to VCS fails.

These IOException handlers log and still return a cache hit, so the caller publishes success even when no result is posted.

🔧 Proposed fix
     ) {
@@
         try {
             reportingService.postAnalysisResults(cloned, project,
                     request.getPullRequestId(), pullRequest.getId(),
                     request.getPlaceholderCommentId());
+            return true;
         } catch (IOException e) {
             log.error("Failed to post fingerprint-cached results to VCS: {}", e.getMessage(), e);
+            return false;
         }
-        return true;
     }
@@
         if (cachedAnalysis.isPresent()) {
             try {
                 reportingService.postAnalysisResults(cachedAnalysis.get(),
                         project,
                         prId,
                         pullRequest.getId(),
                         placeholderCommentId);
             } catch (IOException e) {
                 log.error("Failed to post cached analysis results to VCS: {}", e.getMessage(), e);
+                return CacheHitType.NONE;
             }
             return CacheHitType.EXACT;
         }
@@
             try {
                 reportingService.postAnalysisResults(
                         cloned,
                         project,
                         prId,
                         pullRequest.getId(),
                         placeholderCommentId
                 );
             } catch (IOException e) {
                 log.error("Failed to post commit-hash cached results to VCS: {}", e.getMessage(), e);
+                return CacheHitType.NONE;
             }
             return CacheHitType.COMMIT_HASH;
         }

Also applies to: 497-506, 524-535

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/processor/analysis/PullRequestAnalysisProcessor.java`
around lines 467 - 474, The catch blocks that handle IOException around
reportingService.postAnalysisResults currently log the error but still return
true (treating as a cache-hit success); change those handlers to log the error
and return false (or otherwise signal failure) so a failed post does not report
success. Update the catch in PullRequestAnalysisProcessor where
reportingService.postAnalysisResults(...) is called (and the other two similar
try/catch sites mentioned) so that the catch (IOException e) { log.error(...);
return false; } replaces the current behavior, ensuring the method does not
claim success when posting to VCS fails.

Comment on lines +515 to +516
List<ContentClassification> results = new ArrayList<>();

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add the same empty-content guard to the AST-aware classifier.

When currentHashes is empty, this overload can classify anchored issues as RESOLVED; the non-AST method correctly routes them to NEEDS_AI.

🔧 Proposed fix
     public List<ContentClassification> classifyByContent(
             List<? extends ReconcilableIssue> issues,
             LineHashSequence currentHashes,
             ParsedTree parsedTree,
             ScopeResolver scopeResolver) {
@@
         List<ContentClassification> results = new ArrayList<>();
+
+        // Keep parity with non-AST path: no content => cannot deterministically resolve.
+        if (currentHashes.getLineCount() == 0) {
+            for (ReconcilableIssue issue : issues) {
+                results.add(new ContentClassification(issue, Classification.NEEDS_AI, null, null));
+            }
+            return results;
+        }
 
         for (ReconcilableIssue issue : issues) {

Also applies to: 614-617

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/analysis-engine/src/main/java/org/rostilos/codecrow/analysisengine/service/IssueReconciliationEngine.java`
around lines 515 - 516, The AST-aware classification path in
IssueReconciliationEngine incorrectly marks anchored issues RESOLVED when
currentHashes is empty; add the same empty-content guard used in the non-AST
overload so that when currentHashes.isEmpty() the code does not treat issues as
resolved but instead routes them to NEEDS_AI (or returns the same default
handling as the non-AST method). Locate the AST-aware classifier overload that
references the currentHashes variable (and the surrounding logic around where
List<ContentClassification> results is created) and insert a check for
currentHashes.isEmpty() that mirrors the non-AST implementation’s behavior
(return/mark NEEDS_AI for anchored issues) — also apply the same guard in the
other similar block around lines 614-617.

Comment on lines +84 to +107
public TSParser borrow(SupportedLanguage language) {
if (closed) {
throw new AstParseException("ParserPool is closed");
}

BlockingQueue<TSParser> queue = pools.computeIfAbsent(language,
lang -> new ArrayBlockingQueue<>(poolSize));

// Try to get an existing parser without blocking
TSParser parser = queue.poll();
if (parser != null) {
return parser;
}

// No parser available — create a new one if below capacity
synchronized (queue) {
// Re-check after acquiring lock
parser = queue.poll();
if (parser != null) {
return parser;
}
// Create new parser
return createParser(language);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

wc -l java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/ParserPool.java

Repository: rostilos/CodeCrow

Length of output: 166


🏁 Script executed:

cat -n java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/ParserPool.java

Repository: rostilos/CodeCrow

Length of output: 7189


🏁 Script executed:

cat -n java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/ParserPool.java | sed -n '75,82p'

Repository: rostilos/CodeCrow

Length of output: 565


🏁 Script executed:

python3 << 'EOF'
# Verify EnumMap thread-safety concern
# EnumMap is documented as NOT thread-safe for concurrent mutation
# but we're calling computeIfAbsent and clear concurrently in borrow/close paths

print("EnumMap thread-safety analysis:")
print("- EnumMap is NOT synchronized")
print("- concurrent computeIfAbsent() calls from multiple borrow() threads can race")
print("- concurrent clear() call from close() can race with active borrows")
print("- Solution: use ConcurrentHashMap or synchronize all EnumMap access")
print("\nTimeoutMs usage analysis:")
print("- timeoutMs field initialized in constructor")
print("- Never passed to any blocking queue operation")
print("- Should use: queue.poll(timeoutMs, TimeUnit.MILLISECONDS)")
print("- Current code: queue.poll() — always non-blocking")
EOF

Repository: rostilos/CodeCrow

Length of output: 564


Implement timeout enforcement and fix thread-safety in borrow() and shared maps.

The implementation violates its own contract in three ways:

  1. Timeout is never enforced: The javadoc (lines 77–78) promises "blocks until one is returned or timeout is reached," but borrow() calls queue.poll() without timeout arguments. When the queue is empty, it immediately creates a new parser without blocking or timing out. Use queue.poll(timeoutMs, TimeUnit.MILLISECONDS) to enforce the promised behavior.

  2. No per-language creation cap: The ArrayBlockingQueue capacity limits only idle parsers, not total created ones. Multiple concurrent borrows can each trigger createParser(), creating more than poolSize parsers in flight at once. Add a per-language counter (e.g., AtomicInteger) to enforce the hard cap.

  3. EnumMap is not thread-safe: pools and grammars (lines 49, 52) are unsynchronized EnumMap instances accessed concurrently via computeIfAbsent() in borrow/getGrammar, get() in release, and clear() in close. This risks ConcurrentModificationException. Replace with ConcurrentHashMap or synchronize all map accesses.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/ParserPool.java`
around lines 84 - 107, borrow() currently bypasses the promised timeout, allows
unlimited concurrent parser creation, and accesses non-thread-safe EnumMaps; to
fix: replace the EnumMap instances pools and grammars with thread-safe
ConcurrentHashMap (or wrap all accesses) and update borrow(SupportedLanguage) to
use queue.poll(timeoutMs, TimeUnit.MILLISECONDS) (or the existing timeout field)
so it blocks and returns null on timeout (then throw an appropriate timeout
AstParseException); to enforce the hard cap (poolSize) add a per-language
AtomicInteger counter map (e.g., createdCounts.computeIfAbsent(language, ...))
that you increment atomically before calling createParser() and decrement on
parser release/close, and guard parser creation by checking createdCount.get() <
poolSize (use compare-and-set or synchronized block on the per-language
counter/queue to avoid races) so createParser() cannot produce more than
poolSize instances for a language; ensure release() and close() decrement the
counters and return parsers to the queue consistently.

Comment on lines +13 to +15
(block) @function.def

(do_block) @function.def
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify how scope captures are interpreted by resolver logic and compare patterns across languages.

rg -n -C3 '@function\.def|@block\.def|@class\.def|@namespace\.def' \
  java-ecosystem/libs/ast-parser/src/main/resources/queries

rg -n -C4 'function\.def|block\.def|class\.def|namespace\.def' \
  java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/TreeSitterScopeResolver.java \
  java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/ScopeQueryRegistry.java

Repository: rostilos/CodeCrow

Length of output: 50373


🏁 Script executed:

#!/bin/bash
# Search for resolver logic that handles scope types
rg -A5 -B5 'ScopeType|@function\.def|@block\.def' \
  java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/ \
  --type java -m 20

# Also check if there's any special handling based on scope capture type
rg 'function\.def|block\.def' \
  java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/ \
  -m 10 --type java

Repository: rostilos/CodeCrow

Length of output: 4790


Change Ruby block and do_block from @function.def to @block.def.

Ruby blocks are not function/method/lambda definitions—they are passed-to-method constructs similar to control-flow blocks. The resolver explicitly documents that @function.def is for "function/method/lambda definitions" and @block.def is for "control-flow blocks," and since the resolver handles these captures differently, misclassifying blocks affects scope reconciliation precision. This change aligns Ruby with patterns across all other supported languages.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java-ecosystem/libs/ast-parser/src/main/resources/queries/ruby/scopes.scm`
around lines 13 - 15, The captures for Ruby block nodes are misclassified:
change the capture assigned to the tree-sitter node names block and do_block
from `@function.def` to `@block.def` so they are treated as control-flow blocks
rather than function/method/lambda definitions; update the entries for the
symbols "block" and "do_block" in scopes.scm to use "@block.def" instead of
"@function.def" so the resolver will reconcile scopes correctly.

(match_expression) @block.def

; ── Packages ─────────────────────────────────────────────────────────
(package_clause) @namespace.def
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Compare namespace captures across language scope queries =="
fd 'scopes\.scm$' java-ecosystem/libs/ast-parser/src/main/resources/queries \
  | while read -r f; do
      echo "---- $f"
      rg -n '@namespace\.def|@name|package|namespace' "$f" || true
    done

echo
echo "== Check resolver/symbol extractor expectations for `@name` =="
rg -n -C3 -P '@name|@namespace\.def|ScopeQueryRegistry|TreeSitterScopeResolver|TreeSitterSymbolExtractor|ScopeResolver' \
  java-ecosystem/libs || true

Repository: rostilos/CodeCrow

Length of output: 50373


🏁 Script executed:

#!/bin/bash
# Check Scala scopes.scm structure around package_clause
head -30 java-ecosystem/libs/ast-parser/src/main/resources/queries/scala/scopes.scm | cat -n

echo "=== Compare with Go's approach ==="
grep -A2 'package_clause' java-ecosystem/libs/ast-parser/src/main/resources/queries/go/scopes.scm

Repository: rostilos/CodeCrow

Length of output: 1202


🏁 Script executed:

#!/bin/bash
# Search for Scala package-related tests or tree-sitter grammar info
echo "=== Search for Scala package tests or examples ==="
find java-ecosystem -name "*scala*" -o -name "*Scala*" | grep -i test | head -10

echo
echo "=== Look for any Scala test fixtures with packages ==="
fd -e scala . java-ecosystem | head -20

Repository: rostilos/CodeCrow

Length of output: 167


🏁 Script executed:

#!/bin/bash
# Check if there are any AST parser tests for Scala that might show expected structure
echo "=== Search for Scala tests in ast-parser test directory ==="
find java-ecosystem/libs/ast-parser/src/test -type f -name "*.java" | xargs grep -l -i scala | head -5

echo
echo "=== Search for any Scala source in test resources ==="
find java-ecosystem/libs/ast-parser/src/test -type f -name "*.scala" 2>/dev/null | head -10

echo
echo "=== Check AstParserFacadeTest for language patterns ==="
grep -n "scala\|Scala" java-ecosystem/libs/ast-parser/src/test/java/org/rostilos/codecrow/astparser/AstParserFacadeTest.java | head -20

Repository: rostilos/CodeCrow

Length of output: 342


🏁 Script executed:

#!/bin/bash
# Check test file for Scala support
echo "=== Check SupportedLanguageTest for Scala ==="
grep -n -C3 -i scala java-ecosystem/libs/ast-parser/src/test/java/org/rostilos/codecrow/astparser/model/SupportedLanguageTest.java

echo
echo "=== Check if there's a pattern by looking at one confirmed language (Go) ==="
rg -n "package_identifier|go|Go" java-ecosystem/libs/ast-parser/src/test -A2 | head -30

Repository: rostilos/CodeCrow

Length of output: 1233


🏁 Script executed:

#!/bin/bash
# Look for Scala grammar or parser documentation within the repo
echo "=== Search for tree-sitter or grammar references ==="
find java-ecosystem -type f -name "*.md" -o -name "*.txt" | xargs grep -l -i "tree-sitter\|grammar" | head -5

echo
echo "=== Check for Scala-related documentation ==="
find java-ecosystem -type f \( -name "*.md" -o -name "*.txt" \) | xargs grep -l -i scala | head -5

echo
echo "=== Look for any comments about package_clause structure in Scala ==="
rg -n "package" java-ecosystem/libs/ast-parser/src/main/resources/queries/scala/

Repository: rostilos/CodeCrow

Length of output: 395


🏁 Script executed:

#!/bin/bash
# Check the test scopes documentation
echo "=== Check TEST_SCOPES.md for Scala examples ==="
grep -n -C10 -i scala java-ecosystem/tests/integration-tests/TEST_SCOPES.md | head -60

echo
echo "=== Check if Scala has test cases ==="
grep -n "package" java-ecosystem/tests/integration-tests/TEST_SCOPES.md | head -20

Repository: rostilos/CodeCrow

Length of output: 2104


🏁 Script executed:

#!/bin/bash
# Let's look for actual scope resolution tests in the test directory
echo "=== Search for scope tests that might include namespace/package examples ==="
find java-ecosystem/libs/ast-parser/src/test -name "*.java" -type f | xargs grep -l "namespace\|package\|@namespace.def" | head -5

echo
echo "=== Check for any test data or fixtures ==="
find java-ecosystem/libs/ast-parser/src/test/resources -type f 2>/dev/null | head -20

Repository: rostilos/CodeCrow

Length of output: 394


🏁 Script executed:

#!/bin/bash
# Check AstParserFacadeTest for namespace/package tests
echo "=== Search for package/namespace scope tests in AstParserFacadeTest ==="
grep -n "NAMESPACE\|package\|Package" java-ecosystem/libs/ast-parser/src/test/java/org/rostilos/codecrow/astparser/AstParserFacadeTest.java | head -30

echo
echo "=== Check if any Scala-specific scope tests exist ==="
grep -n -i "scala" java-ecosystem/libs/ast-parser/src/test/java/org/rostilos/codecrow/astparser/AstParserFacadeTest.java

Repository: rostilos/CodeCrow

Length of output: 274


🏁 Script executed:

#!/bin/bash
# Let me search for references to how Scala tree-sitter grammar is being used
echo "=== Look for any tree-sitter grammar files or references ==="
find java-ecosystem -type f \( -name "*.json" -o -name "*.scm" \) | xargs grep -l "package" | head -10

echo
echo "=== Check resolver code to understand how `@name` is used ==="
find java-ecosystem/libs/ast-parser/src -type f -name "*.java" | xargs grep -l "TreeSitterSymbolExtractor\|ScopeResolver\|@name" | head -10

Repository: rostilos/CodeCrow

Length of output: 935


🏁 Script executed:

#!/bin/bash
# Check TreeSitterSymbolExtractor to understand how `@name` is used with namespaces
echo "=== Check TreeSitterSymbolExtractor for `@name` handling ==="
grep -n -A5 -B5 "@name\|namespace" java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/TreeSitterSymbolExtractor.java | head -60

echo
echo "=== Check TreeSitterScopeResolver for namespace handling ==="
grep -n -A5 -B5 "namespace\|NAMESPACE" java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/TreeSitterScopeResolver.java | head -40

Repository: rostilos/CodeCrow

Length of output: 4386


🏁 Script executed:

#!/bin/bash
# Perfect - let me verify the exact issue by checking how name extraction works
echo "=== Look for the full ScopeInfo creation to see what happens when `@name` is missing ==="
grep -n -A10 "ScopeInfo(" java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/TreeSitterScopeResolver.java | head -40

echo
echo "=== Check if empty name causes any issues ==="
grep -n "isEmpty\|isBlank\|null" java-ecosystem/libs/ast-parser/src/main/java/org/rostilos/codecrow/astparser/internal/TreeSitterScopeResolver.java | head -20

Repository: rostilos/CodeCrow

Length of output: 1757


🏁 Script executed:

#!/bin/bash
# Let me verify the exact structure by checking the Scala grammar or looking for documentation
echo "=== Check if there's any Scala grammar documentation in the repo ==="
find java-ecosystem -type f \( -name "*.md" -o -name "*.txt" \) -exec grep -l "package_clause\|Scala" {} \;

echo
echo "=== Search for any comments or docs about tree-sitter Scala ==="
rg -i "tree.?sitter.*scala|scala.*tree.?sitter" java-ecosystem --type md --type txt | head -10

Repository: rostilos/CodeCrow

Length of output: 194


🏁 Script executed:

#!/bin/bash
# Let me check if there's a web source or if we can infer from tree-sitter documentation
echo "=== Check the header comment pattern - it says `@name` should be captured ==="
head -3 java-ecosystem/libs/ast-parser/src/main/resources/queries/scala/scopes.scm

echo
echo "=== Compare all language headers ==="
head -2 java-ecosystem/libs/ast-parser/src/main/resources/queries/*/scopes.scm | grep -A1 "^==>"

Repository: rostilos/CodeCrow

Length of output: 2172


Add @name capture to Scala package_clause to match declared contract.

The header (line 2) declares that scopes.scm captures @name, but line 28's (package_clause) @namespace.def does not capture it. Go's equivalent rule shows the correct pattern: `(package_clause (package_identifier) `@name`) `@namespace.def. Most languages that define @namespace.def (C#, C++, PHP, Ruby, Rust, TypeScript) also explicitly capture the namespace name. Update the Scala rule to include the package identifier child node as @name to match the declared contract and align with resolver expectations for stable scope identity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java-ecosystem/libs/ast-parser/src/main/resources/queries/scala/scopes.scm`
at line 28, The Scala tree-sitter rule in scopes.scm currently marks
(package_clause) as `@namespace.def` but does not capture the package identifier
as `@name`; update the rule so the child package_identifier node is captured as
`@name` (e.g., change the pattern for package_clause to include
(package_identifier) `@name`) so `@namespace.def` contains the package identifier
capture and matches the declared header contract and resolver expectations.

Comment on lines +1 to +33
; TSX scope queries — same as TypeScript (TSX extends TypeScript grammar)
; Captures: @function.def, @class.def, @block.def, @namespace.def, @name

; ── Functions ────────────────────────────────────────────────────────
(function_declaration
name: (identifier) @name) @function.def

(method_definition
name: (property_identifier) @name) @function.def

(arrow_function) @function.def

(function) @function.def

; ── Classes & Interfaces ─────────────────────────────────────────────
(class_declaration
name: (type_identifier) @name) @class.def

(interface_declaration
name: (type_identifier) @name) @class.def

(enum_declaration
name: (identifier) @name) @class.def

; ── Control-flow blocks ──────────────────────────────────────────────
(if_statement) @block.def
(for_statement) @block.def
(for_in_statement) @block.def
(while_statement) @block.def
(do_statement) @block.def
(try_statement) @block.def
(catch_clause) @block.def
(switch_statement) @block.def
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

TSX scope queries are missing several constructs present in the TypeScript version.

The comment states "same as TypeScript" but several constructs are missing compared to typescript/scopes.scm:

  • generator_function_declaration (functions)
  • type_alias_declaration (classes)
  • finally_clause (control-flow)
  • Entire namespace.def / module section

This inconsistency could cause different scope resolution behavior between .ts and .tsx files.

📝 Suggested additions for parity
 (function) `@function.def`

+; Generator functions
+(generator_function_declaration
+  name: (identifier) `@name`) `@function.def`
+
 ; ── Classes & Interfaces ─────────────────────────────────────────────
 (class_declaration
   name: (type_identifier) `@name`) `@class.def`
@@ ...
 (enum_declaration
   name: (identifier) `@name`) `@class.def`

+(type_alias_declaration
+  name: (type_identifier) `@name`) `@class.def`
+
 ; ── Control-flow blocks ──────────────────────────────────────────────
 (if_statement) `@block.def`
 (for_statement) `@block.def`
 (for_in_statement) `@block.def`
+(for_of_statement) `@block.def`
 (while_statement) `@block.def`
 (do_statement) `@block.def`
 (try_statement) `@block.def`
 (catch_clause) `@block.def`
+(finally_clause) `@block.def`
 (switch_statement) `@block.def`
+
+; ── Namespaces / Modules ─────────────────────────────────────────────
+(module
+  name: (identifier) `@name`) `@namespace.def`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@java-ecosystem/libs/ast-parser/src/main/resources/queries/tsx/scopes.scm`
around lines 1 - 33, The TSX scopes query must be brought to parity with the
TypeScript version: add the missing nodes used for scope captures — include
generator_function_declaration alongside
function_declaration/method_definition/arrow_function/function in the Functions
section, add type_alias_declaration to the Classes & Interfaces section (to
capture `@class.def` for type aliases), add finally_clause to the Control-flow
blocks (so finally is tagged `@block.def`), and reintroduce the namespace/module
section so that namespace and module nodes (captured as `@namespace.def`) are
present; update the captures list accordingly to ensure `@namespace.def` is
emitted just like in typescript/scopes.scm.

Comment on lines +22 to +24
WHEN (line_number IS NOT NULL AND line_number > 1 AND code_snippet IS NOT NULL AND code_snippet <> '')
THEN 'LINE'
ELSE 'FILE'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Backfill rule misclassifies real Line 1 issues as FILE scope.

Using line_number > 1 means valid issues on Line 1 with a snippet are backfilled as FILE instead of LINE.

Proposed fix
-    WHEN (line_number IS NOT NULL AND line_number > 1 AND code_snippet IS NOT NULL AND code_snippet <> '')
+    WHEN (line_number IS NOT NULL AND code_snippet IS NOT NULL AND code_snippet <> '')
         THEN 'LINE'
@@
-    WHEN (line_number IS NOT NULL AND line_number > 1 AND code_snippet IS NOT NULL AND code_snippet <> '')
+    WHEN (line_number IS NOT NULL AND code_snippet IS NOT NULL AND code_snippet <> '')
         THEN 'LINE'

Also applies to: 38-40

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@java-ecosystem/libs/core/src/main/resources/db/migration/managed/V2.0.0__add_issue_scope_and_end_line.sql`
around lines 22 - 24, The backfill WHEN clause currently uses "line_number > 1"
and therefore misclassifies valid issues on line 1 as FILE; update the condition
in V2.0.0__add_issue_scope_and_end_line.sql to use "line_number >= 1" (i.e.,
WHEN (line_number IS NOT NULL AND line_number >= 1 AND code_snippet IS NOT NULL
AND code_snippet <> '') THEN 'LINE' ELSE 'FILE') and apply the same change to
the second identical occurrence later in the file (the block around lines 38-40)
so line 1 with a snippet is correctly backfilled as LINE.

Comment on lines +568 to +577
new_line = new_data.get('line')
prev_line = prev_data.get('line') or prev_data.get('lineNumber')
# Use the LLM's line if it provided a meaningful one (> 1),
# otherwise fall back to previous data, then default to 1.
if new_line is not None and int(new_line) > 1:
merged_line = new_line
elif prev_line is not None and int(prev_line) > 0:
merged_line = prev_line
else:
merged_line = new_line or prev_line or 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the actual file content at the specified lines
wc -l python-ecosystem/inference-orchestrator/service/review/orchestrator/reconciliation.py

Repository: rostilos/CodeCrow

Length of output: 150


🏁 Script executed:

# Read the specific lines mentioned in the review
sed -n '560,590p' python-ecosystem/inference-orchestrator/service/review/orchestrator/reconciliation.py

Repository: rostilos/CodeCrow

Length of output: 1836


🏁 Script executed:

# Run the verification script provided in the review
rg -n 'int\(new_line\)|int\(prev_line\)' python-ecosystem/inference-orchestrator/service/review/orchestrator/reconciliation.py

python <<'PY'
samples = ["", "N/A", None, "12"]
for s in samples:
    try:
        print(f"{s!r} -> {int(s)}")
    except Exception as e:
        print(f"{s!r} -> {type(e).__name__}")
PY

Repository: rostilos/CodeCrow

Length of output: 256


🏁 Script executed:

# Check if there are any safeguards already in place around line field handling
rg -B 5 -A 10 "new_line = new_data.get\('line'\)" python-ecosystem/inference-orchestrator/service/review/orchestrator/reconciliation.py

Repository: rostilos/CodeCrow

Length of output: 833


Guard numeric coercion for merged line selection.

Direct int(...) conversion on model output can throw and abort reconciliation when the line field is malformed or non-numeric. The is not None check does not prevent ValueError or TypeError when new_line or prev_line contain empty strings, non-numeric values, or unexpected types from LLM output or stored data.

🐛 Proposed fix
             new_line = new_data.get('line')
             prev_line = prev_data.get('line') or prev_data.get('lineNumber')
-            # Use the LLM's line if it provided a meaningful one (> 1),
-            # otherwise fall back to previous data, then default to 1.
-            if new_line is not None and int(new_line) > 1:
-                merged_line = new_line
-            elif prev_line is not None and int(prev_line) > 0:
-                merged_line = prev_line
-            else:
-                merged_line = new_line or prev_line or 1
+            # Use the LLM's line if it provided a meaningful one (> 1),
+            # otherwise fall back to previous data, then default to 1.
+            def _safe_int(value):
+                try:
+                    return int(value)
+                except (TypeError, ValueError):
+                    return None
+
+            new_line_num = _safe_int(new_line)
+            prev_line_num = _safe_int(prev_line)
+
+            if new_line_num is not None and new_line_num > 1:
+                merged_line = new_line_num
+            elif prev_line_num is not None and prev_line_num > 0:
+                merged_line = prev_line_num
+            else:
+                merged_line = 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
new_line = new_data.get('line')
prev_line = prev_data.get('line') or prev_data.get('lineNumber')
# Use the LLM's line if it provided a meaningful one (> 1),
# otherwise fall back to previous data, then default to 1.
if new_line is not None and int(new_line) > 1:
merged_line = new_line
elif prev_line is not None and int(prev_line) > 0:
merged_line = prev_line
else:
merged_line = new_line or prev_line or 1
new_line = new_data.get('line')
prev_line = prev_data.get('line') or prev_data.get('lineNumber')
# Use the LLM's line if it provided a meaningful one (> 1),
# otherwise fall back to previous data, then default to 1.
def _safe_int(value):
try:
return int(value)
except (TypeError, ValueError):
return None
new_line_num = _safe_int(new_line)
prev_line_num = _safe_int(prev_line)
if new_line_num is not None and new_line_num > 1:
merged_line = new_line_num
elif prev_line_num is not None and prev_line_num > 0:
merged_line = prev_line_num
else:
merged_line = 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/reconciliation.py`
around lines 568 - 577, The model may return non-numeric or malformed strings
for the line fields causing int(...) to raise; update the merged-line selection
in reconciliation.py (around variables new_line, prev_line, merged_line) to
safely coerce to integers using a guarded parse (e.g., a small helper like
parse_int_or_none or try/except inline) that returns None for non-coercible
values, then use those parsed integers for comparisons (check parsed_new > 1,
parsed_prev > 0) and fall back to the original values or 1 only after safe
parsing to avoid ValueError/TypeError aborting reconciliation.

…on density scoring

- Added support for enrichment identifiers in fetch_batch_rag_context to improve context retrieval by including extends, implements, and calls from AST metadata.
- Updated DeterministicContextRequest model to accept additional identifiers for enhanced lookups.
- Modified query handling in deterministic context service to inject enrichment identifiers, ensuring comprehensive context retrieval.
- Implemented AST-based extraction for docstrings and signatures in MetadataExtractor, improving accuracy over regex.
- Introduced information density metric in scoring configuration to penalize low-signal chunks, enhancing relevance in RAG results.
- Updated scoring logic to incorporate information density, ensuring better ranking of context results based on meaningful structure.
- Enhanced transitive parent resolution in deterministic context service to include additional identifiers, improving inheritance chain visibility.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (6)
python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py (1)

164-169: Consider guarding against zero threshold to avoid silent penalty disablement.

The density penalty logic is mathematically correct. However, if density_threshold is set to 0.0 via the environment variable, the condition information_density < self.density_threshold will never be true for valid densities (≥0), silently disabling the penalty entirely.

While not a bug (division by zero is prevented by the condition), this could be surprising behavior. Consider adding a validation or log warning during initialization if threshold ≤ 0.

💡 Optional: Add threshold validation in __init__ or get_scoring_config
 def get_scoring_config() -> ScoringConfig:
     """Get the global ScoringConfig instance."""
     global _scoring_config
     if _scoring_config is None:
         _scoring_config = ScoringConfig()
+        if _scoring_config.density_threshold <= 0:
+            logger.warning(
+                f"RAG_DENSITY_THRESHOLD={_scoring_config.density_threshold} disables density penalty"
+            )
         logger.info("ScoringConfig initialized (content-type boost + density penalty)")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py`
around lines 164 - 169, The density penalty can be silently disabled when
density_threshold ≤ 0; add validation where density_threshold is set (e.g.,
ScoringConfig.__init__ or get_scoring_config) to guard against non-positive
values: if density_threshold <= 0, either raise a ValueError or log a warning
and replace it with a small positive epsilon (e.g., 1e-6) so the interpolation
using information_density / density_threshold remains meaningful; ensure the
message references density_threshold and density_floor so callers know which
config was adjusted.
python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py (1)

169-176: Rename unused loop variable def_name to _def_name.

The loop only uses def_chunks; def_name is not referenced in the loop body.

Proposed fix
-        for def_name, def_chunks in related_definitions.items():
+        for _def_name, def_chunks in related_definitions.items():
             for chunk in def_chunks:
                 meta = chunk.get("metadata", {})
                 if isinstance(meta.get("extends"), list):
                     transitive_parents.update(meta["extends"])
                 if meta.get("parent_class"):
                     transitive_parents.add(meta["parent_class"])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py`
around lines 169 - 176, The loop over related_definitions uses an unused loop
variable def_name; rename it to _def_name in the for statement (for _def_name,
def_chunks in related_definitions.items()) to signal it's intentionally unused
and satisfy linters, leaving the body that references def_chunks,
transitive_parents, and chunk.get("metadata", {}) unchanged.
python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py (1)

392-412: Rename unused loop variable fp to _fp.

The loop iterates over enrichment_metadata.items() but only uses meta, not fp.

Proposed fix
     # ── Strategy 1: AST metadata from enrichment ──
     # enrichment_metadata maps file_path → ParsedFileMetadataDto fields
     if enrichment_metadata:
-        for fp, meta in enrichment_metadata.items():
+        for _fp, meta in enrichment_metadata.items():
             # Class/interface definitions → find other implementations
             for name in (meta.get("semantic_names") or meta.get("semanticNames") or [])[:10]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py`
around lines 392 - 412, The loop over enrichment_metadata.items() declares an
unused variable fp; rename it to _fp to signal intentional unused status and
avoid lint warnings — update the for loop (currently "for fp, meta in
enrichment_metadata.items():") to "for _fp, meta in
enrichment_metadata.items():" and verify there are no references to fp elsewhere
in this block where _add and meta are used.
python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py (1)

364-369: Avoid re-walking the AST for the same capture.

ts_node is already resolved here, but _extract_rich_ast_details(...) performs another _find_node_at_position(...) for the same range. Reuse the resolved node to reduce per-chunk traversal overhead.

♻️ Proposed refactor
-                self._extract_rich_ast_details(chunk, tree, main_cap, lang_name)
+                self._extract_rich_ast_details(
+                    chunk, tree, main_cap, lang_name, ts_node=ts_node
+                )
-    def _extract_rich_ast_details(
+    def _extract_rich_ast_details(
         self,
         chunk: ASTChunk,
         tree: Any,
         captured_node: Any,
-        lang_name: str
+        lang_name: str,
+        ts_node: Optional[Any] = None
     ) -> None:
@@
-        node = self._find_node_at_position(
-            tree.root_node, 
-            captured_node.start_byte, 
-            captured_node.end_byte
-        )
+        node = ts_node or self._find_node_at_position(
+            tree.root_node,
+            captured_node.start_byte,
+            captured_node.end_byte
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py`
around lines 364 - 369, The code currently calls _find_node_at_position to get
ts_node and then later re-walks the AST inside _extract_rich_ast_details for the
same range; change the flow so the already-resolved ts_node is passed into
_extract_rich_ast_details (or into the metadata extractor calls) to avoid a
second _find_node_at_position call. Update callers (_extract_rich_ast_details,
_metadata_extractor.extract_docstring, and
_metadata_extractor.extract_signature) to accept an optional ts_node parameter
and use it when provided, falling back to finding the node only if ts_node is
None, and ensure chunk.docstring and chunk.signature assignment reuses the
resolved ts_node.
python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py (2)

347-351: Use an immutable type for _BODY_NODE_TYPES

_BODY_NODE_TYPES is a mutable class attribute. Freezing it avoids accidental cross-instance/test mutation.

Proposed fix
@@
-from typing import Dict, Any, List, Optional
+from typing import Dict, Any, List, Optional, ClassVar, FrozenSet
@@
-    _BODY_NODE_TYPES = {
+    _BODY_NODE_TYPES: ClassVar[FrozenSet[str]] = frozenset({
         'block', 'statement_block', 'compound_statement', 'class_body',
         'declaration_list', 'field_declaration_list', 'enum_body',
         'interface_body', 'match_block', 'block_node',
-    }
+    })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py`
around lines 347 - 351, _BODY_NODE_TYPES is defined as a mutable set which risks
accidental mutation across instances/tests; replace it with an immutable
collection (e.g., frozenset or tuple) to freeze the values. Update the
declaration of _BODY_NODE_TYPES to use frozenset({...}) (or a tuple literal) so
that code referencing _BODY_NODE_TYPES elsewhere (e.g., in splitter/metadata
logic) reads from an immutable constant and cannot be modified at runtime.

385-387: Avoid blind except Exception in AST extraction paths

Line 385 and Line 499 catch everything, which can hide real parser/logic defects and make regressions harder to detect.

Proposed fix
@@
-        except Exception as e:
-            logger.debug(f"AST docstring extraction failed for {language}: {e}")
+        except (AttributeError, TypeError, UnicodeDecodeError, ValueError) as e:
+            logger.debug("AST docstring extraction failed for %s: %s", language, e, exc_info=True)
             return None
@@
-        except Exception as e:
-            logger.debug(f"AST signature extraction failed for {language}: {e}")
+        except (AttributeError, TypeError, UnicodeDecodeError, ValueError) as e:
+            logger.debug("AST signature extraction failed for %s: %s", language, e, exc_info=True)
             return None

Also applies to: 499-501

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py`
around lines 385 - 387, The try/except in the AST docstring extraction silently
catches all exceptions (e.g., in the block that logs "AST docstring extraction
failed for {language}: {e}"), which can hide real bugs; replace the blind
"except Exception" with a narrow set of expected exceptions (for example
SyntaxError, ValueError, AttributeError) and log those, but re-raise any
unexpected exceptions (or omit catching them) so real parser/logic defects
surface; apply the same change to the second occurrence around the other AST
extraction block so both extraction paths only handle expected errors and let
unknown exceptions propagate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/context_helpers.py`:
- Line 377: The log call uses an unnecessary f-string prefix: change
logger.warning(f"No RAG chunks included after tiered selection") to a plain
string logger.warning("No RAG chunks included after tiered selection"); locate
the call to logger.warning in context_helpers.py (the one logging "No RAG chunks
included after tiered selection") and remove the leading "f" so the message is a
normal string literal.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py`:
- Around line 416-424: The current logic only inspects the immediate previous
sibling and drops earlier comment lines; change it to walk backwards collecting
all consecutive comment nodes before the node (following prev_sibling /
prev_named_sibling links) into a list, decode each node.text (bytes vs str) in
order, join them with newlines, then pass the combined text to
_clean_comment_text; use the existing helper _is_comment_node to detect comment
nodes and preserve the original decoding logic for each node.text so this
replaces the single-prev logic around the variables prev and text and still
calls _clean_comment_text at the end.
- Around line 474-487: The signature slice uses body_start (an offset relative
to node) but then slices content_bytes from 0, which corrupts signatures when a
wrapper node (e.g., decorated_definition) is present; update the slice to use
absolute byte positions: compute content_bytes once, then take sig_bytes =
content_bytes[node.start_byte: body_child.start_byte] (or equivalently
content_bytes[node.start_byte: node.start_byte + body_start]) and decode that,
and apply the same change for the other occurrence around lines 503-519; adjust
uses of body_start/sig_bytes accordingly in the _find_node_with_body handling to
ensure the slice starts at node.start_byte rather than 0.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py`:
- Line 1124: The line-span calculation uses exclusive math causing inflated
densities; change the computation of line_span (where metadata.get('end_line')
and metadata.get('start_line') are used) to account for inclusive end_line by
computing max(metadata.get('end_line', 1) - metadata.get('start_line', 0) + 1,
1) and keep using that value in the density calculation and any downstream
ranking logic so chunks reflect correct inclusive line counts.
- Around line 930-932: The density is being computed against the parent span
because metadata still contains the original start_line/end_line; update the
code so _compute_information_density uses the fragment's actual span (not the
parent) — either by setting metadata['start_line'] and metadata['end_line'] to
the fragment's local start/end before calling self._compute_information_density,
or by changing _compute_information_density to accept and compute from the
fragment's own offsets/length (e.g., pass fragment_start and fragment_end or
fragment_text_length) so each fragment's information_density is normalized by
its own span rather than the parent span.

In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py`:
- Around line 38-39: Remove the unused pr_changed_files parameter from the
method signature (it currently appears alongside additional_identifiers); update
the method declaration to drop pr_changed_files and the Optional[List[str]] type
for it, adjust any defaulting logic if present, and update any callers to stop
passing that argument. Ensure only additional_identifiers remains in the
signature (with its existing type), and run static checks to remove any
now-unused imports or type hints related solely to pr_changed_files.

---

Nitpick comments:
In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py`:
- Around line 392-412: The loop over enrichment_metadata.items() declares an
unused variable fp; rename it to _fp to signal intentional unused status and
avoid lint warnings — update the for loop (currently "for fp, meta in
enrichment_metadata.items():") to "for _fp, meta in
enrichment_metadata.items():" and verify there are no references to fp elsewhere
in this block where _add and meta are used.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py`:
- Around line 347-351: _BODY_NODE_TYPES is defined as a mutable set which risks
accidental mutation across instances/tests; replace it with an immutable
collection (e.g., frozenset or tuple) to freeze the values. Update the
declaration of _BODY_NODE_TYPES to use frozenset({...}) (or a tuple literal) so
that code referencing _BODY_NODE_TYPES elsewhere (e.g., in splitter/metadata
logic) reads from an immutable constant and cannot be modified at runtime.
- Around line 385-387: The try/except in the AST docstring extraction silently
catches all exceptions (e.g., in the block that logs "AST docstring extraction
failed for {language}: {e}"), which can hide real bugs; replace the blind
"except Exception" with a narrow set of expected exceptions (for example
SyntaxError, ValueError, AttributeError) and log those, but re-raise any
unexpected exceptions (or omit catching them) so real parser/logic defects
surface; apply the same change to the second occurrence around the other AST
extraction block so both extraction paths only handle expected errors and let
unknown exceptions propagate.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py`:
- Around line 364-369: The code currently calls _find_node_at_position to get
ts_node and then later re-walks the AST inside _extract_rich_ast_details for the
same range; change the flow so the already-resolved ts_node is passed into
_extract_rich_ast_details (or into the metadata extractor calls) to avoid a
second _find_node_at_position call. Update callers (_extract_rich_ast_details,
_metadata_extractor.extract_docstring, and
_metadata_extractor.extract_signature) to accept an optional ts_node parameter
and use it when provided, falling back to finding the node only if ts_node is
None, and ensure chunk.docstring and chunk.signature assignment reuses the
resolved ts_node.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py`:
- Around line 164-169: The density penalty can be silently disabled when
density_threshold ≤ 0; add validation where density_threshold is set (e.g.,
ScoringConfig.__init__ or get_scoring_config) to guard against non-positive
values: if density_threshold <= 0, either raise a ValueError or log a warning
and replace it with a small positive epsilon (e.g., 1e-6) so the interpolation
using information_density / density_threshold remains meaningful; ensure the
message references density_threshold and density_floor so callers know which
config was adjusted.

In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py`:
- Around line 169-176: The loop over related_definitions uses an unused loop
variable def_name; rename it to _def_name in the for statement (for _def_name,
def_chunks in related_definitions.items()) to signal it's intentionally unused
and satisfy linters, leaving the body that references def_chunks,
transitive_parents, and chunk.get("metadata", {}) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d990bef0-a421-4b70-a64c-c0a857168740

📥 Commits

Reviewing files that changed from the base of the PR and between 2099d2f and dd0c72a.

📒 Files selected for processing (10)
  • python-ecosystem/inference-orchestrator/service/rag/rag_client.py
  • python-ecosystem/inference-orchestrator/service/review/orchestrator/context_helpers.py
  • python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py
  • python-ecosystem/rag-pipeline/src/rag_pipeline/api/models.py
  • python-ecosystem/rag-pipeline/src/rag_pipeline/api/routers/query.py
  • python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py
  • python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py
  • python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py
  • python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py
  • python-ecosystem/rag-pipeline/src/rag_pipeline/services/pr_context.py


if not formatted_parts and not duplication_parts:
logger.warning(f"No RAG chunks included (total: {len(chunks)}, skipped_stale: {skipped_stale})")
logger.warning(f"No RAG chunks included after tiered selection")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unnecessary f-string prefix.

The f-string has no placeholders, so the f prefix is extraneous.

Proposed fix
-        logger.warning(f"No RAG chunks included after tiered selection")
+        logger.warning("No RAG chunks included after tiered selection")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logger.warning(f"No RAG chunks included after tiered selection")
logger.warning("No RAG chunks included after tiered selection")
🧰 Tools
🪛 Ruff (0.15.2)

[error] 377-377: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/context_helpers.py`
at line 377, The log call uses an unnecessary f-string prefix: change
logger.warning(f"No RAG chunks included after tiered selection") to a plain
string logger.warning("No RAG chunks included after tiered selection"); locate
the call to logger.warning in context_helpers.py (the one logging "No RAG chunks
included after tiered selection") and remove the leading "f" so the message is a
normal string literal.

Comment on lines +416 to +424
prev = node.prev_sibling if hasattr(node, 'prev_sibling') else None
if prev is None:
prev = getattr(node, 'prev_named_sibling', None)

if prev is None or not self._is_comment_node(prev):
return None

text = prev.text.decode('utf-8', errors='replace') if isinstance(prev.text, bytes) else str(prev.text)
return self._clean_comment_text(text)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the target file
git ls-files | grep -E "metadata\.py$"

Repository: rostilos/CodeCrow

Length of output: 133


🏁 Script executed:

# Check if the file exists and read the relevant section
head -500 python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py | tail -150

Repository: rostilos/CodeCrow

Length of output: 6461


🏁 Script executed:

# Get more context around lines 416-424, including the full method
sed -n '380,450p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py

Repository: rostilos/CodeCrow

Length of output: 3183


🏁 Script executed:

# Search for method definition and class context
rg -B 30 "_extract_preceding_comment_docstring" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py | head -80

Repository: rostilos/CodeCrow

Length of output: 2702


🏁 Script executed:

# Check what _is_comment_node does
rg -A 10 "_is_comment_node" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py

Repository: rostilos/CodeCrow

Length of output: 894


🏁 Script executed:

# Look for tests related to comment extraction
fd -t f "test.*\.py$" | xargs rg -l "comment|docstring" | head -5

Repository: rostilos/CodeCrow

Length of output: 119


🏁 Script executed:

# Search for test cases related to comment extraction
rg -r "preceding_comment|_extract_preceding_comment_docstring" --type py python-ecosystem/rag-pipeline/

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Look for test files that might show multi-line comment scenarios
fd -t f "test.*\.py" python-ecosystem/rag-pipeline/ | xargs rg -l "comment|docstring" 2>/dev/null | head -10

Repository: rostilos/CodeCrow

Length of output: 119


🏁 Script executed:

# Check how tree-sitter grammars emit comments - look for any grammar config or documentation
fd -t f -e "json" -e "toml" -e "md" python-ecosystem/ | xargs rg -l "grammar|tree-sitter" 2>/dev/null | head -5

Repository: rostilos/CodeCrow

Length of output: 733


🏁 Script executed:

# Search for actual usage and examples of comment parsing
rg -B 5 -A 5 "_extract_preceding_comment_docstring\|_extract_docstring_from_node" python-ecosystem/rag-pipeline/ --type py | head -100

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Check if there are any integration tests with actual code samples
fd -t f -e "py" -e "js" -e "java" python-ecosystem/rag-pipeline/tests/ | head -10 | xargs ls -lh 2>/dev/null

Repository: rostilos/CodeCrow

Length of output: 1078


🏁 Script executed:

# Search for test cases with correct syntax
fd . python-ecosystem/rag-pipeline/tests/ -name "*.py" | head -10

Repository: rostilos/CodeCrow

Length of output: 291


🏁 Script executed:

# Look for any test files that exercise comment extraction
rg "extract.*comment|test.*comment" python-ecosystem/rag-pipeline/ -t py

Repository: rostilos/CodeCrow

Length of output: 508


🏁 Script executed:

# Check if there are sample code files used in tests
fd . python-ecosystem/rag-pipeline/ -name "*.py" -o -name "*.js" -o -name "*.java" | grep -E "(test|sample|example)" | head -15

Repository: rostilos/CodeCrow

Length of output: 291


🏁 Script executed:

# Search documentation on how multi-line comments are handled
rg "multi.*line|block.*comment" python-ecosystem/rag-pipeline/ -A 3 -B 3

Repository: rostilos/CodeCrow

Length of output: 1687


🏁 Script executed:

# Look for tree-sitter grammar definitions or tree structure inspection
fd . python-ecosystem/ -name "*.scm" -o -name "*grammar*" 2>/dev/null

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Check the full context of the method including how it's called
sed -n '350,500p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py

Repository: rostilos/CodeCrow

Length of output: 6516


🏁 Script executed:

# Check how tree-sitter actually represents consecutive line comments
# by looking for any test data or examples
fd . python-ecosystem/ -type f \( -name "*.py" -o -name "*.js" -o -name "*.go" -o -name "*.java" \) | xargs grep -l "^//" 2>/dev/null | head -5

Repository: rostilos/CodeCrow

Length of output: 231


🏁 Script executed:

# Look for documentation or comments about tree-sitter node structure
rg "prev_sibling|prev_named_sibling|line_comment" python-ecosystem/rag-pipeline/ -B 2 -A 2

Repository: rostilos/CodeCrow

Length of output: 1441


🏁 Script executed:

# Check if there are any actual tests that demonstrate multi-line comment scenarios
find python-ecosystem/rag-pipeline -type f -name "*.py" | xargs grep -l "comment" 2>/dev/null | head -5

Repository: rostilos/CodeCrow

Length of output: 206


🏁 Script executed:

# Check the full file structure to understand class hierarchy and usage
wc -l python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py

Repository: rostilos/CodeCrow

Length of output: 137


🏁 Script executed:

# Look for how prev_sibling vs prev_named_sibling differs in tree-sitter usage
rg "prev_sibling|prev_named_sibling" python-ecosystem/ -t py -B 3 -A 3

Repository: rostilos/CodeCrow

Length of output: 1109


🏁 Script executed:

# Search for any documentation or comments about tree-sitter node structures
rg "tree-sitter|grammar|comment.*node" python-ecosystem/rag-pipeline/ -A 2 | head -50

Repository: rostilos/CodeCrow

Length of output: 4330


🏁 Script executed:

# Check if there are actual test fixtures or examples showing comment structures
find python-ecosystem/rag-pipeline -type f -name "*.md" -o -name "*.txt" | xargs grep -l "comment\|tree-sitter" 2>/dev/null

Repository: rostilos/CodeCrow

Length of output: 107


🏁 Script executed:

# Look at how _clean_comment_text actually processes multi-line text
sed -n '440,480p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py

Repository: rostilos/CodeCrow

Length of output: 1721


🏁 Script executed:

# Check how the method is called - to understand what node types are passed
rg "_extract_docstring_from_node\|_extract_preceding_comment_docstring" python-ecosystem/rag-pipeline/src/ -B 5 -A 2

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Look for any comments or documentation mentioning multi-line behavior
rg "multi.*line|consecutive|adjacent.*comment" python-ecosystem/rag-pipeline/

Repository: rostilos/CodeCrow

Length of output: 199


🏁 Script executed:

# Check how tree-sitter represents consecutive line comments by examining splitter usage
sed -n '1,100p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py

Repository: rostilos/CodeCrow

Length of output: 3393


🏁 Script executed:

# Look for any actual code being tested or processed
find python-ecosystem/rag-pipeline -type f \( -name "*.py" -o -name "*.js" -o -name "*.java" \) -path "*/test*" 2>/dev/null | head -5

Repository: rostilos/CodeCrow

Length of output: 257


🏁 Script executed:

# Check if _clean_comment_text actually handles multi-line content within a single node
grep -A 20 "_clean_comment_text" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py | head -40

Repository: rostilos/CodeCrow

Length of output: 893


🏁 Script executed:

# Search for any tests or example snippets with comments
rg "//|#|/\*" python-ecosystem/rag-pipeline/tests/ --type py -B 2 -A 2 | head -80

Repository: rostilos/CodeCrow

Length of output: 4639


🏁 Script executed:

# Check the tree-sitter Python grammar documentation or actual behavior examples
python3 << 'EOF'
# Try to understand tree-sitter comment node behavior
import subprocess
result = subprocess.run(['pip', 'list'], capture_output=True, text=True)
print("tree-sitter" in result.stdout)
EOF

Repository: rostilos/CodeCrow

Length of output: 66


🏁 Script executed:

# Make one final check: search for any comments or issues mentioning this behavior
rg "consecutive|adjacent|multiple.*comment|line.*comment.*node" python-ecosystem/rag-pipeline/ -i

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Check if there's any documentation about tree-sitter comment behavior
rg "emit|node.*per.*line|comment.*node" python-ecosystem/ README.md -A 2 -B 2 2>/dev/null | head -30

Repository: rostilos/CodeCrow

Length of output: 3116


Collect all preceding comment nodes to avoid losing multi-line doc comments

Lines 416–424 capture only the immediate previous comment node. For tree-sitter grammars that emit one node per line comment, this discards earlier lines in a multi-line comment block.

Proposed fix
@@
-        prev = node.prev_sibling if hasattr(node, 'prev_sibling') else None
-        if prev is None:
-            prev = getattr(node, 'prev_named_sibling', None)
-
-        if prev is None or not self._is_comment_node(prev):
-            return None
-
-        text = prev.text.decode('utf-8', errors='replace') if isinstance(prev.text, bytes) else str(prev.text)
-        return self._clean_comment_text(text)
+        prev = getattr(node, 'prev_named_sibling', None) or getattr(node, 'prev_sibling', None)
+        if prev is None:
+            return None
+
+        parts = []
+        curr = prev
+        while curr is not None and self._is_comment_node(curr):
+            text = curr.text.decode('utf-8', errors='replace') if isinstance(curr.text, bytes) else str(curr.text)
+            parts.append(text)
+            curr = getattr(curr, 'prev_named_sibling', None)
+
+        if not parts:
+            return None
+        return self._clean_comment_text('\n'.join(reversed(parts)))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
prev = node.prev_sibling if hasattr(node, 'prev_sibling') else None
if prev is None:
prev = getattr(node, 'prev_named_sibling', None)
if prev is None or not self._is_comment_node(prev):
return None
text = prev.text.decode('utf-8', errors='replace') if isinstance(prev.text, bytes) else str(prev.text)
return self._clean_comment_text(text)
prev = getattr(node, 'prev_named_sibling', None) or getattr(node, 'prev_sibling', None)
if prev is None:
return None
parts = []
curr = prev
while curr is not None and self._is_comment_node(curr):
text = curr.text.decode('utf-8', errors='replace') if isinstance(curr.text, bytes) else str(curr.text)
parts.append(text)
curr = getattr(curr, 'prev_named_sibling', None)
if not parts:
return None
return self._clean_comment_text('\n'.join(reversed(parts)))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py`
around lines 416 - 424, The current logic only inspects the immediate previous
sibling and drops earlier comment lines; change it to walk backwards collecting
all consecutive comment nodes before the node (following prev_sibling /
prev_named_sibling links) into a list, decode each node.text (bytes vs str) in
order, join them with newlines, then pass the combined text to
_clean_comment_text; use the existing helper _is_comment_node to detect comment
nodes and preserve the original decoding logic for each node.text so this
replaces the single-prev logic around the variables prev and text and still
calls _clean_comment_text at the end.

Comment on lines +474 to +487
target = self._find_node_with_body(ts_node)
if target is None:
# No body found — use first line as signature
first_line = content.split('\n')[0].strip()
return first_line if len(first_line) > 5 else None

node, body_child = target
body_start = body_child.start_byte - node.start_byte

if body_start > 0:
# Signature = everything before the body
content_bytes = content.encode('utf-8') if isinstance(content, str) else content
sig_bytes = content_bytes[:body_start]
sig = sig_bytes.decode('utf-8', errors='replace').rstrip().rstrip('{').rstrip()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the metadata.py file around the cited lines
cat -n python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py | sed -n '470,525p'

Repository: rostilos/CodeCrow

Length of output: 2661


🏁 Script executed:

# Let's also find the _find_node_with_body method to understand its implementation
rg -n "_find_node_with_body" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py -B2 -A15

Repository: rostilos/CodeCrow

Length of output: 1733


🏁 Script executed:

# Check how this method is called and what content is passed
rg -n "_extract_signature_from_node" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py -B5 -A5

Repository: rostilos/CodeCrow

Length of output: 918


🏁 Script executed:

# Look at the full context of the method containing lines 474-487
rg -n "def _extract_signature_from_node" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py -A50

Repository: rostilos/CodeCrow

Length of output: 2363


🏁 Script executed:

# Check if there are decorators or wrapper content handling in the code
rg -n "decorated|wrapper" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py -i -B2 -A2

Repository: rostilos/CodeCrow

Length of output: 309


🏁 Script executed:

# Find all call sites of extract_signature and _extract_signature_from_node to understand what content is
rg -n "extract_signature\(|_extract_signature_from_node" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/ -B3 -A1

Repository: rostilos/CodeCrow

Length of output: 3147


🏁 Script executed:

# Look at how extract_signature is called - let's trace the full context
rg -n "def extract_signature" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py -A20

Repository: rostilos/CodeCrow

Length of output: 1043


🏁 Script executed:

# Check what ts_node.text gives us - this is important to understand the byte boundaries
rg -n "ts_node\.text|node\.text" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py -B2 -A2

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Look at _is_body_node to understand what constitutes a body
rg -n "_is_body_node" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py -A10

Repository: rostilos/CodeCrow

Length of output: 2100


🏁 Script executed:

# Check if content parameter is the full file or node-specific by examining the broader context
rg -n "class MetadataExtractor\|def __init__" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py -A30 | head -60

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Critical: understand if content represents the full file or just the chunk
# Let's examine how chunks are created in splitter.py
rg -n "main_cap\.text\|ChunkMetadata" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py -B5 -A5 | head -80

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Check if node.text is ever used - this would reveal byte boundary assumptions
rg -n "node\.text\|ts_node\.text\|\.text" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py | head -40

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Understand the byte offset semantics - are they relative to file or chunk?
# Let's trace get_node_in_range if it exists
rg -n "get_node_in_range\|def.*node.*range" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/ -A10

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Check imports and understand tree-sitter usage
head -50 python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py

Repository: rostilos/CodeCrow

Length of output: 1609


🏁 Script executed:

# Check _get_node_in_range to understand offset semantics
rg -n "_get_node_in_range" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/ -A15

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Look at the full context of how ts_node is passed to extract_signature
sed -n '360,375p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py

Repository: rostilos/CodeCrow

Length of output: 916


🏁 Script executed:

# Check if there's any offset tracking or adjustment
sed -n '445,465p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py

Repository: rostilos/CodeCrow

Length of output: 1184


🏁 Script executed:

# Look at the tree-sitter usage to understand if start_byte is absolute or relative
rg -n "start_byte\|end_byte" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py -B1 -A1

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Check what main_cap is and whether it matches the node text
rg -n "main_cap\|Capture" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py -B2 -A2 | head -60

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Look at _find_node_at_position to understand how ts_node is located
rg -n "_find_node_at_position" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py -A20

Repository: rostilos/CodeCrow

Length of output: 3349


🏁 Script executed:

# Understand the content parameter better - is it the chunk or the full file?
# Let's look at how content is constructed in both call sites
sed -n '340,380p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py

Repository: rostilos/CodeCrow

Length of output: 2077


🏁 Script executed:

# Let's also see the second call site context
sed -n '440,470p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py

Repository: rostilos/CodeCrow

Length of output: 1582


🏁 Script executed:

# Check if there's any offset adjustment being made between content and node
rg -n "offset\|start_byte" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py -B2 -A2 | head -50

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Look at how source_bytes is used in the second call site
sed -n '430,460p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py

Repository: rostilos/CodeCrow

Length of output: 1569


🏁 Script executed:

# Check the exact relationship between ts_node and content in both call sites
# First, let's understand what happens when _find_node_with_body returns a child
rg -n "source_bytes\|chunk\.content\|main_cap\.text" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py -B2 -A2 | head -70

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Critical: Let's check if the first call site uses main_cap.text which is node.text
rg -n "main_cap\.text\|node\.text" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py -B3 -A3

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Verify if ts_node in first call site matches the content scope
# If ts_node is located via _find_node_at_position with main_cap's start/end bytes,
# then ts_node should have the same span as main_cap
# Let's verify that the spans match
sed -n '318,330p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py

Repository: rostilos/CodeCrow

Length of output: 652


🏁 Script executed:

# Let's find get_node_text to understand how it extracts text
rg -n "def get_node_text\|get_node_text =" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py -A10

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Let's verify: in the first call site, main_cap.text is what type exactly?
# Let's see the definition of the Capture object or how main_cap is created
rg -n "\.text" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py | head -20

Repository: rostilos/CodeCrow

Length of output: 891


🏁 Script executed:

# Let's trace: when _extract_signature_from_node is called with ts_node and content,
# what is the relationship? Does content always match ts_node's byte span?
# Let's check if there's ever an offset adjustment
sed -n '365,375p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py

Repository: rostilos/CodeCrow

Length of output: 713


🏁 Script executed:

# Let me verify the core issue: when _find_node_with_body returns (child, grandchild),
# does the child's byte offsets still work with the content that was extracted for ts_node?
# Let's look at actual body node detection
sed -n '503,520p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py

Repository: rostilos/CodeCrow

Length of output: 734


🏁 Script executed:

# Now let me verify the core logic: when _find_node_with_body returns (child, grandchild),
# what is the relationship to ts_node and content?

# Let's create a test to show the issue:
python3 << 'EOF'
# Simulating the scenario:
# ts_node: decorated function at bytes 0-100 with content "def foo():\n    pass"
# child: actual function_definition starting at byte 10 (after decorator)
# grandchild: body starting at byte 30 (after function signature)

# If content = ts_node.text (the whole decorated function)
# And _find_node_with_body returns (child, grandchild)
# Then:
node_start = 0  # ts_node.start_byte
child_start = 10  # child.start_byte (inside ts_node)
grandchild_start = 30  # grandchild.start_byte

body_start = grandchild_start - child_start  # 30 - 10 = 20
print(f"body_start (relative to child): {body_start}")

# But content is from ts_node (bytes 0-100)
# So sig_bytes = content_bytes[:20] extracts bytes 0-20
# Which might include part of the decorator, not the signature!
print(f"But content is from ts_node (bytes 0-100)")
print(f"sig_bytes takes [:20] from full decorated function")
print(f"This is WRONG because we need bytes [{child_start}:{child_start+body_start}] = bytes 10:30")
EOF

Repository: rostilos/CodeCrow

Length of output: 246


🏁 Script executed:

# Verify: when does _find_node_with_body return a child instead of ts_node?
# It returns child when ts_node doesn't have a body child directly but its child does
# This happens with decorated or wrapper nodes

# Let's check if there's actual test coverage or evidence of decorated functions
rg -n "decorator\|@.*\(" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/ -i

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Let's verify by looking at what _is_body_node considers a body
sed -n '367,370p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py

Repository: rostilos/CodeCrow

Length of output: 289


🏁 Script executed:

# Let me verify the exact byte offset scenario one more time
# to ensure the bug is real. Let's trace through with actual code.

# When _find_node_with_body returns (child, grandchild) instead of (ts_node, body_child),
# the issue is clear. Let me verify this can actually happen:

python3 << 'EOF'
# Scenario: decorated function in Python
# tree-sitter parse tree:
# decorated_definition (ts_node, bytes 0-50)
#   ├── decorator (bytes 0-12) "@decorator\n"
#   └── function_definition (child, bytes 12-50) "def foo():\n    pass"
#       ├── parameters (bytes 16-18) "()"
#       └── block (grandchild, bytes 20-50) ":\n    pass"

# In _find_node_with_body:
# - Check ts_node (decorated_definition) children: decorator, function_definition
# - Is decorator a body? No. Is function_definition a body? No.
# - Then check ts_node.children[1] (function_definition) children: parameters, block
# - Is block a body? YES!
# - Return (function_definition, block) where:
#   - function_definition.start_byte = 12
#   - block.start_byte = 20

# Now in _extract_signature_from_node:
# body_start = 20 - 12 = 8
# content = "@decorator\ndef foo():" (full decorated_definition)
# content_bytes[:8] = "@decora" <- WRONG! Should be "def foo(" from child

ts_node_start = 0
ts_node_end = 50
child_start = 12
grandchild_start = 20

body_start = grandchild_start - child_start
print(f"body_start calculated: {body_start}")
print(f"Slicing content_bytes[:body_start] = content_bytes[:{body_start}]")
print(f"But content is from ts_node (bytes {ts_node_start}:{ts_node_end})")
print(f"Should actually slice from child: content_bytes[{child_start-ts_node_start}:{child_start-ts_node_start}+{body_start}]")
print(f"= content_bytes[{child_start-ts_node_start}:{child_start-ts_node_start + body_start}]")
EOF

Repository: rostilos/CodeCrow

Length of output: 261


🏁 Script executed:

# Now let me check if this scenario actually occurs in practice
# by looking at tree-sitter grammar files for Python

find python-ecosystem/rag-pipeline -name "*.scm" | head -5

Repository: rostilos/CodeCrow

Length of output: 456


🏁 Script executed:

# Check if there are any test cases that demonstrate decorated functions
rg -n "decorated\|@" python-ecosystem/rag-pipeline/tests -i -A3 -B3 2>/dev/null | head -40

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Final verification: check if content matches ts_node span in both call sites
# by examining how main_cap.text is used
sed -n '348,375p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py

Repository: rostilos/CodeCrow

Length of output: 1430


🏁 Script executed:

# Let me verify what node types are actually processed in the second call site
sed -n '405,440p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py

Repository: rostilos/CodeCrow

Length of output: 1598


🏁 Script executed:

# Check what all_types contains
rg -n "all_types\|class_types\|function_types" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py -B2 -A2

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Let me verify if decorated_definition is included in the node types being processed
rg -n "decorated" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py -i

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Let me check what _get_semantic_node_types returns to see if decorated functions are included
rg -n "_get_semantic_node_types" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py -A50 | head -70

Repository: rostilos/CodeCrow

Length of output: 3732


🏁 Script executed:

# Let me verify if the bug is real by understanding node relationships better
# Check if decorated_definition or similar wrapper nodes can be in the processing
rg -n "def.*get_semantic\|decorated\|function_definition\|class_definition" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py -i | head -30

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Let's trace through the exact scenario to confirm the bug:
# When node returned by _find_node_with_body is a child of ts_node,
# and content comes from ts_node, what happens?

python3 << 'EOF'
# Real scenario from the code:
# First call site: content = main_cap.text (which is the ts_node span)
#                  ts_node = _find_node_at_position(...main_cap.start_byte...main_cap.end_byte)
#                  So ts_node.start_byte == 0 (relative to content)

# Second call site: content = get_node_text(node) (which is node.start_byte:node.end_byte)
#                  ts_node = node
#                  So ts_node.start_byte == 0 (relative to content)

# In BOTH cases, when _find_node_with_body returns (child, grandchild):
# - child.start_byte is now absolute (file bytes)
# - But it's being used relative to content start

# The key insight: In both call sites, content starts at position 0
# When a child node is returned, its start_byte is still absolute
# body_start = grandchild.start_byte - child.start_byte
# This is correct RELATIVE to child
# But sig_bytes = content_bytes[:body_start] assumes content starts where child starts
# Which is WRONG in first call site where content starts where ts_node starts!

print("First call site issue confirmed:")
print("- content = main_cap.text starts at ts_node.start_byte (absolute)")
print("- If _find_node_with_body returns (child, grandchild):")
print("  body_start = grandchild.start_byte - child.start_byte (correct child-relative)")
print("  sig_bytes = content_bytes[:body_start] (WRONG - ignores child offset from ts_node)")
EOF

Repository: rostilos/CodeCrow

Length of output: 380


🏁 Script executed:

# Now verify the second call site doesn't have the same issue
# by checking if ts_node passed equals the node being processed

sed -n '454,456p' python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py

Repository: rostilos/CodeCrow

Length of output: 299


🏁 Script executed:

# Let me confirm the bug one more time with concrete understanding:
# In second call site, ts_node = node, and content = get_node_text(node)
# So content starts at position 0 relative to itself

# When _find_node_with_body returns (child, grandchild):
# - child is a child of ts_node
# - child.start_byte is absolute file position
# - grandchild.start_byte is absolute file position
# - body_start = grandchild.start_byte - child.start_byte (correct relative to child)
# - But content is from ts_node, not from child
# - So we need: content_bytes[child.start_byte - ts_node.start_byte : child.start_byte - ts_node.start_byte + body_start]
# - Currently: content_bytes[:body_start] is WRONG

# Let me verify the actual scenario one final time by checking node types
rg -n "decorated_definition\|function_definition\|class_definition" python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Confirm: the semantic types processed in fallback don't include decorated_definition
# so how can the issue occur?

# Let me trace: if a decorated function is in the source code, 
# what node type is matched by the semantic types?

cat python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py | sed -n '1361,1400p'

Repository: rostilos/CodeCrow

Length of output: 1857


🏁 Script executed:

# I need to understand: if we process function_definition nodes,
# and a function_definition can be a child of decorated_definition,
# then when traverse() encounters function_definition, it passes that to extract_signature
# But _find_node_with_body will look at function_definition's children, not decorated_definition's

# So the bug might not trigger in the second call site since we already have function_definition
# Let me verify the first call site more carefully

# In first call site: main_cap is from a query match (probably the full decorated_definition)
# Then ts_node = _find_node_at_position(...main_cap bounds...)
# Could return the decorated_definition
# Then extract_signature receives both decorated_definition content and that node
# If _find_node_with_body looks at decorated_definition's children, it might find function_definition
# which doesn't have a body child, then look at function_definition's children
# and find the block!

# This is the actual scenario where the bug occurs
python3 << 'EOF'
print("FIRST CALL SITE BUG SCENARIO:")
print("1. Query matches decorated_definition (bytes 0-50)")
print("2. main_cap.text = '@decorator\\ndef foo():\\n    pass'")
print("3. ts_node = decorated_definition (bytes 0-50)")
print("4. extract_signature called with (content, 'python', ts_node=decorated_definition)")
print("5. _find_node_with_body(decorated_definition) checks:")
print("   - decorated_definition.children = [decorator, function_definition]")
print("   - decorator is not a body, function_definition is not a body")
print("   - Check function_definition.children = [def, identifier, parameters, block]")
print("   - block is a body! Return (function_definition, block)")
print("6. node = function_definition (child of decorated_definition)")
print("7. body_start = block.start_byte - function_definition.start_byte = 20")
print("8. content_bytes[:20] takes from decorated_definition's start (WRONG!)")
print("   Should skip the decorator first")
EOF

Repository: rostilos/CodeCrow

Length of output: 889


Signature extraction can be corrupted when _find_node_with_body() returns a child node

When _find_node_with_body() returns a child node (e.g., from a decorated function where it descends to find the body), line 486 slices from content_bytes without accounting for the child's offset relative to the wrapper node. If content spans a decorated definition, the extracted signature can include decorator text or be truncated.

Example: For @decorator\ndef foo():\n pass matched as decorated_definition, if _find_node_with_body returns (function_definition, block), then body_start = block.start_byte - function_definition.start_byte, but sig_bytes = content_bytes[:body_start] slices from position 0 of the decorated wrapper, not from the function definition's start.

This occurs in the first call site (lines 365-369) where queries can match wrapper nodes.

Proposed fix
-            node, body_child = target
-            body_start = body_child.start_byte - node.start_byte
+            node, body_child = target
+            body_start = body_child.start_byte - node.start_byte
 
             if body_start > 0:
-                # Signature = everything before the body
+                # Signature = everything before the body (node-relative)
-                content_bytes = content.encode('utf-8') if isinstance(content, str) else content
-                sig_bytes = content_bytes[:body_start]
+                # Adjust offset if node is not the original ts_node
+                node_offset = 0
+                if node is not ts_node and hasattr(ts_node, 'start_byte') and hasattr(node, 'start_byte'):
+                    node_offset = node.start_byte - ts_node.start_byte
+                content_bytes = content.encode('utf-8') if isinstance(content, str) else content
+                sig_bytes = content_bytes[node_offset:node_offset + body_start]
                 sig = sig_bytes.decode('utf-8', errors='replace').rstrip().rstrip('{').rstrip()

Also applies to lines 503-519.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/metadata.py`
around lines 474 - 487, The signature slice uses body_start (an offset relative
to node) but then slices content_bytes from 0, which corrupts signatures when a
wrapper node (e.g., decorated_definition) is present; update the slice to use
absolute byte positions: compute content_bytes once, then take sig_bytes =
content_bytes[node.start_byte: body_child.start_byte] (or equivalently
content_bytes[node.start_byte: node.start_byte + body_start]) and decode that,
and apply the same change for the other occurrence around lines 503-519; adjust
uses of body_start/sig_bytes accordingly in the _find_node_with_body handling to
ensure the slice starts at node.start_byte rather than 0.

Comment on lines +930 to +932
# Compute information density for fragment
metadata['information_density'] = self._compute_information_density(metadata)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fragment density is calculated against parent span, which under-scores split chunks.

In oversized splitting, metadata keeps original start_line/end_line, so density for each fragment is divided by the full parent span. This can incorrectly push relevant fragments toward near-zero density.

🛠️ Proposed fix
-            metadata['information_density'] = self._compute_information_density(metadata)
+            density_metadata = dict(metadata)
+            density_metadata['start_line'] = 1
+            density_metadata['end_line'] = sub_chunk.count('\n') + 1
+            metadata['information_density'] = self._compute_information_density(density_metadata)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py`
around lines 930 - 932, The density is being computed against the parent span
because metadata still contains the original start_line/end_line; update the
code so _compute_information_density uses the fragment's actual span (not the
parent) — either by setting metadata['start_line'] and metadata['end_line'] to
the fragment's local start/end before calling self._compute_information_density,
or by changing _compute_information_density to accept and compute from the
fragment's own offsets/length (e.g., pass fragment_start and fragment_end or
fragment_text_length) so each fragment's information_density is normalized by
its own span rather than the parent span.


Returns a float in [0.0, 1.0] representing the density.
"""
line_span = max(metadata.get('end_line', 1) - metadata.get('start_line', 0), 1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix inclusive line-span math in density calculation.

Line 1124 uses end_line - start_line, but this file treats end_line as inclusive. That inflates density for most chunks and skews ranking.

🧮 Proposed fix
-        line_span = max(metadata.get('end_line', 1) - metadata.get('start_line', 0), 1)
+        line_span = max(
+            metadata.get('end_line', 1) - metadata.get('start_line', 1) + 1,
+            1
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
line_span = max(metadata.get('end_line', 1) - metadata.get('start_line', 0), 1)
line_span = max(
metadata.get('end_line', 1) - metadata.get('start_line', 1) + 1,
1
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/splitter/splitter.py` at
line 1124, The line-span calculation uses exclusive math causing inflated
densities; change the computation of line_span (where metadata.get('end_line')
and metadata.get('start_line') are used) to account for inclusive end_line by
computing max(metadata.get('end_line', 1) - metadata.get('start_line', 0) + 1,
1) and keep using that value in the density calculation and any downstream
ranking logic so chunks reflect correct inclusive line counts.

Comment on lines +38 to +39
pr_changed_files: Optional[List[str]] = None,
additional_identifiers: Optional[List[str]] = None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

rg -n "pr_changed_files" python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py

Repository: rostilos/CodeCrow

Length of output: 121


🏁 Script executed:

cat -n python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py | sed -n '25,75p'

Repository: rostilos/CodeCrow

Length of output: 2454


🏁 Script executed:

cat -n python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py | sed -n '75,120p'

Repository: rostilos/CodeCrow

Length of output: 2157


🏁 Script executed:

cat -n python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py | sed -n '120,160p'

Repository: rostilos/CodeCrow

Length of output: 2598


🏁 Script executed:

rg -n "get_deterministic_context" python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py --max-count=5

Repository: rostilos/CodeCrow

Length of output: 98


🏁 Script executed:

rg "get_deterministic_context" --type py -A 3 | head -50

Repository: rostilos/CodeCrow

Length of output: 4349


🏁 Script executed:

cat -n python-ecosystem/inference-orchestrator/utils/dependency_graph.py | sed -n '1,50p'

Repository: rostilos/CodeCrow

Length of output: 2082


🏁 Script executed:

rg "rag_response = self.rag_client.get_deterministic_context" -A 10 python-ecosystem/inference-orchestrator/utils/dependency_graph.py

Repository: rostilos/CodeCrow

Length of output: 563


Remove unused pr_changed_files parameter from method signature.

The pr_changed_files parameter (line 38) is defined in the method signature but is never used in the method body and is not passed by any known callers. Remove it to eliminate confusion about its purpose.

🧰 Tools
🪛 Ruff (0.15.2)

[warning] 38-38: Unused method argument: pr_changed_files

(ARG002)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py`
around lines 38 - 39, Remove the unused pr_changed_files parameter from the
method signature (it currently appears alongside additional_identifiers); update
the method declaration to drop pr_changed_files and the Optional[List[str]] type
for it, adjust any defaulting logic if present, and update any callers to stop
passing that argument. Ensure only additional_identifiers remains in the
signature (with its existing type), and run static checks to remove any
now-unused imports or type hints related solely to pr_changed_files.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (1)
python-ecosystem/inference-orchestrator/service/review/orchestrator/context_helpers.py (1)

377-377: ⚠️ Potential issue | 🟡 Minor

Remove unnecessary f-string prefix.

The f-string has no placeholders, so the f prefix is extraneous.

Proposed fix
-        logger.warning(f"No RAG chunks included after tiered selection")
+        logger.warning("No RAG chunks included after tiered selection")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/context_helpers.py`
at line 377, The call to logger.warning uses an unnecessary f-string prefix in
the message "No RAG chunks included after tiered selection"; update the
logger.warning invocation (the logger.warning call in this module) to use a
plain string literal without the f prefix so the message is logged as "No RAG
chunks included after tiered selection".
🧹 Nitpick comments (4)
python-ecosystem/inference-orchestrator/utils/prompts/constants_stage_2.py (1)

93-93: Comprehensive anchoring constraint with correct guidance.

The constraint clearly establishes anchoring requirements and correctly directs the LLM to source codeSnippet from Stage 1 findings or the diff. The warning about invisible issues aligns with the downstream behavior where issues without proper anchoring may be silently skipped (per CrossFileAnalysisResult Optional field handling in stage_2_cross_file.py).

Consider breaking this long constraint into multiple lines or bullet points for improved maintainability:

📝 Optional: Format for readability
-- CRITICAL ANCHORING: For each cross_file_issue, you MUST set "primary_file" to the single most relevant file where the issue should be annotated in the PR diff. You MUST set "line" to a specific line number in that file. You MUST set "codeSnippet" to the EXACT verbatim line of source code from the Stage 1 findings (codeSnippet field) or from the diff that best represents the issue. Issues without a codeSnippet are INVISIBLE to developers.
+- CRITICAL ANCHORING: For each cross_file_issue:
+  • "primary_file" MUST be the single most relevant file where the issue should be annotated in the PR diff
+  • "line" MUST be a specific line number in that file
+  • "codeSnippet" MUST be the EXACT verbatim line of source code from Stage 1 findings (codeSnippet field) or from the diff that best represents the issue
+  • Issues without a codeSnippet are INVISIBLE to developers
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python-ecosystem/inference-orchestrator/utils/prompts/constants_stage_2.py`
at line 93, The long single-line constraint string beginning with "CRITICAL
ANCHORING:" in constants_stage_2.py should be split into multiple lines or
bullet points for readability while preserving the exact required wording and
meaning; locate the string literal containing "CRITICAL ANCHORING: For each
cross_file_issue..." and refactor it into a small multiline string or a list of
bullet strings (keeping "primary_file", "line", and "codeSnippet" requirements
intact) so the content is easier to maintain without changing semantics.
python-ecosystem/inference-orchestrator/service/review/orchestrator/context_helpers.py (1)

364-364: Simplify single-element tuple check.

The tuple ("duplication",) with a single element could be simplified to a direct string comparison for clarity.

Proposed simplification
-        is_duplication = source in ("duplication",)
+        is_duplication = source == "duplication"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/context_helpers.py`
at line 364, Replace the membership test using a single-element tuple with a
direct string equality check: change the assignment to is_duplication = source
== "duplication" (update where is_duplication is set in context_helpers.py) to
improve clarity and performance.
python-ecosystem/inference-orchestrator/service/review/orchestrator/orchestrator.py (1)

551-556: Edge case: empty affected_files with no primary_file produces "cross-file" as file path.

When both cfi.primary_file is empty and cfi.affected_files is empty, primary_file defaults to "cross-file", which is not a valid file path and may cause downstream issues when posting annotations.

Consider logging a warning or using a more explicit sentinel:

💡 Proposed defensive logging
         primary_file = (
             cfi.primary_file
             if cfi.primary_file
             else (cfi.affected_files[0] if cfi.affected_files else "cross-file")
         )
+        if primary_file == "cross-file":
+            logger.warning(f"Cross-file issue {cfi.id} has no primary_file or affected_files")
         other_files = [f for f in cfi.affected_files if f != primary_file]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/orchestrator.py`
around lines 551 - 556, The current logic that sets primary_file from
cfi.primary_file or cfi.affected_files falls back to the string "cross-file"
when both are empty (primary_file variable), which is not a valid path and can
break downstream annotation posting; update the code around the primary_file
assignment in orchestrator.py (the cfi usage) to detect the empty-case and
either (a) set primary_file to a clear sentinel like None or "UNKNOWN_FILE" and
ensure downstream callers handle that, or (b) log a warning using the existing
logger and skip posting file-specific annotations/raise a controlled error so
invalid paths are never used; include a log message referencing cfi (e.g.,
including cfi.id or summary) so the missing-file condition is traceable.
python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py (1)

150-166: Consider simplifying bidirectional setdefault for text/content.

Lines 161-162 set text from content and content from text, which means if one is populated the other gets the same value, but if both are empty strings, both remain empty. This is defensive but could be clearer:

Optional simplification
-                            merged.setdefault("text", chunk.get("content", ""))
-                            merged.setdefault("content", chunk.get("text", ""))
+                            # Ensure both text and content are populated from whichever is available
+                            chunk_text = chunk.get("text") or chunk.get("content", "")
+                            merged["text"] = chunk_text
+                            merged["content"] = chunk_text
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py`
around lines 150 - 166, The current bidirectional setdefault for "text" and
"content" is confusing; replace the two setdefault calls with a single,
deterministic assignment: derive a primary_text = chunk.get("text") or
chunk.get("content", "") and set merged["text"] = primary_text and
merged["content"] = merged.get("content", primary_text) (or simply
merged["content"] = primary_text if you prefer sameness), preserving other
fields (merged, related_defs, context["relevant_code"]) and ensuring you don't
overwrite existing non-empty values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python-ecosystem/inference-orchestrator/model/multi_stage.py`:
- Around line 66-68: The CrossFileIssue anchoring fields are currently
permissive; change primary_file, line and codeSnippet to required, validated
fields so Pydantic will raise on missing anchors: replace the permissive
definitions for primary_file, line and codeSnippet in multi_stage.py with
non-optional required Fields (e.g., primary_file as a non-empty string
Field(..., min_length=1), line as an int Field(...), and codeSnippet as a
non-empty string Field(..., min_length=1)) so deserialization fails instead of
silently accepting empty/default anchors and allowing downstream fallback to
drop issues.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/loader.py`:
- Around line 12-16: The regex _HASH_ASSET_PATTERN uses a minimum length of 7
for the hash group which excludes documented examples like "9fca3e" (6 chars);
update the quantifier in _HASH_ASSET_PATTERN from {7,} to {6,} so the pattern
r'[._-]([a-zA-Z0-9]{6,})\.(js|css|mjs|cjs)$' will match 6+ char content hashes
(e.g., vendor~lib.9fca3e.mjs) and continue to cover longer hashes.
- Around line 235-239: The exclusion counter semantics are wrong: the block that
skips generated assets (uses _is_generated_asset(file_path.name) and
logger.debug) increments excluded_count, but the final load_from_directory
summary still says exclusions are “by patterns”; fix by stopping the
generated-asset branch from incrementing the pattern counter and either
introduce a separate generated_excluded_count (increment that in the
_is_generated_asset branch) or change the final summary label to a generic
“excluded” count; update the final log in load_from_directory (which currently
references excluded_count) to report both counts or the new unified label so the
message accurately reflects exclusions.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py`:
- Line 161: The parameter has_signature in scoring_config.py is intentionally
kept for API compatibility but triggers Ruff ARG002; to silence the warning
without renaming it, explicitly mark it as consumed in the function/class body
that declares it (e.g., add a deliberate no-op use such as deleting it or
assigning it to a throwaway variable) so the linter recognizes it as used;
reference the has_signature parameter in the appropriate function/method/class
where it's declared (keeping the name unchanged) and add a brief comment like
"intentionally consumed to preserve API" to make the intent clear.
- Around line 133-141: The environment integer parsing for the Field
default_factories (the lambda using RAG_MAX_CHUNKS_PER_SOURCE_FILE and the
oversized_chunk_threshold default from RAG_OVERSIZED_CHUNK_THRESHOLD) must be
hardened: replace the direct int(os.getenv(...)) calls with a small safe parser
(e.g., parse_env_int or parse_positive_int_env) that catches
ValueError/TypeError, returns the provided fallback when parsing fails, and
enforces bounds (ensure max chunks is at least 1 and oversized_chunk_threshold
is a positive integer > 0, or clamp to sensible defaults); update the
default_factory lambdas for the Field that references
RAG_MAX_CHUNKS_PER_SOURCE_FILE and the oversized_chunk_threshold Field to call
this parser so invalid env values don’t crash initialization and downstream
scoring gets valid, bounded integers.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/services/pr_context.py`:
- Line 349: The inline comment "Linear penalty: at threshold → 1.0, at
3×threshold → floor" contains the multiplication sign "×" which triggers Ruff
RUF003; update that comment in pr_context.py (the "Linear penalty..." comment)
to use a plain "x" (e.g., "3xthreshold") so the lint rule is satisfied while
preserving the intended meaning.
- Around line 370-379: The per-file capping uses metadata.get('path', '') which
can be missing and collapse unrelated chunks; update the logic in the capping
block (where max_per_file = scoring_config.max_chunks_per_source_file, iterating
over filtered, building file_counts and capped) to derive a stable file
identifier by checking metadata.get('file_path') first and falling back to
metadata.get('path') (and finally a safe default) before incrementing
file_counts; use that stable identifier for bucketing so unrelated chunks aren’t
merged.

---

Duplicate comments:
In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/context_helpers.py`:
- Line 377: The call to logger.warning uses an unnecessary f-string prefix in
the message "No RAG chunks included after tiered selection"; update the
logger.warning invocation (the logger.warning call in this module) to use a
plain string literal without the f prefix so the message is logged as "No RAG
chunks included after tiered selection".

---

Nitpick comments:
In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/context_helpers.py`:
- Line 364: Replace the membership test using a single-element tuple with a
direct string equality check: change the assignment to is_duplication = source
== "duplication" (update where is_duplication is set in context_helpers.py) to
improve clarity and performance.

In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/orchestrator.py`:
- Around line 551-556: The current logic that sets primary_file from
cfi.primary_file or cfi.affected_files falls back to the string "cross-file"
when both are empty (primary_file variable), which is not a valid path and can
break downstream annotation posting; update the code around the primary_file
assignment in orchestrator.py (the cfi usage) to detect the empty-case and
either (a) set primary_file to a clear sentinel like None or "UNKNOWN_FILE" and
ensure downstream callers handle that, or (b) log a warning using the existing
logger and skip posting file-specific annotations/raise a controlled error so
invalid paths are never used; include a log message referencing cfi (e.g.,
including cfi.id or summary) so the missing-file condition is traceable.

In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py`:
- Around line 150-166: The current bidirectional setdefault for "text" and
"content" is confusing; replace the two setdefault calls with a single,
deterministic assignment: derive a primary_text = chunk.get("text") or
chunk.get("content", "") and set merged["text"] = primary_text and
merged["content"] = merged.get("content", primary_text) (or simply
merged["content"] = primary_text if you prefer sameness), preserving other
fields (merged, related_defs, context["relevant_code"]) and ensuring you don't
overwrite existing non-empty values.

In `@python-ecosystem/inference-orchestrator/utils/prompts/constants_stage_2.py`:
- Line 93: The long single-line constraint string beginning with "CRITICAL
ANCHORING:" in constants_stage_2.py should be split into multiple lines or
bullet points for readability while preserving the exact required wording and
meaning; locate the string literal containing "CRITICAL ANCHORING: For each
cross_file_issue..." and refactor it into a small multiline string or a list of
bullet strings (keeping "primary_file", "line", and "codeSnippet" requirements
intact) so the content is easier to maintain without changing semantics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1f3e8452-f8fa-4df7-87a9-af24ad4d00a1

📥 Commits

Reviewing files that changed from the base of the PR and between dd0c72a and cb54057.

📒 Files selected for processing (11)
  • python-ecosystem/inference-orchestrator/model/multi_stage.py
  • python-ecosystem/inference-orchestrator/service/review/orchestrator/context_helpers.py
  • python-ecosystem/inference-orchestrator/service/review/orchestrator/orchestrator.py
  • python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py
  • python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_2_cross_file.py
  • python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_helpers.py
  • python-ecosystem/inference-orchestrator/utils/prompts/constants_stage_2.py
  • python-ecosystem/rag-pipeline/src/rag_pipeline/core/loader.py
  • python-ecosystem/rag-pipeline/src/rag_pipeline/models/config.py
  • python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py
  • python-ecosystem/rag-pipeline/src/rag_pipeline/services/pr_context.py

Comment on lines +66 to +68
primary_file: str = Field(default="", description="The single most relevant file where this issue should be annotated")
line: Optional[int] = Field(default=None, description="Line number in primary_file where the issue is most evident")
codeSnippet: Optional[str] = Field(default=None, description="Verbatim code line from primary_file that anchors this issue")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the actual current state of the CrossFileIssue class in multi_stage.py
cat -n python-ecosystem/inference-orchestrator/model/multi_stage.py | head -100

Repository: rostilos/CodeCrow

Length of output: 3709


🏁 Script executed:

# Search for the class definition and field usages
rg -n 'class CrossFileIssue|primary_file|codeSnippet' --type=py -A 3 python-ecosystem/inference-orchestrator/

Repository: rostilos/CodeCrow

Length of output: 17927


🏁 Script executed:

# Check json_utils.py for deserialization logic
cat -n python-ecosystem/inference-orchestrator/service/review/orchestrator/json_utils.py | head -100

Repository: rostilos/CodeCrow

Length of output: 4278


🏁 Script executed:

# Check orchestrator.py for fallback logic
rg -n 'primary_file|codeSnippet|fallback|line.*1' --type=py -B 2 -A 2 python-ecosystem/inference-orchestrator/service/review/orchestrator/orchestrator.py | head -150

Repository: rostilos/CodeCrow

Length of output: 1930


🏁 Script executed:

# Search for stage-2 prompts and structured output usage
rg -n 'with_structured_output|CrossFileAnalysisResult|primary_file.*required|line.*required' --type=py python-ecosystem/inference-orchestrator/

Repository: rostilos/CodeCrow

Length of output: 2503


🏁 Script executed:

# Search for CrossFileIssue instantiation patterns
rg -n 'CrossFileIssue\(' --type=py -B 2 -A 5 python-ecosystem/inference-orchestrator/

Repository: rostilos/CodeCrow

Length of output: 721


🏁 Script executed:

# Look for test files and fixtures with CrossFileIssue
find python-ecosystem/inference-orchestrator -name '*test*' -o -name '*fixture*' | head -20

Repository: rostilos/CodeCrow

Length of output: 244


🏁 Script executed:

# Check if there are any test files or data examples
fd -e yaml -e json -e py --path-separator / | grep -i test | head -20

Repository: rostilos/CodeCrow

Length of output: 1275


🏁 Script executed:

# Search for any intentional empty/None assignments to these fields
rg -n 'primary_file\s*=\s*("")|line\s*=\s*(None|1)|codeSnippet\s*=\s*("")' --type=py python-ecosystem/inference-orchestrator/

Repository: rostilos/CodeCrow

Length of output: 43


🏁 Script executed:

# Check stage_2 prompt file for detailed instructions
cat -n python-ecosystem/inference-orchestrator/utils/prompts/constants_stage_2.py | head -150

Repository: rostilos/CodeCrow

Length of output: 5822


🏁 Script executed:

# Search for where CrossFileIssue results are actually processed/parsed
rg -n 'cross_file_issues' --type=py -B 2 -A 3 python-ecosystem/inference-orchestrator/ | head -100

Repository: rostilos/CodeCrow

Length of output: 3820


🏁 Script executed:

# Check if there are mock/test data that show the expected structure
rg -n 'primary_file.*line.*codeSnippet' --type=py -B 1 -A 1 python-ecosystem/

Repository: rostilos/CodeCrow

Length of output: 944


🏁 Script executed:

# Look at stage_2_cross_file.py to see the complete flow
cat -n python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_2_cross_file.py

Repository: rostilos/CodeCrow

Length of output: 10977


Enforce strict anchoring fields in CrossFileIssue schema to prevent silent quality degradation.

The schema allows missing primary_file, line, and codeSnippet via permissive defaults (empty string, None, None), which contradicts the prompt requirement (Stage 2, Line 93: "Issues without a codeSnippet are INVISIBLE to developers"). When the LLM structured output path fails and the fallback parse_llm_response is triggered, Pydantic silently fills defaults, allowing issues to be created with no anchor fields. The downstream fallback in orchestrator.py:573-574 then converts these to line=1 and empty snippet, causing the issue to be discarded despite passing deserialization.

Proposed schema hardening
 class CrossFileIssue(BaseModel):
     """Issue spanning multiple files (Stage 2)."""
     id: str
     severity: str
     category: str
     title: str
-    primary_file: str = Field(default="", description="The single most relevant file where this issue should be annotated")
-    line: Optional[int] = Field(default=None, description="Line number in primary_file where the issue is most evident")
-    codeSnippet: Optional[str] = Field(default=None, description="Verbatim code line from primary_file that anchors this issue")
+    primary_file: str = Field(..., min_length=1, description="The single most relevant file where this issue should be annotated")
+    line: int = Field(..., ge=1, description="Line number in primary_file where the issue is most evident")
+    codeSnippet: str = Field(..., min_length=1, description="Verbatim code line from primary_file that anchors this issue")
     affected_files: List[str]

This prevents the fallback deserialization path (json_utils.py:25) from accepting incomplete data and forces errors to surface in the LLM repair loop rather than degrading silently.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
primary_file: str = Field(default="", description="The single most relevant file where this issue should be annotated")
line: Optional[int] = Field(default=None, description="Line number in primary_file where the issue is most evident")
codeSnippet: Optional[str] = Field(default=None, description="Verbatim code line from primary_file that anchors this issue")
primary_file: str = Field(..., min_length=1, description="The single most relevant file where this issue should be annotated")
line: int = Field(..., ge=1, description="Line number in primary_file where the issue is most evident")
codeSnippet: str = Field(..., min_length=1, description="Verbatim code line from primary_file that anchors this issue")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python-ecosystem/inference-orchestrator/model/multi_stage.py` around lines 66
- 68, The CrossFileIssue anchoring fields are currently permissive; change
primary_file, line and codeSnippet to required, validated fields so Pydantic
will raise on missing anchors: replace the permissive definitions for
primary_file, line and codeSnippet in multi_stage.py with non-optional required
Fields (e.g., primary_file as a non-empty string Field(..., min_length=1), line
as an int Field(...), and codeSnippet as a non-empty string Field(...,
min_length=1)) so deserialization fails instead of silently accepting
empty/default anchors and allowing downstream fallback to drop issues.

Comment on lines +12 to +16
# Detects build-tool-generated assets with content hashes in their filenames.
# Examples: index-D25HpPdh.js, main.a1b2c3d4.css, vendor~lib.9fca3e.mjs
_HASH_ASSET_PATTERN = re.compile(
r'[._-]([a-zA-Z0-9]{7,})\.(js|css|mjs|cjs)$'
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Regex length threshold misses documented hashed assets

On Line 15, {7,} conflicts with your own example (vendor~lib.9fca3e.mjs, 6 chars). Those assets will not be filtered and can still get indexed.

Suggested fix
-_HASH_ASSET_PATTERN = re.compile(
-    r'[._-]([a-zA-Z0-9]{7,})\.(js|css|mjs|cjs)$'
-)
+_HASH_ASSET_PATTERN = re.compile(
+    r'[._-]([a-zA-Z0-9]{6,})\.(js|css|mjs|cjs)$'
+)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Detects build-tool-generated assets with content hashes in their filenames.
# Examples: index-D25HpPdh.js, main.a1b2c3d4.css, vendor~lib.9fca3e.mjs
_HASH_ASSET_PATTERN = re.compile(
r'[._-]([a-zA-Z0-9]{7,})\.(js|css|mjs|cjs)$'
)
# Detects build-tool-generated assets with content hashes in their filenames.
# Examples: index-D25HpPdh.js, main.a1b2c3d4.css, vendor~lib.9fca3e.mjs
_HASH_ASSET_PATTERN = re.compile(
r'[._-]([a-zA-Z0-9]{6,})\.(js|css|mjs|cjs)$'
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/loader.py` around lines
12 - 16, The regex _HASH_ASSET_PATTERN uses a minimum length of 7 for the hash
group which excludes documented examples like "9fca3e" (6 chars); update the
quantifier in _HASH_ASSET_PATTERN from {7,} to {6,} so the pattern
r'[._-]([a-zA-Z0-9]{6,})\.(js|css|mjs|cjs)$' will match 6+ char content hashes
(e.g., vendor~lib.9fca3e.mjs) and continue to cover longer hashes.

Comment on lines +235 to +239
if _is_generated_asset(file_path.name):
logger.debug(f"Generated asset, skipping: {relative_path}")
excluded_count += 1
continue

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Exclusion counter semantics changed, but log label still says “by patterns”

This block increments excluded_count for generated assets too, so the final message in load_from_directory is now misleading when it says exclusions are only “by patterns” (Line 280).

Suggested fix
-logger.info(f"Loaded {len(documents)} documents from {repo_path} (excluded {excluded_count} files by patterns)")
+logger.info(f"Loaded {len(documents)} documents from {repo_path} (excluded {excluded_count} files by filters)")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/core/loader.py` around lines
235 - 239, The exclusion counter semantics are wrong: the block that skips
generated assets (uses _is_generated_asset(file_path.name) and logger.debug)
increments excluded_count, but the final load_from_directory summary still says
exclusions are “by patterns”; fix by stopping the generated-asset branch from
incrementing the pattern counter and either introduce a separate
generated_excluded_count (increment that in the _is_generated_asset branch) or
change the final summary label to a generic “excluded” count; update the final
log in load_from_directory (which currently references excluded_count) to report
both counts or the new unified label so the message accurately reflects
exclusions.

Comment on lines +133 to +141
default_factory=lambda: int(os.getenv("RAG_MAX_CHUNKS_PER_SOURCE_FILE", "2")),
description="Max chunks kept from a single source file in results"
)

# Oversized chunk penalty — penalize very large chunks that waste token budget
oversized_chunk_threshold: int = Field(
default_factory=lambda: int(os.getenv("RAG_OVERSIZED_CHUNK_THRESHOLD", "4000")),
description="Chunk text length above which size penalty starts"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Harden integer env parsing and bounds validation.

Direct int(os.getenv(...)) can crash initialization on invalid values, and non-positive oversized_chunk_threshold can break downstream scoring math.

🛠️ Proposed fix
-import os
-from typing import List
+import os
+from typing import List, Optional
@@
 def _parse_float_env(env_var: str, default: float) -> float:
@@
         logger.warning(f"Invalid float value for {env_var}: {value}, using default {default}")
         return default
+
+
+def _parse_int_env(env_var: str, default: int, min_value: Optional[int] = None) -> int:
+    """Parse int from environment variable with optional lower-bound validation."""
+    value = os.getenv(env_var)
+    if not value:
+        return default
+    try:
+        parsed = int(value)
+    except ValueError:
+        logger.warning(f"Invalid int value for {env_var}: {value}, using default {default}")
+        return default
+    if min_value is not None and parsed < min_value:
+        logger.warning(f"Out-of-range int value for {env_var}: {parsed}, using default {default}")
+        return default
+    return parsed
@@
     max_chunks_per_source_file: int = Field(
-        default_factory=lambda: int(os.getenv("RAG_MAX_CHUNKS_PER_SOURCE_FILE", "2")),
+        default_factory=lambda: _parse_int_env("RAG_MAX_CHUNKS_PER_SOURCE_FILE", 2, min_value=0),
         description="Max chunks kept from a single source file in results"
     )
@@
     oversized_chunk_threshold: int = Field(
-        default_factory=lambda: int(os.getenv("RAG_OVERSIZED_CHUNK_THRESHOLD", "4000")),
+        default_factory=lambda: _parse_int_env("RAG_OVERSIZED_CHUNK_THRESHOLD", 4000, min_value=1),
         description="Chunk text length above which size penalty starts"
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py`
around lines 133 - 141, The environment integer parsing for the Field
default_factories (the lambda using RAG_MAX_CHUNKS_PER_SOURCE_FILE and the
oversized_chunk_threshold default from RAG_OVERSIZED_CHUNK_THRESHOLD) must be
hardened: replace the direct int(os.getenv(...)) calls with a small safe parser
(e.g., parse_env_int or parse_positive_int_env) that catches
ValueError/TypeError, returns the provided fallback when parsing fails, and
enforces bounds (ensure max chunks is at least 1 and oversized_chunk_threshold
is a positive integer > 0, or clamp to sensible defaults); update the
default_factory lambdas for the Field that references
RAG_MAX_CHUNKS_PER_SOURCE_FILE and the oversized_chunk_threshold Field to call
this parser so invalid env values don’t crash initialization and downstream
scoring gets valid, bounded integers.

has_semantic_names: bool = False,
has_docstring: bool = False,
has_signature: bool = False
has_signature: bool = False,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silence intentional compatibility arg warning (ARG002).

has_signature is intentionally retained for API compatibility, but currently unused and flagged by Ruff. Mark it as intentionally consumed.

🛠️ Proposed fix
     def calculate_boosted_score(
@@
         """
         Calculate final score for a result.
@@
         """
+        _ = has_signature  # kept for backward-compatible signature
         # Factor 1: content type quality
         content_boost = self.content_type_boost.get(content_type)
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 161-161: Unused method argument: has_signature

(ARG002)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/models/scoring_config.py` at
line 161, The parameter has_signature in scoring_config.py is intentionally kept
for API compatibility but triggers Ruff ARG002; to silence the warning without
renaming it, explicitly mark it as consumed in the function/class body that
declares it (e.g., add a deliberate no-op use such as deleting it or assigning
it to a throwaway variable) so the linter recognizes it as used; reference the
has_signature parameter in the appropriate function/method/class where it's
declared (keeping the name unchanged) and add a brief comment like
"intentionally consumed to preserve API" to make the intent clear.

text_len = len(result.get('text', ''))
threshold = scoring_config.oversized_chunk_threshold
if text_len > threshold:
# Linear penalty: at threshold → 1.0, at 3×threshold → floor
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace ambiguous × in comment to avoid Ruff RUF003.

Use plain x in the comment text so lint remains clean.

🧰 Tools
🪛 Ruff (0.15.2)

[warning] 349-349: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python-ecosystem/rag-pipeline/src/rag_pipeline/services/pr_context.py` at
line 349, The inline comment "Linear penalty: at threshold → 1.0, at 3×threshold
→ floor" contains the multiplication sign "×" which triggers Ruff RUF003; update
that comment in pr_context.py (the "Linear penalty..." comment) to use a plain
"x" (e.g., "3xthreshold") so the lint rule is satisfied while preserving the
intended meaning.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py (1)

721-749: Minor inconsistency in path matching logic.

The path matching on line 729 (meta.path.endswith(bp)) is unidirectional, whereas lines 227-229 use bidirectional matching (meta.path.endswith(bp) or bp.endswith(meta.path)). This could cause edge cases where enrichment metadata is matched for duplication queries but not for identifier extraction, or vice versa.

Consider aligning the matching logic for consistency:

Suggested alignment
         for meta in request.enrichmentData.fileMetadata:
-            if meta.path in batch_file_paths or any(meta.path.endswith(bp) for bp in batch_file_paths):
+            if meta.path in batch_file_paths or any(
+                meta.path.endswith(bp) or bp.endswith(meta.path) for bp in batch_file_paths
+            ):
                 _ids.update(meta.extendsClasses or [])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py`
around lines 721 - 749, The path-matching in the enrichment extraction block is
unidirectional (uses meta.path.endswith(bp)) which differs from the
bidirectional check used elsewhere; update the conditional inside the loop that
builds _ids (the if checking meta.path against batch_file_paths) to the same
bidirectional form (meta.path.endswith(bp) or bp.endswith(meta.path)) so
enrichment_identifiers is populated consistently with the rest of the code that
references request.enrichmentData.fileMetadata and batch_file_paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py`:
- Around line 422-440: The for-loop over enrichment_metadata currently uses the
key variable fp but never references it; update the loop to indicate the key is
intentionally unused by either changing "for fp, meta in
enrichment_metadata.items()" to "for _, meta in enrichment_metadata.items()" or
iterate only over values with "for meta in enrichment_metadata.values()",
leaving the rest of the logic (checks on meta and calls to _add) unchanged; this
should be applied where enrichment_metadata, fp, meta, and _add appear in
stage_1_file_review.py.

---

Nitpick comments:
In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py`:
- Around line 721-749: The path-matching in the enrichment extraction block is
unidirectional (uses meta.path.endswith(bp)) which differs from the
bidirectional check used elsewhere; update the conditional inside the loop that
builds _ids (the if checking meta.path against batch_file_paths) to the same
bidirectional form (meta.path.endswith(bp) or bp.endswith(meta.path)) so
enrichment_identifiers is populated consistently with the rest of the code that
references request.enrichmentData.fileMetadata and batch_file_paths.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bb515947-34fb-424b-b279-b4036b9f423a

📥 Commits

Reviewing files that changed from the base of the PR and between cb54057 and 3b95d0b.

📒 Files selected for processing (1)
  • python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py

Comment on lines +422 to +440
if enrichment_metadata:
for fp, meta in enrichment_metadata.items():
# Class/interface definitions → find other implementations
for name in (meta.get("semantic_names") or meta.get("semanticNames") or [])[:10]:
if name and len(name) > 3:
_add(f"existing implementation of {name}")

# Parent types → find sibling implementations
for parent in (meta.get("extends") or meta.get("extendsClasses") or []):
if parent and len(parent) > 3:
_add(f"class extending {parent} implementation")
for iface in (meta.get("implements") or meta.get("implementsInterfaces") or []):
if iface and len(iface) > 3:
_add(f"class implementing {iface}")

# Called functions → find other callers (may reveal shared logic)
for call in (meta.get("calls") or [])[:8]:
if call and len(call) > 4:
_add(f"usage of {call} implementation")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unused loop variable fp.

The loop iterates over enrichment_metadata.items() but only uses the value (meta), not the key (fp). Use _ to indicate an intentionally unused variable.

Proposed fix
-    if enrichment_metadata:
-        for fp, meta in enrichment_metadata.items():
+    if enrichment_metadata:
+        for _, meta in enrichment_metadata.items():

Alternatively, iterate over just the values:

-    if enrichment_metadata:
-        for fp, meta in enrichment_metadata.items():
+    if enrichment_metadata:
+        for meta in enrichment_metadata.values():
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if enrichment_metadata:
for fp, meta in enrichment_metadata.items():
# Class/interface definitions → find other implementations
for name in (meta.get("semantic_names") or meta.get("semanticNames") or [])[:10]:
if name and len(name) > 3:
_add(f"existing implementation of {name}")
# Parent types → find sibling implementations
for parent in (meta.get("extends") or meta.get("extendsClasses") or []):
if parent and len(parent) > 3:
_add(f"class extending {parent} implementation")
for iface in (meta.get("implements") or meta.get("implementsInterfaces") or []):
if iface and len(iface) > 3:
_add(f"class implementing {iface}")
# Called functions → find other callers (may reveal shared logic)
for call in (meta.get("calls") or [])[:8]:
if call and len(call) > 4:
_add(f"usage of {call} implementation")
if enrichment_metadata:
for _, meta in enrichment_metadata.items():
# Class/interface definitions → find other implementations
for name in (meta.get("semantic_names") or meta.get("semanticNames") or [])[:10]:
if name and len(name) > 3:
_add(f"existing implementation of {name}")
# Parent types → find sibling implementations
for parent in (meta.get("extends") or meta.get("extendsClasses") or []):
if parent and len(parent) > 3:
_add(f"class extending {parent} implementation")
for iface in (meta.get("implements") or meta.get("implementsInterfaces") or []):
if iface and len(iface) > 3:
_add(f"class implementing {iface}")
# Called functions → find other callers (may reveal shared logic)
for call in (meta.get("calls") or [])[:8]:
if call and len(call) > 4:
_add(f"usage of {call} implementation")
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 423-423: Loop control variable fp not used within loop body

(B007)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py`
around lines 422 - 440, The for-loop over enrichment_metadata currently uses the
key variable fp but never references it; update the loop to indicate the key is
intentionally unused by either changing "for fp, meta in
enrichment_metadata.items()" to "for _, meta in enrichment_metadata.items()" or
iterate only over values with "for meta in enrichment_metadata.values()",
leaving the rest of the logic (checks on meta and calls to _add) unchanged; this
should be applied where enrichment_metadata, fp, meta, and _add appear in
stage_1_file_review.py.

@rostilos
Copy link
Owner Author

rostilos commented Mar 5, 2026

/codecrow analyze

@codecrow-ai codecrow-ai bot deleted a comment from codecrow-local bot Mar 5, 2026
@rostilos
Copy link
Owner Author

rostilos commented Mar 5, 2026

/codecrow analyze

1 similar comment
@rostilos
Copy link
Owner Author

rostilos commented Mar 5, 2026

/codecrow analyze

@codecrow-local codecrow-local bot deleted a comment from codecrow-ai bot Mar 5, 2026
@rostilos
Copy link
Owner Author

rostilos commented Mar 5, 2026

/codecrow analyze

@codecrow-local codecrow-local bot deleted a comment from codecrow-ai bot Mar 5, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py (4)

169-169: Rename unused loop variable to _def_name.

The loop variable def_name is not used within the loop body. Per Python convention, prefix with underscore to indicate intentional non-use.

♻️ Proposed fix
-        for def_name, def_chunks in related_definitions.items():
+        for _def_name, def_chunks in related_definitions.items():
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py`
at line 169, Rename the unused loop variable def_name to _def_name in the loop
inside the relevant function in deterministic_context.py (the line with "for
def_name, def_chunks in related_definitions.items()") so it follows Python
convention for intentionally unused variables; keep def_chunks unchanged and
ensure no other references to def_name remain elsewhere in the function.

178-180: Remove ineffective set subtraction.

transitive_parents -= changed_file_paths subtracts file paths (e.g., "src/foo/Bar.java") from parent type names (e.g., "BaseClass"). These are disjoint sets in practice, so this line has no effect and adds confusion.

♻️ Proposed fix
         # Remove names already looked up to avoid redundant queries
         transitive_parents -= all_to_find
-        transitive_parents -= changed_file_paths  # Skip changed file paths
         transitive_parents = {p for p in transitive_parents if p and len(p) > 1}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py`
around lines 178 - 180, The subtraction transitive_parents -= changed_file_paths
is ineffective because changed_file_paths contains file path strings while
transitive_parents contains parent type names, so remove that line; instead
ensure transitive_parents only excludes all_to_find and empty/one-character
values by keeping the existing transitive_parents -= all_to_find and the
filtering line transitive_parents = {p for p in transitive_parents if p and
len(p) > 1}; delete any reference to changed_file_paths in this block (e.g., the
line using changed_file_paths) to avoid confusion.

145-153: Simplify the enrichment delta calculation.

The pre_count/post_count approach computes the union of three sets, but only identifiers_to_find is modified. This makes the delta calculation unnecessarily complex and potentially confusing.

♻️ Proposed simplification
         if additional_identifiers:
-            pre_count = len(identifiers_to_find | imports_raw | extends_raw)
+            added_count = 0
             for name in additional_identifiers:
                 name = name.strip()
                 if name and len(name) > 1:
-                    identifiers_to_find.add(name)
-            post_count = len(identifiers_to_find | imports_raw | extends_raw)
-            logger.info(f"Enrichment injection: {post_count - pre_count} new identifiers "
+                    if name not in identifiers_to_find:
+                        identifiers_to_find.add(name)
+                        added_count += 1
+            logger.info(f"Enrichment injection: {added_count} new identifiers "
                        f"from {len(additional_identifiers)} additional_identifiers")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py`
around lines 145 - 153, The delta calculation is overcomplicated because only
identifiers_to_find changes; replace the union-based pre_count/post_count with
counts of identifiers_to_find alone: record pre_count = len(identifiers_to_find)
before adding cleaned items from additional_identifiers (keep the same
strip/len>1 filter and add to identifiers_to_find), then compute post_count =
len(identifiers_to_find) and log post_count - pre_count in the logger.info
message (reference symbols: identifiers_to_find, additional_identifiers,
imports_raw, extends_raw, logger.info).

428-429: Consider making the batch limit configurable.

The batch limit of 50 for transitive parents is hardcoded in lines 429 and 436. Other query methods follow the established pattern of using self.config for similar limits: max_identifiers_per_query, max_parent_classes_per_query, and max_namespaces_per_query are all defined in the config model with environment variable support.

♻️ Proposed change

Add max_transitive_parents_per_query to the config model and use it here:

-            batch = list(transitive_parents)[:50]
+            batch = list(transitive_parents)[:self.config.max_transitive_parents_per_query]
             results, _ = self.qdrant_client.scroll(
                 collection_name=collection_name,
                 scroll_filter=Filter(must=[
                     branch_filter,
                     FieldCondition(key="primary_name", match=MatchAny(any=batch))
                 ]),
-                limit=50 * len(branches),
+                limit=self.config.max_transitive_parents_per_query * len(branches),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py`
around lines 428 - 429, Replace the hardcoded 50-limit on transitive parent
batching with a configurable value: add a new config field (e.g.,
max_transitive_parents_per_query) to the existing config model (with env var
support consistent with max_identifiers_per_query, max_parent_classes_per_query,
max_namespaces_per_query) and then use that value when slicing
transitive_parents in deterministic_context.py (replace
list(transitive_parents)[:50] and the similar slice at the other occurrence with
list(transitive_parents)[: self.config.max_transitive_parents_per_query]);
ensure the config default mirrors current behavior if unset.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py`:
- Around line 845-846: The path comparison is asymmetric because it only checks
meta.path.endswith(bp); to fix, normalize both sides before matching: compute
normalized/absolute forms for meta.path and each entry in batch_file_paths
(e.g., via pathlib.Path(...).resolve() or
os.path.normpath(os.path.abspath(...))) and then compare exact equality or Path
equivalence instead of endswith; update the block that populates _ids (the
branch using meta.path in batch_file_paths or meta.path.endswith(bp) and the
similar logic around lines handling enrichment at 905-906) to use the normalized
comparisons so enrichment lookups include matches regardless of relative vs
absolute paths.
- Around line 640-664: When scoping chunks by diff tokens you currently only set
is_relevant from token hits and later cap/move non-matches to file_level, which
discards deterministic transitive-parent chunks (marked by _match_priority==2 or
_match_type=="transitive_parent"). Fix: when evaluating each chunk in def_chunks
(use variables meta, diff_tokens, is_relevant) treat a chunk as relevant if it
matches tokens OR chunk.get("_match_priority")==2 or
chunk.get("_match_type")=="transitive_parent"; ensure you set is_relevant=True
in that case so those chunks go into diff_relevant rather than file_level, and
do not overwrite or demote the chunk's existing _match_priority/_match_type
later (the code around where _match_priority/_match_type is reset at the earlier
location referenced). Also ensure the per-definition capping (capped =
def_chunks[:max_per_def]) and subsequent annotated placement
(annotated["_diff_relevant"], diff_relevant, file_level, max_file_level)
preserves these transitive_parent items before slicing.

---

Nitpick comments:
In
`@python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py`:
- Line 169: Rename the unused loop variable def_name to _def_name in the loop
inside the relevant function in deterministic_context.py (the line with "for
def_name, def_chunks in related_definitions.items()") so it follows Python
convention for intentionally unused variables; keep def_chunks unchanged and
ensure no other references to def_name remain elsewhere in the function.
- Around line 178-180: The subtraction transitive_parents -= changed_file_paths
is ineffective because changed_file_paths contains file path strings while
transitive_parents contains parent type names, so remove that line; instead
ensure transitive_parents only excludes all_to_find and empty/one-character
values by keeping the existing transitive_parents -= all_to_find and the
filtering line transitive_parents = {p for p in transitive_parents if p and
len(p) > 1}; delete any reference to changed_file_paths in this block (e.g., the
line using changed_file_paths) to avoid confusion.
- Around line 145-153: The delta calculation is overcomplicated because only
identifiers_to_find changes; replace the union-based pre_count/post_count with
counts of identifiers_to_find alone: record pre_count = len(identifiers_to_find)
before adding cleaned items from additional_identifiers (keep the same
strip/len>1 filter and add to identifiers_to_find), then compute post_count =
len(identifiers_to_find) and log post_count - pre_count in the logger.info
message (reference symbols: identifiers_to_find, additional_identifiers,
imports_raw, extends_raw, logger.info).
- Around line 428-429: Replace the hardcoded 50-limit on transitive parent
batching with a configurable value: add a new config field (e.g.,
max_transitive_parents_per_query) to the existing config model (with env var
support consistent with max_identifiers_per_query, max_parent_classes_per_query,
max_namespaces_per_query) and then use that value when slicing
transitive_parents in deterministic_context.py (replace
list(transitive_parents)[:50] and the similar slice at the other occurrence with
list(transitive_parents)[: self.config.max_transitive_parents_per_query]);
ensure the config default mirrors current behavior if unset.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5fc52eab-2faf-4306-8683-8ecf70608469

📥 Commits

Reviewing files that changed from the base of the PR and between 3b95d0b and 85fa0c1.

📒 Files selected for processing (2)
  • python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py
  • python-ecosystem/rag-pipeline/src/rag_pipeline/services/deterministic_context.py

Comment on lines +640 to +664
# Check semantic_names / primary_name from chunk metadata
for chunk in def_chunks:
meta = chunk.get("metadata", {})
names = set()
if meta.get("primary_name"):
names.add(meta["primary_name"])
for sn in (meta.get("semantic_names") or []):
names.add(sn)
if names & diff_tokens:
is_relevant = True
break

# Cap chunks per definition
capped = def_chunks[:max_per_def]

for chunk in capped:
annotated = dict(chunk)
annotated["_def_name"] = def_name
annotated["_diff_relevant"] = is_relevant
if is_relevant:
diff_relevant.append(annotated)
else:
file_level.append(annotated)

result = diff_relevant + file_level[:max_file_level]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve transitive-parent deterministic chunks during diff scoping.

Line 640–664 classifies relevance only by token hits, then non-matches are file-level-capped; Line 241 also overwrites match type for non-diff items. This can discard/demote transitive_parent chunks (_match_priority: 2) produced by deterministic retrieval.

Proposed fix
@@ def _scope_deterministic_to_diff(...):
-        if not diff_tokens:
+        has_structural_priority = any(
+            chunk.get("_match_type") == "transitive_parent"
+            or chunk.get("_match_priority") == 2
+            for chunk in def_chunks
+        )
+
+        if has_structural_priority:
+            is_relevant = True
+        elif not diff_tokens:
             # No diff text to filter against (e.g., binary) — keep everything
             is_relevant = True
@@
-                        if not is_diff_rel:
+                        if not is_diff_rel and not merged.get("_match_type"):
                             merged["_match_type"] = "file_level_dep"

Also applies to: 239-242

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py`
around lines 640 - 664, When scoping chunks by diff tokens you currently only
set is_relevant from token hits and later cap/move non-matches to file_level,
which discards deterministic transitive-parent chunks (marked by
_match_priority==2 or _match_type=="transitive_parent"). Fix: when evaluating
each chunk in def_chunks (use variables meta, diff_tokens, is_relevant) treat a
chunk as relevant if it matches tokens OR chunk.get("_match_priority")==2 or
chunk.get("_match_type")=="transitive_parent"; ensure you set is_relevant=True
in that case so those chunks go into diff_relevant rather than file_level, and
do not overwrite or demote the chunk's existing _match_priority/_match_type
later (the code around where _match_priority/_match_type is reset at the earlier
location referenced). Also ensure the per-definition capping (capped =
def_chunks[:max_per_def]) and subsequent annotated placement
(annotated["_diff_relevant"], diff_relevant, file_level, max_file_level)
preserves these transitive_parent items before slicing.

Comment on lines +845 to +846
if meta.path in batch_file_paths or any(meta.path.endswith(bp) for bp in batch_file_paths):
_ids.update(meta.extendsClasses or [])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix asymmetric path matching for enrichment lookups.

Line 845 and Line 905 only check meta.path.endswith(bp). If one side is absolute and the other is relative, enrichment data is skipped, which drops identifiers/outlines for the batch.

Proposed fix
-            if meta.path in batch_file_paths or any(meta.path.endswith(bp) for bp in batch_file_paths):
+            if meta.path in batch_file_paths or any(
+                meta.path.endswith(bp) or bp.endswith(meta.path)
+                for bp in batch_file_paths
+            ):
                 _ids.update(meta.extendsClasses or [])
                 _ids.update(meta.implementsInterfaces or [])
                 _ids.update(meta.calls or [])
                 if meta.parentClass:
                     _ids.add(meta.parentClass)

@@
-            if meta.path in batch_file_paths or any(meta.path.endswith(bp) for bp in batch_file_paths):
+            if meta.path in batch_file_paths or any(
+                meta.path.endswith(bp) or bp.endswith(meta.path)
+                for bp in batch_file_paths
+            ):
                 outline = f"File: {meta.path}\n"

Also applies to: 905-906

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@python-ecosystem/inference-orchestrator/service/review/orchestrator/stage_1_file_review.py`
around lines 845 - 846, The path comparison is asymmetric because it only checks
meta.path.endswith(bp); to fix, normalize both sides before matching: compute
normalized/absolute forms for meta.path and each entry in batch_file_paths
(e.g., via pathlib.Path(...).resolve() or
os.path.normpath(os.path.abspath(...))) and then compare exact equality or Path
equivalence instead of endswith; update the block that populates _ids (the
branch using meta.path in batch_file_paths or meta.path.endswith(bp) and the
similar logic around lines handling enrichment at 905-906) to use the normalized
comparisons so enrichment lookups include matches regardless of relative vs
absolute paths.

@rostilos
Copy link
Owner Author

rostilos commented Mar 5, 2026

/codecrow ask Column length for tracking_confidence Enum - are you sure regarding this issue? since we're not having enums for 30+ chars length

@codecrow-ai codecrow-ai bot deleted a comment from codecrow-local bot Mar 5, 2026
@codecrow-local codecrow-local bot deleted a comment from codecrow-ai bot Mar 5, 2026
@codecrow-local
Copy link

codecrow-local bot commented Mar 5, 2026

@rostilos

/codecrow ask Column length for tracking_confidence Enum - are you sure regarding this issue? since ...

💬 CodeCrow Answer

@rostilos
Copy link
Owner Author

rostilos commented Mar 5, 2026

/codecrow ask Column length for tracking_confidence Enum - are you sure regarding this issue? since ...

@codecrow-local codecrow-local bot deleted a comment from codecrow-ai bot Mar 5, 2026
@codecrow-local
Copy link

codecrow-local bot commented Mar 5, 2026

@rostilos

/codecrow ask Column length for tracking_confidence Enum - are you sure regarding this issue? since ...

💬 CodeCrow Answer

The column length for tracking_confidence in the code_analysis_issue table was indeed an issue, and it has been addressed in PR #160.

Previously, the column was defined as VARCHAR(10), which caused "value too long" errors when the system attempted to save the new tracking confidence level UNANCHORED_FP_MATCH (19 characters).

Changes in PR #160:

  1. Database Migration: A new Flyway migration file V2.2.0__widen_tracking_confidence_columns.sql was added to widen the column to VARCHAR(30) and update the associated CHECK constraints for both code_analysis_issue and branch_issue tables.
    ALTER TABLE code_analysis_issue
        ALTER COLUMN tracking_confidence TYPE VARCHAR(30);
    
    ALTER TABLE code_analysis_issue
        ADD CONSTRAINT chk_code_analysis_issue_tracking_confidence
        CHECK (tracking_confidence IS NULL
            OR tracking_confidence IN ('EXACT', 'SHIFTED', 'EDITED', 'WEAK', 'NONE', 'UNANCHORED_FP_MATCH'));
  2. Entity Update: The CodeAnalysisIssue Java entity was updated to use the TrackingConfidence Enum with @Enumerated(EnumType.STRING) and a length of 30 to match the database schema.
  3. New Enum Value: The TrackingConfidence enum was expanded to include UNANCHORED_FP_MATCH, which is used when an issue has no reliable line anchor (e.g., architectural issues) but matches a previous issue via its fingerprint.

These changes ensure that the system can safely persist all tracking confidence levels without encountering database truncation errors.

…Map for thread safety and improve source line handling in symbol extraction
@codecrow-local
Copy link

codecrow-local bot commented Mar 5, 2026

🔄 CodeCrow is analyzing this PR...

This may take a few minutes depending on the size of the changes.
This comment will be updated with the analysis results when complete.

@rostilos rostilos merged commit 4853f35 into main Mar 5, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant