cpf/enhancement: Add pattern registry with hardcoded code injection example #329

shivasurya · 2025-10-29T01:16:00Z

Implements pattern matching infrastructure for security analysis with one example pattern (code injection via eval). Additional patterns will be loaded from queries in future PRs. Includes pattern types (source-sink, missing-sanitizer, dangerous-function) and matching algorithms with 92.4% test coverage.

Checklist:

Tests passing (gradle testGo)?
Lint passing (golangci-lint run this requires golangci-lint)?

Add foundational data structures for Python call graph construction: New Types: - CallSite: Represents function call locations with arguments and resolution status - CallGraph: Maps functions to callees with forward/reverse edges - ModuleRegistry: Maps Python file paths to module paths - ImportMap: Tracks imports per file for name resolution - Location: Source code position tracking - Argument: Function call argument metadata Features: - 100% test coverage with comprehensive unit tests - Bidirectional call graph edges (forward and reverse) - Support for ambiguous short names in module registry - Helper functions for module path manipulation This establishes the foundation for 3-pass call graph algorithm: - Pass 1 (next PR): Module registry builder - Pass 2 (next PR): Import extraction and resolution - Pass 3 (next PR): Call graph construction Related: Phase 1 - Call Graph Construction & 3-Pass Algorithm 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Implement the first pass of the call graph construction algorithm: building a complete registry of Python modules by walking the directory tree. New Features: - BuildModuleRegistry: Walks directory tree and maps file paths to module paths - convertToModulePath: Converts file system paths to Python import paths - shouldSkipDirectory: Filters out venv, __pycache__, build dirs, etc. Module Path Conversion: - Handles regular files: myapp/views.py → myapp.views - Handles packages: myapp/utils/__init__.py → myapp.utils - Supports deep nesting: myapp/api/v1/endpoints/users.py → myapp.api.v1.endpoints.users - Cross-platform: Normalizes Windows/Unix path separators Performance Optimizations: - Skips 15+ common non-source directories (venv, __pycache__, .git, dist, build, etc.) - Avoids scanning thousands of dependency files - Indexes both full module paths and short names for ambiguity detection Test Coverage: 93% - Comprehensive unit tests for all conversion scenarios - Integration tests with real Python project structure - Edge case handling: empty dirs, non-Python files, deep nesting, permissions - Error path testing: walk errors, invalid paths, system errors - Test fixtures: test-src/python/simple_project/ with realistic structure - Documented: Remaining 7% are untestable OS-level errors (filepath.Abs failures) This establishes Pass 1 of 3: - ✅ Pass 1: Module registry (this PR) - Next: Pass 2 - Import extraction and resolution - Next: Pass 3 - Call graph construction Related: Phase 1 - Call Graph Construction & 3-Pass Algorithm Base Branch: shiva/callgraph-infra-1 (PR #1) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This PR implements comprehensive import extraction for Python code using tree-sitter AST parsing. It handles all three main import styles: 1. Simple imports: `import module` 2. From imports: `from module import name` 3. Aliased imports: `import module as alias` and `from module import name as alias` The implementation uses direct AST traversal instead of tree-sitter queries for better compatibility and control. It properly handles: - Multiple imports per line (`from json import dumps, loads`) - Nested module paths (`import xml.etree.ElementTree`) - Whitespace variations - Invalid/malformed syntax (fault-tolerant parsing) Key functions: - ExtractImports(): Main entry point that parses code and builds ImportMap - traverseForImports(): Recursively traverses AST to find import statements - processImportStatement(): Handles simple and aliased imports - processImportFromStatement(): Handles from-import statements with proper module name skipping to avoid duplicate entries Test coverage: 92.8% overall, 90-95% for import extraction functions Test fixtures include: - simple_imports.py: Basic import statements - from_imports.py: From import statements with multiple names - aliased_imports.py: Aliased imports (both simple and from) - mixed_imports.py: Mixed import styles All tests passing, linting clean, builds successfully. This is Pass 2 Part A of the 3-pass call graph algorithm. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This PR implements comprehensive relative import resolution for Python using a 3-pass algorithm. It extends the import extraction system from PR #3 to handle Python's relative import syntax with dot notation. Key Changes: 1. **Added FileToModule reverse mapping to ModuleRegistry** - Enables O(1) lookup from file path to module path - Required for resolving relative imports - Updated AddModule() to maintain bidirectional mapping 2. **Implemented resolveRelativeImport() function** - Handles single dot (.) for current package - Handles multiple dots (.., ...) for parent/grandparent packages - Navigates package hierarchy using module path components - Clamps excessive dots to root package level - Falls back gracefully when file not in registry 3. **Enhanced processImportFromStatement() for relative imports** - Detects relative_import nodes in tree-sitter AST - Extracts import_prefix (dots) and optional module suffix - Resolves relative paths to absolute module paths before adding to ImportMap 4. **Comprehensive test coverage (94.5% overall)** - Unit tests for resolveRelativeImport with various dot counts - Integration tests with ExtractImports - Tests for deeply nested packages - Tests for mixed absolute and relative imports - Real fixture files with project structure Relative Import Examples: - `from . import utils` → "currentpackage.utils" - `from .. import config` → "parentpackage.config" - `from ..utils import helper` → "parentpackage.utils.helper" - `from ...db import query` → "grandparent.db.query" Test Fixtures: - Created myapp/submodule/handler.py with all relative import styles - Created supporting package structure with __init__.py files - Tests verify correct resolution across package hierarchy All tests passing, linting clean, builds successfully. This is Pass 2 Part B of the 3-pass call graph algorithm. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This PR implements call site extraction from Python source code using tree-sitter AST parsing. It builds on the import resolution work from PRs #3 and #4 to prepare for call graph construction in Pass 3. ## Changes ### Core Implementation (callsites.go) 1. **ExtractCallSites()**: Main entry point for extracting call sites - Parses Python source with tree-sitter - Traverses AST to find all call expressions - Returns slice of CallSite objects with location information 2. **traverseForCalls()**: Recursive AST traversal - Tracks function context while traversing - Updates context when entering function definitions - Finds and processes call expressions 3. **processCallExpression()**: Call site processing - Extracts callee name (function/method being called) - Parses arguments (positional and keyword) - Creates CallSite with source location - Parameters for importMap and caller reserved for Pass 3 4. **extractCalleeName()**: Callee name extraction - Handles simple identifiers: foo() - Handles attributes: obj.method(), obj.attr.method() - Recursively builds dotted names 5. **extractArguments()**: Argument parsing - Extracts all positional arguments - Preserves keyword arguments as "name=value" in Value field - Tracks argument position and variable status 6. **convertArgumentsToSlice()**: Helper for struct conversion - Converts []*Argument to []Argument for CallSite struct ### Comprehensive Tests (callsites_test.go) Created 17 test functions covering: - Simple function calls: foo(), bar() - Method calls: obj.method(), self.helper() - Arguments: positional, keyword, mixed - Nested calls: foo(bar(x)) - Multiple functions in one file - Class methods - Chained calls: obj.method1().method2() - Module-level calls (no function context) - Source location tracking - Empty files - Complex arguments: expressions, lists, dicts, lambdas - Nested method calls: obj.attr.method() - Real file fixture integration ### Test Fixture (simple_calls.py) Created realistic test file with: - Function definitions with various call patterns - Method calls on objects - Calls with arguments (positional and keyword) - Nested calls - Class methods with self references ## Test Coverage - Overall: 93.3% - ExtractCallSites: 90.0% - traverseForCalls: 93.3% - processCallExpression: 83.3% - extractCalleeName: 91.7% - extractArguments: 87.5% - convertArgumentsToSlice: 100.0% ## Design Decisions 1. **Keyword argument handling**: Store as "name=value" in Value field - Tree-sitter provides full keyword_argument node content - Preserves complete argument information for later analysis - Separating name/value would require additional parsing 2. **Caller context tracking**: Parameter reserved but not used yet - Will be populated in Pass 3 during call graph construction - Enables linking call sites to their containing functions 3. **Import map parameter**: Reserved for Pass 3 resolution - Will be used to resolve qualified names to FQNs - Enables cross-file call graph construction 4. **Location tracking**: Store exact position for each call site - File, line, column information - Enables precise error reporting and code navigation ## Testing Strategy - Unit tests for each extraction function - Integration tests with tree-sitter AST - Real file fixture for end-to-end validation - Edge cases: empty files, no context, nested structures ## Next Steps (PR #6) Pass 3 will use this call site data to: 1. Build the complete call graph structure 2. Resolve call targets to function definitions 3. Link caller and callee through edges 4. Handle disambiguation for overloaded names 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This PR completes the 3-pass algorithm for building Python call graphs by implementing the final pass that resolves call targets and constructs the complete graph structure with edges linking callers to callees. ## Changes ### Core Implementation (builder.go) 1. **BuildCallGraph()**: Main entry point for Pass 3 - Indexes all function definitions from code graph - Iterates through all Python files in the registry - Extracts imports and call sites for each file - Resolves each call site to its target function - Builds edges and stores call site details - Returns complete CallGraph with all relationships 2. **indexFunctions()**: Function indexing - Scans code graph for all function/method definitions - Maps each function to its FQN using module registry - Populates CallGraph.Functions map for quick lookup 3. **getFunctionsInFile()**: File-scoped function retrieval - Filters code graph nodes by file path - Returns only function/method definitions in that file - Used for finding containing functions of call sites 4. **findContainingFunction()**: Call site parent resolution - Determines which function contains a given call site - Uses line number comparison with nearest-match algorithm - Finds function with highest line number ≤ call line - Returns empty string for module-level calls 5. **resolveCallTarget()**: Core resolution logic - Handles simple names: sanitize() → myapp.utils.sanitize - Handles qualified names: utils.sanitize() → myapp.utils.sanitize - Resolves through import maps first - Falls back to same-module resolution - Validates FQNs against module registry - Returns (FQN, resolved bool) tuple 6. **validateFQN()**: FQN validation - Checks if a fully qualified name exists in registry - Handles both modules and functions within modules - Validates parent module for function FQNs 7. **readFileBytes()**: File reading helper - Reads source files for parsing - Handles absolute path conversion ### Comprehensive Tests (builder_test.go) Created 15 test functions covering: **Resolution Tests:** - Simple imported function resolution - Qualified import resolution (module.function) - Same-module function resolution - Unresolved method calls (obj.method) - Non-existent function handling **Validation Tests:** - Module existence validation - Function-in-module validation - Non-existent module handling **Helper Function Tests:** - Function indexing from code graph - Functions-in-file filtering - Containing function detection with edge cases **Integration Tests:** - Simple single-file call graph - Multi-file call graph with imports - Real test fixture integration ## Test Coverage - Overall: 91.8% - BuildCallGraph: 80.8% - indexFunctions: 87.5% - getFunctionsInFile: 100.0% - findContainingFunction: 100.0% - resolveCallTarget: 85.0% - validateFQN: 100.0% - readFileBytes: 75.0% ## Algorithm Overview Pass 3 ties together all previous work: ### Pass 1 (PR #2): BuildModuleRegistry - Maps file paths to module paths - Enables FQN generation ### Pass 2 (PRs #3-5): Import & Call Site Extraction - ExtractImports: Maps local names to FQNs - ExtractCallSites: Finds all function calls in AST ### Pass 3 (This PR): Call Graph Construction - Resolves call targets using import maps - Links callers to callees with edges - Validates resolutions against registry - Stores detailed call site information ## Resolution Strategy The resolver uses a multi-step approach: 1. **Simple names** (no dots): - Check import map first - Fall back to same-module lookup - Return unresolved if neither works 2. **Qualified names** (with dots): - Split into base + rest - Resolve base through imports - Append rest to get full FQN - Try current module if not imported 3. **Validation**: - Check if target exists in registry - For functions, validate parent module exists - Mark resolution success/failure ## Design Decisions 1. **Containing function detection**: - Uses nearest-match algorithm based on line numbers - Finds function with highest line number ≤ call line - Handles module-level calls by returning empty FQN 2. **Resolution priority**: - Import map takes precedence over same-module - Explicit imports always respected even if unresolved - Same-module only tried when not in imports 3. **Validation vs Resolution**: - Resolution finds FQN from imports/context - Validation checks if FQN exists in registry - Both pieces of information stored in CallSite 4. **Error handling**: - Continues processing even if some files fail - Marks individual call sites as unresolved - Returns partial graph instead of failing completely ## Next Steps The call graph infrastructure is now complete. Future PRs will: - PR #7: Add CFG data structures for control flow analysis - PR #8: Implement pattern matching for security rules - PR #9: Integrate into main initialization pipeline - PR #10: Add comprehensive documentation and examples - PR #11: Performance optimizations (caching, pooling) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This PR implements Control Flow Graph (CFG) data structures to enable intra-procedural analysis of execution paths through functions. CFGs are essential for security analysis patterns like taint tracking and detecting missing sanitization on all paths. ## Changes ### Core Implementation (cfg.go) 1. **BlockType**: Enumeration of basic block types - Entry: Function entry point - Exit: Function exit point - Normal: Sequential execution block - Conditional: Branch blocks (if/else) - Loop: Loop header blocks (while/for) - Switch: Switch/match statement blocks - Try/Catch/Finally: Exception handling blocks 2. **BasicBlock**: Represents a single basic block - ID: Unique identifier within CFG - Type: Block category for analysis - StartLine/EndLine: Source code location - Instructions: CallSites occurring in this block - Successors: Blocks that can execute next - Predecessors: Blocks that can execute before - Condition: Condition expression (for conditional blocks) - Dominators: Blocks that always execute before this one 3. **ControlFlowGraph**: Complete CFG for a function - FunctionFQN: Fully qualified function name - Blocks: Map of block ID to BasicBlock - EntryBlockID/ExitBlockID: Special block identifiers - CallGraph: Reference for inter-procedural analysis 4. **CFG Operations**: - NewControlFlowGraph(): Creates CFG with entry/exit blocks - AddBlock(): Adds basic block to CFG - AddEdge(): Connects blocks with control flow edges - GetBlock(): Retrieves block by ID - GetSuccessors(): Returns successor blocks - GetPredecessors(): Returns predecessor blocks 5. **Dominator Analysis**: - ComputeDominators(): Calculates dominator sets using iterative data flow - IsDominator(): Checks if one block dominates another - Used to verify sanitization always occurs before usage 6. **Path Analysis**: - GetAllPaths(): Enumerates all execution paths from entry to exit - dfsAllPaths(): DFS-based path enumeration - Used for exhaustive security analysis 7. **Helper Functions**: - intersect(): Set intersection for dominator computation - slicesEqual(): Compare string slices for fixed-point detection ### Comprehensive Tests (cfg_test.go) Created 23 test functions covering: **Construction Tests:** - CFG creation with entry/exit blocks - Basic block creation with all fields - Block addition to CFG **Edge Management Tests:** - Adding edges between blocks - Duplicate edge handling - Non-existent block edge handling **Graph Navigation Tests:** - Block retrieval by ID - Successor block retrieval - Predecessor block retrieval **Dominator Analysis Tests:** - Linear CFG dominators (A→B→C) - Branching CFG dominators (if/else merge) - Dominator checking **Path Analysis Tests:** - All paths in linear CFG - All paths in branching CFG **Helper Function Tests:** - Set intersection operations - Slice equality checking **Complex Integration Test:** - Realistic function CFG with branches - Multiple blocks and paths - Dominator relationships verification ## Test Coverage - Overall: 92.7% - NewControlFlowGraph: 100.0% - AddBlock: 100.0% - AddEdge: 100.0% - GetBlock: 100.0% - GetSuccessors: 87.5% - GetPredecessors: 87.5% - ComputeDominators: 100.0% - IsDominator: 75.0% - GetAllPaths: 100.0% - dfsAllPaths: 91.7% - intersect: 100.0% - slicesEqual: 100.0% ## Design Decisions 1. **Entry/Exit blocks always created**: - Simplifies analysis by providing single entry/exit points - Standard CFG construction practice 2. **Dominator computation uses iterative algorithm**: - Simple fixed-point iteration - Converges quickly for most real-world CFGs - More efficient than other dominator algorithms for small graphs 3. **Path enumeration with cycle detection**: - Avoids infinite loops in cyclic CFGs - Uses visited tracking during DFS - WARNING: Can be exponential for complex CFGs 4. **Blocks store CallSites as instructions**: - Links CFG to call graph for inter-procedural analysis - Enables tracking tainted data through function calls 5. **Condition stored as string**: - Simple representation for conditional blocks - Could be enhanced with AST expression nodes later ## Use Cases CFGs enable several security analysis patterns: **Taint Analysis:** - Track data flow through execution paths - Detect if tainted data reaches sensitive sinks **Sanitization Verification:** - Use dominators to check if sanitization always occurs - Detect missing sanitization on some paths **Dead Code Detection:** - Find unreachable blocks - Identify code that never executes **Inter-Procedural Analysis:** - Combine CFG with call graph - Track data flow across function boundaries ## Example CFG ```python def process_user(user_id): user = get_user(user_id) # Block 1 (entry) if user.is_admin(): # Block 2 (conditional) grant_access() # Block 3 (true branch) else: deny_access() # Block 4 (false branch) log_action(user) # Block 5 (merge point) return # Block 6 (exit) ``` CFG Structure: ``` Entry → Block1 → Block2 → Block3 → Block5 → Exit ↘ Block4 ↗ ``` Dominators: - Block1 dominates all blocks (always executes) - Block2 dominates Block3, Block4, Block5 - Block3 does NOT dominate Block5 (false branch skips it) - Block4 does NOT dominate Block5 (true branch skips it) ## Next Steps Future PRs will: - PR #8: Implement pattern registry for security rules - Use CFG to detect missing sanitization patterns - Implement taint tracking across CFG paths - Combine CFG with call graph for full analysis 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Implements pattern matching infrastructure for security analysis with one example pattern (code injection via eval). Additional patterns will be loaded from queries in future PRs. Includes pattern types (source-sink, missing-sanitizer, dangerous-function) and matching algorithms with 92.4% test coverage.

safedep · 2025-10-29T01:16:03Z

SafeDep Report Summary

No dependency changes detected. Nothing to scan.

_{This report is generated by SafeDep Github App}

codecov · 2025-10-29T01:17:03Z

Codecov Report

❌ Patch coverage is 87.96296% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.01%. Comparing base (a09ce3a) to head (d349f28).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
sourcecode-parser/graph/callgraph/patterns.go	87.96%	8 Missing and 5 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #329      +/-   ##
==========================================
+ Coverage   75.58%   76.01%   +0.42%     
==========================================
  Files          29       30       +1     
  Lines        3006     3114     +108     
==========================================
+ Hits         2272     2367      +95     
- Misses        667      675       +8     
- Partials       67       72       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

shivasurya and others added 8 commits October 25, 2025 22:47

shivasurya self-assigned this Oct 29, 2025

shivasurya added enhancement New feature or request go Pull requests that update go code labels Oct 29, 2025

Base automatically changed from shiva/callgraph-infra-7 to main October 29, 2025 02:43

Merge branch 'main' into shiva/callgraph-infra-8

d349f28

shivasurya merged commit b312533 into main Oct 29, 2025
5 checks passed

shivasurya deleted the shiva/callgraph-infra-8 branch October 29, 2025 02:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cpf/enhancement: Add pattern registry with hardcoded code injection example #329

cpf/enhancement: Add pattern registry with hardcoded code injection example #329

shivasurya commented Oct 29, 2025

Uh oh!

safedep bot commented Oct 29, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cpf/enhancement: Add pattern registry with hardcoded code injection example #329

cpf/enhancement: Add pattern registry with hardcoded code injection example #329

Conversation

shivasurya commented Oct 29, 2025

Checklist:

Uh oh!

safedep bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SafeDep Report Summary

Uh oh!

codecov bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

safedep bot commented Oct 29, 2025 •

edited

Loading

codecov bot commented Oct 29, 2025 •

edited

Loading