Skip to content

Create code_graph.py extraction boundary module #7

@verkligheten

Description

@verkligheten

Parent Epic

Part of #5 — Integrate Graphify for zero-cost code entity extraction

Task

Create agent_notes/services/code_graph.py — a boundary module that encapsulates all Graphify interaction. No Graphify types leak into the rest of the codebase; every function works with plain Python dicts and Path objects.

Location

/agent_notes/services/code_graph.py (new file, follows existing pattern: wiki_backend.py, memory_backend.py, credentials.py)

Functions

1. graphify_available() -> bool

def graphify_available() -> bool:
    """Return True if the graphifyy package is importable."""
    try:
        import graphify.extract  # noqa: F401
        return True
    except ImportError:
        return False

2. extract_code_graph(folder_path, *, extensions=None, skip_dirs=None) -> dict

Core extraction function. Runs tree-sitter parsing via Graphify's Python API.

Parameters:

  • folder_path: Path — directory to scan
  • extensions: set[str] | None — allowed code extensions (default: _CODE_EXTENSIONS)
  • skip_dirs: set[str] | None — directories to skip (reuse wiki_backend._SKIP_DIRS)

Returns:

{
    "nodes": [
        {"id": "auth_userservice", "label": "UserService", "source_file": "auth.py",
         "source_location": "L42", "type": "class"}
    ],
    "edges": [
        {"source": "auth_userservice", "target": "payments_gateway",
         "relation": "calls", "confidence": "EXTRACTED"}
    ],
    "communities": {0: ["auth_userservice", "auth_login"], 1: ["payments_gateway"]},
    "cohesion": {0: 0.85, 1: 0.72},
    "god_nodes": [{"label": "UserService", "degree": 12}],
    "stats": {"files_parsed": 5, "nodes": 23, "edges": 41, "communities": 3}
}

Implementation logic:

def extract_code_graph(folder_path: Path, *, extensions=None, skip_dirs=None):
    from graphify.extract import collect_files, extract
    from graphify.build import build_from_json
    from graphify.cluster import cluster, score_all
    from graphify.analyze import god_nodes

    # Step 1: Collect code files
    code_files = collect_files(folder_path)

    # Step 2: Filter by extensions if specified
    if extensions:
        code_files = [f for f in code_files if f.suffix in extensions]

    # Step 3: Filter by skip_dirs if specified
    if skip_dirs:
        code_files = [f for f in code_files
                      if not any(d in f.parts for d in skip_dirs)]

    if not code_files:
        return _empty_graph()

    # Step 4: Extract AST (zero API cost)
    extraction = extract(code_files)
    if not extraction.get("nodes"):
        return _empty_graph()

    # Step 5: Build graph
    G = build_from_json(extraction)

    # Step 6: Community detection
    communities = cluster(G)
    cohesion = score_all(G, communities)
    gods = god_nodes(G)

    # Step 7: Convert to plain dict
    nodes = [
        {
            "id": n,
            "label": G.nodes[n].get("label", n),
            "source_file": G.nodes[n].get("source_file", ""),
            "source_location": G.nodes[n].get("source_location", ""),
            "type": G.nodes[n].get("file_type", "code"),
        }
        for n in G.nodes
    ]
    edges = [
        {
            "source": u,
            "target": v,
            "relation": d.get("relation", "related"),
            "confidence": d.get("confidence", "EXTRACTED"),
        }
        for u, v, d in G.edges(data=True)
    ]

    return {
        "nodes": nodes,
        "edges": edges,
        "communities": {k: list(v) for k, v in communities.items()},
        "cohesion": {k: v for k, v in cohesion.items()},
        "god_nodes": gods,
        "stats": {
            "files_parsed": len(code_files),
            "nodes": len(nodes),
            "edges": len(edges),
            "communities": len(communities),
        },
    }

3. graph_to_wiki_terms(graph_data) -> dict

Maps Graphify nodes and communities to wiki-compatible entity and concept names.

Mapping rules:

Graphify node Condition Wiki type Example
class any degree entity "UserService"
function (top-level) degree >= 3 entity "process_payment"
function (method) skip stays inside class page
module / file degree >= 2 entity "auth"
Leiden community size >= 2 concept "Authentication System"

Community naming algorithm:

  1. Collect source_file values from all community member nodes
  2. Extract common path prefix (e.g., auth/, payments/)
  3. If prefix gives a meaningful directory name → use it title-cased
  4. Otherwise → use the highest-degree node's label + "Module" suffix
  5. Deduplicate against existing concept names

Returns:

{
    "entities": ["UserService", "PaymentGateway", "process_payment"],
    "concepts": ["Authentication", "Payment Processing"],
    "edges_by_entity": {
        "UserService": [
            {"target": "PaymentGateway", "relation": "calls"},
            {"target": "login", "relation": "contains"}
        ]
    }
}

Implementation detail — filtering trivial nodes:

  • Skip nodes whose label starts with _ (private/internal)
  • Skip nodes whose label is __init__, __main__, setup
  • Skip "rationale" type nodes (Graphify extracts # NOTE: comments as rationale nodes)
  • Skip file-level module nodes that are just containers (only have "contains" edges out)

4. save_graph_json(wiki_root, slug, graph_data) -> Path

import json

def save_graph_json(wiki_root: Path, slug: str, graph_data: dict) -> Path:
    """Write graph.json to raw/<slug>-graph.json. Returns the path."""
    raw_dir = wiki_root / "raw"
    raw_dir.mkdir(parents=True, exist_ok=True)
    path = raw_dir / f"{slug}-graph.json"
    path.write_text(json.dumps(graph_data, indent=2, default=str))
    return path

Storage rationale: raw/ is the immutable source material directory. The graph is derived from source code — it belongs with source data. .obsidianignore already excludes raw/ from Obsidian indexing.

5. Helper: _empty_graph() -> dict

def _empty_graph():
    return {
        "nodes": [], "edges": [],
        "communities": {}, "cohesion": {},
        "god_nodes": [],
        "stats": {"files_parsed": 0, "nodes": 0, "edges": 0, "communities": 0},
    }

6. Constant: _CODE_EXTENSIONS

_CODE_EXTENSIONS = {
    ".py", ".ts", ".js", ".tsx", ".jsx",
    ".go", ".rs", ".java", ".cpp", ".c", ".h",
    ".rb", ".swift", ".kt", ".cs", ".scala",
    ".php", ".lua", ".groovy", ".jl",
    ".f90", ".pas",
}

This matches Graphify's supported tree-sitter languages.

Potential Issues

  1. Graphify's collect_files() vs our file walking: collect_files() has its own filtering logic. We may get different file sets than wiki_ingest_folder(). Solution: use our own file list from the walk loop where possible, or at minimum filter collect_files() output with our _SKIP_DIRS and extensions.

  2. NetworkX graph iteration order: G.nodes and G.edges(data=True) iteration order is insertion-order in Python 3.7+, but community assignment is non-deterministic (Leiden uses randomization). This is fine — we only need consistent node IDs, not consistent community assignment.

  3. Large repositories: extract() on a 1000+ file repo could take 10-30 seconds (tree-sitter is fast but not instant). This is acceptable for a one-time ingest operation, but document that large repos may take a moment.

  4. extract() with cache_root: The v7 API supports extract(code_files, cache_root=Path(".")) for caching parsed results. We should pass a cache path to avoid re-parsing on --update runs. Use wiki_root / "raw" as cache root.

  5. Import safety: All Graphify imports are lazy (inside function bodies), so import agent_notes never fails even when graphifyy isn't installed.

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions