Skip to content

Wire Graphify extraction into wiki_ingest_folder() #8

@verkligheten

Description

@verkligheten

Parent Epic

Part of #5 — Integrate Graphify for zero-cost code entity extraction

Task

Modify wiki_ingest_folder() in agent_notes/services/wiki_backend.py to automatically run Graphify extraction when code files are present and the package is available.

File

agent_notes/services/wiki_backend.py — function wiki_ingest_folder() (lines 338-441)

Current Flow

wiki_ingest_folder(folder_path)
    ├── Walk files, filter by extension/.gitignore/_SKIP_DIRS
    ├── Concatenate with "--- FILE: <rel> ---" markers
    ├── Chunk if > 2MB
    └── Call wiki_ingest(concepts=caller_provided, entities=caller_provided)

Problem: concepts and entities are almost always None when called programmatically — the caller (LLM agent or CLI) doesn't know what's in the code yet.

New Flow

wiki_ingest_folder(folder_path)
    ├── Walk files, filter by extension/.gitignore/_SKIP_DIRS
    ├── Track has_code flag during walk
    ├── Concatenate with "--- FILE: <rel> ---" markers
    │
    ├── [NEW] if has_code and graphify_available():
    │   ├── extract_code_graph(folder_path, extensions, skip_dirs)
    │   ├── graph_to_wiki_terms(graph_data)
    │   ├── save_graph_json(wiki_root, slug, graph_data)
    │   └── Merge discovered terms with caller-provided ones
    │
    ├── Chunk if > 2MB
    └── Call wiki_ingest(concepts=merged, entities=merged)

Implementation Details

Step 1: Add _CODE_EXTENSIONS constant (near line 306)

_CODE_EXTENSIONS = {
    ".py", ".ts", ".js", ".tsx", ".jsx",
    ".go", ".rs", ".java", ".cpp", ".c", ".h",
    ".rb", ".swift", ".kt", ".cs", ".scala",
    ".php", ".lua", ".groovy",
}

Step 2: Track has_code during file walk (inside the for loop, line 364-387)

Add before the loop:

has_code = False

Inside the loop, after the extension filter passes (after line 374):

if file.suffix in _CODE_EXTENSIONS:
    has_code = True

Step 3: Insert Graphify extraction block (after line 389, before line 391)

    # ── Graphify auto-extraction (zero-cost entity discovery) ────────
    graphify_concepts: list[str] = []
    graphify_entities: list[str] = []

    if has_code:
        try:
            from .code_graph import (
                graphify_available,
                extract_code_graph,
                graph_to_wiki_terms,
                save_graph_json,
            )

            if graphify_available():
                graph_data = extract_code_graph(
                    folder_path,
                    extensions=allowed_exts & _CODE_EXTENSIONS if allowed_exts != _DEFAULT_EXTENSIONS else None,
                    skip_dirs=_SKIP_DIRS,
                )
                if graph_data["stats"]["nodes"] > 0:
                    wiki_terms = graph_to_wiki_terms(graph_data)
                    graphify_entities = wiki_terms["entities"]
                    graphify_concepts = wiki_terms["concepts"]

                    # Persist graph alongside raw content
                    _slug_name = slug if 'slug' in dir() else _slug(title or folder_path.name)
                    save_graph_json(wiki_root, _slug_name, graph_data)
        except Exception:
            pass  # Graphify failure must never break ingestion

Step 4: Merge terms before wiki_ingest() calls

Add helper function:

def _merge_unique(base: list[str], extra: list[str]) -> list[str]:
    """Merge two lists preserving order, removing duplicates (case-insensitive)."""
    seen = {x.lower() for x in base}
    result = list(base)
    for item in extra:
        if item.lower() not in seen:
            seen.add(item.lower())
            result.append(item)
    return result

Before both wiki_ingest() calls (line 421 and 432), merge:

    merged_concepts = _merge_unique(concepts or [], graphify_concepts)
    merged_entities = _merge_unique(entities or [], graphify_entities)

Then pass concepts=merged_concepts, entities=merged_entities instead of concepts=concepts, entities=entities.

Step 5: Add graph.json reference to source page (optional enhancement)

In wiki_ingest(), if a graph.json was saved, add its path to the sources list in the source page frontmatter. This is optional — the graph.json is discoverable by convention (raw/<slug>-graph.json).

Insertion Points (exact line references)

What Where Line
_CODE_EXTENSIONS constant After _DEFAULT_EXTENSIONS ~306
has_code = False Before for file in sorted(...) ~363
has_code = True Inside loop, after extension check ~375
Graphify extraction block After raw_content = "".join(parts) ~390
_merge_unique() helper Before wiki_ingest_folder() or as module-level ~337
Merged args to wiki_ingest() (chunked) Replace concepts=concepts ~428
Merged args to wiki_ingest() (single) Replace concepts=concepts ~439

Edge Cases

  1. Folder with no code files (only .md/.yaml): has_code stays False, Graphify block skipped entirely. Zero overhead.
  2. Graphify not installed: graphify_available() returns False. Zero overhead beyond one failed import attempt (cached by Python).
  3. Graphify extraction returns empty: stats.nodes == 0 check skips term mapping. Falls through to original behavior.
  4. Graphify crashes: except Exception: pass catches everything. Ingestion continues without entity discovery.
  5. Caller provides entities AND Graphify discovers more: _merge_unique() combines both, deduplicating case-insensitively. Caller's entities come first (higher priority).
  6. Very large folder (1000+ files): extract() may take 10-30s. This is acceptable for a one-time ingest. Tree-sitter is O(n) in file size.

Testing

See #11 for test specifications.

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions