# Athanor — Stage 2: Gap Finder

**Input:** Stage 1 outputs — `concept_graph.json` + `candidate_gaps.json`  
**Process:** Claude analyses each semantically-close / graph-distant concept pair → structured research question  
**Output:** `outputs/gaps/gap_report.json` — ranked research questions ready for Stage 3

**Pipeline:** `Load graph → Enrich candidates → Claude analysis → Score & rank → Visualize → Save`


In [None]:
import os, sys, json, time
from pathlib import Path

sys.path.insert(0, str(Path("..").resolve()))

from dotenv import load_dotenv
load_dotenv(Path("..") / ".env")

from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from rich.text import Text
import plotly.graph_objects as go
import plotly.express as px

console = Console()
print("✓ Imports OK")


## 1. Configuration

Edit this block to change domain or tune the analysis.

In [None]:
CONFIG = {
    # Must match Stage 1
    "domain": "information theory",
    "claude_model": "claude-opus-4-5",

    # How many candidate gaps to send to Claude (controls API cost; ~$0.05–0.15 each)
    "max_gaps_to_analyse": 15,

    # Stage 1 outputs (read)
    "graph_path": Path("..") / "outputs" / "graphs" / "concept_graph.json",
    "gaps_path":  Path("..") / "outputs" / "graphs" / "candidate_gaps.json",

    # Stage 2 outputs (write)
    "output_dir": Path("..") / "outputs" / "gaps",
}

CONFIG["output_dir"].mkdir(parents=True, exist_ok=True)

if not os.environ.get("ANTHROPIC_API_KEY"):
    console.print("[bold red]ANTHROPIC_API_KEY not set![/]")
else:
    console.print(f"[bold green]✓ Config ready[/] — model: {CONFIG['claude_model']}, max gaps: {CONFIG['max_gaps_to_analyse']}")


## 2. Load Stage 1 Outputs

Read the concept graph and candidate gaps produced by Stage 1.  
**Stage 1 must be run first** — if files are missing, run `stage1_literature_mapper.ipynb`.

In [None]:
from athanor.graph.models import ConceptGraph
from athanor.gaps.models import CandidateGap

# ── Load concept graph ────────────────────────────────────────────────────────
if not CONFIG["graph_path"].exists():
    console.print(f"[bold red]Missing:[/] {CONFIG['graph_path']}\nRun Stage 1 first.")
    raise FileNotFoundError(CONFIG["graph_path"])

concept_graph = ConceptGraph.model_validate_json(
    CONFIG["graph_path"].read_text()
)

# Build label → concept lookup for context enrichment
concept_map = {c.label: c for c in concept_graph.concepts}

console.print(f"[bold green]✓ Concept graph loaded[/] — {len(concept_graph.concepts)} concepts, {len(concept_graph.edges)} edges")

# ── Load candidate gaps ───────────────────────────────────────────────────────
if not CONFIG["gaps_path"].exists():
    console.print(f"[bold red]Missing:[/] {CONFIG['gaps_path']}\nRun Stage 1 first.")
    raise FileNotFoundError(CONFIG["gaps_path"])

raw_gaps = json.loads(CONFIG["gaps_path"].read_text())
candidate_gaps = [CandidateGap(**g) for g in raw_gaps]

console.print(f"[bold green]✓ Candidate gaps loaded[/] — {len(candidate_gaps)} pairs")


## 3. Enrich Candidates with Concept Context

Inject descriptions and provenance from the concept graph into each gap.  
This is the context Claude uses to reason about *why* the gap exists.

In [None]:
import networkx as nx

G = concept_graph.to_networkx()
enriched: list[CandidateGap] = []

for gap in candidate_gaps:
    ca = concept_map.get(gap.concept_a)
    cb = concept_map.get(gap.concept_b)

    # Shared papers: concepts that appear in papers referencing both endpoints
    papers_a = set(ca.source_papers) if ca else set()
    papers_b = set(cb.source_papers) if cb else set()
    shared = sorted(papers_a & papers_b)

    enriched.append(
        CandidateGap(
            concept_a=gap.concept_a,
            concept_b=gap.concept_b,
            similarity=gap.similarity,
            graph_distance=gap.graph_distance,
            description_a=ca.description if ca else "",
            description_b=cb.description if cb else "",
            shared_papers=shared,
            papers_a=sorted(papers_a - papers_b)[:4],
            papers_b=sorted(papers_b - papers_a)[:4],
        )
    )

# Sort by similarity desc (most semantically close = highest priority gap)
enriched.sort(key=lambda g: g.similarity, reverse=True)

table = Table(title="Top 10 Enriched Candidate Gaps", show_lines=True)
table.add_column("Concept A", style="cyan")
table.add_column("Concept B", style="cyan")
table.add_column("Sim", style="green")
table.add_column("Dist", style="red")
table.add_column("Shared Papers", style="yellow")

for g in enriched[:10]:
    table.add_row(
        g.concept_a,
        g.concept_b,
        f"{g.similarity:.3f}",
        str(g.graph_distance) if g.graph_distance < 999 else "∞",
        str(len(g.shared_papers)),
    )

console.print(table)


## 4. Gap Analysis via Claude

For each gap, Claude produces:
- A precise, testable **research question**
- **Why** the community has missed this connection
- The **opportunity** at this intersection
- A concrete **methodology** sketch
- Scores: novelty · tractability · impact (1–5 each)

Results are cached to disk — re-running this cell is free after the first run.

In [None]:
from athanor.gaps import GapFinder, GapReport, GapAnalysis
from tqdm.notebook import tqdm

cache_path = CONFIG["output_dir"] / "gap_report_cache.json"

if cache_path.exists():
    console.print("[yellow]Loading cached gap report…[/]")
    gap_report = GapReport.model_validate_json(cache_path.read_text())
else:
    finder = GapFinder(
        domain=CONFIG["domain"],
        model=CONFIG["claude_model"],
        api_key=os.environ["ANTHROPIC_API_KEY"],
        max_gaps=CONFIG["max_gaps_to_analyse"],
    )

    # Wrap with tqdm for progress visibility
    gaps_to_analyse = enriched[: CONFIG["max_gaps_to_analyse"]]
    gap_report = GapReport(
        domain=CONFIG["domain"],
        query=concept_graph.query,
        n_candidates=len(enriched),
        n_analyzed=len(gaps_to_analyse),
    )

    for i, gap in enumerate(tqdm(gaps_to_analyse, desc="Analysing gaps")):
        console.print(f"  [{i+1}/{len(gaps_to_analyse)}] [cyan]{gap.concept_a}[/] ↔ [cyan]{gap.concept_b}[/]")
        analysis = finder._analyse_one(gap)
        if analysis:
            gap_report.analyses.append(analysis)
        time.sleep(0.5)

    # Cache
    cache_path.write_text(gap_report.model_dump_json(indent=2))
    console.print(f"\n[green]✓ Cached to {cache_path}[/]")

console.print(
    f"\n[bold green]✓ Gap report ready[/] — "
    f"{len(gap_report.analyses)} analyses from {gap_report.n_candidates} candidates"
)


## 5. Score & Rank

Composite score = `impact × 0.4 + novelty × 0.35 + tractability × 0.25`  
Weighted toward impact because high-novelty/low-impact gaps aren't the goal.

In [None]:
ranked = gap_report.ranked  # sorted by composite_score desc

score_table = Table(
    title=f"All Gap Analyses — {CONFIG['domain']} (ranked by composite score)",
    show_lines=True,
)
score_table.add_column("#", style="dim", width=3)
score_table.add_column("Concept A", style="cyan")
score_table.add_column("Concept B", style="cyan")
score_table.add_column("Score", style="bold green")
score_table.add_column("Nov", style="magenta")
score_table.add_column("Trc", style="blue")
score_table.add_column("Imp", style="red")
score_table.add_column("Comp?", style="yellow")
score_table.add_column("Research Question", style="white")

for i, a in enumerate(ranked, 1):
    score_table.add_row(
        str(i),
        a.concept_a,
        a.concept_b,
        f"{a.composite_score:.2f}",
        str(a.novelty),
        str(a.tractability),
        str(a.impact),
        "✓" if a.computational else "✗",
        a.research_question[:80] + ("…" if len(a.research_question) > 80 else ""),
    )

console.print(score_table)


## 6. Top 5 Gaps — Deep Dive

In [None]:
for i, a in enumerate(gap_report.top, 1):
    label = f"#{i} | Score {a.composite_score:.2f} | Nov:{a.novelty} Trc:{a.tractability} Imp:{a.impact} {'[COMP]' if a.computational else '[WET]'}"
    body = Text()
    body.append(f"Gap:  {a.concept_a}  ↔  {a.concept_b}\n", style="bold cyan")
    body.append(f"\nRQ:   {a.research_question}\n", style="bold white")
    body.append(f"\nWhy unexplored:\n{a.why_unexplored}\n", style="dim")
    body.append(f"\nOpportunity:\n{a.intersection_opportunity}\n", style="green")
    body.append(f"\nMethodology:\n{a.methodology}\n", style="yellow")
    body.append(f"\nKeywords: {', '.join(a.keywords)}", style="magenta")
    console.print(Panel(body, title=label, border_style="green" if i == 1 else "blue"))


## 7. Visualize the Gap Landscape

Scatter plot of all analysed gaps.  
- **X axis:** tractability (how easy is it to investigate?)  
- **Y axis:** novelty (how unstudied is this?)  
- **Size:** impact score  
- **Color:** composite score  
- **Ideal gaps:** top-right, large — novel, tractable, high-impact

In [None]:
import plotly.graph_objects as go

analyses = gap_report.ranked

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=[a.tractability + (i * 0.05) for i, a in enumerate(analyses)],  # jitter
    y=[a.novelty     + (i * 0.03) for i, a in enumerate(analyses)],
    mode="markers+text",
    text=[f"{a.concept_a[:12]}↔{a.concept_b[:12]}" for a in analyses],
    textposition="top center",
    textfont=dict(size=8),
    hovertext=[
        f"<b>{a.concept_a} ↔ {a.concept_b}</b><br>"
        f"Score: {a.composite_score:.2f}<br>"
        f"Nov:{a.novelty} Trc:{a.tractability} Imp:{a.impact}<br>"
        f"<i>{a.research_question[:100]}…</i>"
        for a in analyses
    ],
    hoverinfo="text",
    marker=dict(
        size=[6 + a.impact * 4 for a in analyses],
        color=[a.composite_score for a in analyses],
        colorscale="Viridis",
        showscale=True,
        colorbar=dict(title="Composite Score"),
        line=dict(width=1, color="white"),
        symbol=["circle" if a.computational else "diamond" for a in analyses],
    ),
))

# Quadrant annotation
for x, y, label in [
    (4.5, 4.5, "★ PRIORITY"),
    (1.5, 4.5, "Novel but hard"),
    (4.5, 1.5, "Easy but known"),
    (1.5, 1.5, "Low priority"),
]:
    fig.add_annotation(x=x, y=y, text=label, showarrow=False,
                       font=dict(color="rgba(200,200,200,0.4)", size=11))

fig.update_layout(
    title=f"Gap Landscape — {CONFIG['domain']} (● = computational, ◆ = requires wet lab)",
    xaxis=dict(title="Tractability →", range=[0.5, 5.8], dtick=1),
    yaxis=dict(title="Novelty →", range=[0.5, 5.8], dtick=1),
    height=600,
    plot_bgcolor="#111",
    paper_bgcolor="#111",
    font=dict(color="white"),
)

# Add quadrant lines
fig.add_hline(y=3, line_dash="dash", line_color="rgba(255,255,255,0.15)")
fig.add_vline(x=3, line_dash="dash", line_color="rgba(255,255,255,0.15)")

fig.show()

# Save
landscape_path = CONFIG["output_dir"] / "gap_landscape.html"
fig.write_html(str(landscape_path))
console.print(f"[green]✓ Gap landscape saved → {landscape_path}[/]")


## 8. Save GapReport — Stage 3 Input

In [None]:
report_path = CONFIG["output_dir"] / "gap_report.json"
report_path.write_text(gap_report.model_dump_json(indent=2))

console.print(f"[bold green]✓ GapReport saved → {report_path}[/]")
console.print(f"\n[bold]Summary:[/]")
console.print(f"  Candidates evaluated: {gap_report.n_candidates}")
console.print(f"  Analyses produced:    {len(gap_report.analyses)}")
console.print(f"  Top score:            {gap_report.ranked[0].composite_score:.2f}" if gap_report.analyses else "  (no analyses)")

if gap_report.analyses:
    best = gap_report.ranked[0]
    console.print(f"\n[bold green]Top gap:[/] {best.concept_a} ↔ {best.concept_b}")
    console.print(f"[white]{best.research_question}[/]")


## Stage 2 Complete ✓

**Outputs produced:**
| File | Contents |
|------|----------|
| `outputs/gaps/gap_report_cache.json` | Raw Claude responses (cached) |
| `outputs/gaps/gap_report.json` | Ranked GapReport — **Stage 3 input** |
| `outputs/gaps/gap_landscape.html` | Interactive novelty × tractability scatter |

---

**To change domain:** update `CONFIG["domain"]` and `CONFIG["graph_path"]`/`CONFIG["gaps_path"]` to point at a different Stage 1 run.  
Delete `gap_report_cache.json` to force re-analysis.

---

**Stage 3** will read the top gaps from `gap_report.json` and for each one:  
- Propose a concrete experiment design  
- Generate synthetic data / computational test where possible  
- Flag what requires wet lab or real-world validation and why  
- Close the loop: hypothesis → evidence → refined hypothesis
