# Athanor — Stage 1: Literature Mapper

**Domain-agnostic automated science infrastructure**

**Pipeline:**
`Ingest (arXiv) → Parse → Embed → Claude concept extraction → Merge graph → Analyze → Visualize`

**Test domain:** Information theory (swap the query in §2 to change domains)

**Outputs:**
- `data/raw/` — cached arXiv JSON
- `outputs/graphs/` — interactive HTML graph + JSON graph dump
- In-notebook plotly figure


## 1. Install & Import Libraries

In [None]:
import subprocess, sys

def pip(*pkgs):
    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)

pip(
    "arxiv>=2.1.0",
    "anthropic>=0.40.0",
    "sentence-transformers>=3.0.0",
    "networkx>=3.3",
    "pyvis>=0.3.2",
    "plotly>=5.23.0",
    "pydantic>=2.0.0",
    "scipy>=1.13.0",
    "python-dotenv>=1.0.0",
    "rich>=13.7.0",
    "tqdm>=4.66.0",
    "scikit-learn>=1.5.0",
    "ipywidgets>=8.1.0",
)

print("✓ All packages installed")


In [None]:
import os, sys, json, time, re, logging
from pathlib import Path
from typing import List, Dict, Optional, Tuple

import numpy as np
import networkx as nx
import plotly.graph_objects as go
from tqdm.notebook import tqdm
from rich.console import Console
from rich.table import Table

# Make the athanor package importable from notebooks/
sys.path.insert(0, str(Path("..").resolve()))

console = Console()
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s — %(message)s")

print("✓ Imports OK")


## 2. Configuration

**Edit this cell to change domain.** Everything downstream reads from `CONFIG`.

In [None]:
from dotenv import load_dotenv
load_dotenv(Path("..") / ".env")

# ── SWAP THIS BLOCK TO CHANGE DOMAIN ─────────────────────────────────────────
CONFIG = {
    # Domain identity
    "domain": "information theory",

    # arXiv search query — drives everything
    "arxiv_query": "information theory entropy channel capacity mutual information",

    # How many papers to fetch (keep ≤ 20 for first runs to control API cost)
    "max_papers": 15,

    # Sentence-transformer model for embeddings (local, no API cost)
    "embedding_model": "all-MiniLM-L6-v2",

    # Claude model for concept extraction
    # claude-opus-4-5 = highest quality | claude-haiku-4-5 = fastest/cheapest
    "claude_model": "claude-opus-4-5",

    # Output paths
    "data_dir": Path("..") / "data" / "raw",
    "output_dir": Path("..") / "outputs" / "graphs",

    # Sparse-connection detection threshold
    # Pairs with cosine similarity > this but graph distance > 2 are flagged
    "sparse_sim_threshold": 0.45,
}

# Create output directories
for d in (CONFIG["data_dir"], CONFIG["output_dir"]):
    d.mkdir(parents=True, exist_ok=True)

# Validate API key
if not os.environ.get("ANTHROPIC_API_KEY"):
    console.print("[bold red]ANTHROPIC_API_KEY not set![/] Copy .env.example → .env and add your key.")
else:
    console.print(f"[bold green]✓ Config ready[/] — domain: [cyan]{CONFIG['domain']}[/], papers: {CONFIG['max_papers']}, model: {CONFIG['claude_model']}")


## 3. Fetch Corpus from arXiv

Results are cached locally — re-running this cell is free after the first fetch.

In [None]:
from athanor.ingest import ArxivClient, parse_papers

client = ArxivClient(cache_dir=CONFIG["data_dir"])

papers_raw = client.fetch(
    query=CONFIG["arxiv_query"],
    max_results=CONFIG["max_papers"],
    use_cache=True,
)

# ── summary table ─────────────────────────────────────────────────────────────
table = Table(title=f"Fetched {len(papers_raw)} papers — {CONFIG['domain']}", show_lines=True)
table.add_column("ID", style="cyan", no_wrap=True)
table.add_column("Title", style="white")
table.add_column("Published", style="magenta")
table.add_column("Authors", style="green")

for p in papers_raw:
    table.add_row(
        p.arxiv_id,
        p.title[:70] + ("…" if len(p.title) > 70 else ""),
        p.published,
        ", ".join(p.authors[:2]) + (f" +{len(p.authors)-2}" if len(p.authors) > 2 else ""),
    )

console.print(table)


## 4. Parse & Prepare Text

In [None]:
parsed = parse_papers(papers_raw)

console.print(f"[bold green]✓ Parsed {len(parsed)} papers[/]")
console.print("\n[bold]Sample — first paper text digest:[/]")
console.print(parsed[0]["text"][:600] + "…")


## 5. Embed Papers

Generates dense vector representations using a local sentence-transformer model.  
No API calls — runs fully offline after first model download.

In [None]:
from athanor.embed import Embedder
from sklearn.metrics.pairwise import cosine_similarity

embedder = Embedder(model_name=CONFIG["embedding_model"])

texts = [p["text"] for p in parsed]
embeddings = embedder.embed(texts)  # shape (N, D)

# Pairwise cosine similarity — used downstream for sparse-connection detection
sim_matrix = cosine_similarity(embeddings)

# Attach embeddings to parsed dicts for provenance
for i, p in enumerate(parsed):
    p["embedding"] = embeddings[i]

console.print(f"[bold green]✓ Embeddings computed[/] — shape: {embeddings.shape}")
console.print(f"   Similarity matrix: {sim_matrix.shape}, range [{sim_matrix.min():.3f}, {sim_matrix.max():.3f}]")


## 6. Concept Extraction via Claude

For each paper, Claude extracts:
- **concepts** — canonical scientific terms with descriptions  
- **edges** — typed relationships between concepts with evidence  

This is the AI backbone. The prompt is domain-agnostic — works identically for longevity biology or quantum gravity.

In [None]:
from athanor.graph import ConceptExtractor

extractor = ConceptExtractor(
    model=CONFIG["claude_model"],
    api_key=os.environ["ANTHROPIC_API_KEY"],
)

# ── extract from every paper ──────────────────────────────────────────────────
paper_extractions = []  # list of (arxiv_id, concepts, edges)

cache_path = CONFIG["data_dir"] / "extractions_cache.json"

if cache_path.exists():
    console.print("[yellow]Loading cached extractions…[/]")
    raw_cache = json.loads(cache_path.read_text())
    # Rebuild objects from cache dicts
    from athanor.graph.models import Concept, Edge
    for item in raw_cache:
        concepts = [Concept(**c) for c in item["concepts"]]
        edges = [Edge(**e) for e in item["edges"]]
        paper_extractions.append((item["arxiv_id"], concepts, edges))
else:
    for paper in tqdm(parsed, desc="Extracting concepts"):
        concepts, edges = extractor.extract(
            text=paper["text"],
            arxiv_id=paper["arxiv_id"],
        )
        paper_extractions.append((paper["arxiv_id"], concepts, edges))
        time.sleep(0.5)  # gentle rate limiting

    # Cache to disk
    serialisable = [
        {
            "arxiv_id": arxiv_id,
            "concepts": [c.model_dump() for c in concepts],
            "edges": [e.model_dump() for e in edges],
        }
        for arxiv_id, concepts, edges in paper_extractions
    ]
    cache_path.write_text(json.dumps(serialisable, indent=2))
    console.print(f"[green]✓ Extractions cached to {cache_path}[/]")

total_concepts = sum(len(c) for _, c, _ in paper_extractions)
total_edges = sum(len(e) for _, _, e in paper_extractions)
console.print(f"[bold green]✓ Extraction complete[/] — {total_concepts} concepts, {total_edges} edges across {len(paper_extractions)} papers")


## 7. Build the Concept Graph

Merge per-paper extractions into a unified graph:
- Concepts with the same label (case-insensitive) are deduplicated
- Edges accumulate weight across papers
- Provenance (which papers contributed each node/edge) is preserved

In [None]:
from athanor.graph import GraphBuilder

# GraphBuilder.build() accepts the same parsed_papers format,
# but here we feed the pre-computed extractions to avoid re-calling Claude.
# We build the graph manually from cached extractions.

from athanor.graph.models import ConceptGraph, Concept, Edge

# Collect all concepts and edges
all_concepts: List = []
all_edges: List = []
for arxiv_id, concepts, edges in paper_extractions:
    all_concepts.extend(concepts)
    all_edges.extend(edges)

# Build via GraphBuilder internal merge (reuse the logic)
builder = GraphBuilder(extractor=extractor)
merged_concepts, merged_edges, _ = builder._merge(all_concepts, all_edges)

concept_graph = ConceptGraph(
    domain=CONFIG["domain"],
    query=CONFIG["arxiv_query"],
    concepts=merged_concepts,
    edges=merged_edges,
)
builder._compute_centrality(concept_graph)

# Persist graph JSON
graph_json_path = CONFIG["output_dir"] / "concept_graph.json"
graph_json_path.write_text(concept_graph.model_dump_json(indent=2))

G = concept_graph.to_networkx()
console.print(f"[bold green]✓ Concept graph built[/]")
console.print(f"   Nodes: {G.number_of_nodes()}")
console.print(f"   Edges: {G.number_of_edges()}")
console.print(f"   Graph density: {nx.density(G):.4f}")
console.print(f"   Connected components: {nx.number_connected_components(G)}")
console.print(f"   Saved JSON: {graph_json_path}")


## 8. Graph Analysis & Sparse Connection Detection

Two signals for Stage 2 gap-finding:

1. **Bridge edges** (low Jaccard overlap between endpoints' neighborhoods) — concepts that connect otherwise distant clusters
2. **Embedding-near / graph-far pairs** — concepts semantically similar but structurally unconnected — these are the *implicit gaps*

In [None]:
# ── Graph metrics ─────────────────────────────────────────────────────────────
betweenness = nx.betweenness_centrality(G, weight="weight", normalized=True)
degree_cent  = nx.degree_centrality(G)
clustering   = nx.clustering(G, weight="weight")

# Top concepts by betweenness (hubs that bridge clusters)
top_hubs = sorted(betweenness.items(), key=lambda x: x[1], reverse=True)[:10]

hub_table = Table(title="Top 10 Hub Concepts (betweenness centrality)", show_lines=True)
hub_table.add_column("Concept", style="cyan")
hub_table.add_column("Betweenness", style="magenta")
hub_table.add_column("Degree", style="green")
hub_table.add_column("Clustering", style="yellow")
for concept, score in top_hubs:
    hub_table.add_row(
        concept,
        f"{score:.4f}",
        f"{degree_cent.get(concept, 0):.4f}",
        f"{clustering.get(concept, 0):.4f}",
    )
console.print(hub_table)

# ── Bridge edges (sparse connections in graph structure) ───────────────────────
bridge_edges = concept_graph.sparse_connections(top_k=10)
console.print(f"\n[bold]Bridge edges (structurally sparse connections):[/]")
for e in bridge_edges[:5]:
    console.print(f"  [cyan]{e.source}[/] ──[{e.relation}]→ [cyan]{e.target}[/]  (w={e.weight:.2f})")
    if e.evidence:
        console.print(f"    [dim]Evidence: {e.evidence[:120]}[/]")

# ── Embedding-near / graph-far pairs (implicit gaps) ──────────────────────────
console.print("\n[bold]Embedding-near / graph-far pairs (candidate research gaps):[/]")

# Build a label→embedding lookup using paper embeddings for concept names
labels = [c.label for c in concept_graph.concepts]
concept_texts = [c.label + ". " + c.description for c in concept_graph.concepts]
concept_embeddings = embedder.embed(concept_texts)
concept_sim = cosine_similarity(concept_embeddings)

threshold = CONFIG["sparse_sim_threshold"]
candidate_gaps = []
for i in range(len(labels)):
    for j in range(i + 1, len(labels)):
        if concept_sim[i, j] < threshold:
            continue
        # Are they graph-far?
        try:
            dist = nx.shortest_path_length(G, labels[i], labels[j])
        except nx.NetworkXNoPath:
            dist = 999
        if dist > 2:
            candidate_gaps.append((concept_sim[i, j], dist, labels[i], labels[j]))

candidate_gaps.sort(reverse=True)

gap_table = Table(title="Top Candidate Gaps (semantically similar, structurally distant)", show_lines=True)
gap_table.add_column("Concept A", style="cyan")
gap_table.add_column("Concept B", style="cyan")
gap_table.add_column("Similarity", style="green")
gap_table.add_column("Graph Distance", style="red")
for sim, dist, a, b in candidate_gaps[:10]:
    gap_table.add_row(a, b, f"{sim:.3f}", str(dist) if dist < 999 else "∞")
console.print(gap_table)

# Store for Stage 2
candidate_gaps_export = [
    {"concept_a": a, "concept_b": b, "similarity": float(sim), "graph_distance": dist}
    for sim, dist, a, b in candidate_gaps
]
gaps_path = CONFIG["output_dir"] / "candidate_gaps.json"
gaps_path.write_text(json.dumps(candidate_gaps_export, indent=2))
console.print(f"\n[green]✓ {len(candidate_gaps)} candidate gaps saved → {gaps_path}[/]")


## 9. Visualization

Two renders:
- **pyvis** — force-directed interactive HTML (best for exploration). Red dashed edges = bridge connections.
- **plotly** — in-notebook static/interactive. Node size = betweenness centrality.

In [None]:
from athanor.viz import GraphVisualizer
from IPython.display import IFrame, display

viz = GraphVisualizer(output_dir=CONFIG["output_dir"])

# ── pyvis interactive HTML ────────────────────────────────────────────────────
html_path = viz.pyvis_html(
    concept_graph,
    filename="concept_graph.html",
    highlight_sparse=True,
)
console.print(f"[bold green]✓ Interactive graph saved → {html_path}[/]")

# Render inline in notebook (works in classic Jupyter; in VS Code open the file)
display(IFrame(str(html_path), width="100%", height="800px"))


In [None]:
# ── Plotly in-notebook figure ─────────────────────────────────────────────────
fig = viz.plotly_figure(concept_graph)
fig.show()

# Also export as static HTML for sharing
plotly_path = CONFIG["output_dir"] / "concept_graph_plotly.html"
fig.write_html(str(plotly_path))
console.print(f"[green]✓ Plotly graph saved → {plotly_path}[/]")


## Stage 1 Complete ✓

**Outputs produced:**
| File | Contents |
|------|----------|
| `data/raw/arxiv_*.json` | Cached paper metadata |
| `data/raw/extractions_cache.json` | Claude concept extractions per paper |
| `outputs/graphs/concept_graph.json` | Full merged concept graph (Pydantic model) |
| `outputs/graphs/candidate_gaps.json` | Embedding-near / graph-far pairs → Stage 2 input |
| `outputs/graphs/concept_graph.html` | Interactive pyvis visualization |
| `outputs/graphs/concept_graph_plotly.html` | Plotly visualization |

**To change domain:** edit the `CONFIG` block in §2, delete the cache files, re-run.

---

**Stage 2** will read `candidate_gaps.json` and ask Claude: *"What research question does this gap imply, and why hasn't it been answered?"*

**Stage 3** will take the best questions and propose tractable experiment designs — computational first, flagging what needs wet lab.
