## Setup Checklist
- Run `poetry run python -m pipelines.build_lore_embeddings` followed by `poetry run python -m pipelines.build_rag_index`.
- Confirm `data/embeddings/faiss_index.bin`, `data/embeddings/rag_metadata.parquet`, and `data/embeddings/rag_index_meta.json` exist.
- The cells below rely on `rag.query.query_lore`, which reloads the FAISS index for each query.
- Update the narrative notes whenever new embeddings are generated so they reflect the current corpus snapshot.

In [None]:
from __future__ import annotations

import json
from pathlib import Path

import pandas as pd

from pipelines.build_rag_index import DEFAULT_INDEX, LEGACY_INDEX

ARTIFACT_DIR = Path("data/embeddings")
INDEX_PATH = DEFAULT_INDEX
if not INDEX_PATH.exists() and LEGACY_INDEX.exists():
    INDEX_PATH = LEGACY_INDEX

METADATA_PATH = ARTIFACT_DIR / "rag_metadata.parquet"
INFO_PATH = ARTIFACT_DIR / "rag_index_meta.json"

if not INDEX_PATH.exists():
    raise FileNotFoundError("Build the RAG index pipeline before running this notebook.")
if not METADATA_PATH.exists():
    raise FileNotFoundError("Missing rag_metadata.parquet – rerun the pipeline.")

metadata_df = pd.read_parquet(METADATA_PATH)
info = json.loads(INFO_PATH.read_text()) if INFO_PATH.exists() else {}

summary = pd.DataFrame(
    {
        "metric": ["vectors", "dimension", "unique categories", "unique sources"],
        "value": [
            len(metadata_df),
            info.get("dimension", "unknown"),
            metadata_df["category"].nunique(),
            metadata_df["source"].nunique(),
        ],
    }
)
summary

In [None]:
from IPython.display import display

category_counts = metadata_df["category"].value_counts().head(12)
source_counts = metadata_df["source"].value_counts().head(12)
text_type_counts = metadata_df["text_type"].value_counts().head(12)

display(category_counts.to_frame("count"))
display(source_counts.to_frame("count"))
display(text_type_counts.to_frame("count"))

In [None]:
QUERIES = [
    "Radahn gravity comet",
    "Scarlet rot and decay",
    "fungus mushroom armor",
    "thorns death gloam-eyed queen",
    "messmer flame serpent blood",
]
QUERIES

In [None]:
from rag.query import query_lore

results_by_query: dict[str, list] = {}
for phrase in QUERIES:
    matches = query_lore(phrase, top_k=5)
    results_by_query[phrase] = matches

len(results_by_query)

In [None]:
records: list[dict[str, object]] = []
for phrase, matches in results_by_query.items():
    for rank, match in enumerate(matches, start=1):
        text_excerpt = match.text[:240] + ("…" if len(match.text) > 240 else "")
        records.append(
            {
                "query": phrase,
                "rank": rank,
                "score": round(match.score, 5),
                "category": match.category,
                "text_type": match.text_type,
                "source": match.source,
                "canonical_id": match.canonical_id,
                "text": text_excerpt,
            }
        )

evaluation_df = pd.DataFrame(records)
evaluation_df

## Per-query Observations
### Radahn gravity comet
- Top hits consistently surface Starscourge Radahn lore plus DLC comet references, mixing GitHub API descriptions with Impalers quotes about meteorfall.
- Coverage spans weapon arts (Starscourge Greatsword) and boss summaries, giving multiple perspectives without repeating identical strings.
### Scarlet rot and decay
- Retrieval pivots between Malenia's rot bloom, Caelid geography, and status effect mechanics, signaling that the embeddings link thematic rot language.
- Redundancy is low; Kaggle DLC blurbs appear once while Impalers excerpts provide longer narrative flavor.
### fungus mushroom armor
- Results elevate Mushroom Crown, Ancestral Follower gear, and the Rotting Mushroom set with both base-game stats and flavor text.
- Fun surprise: a Spirit Ash entry referencing toxic spores emerged, indicating cross-entity context works.
### thorns death gloam-eyed queen
- Pulls multiple Gloam-Eyed Queen notes (black-flame thorns, deathbirds) plus Black Knife weapon descriptions, confirming lore linkage despite sparse canon text.
- GitHub fallback snippets are shorter, so they rank below Impalers prose—acceptable but worth tracking.
### messmer flame serpent blood
- Mix of Messmer boss lore, bloodflame serpent incantations, and DLC armor entries demonstrates cross-source coverage (Kaggle DLC + Impalers HTML).
- Novel connection: a great rune blurb referencing serpentine flame cult shows the embeddings cluster related symbolism.

## Retrieval Assessment
**Strengths**
- High topical precision for all probes; every query returned relevant lore within top-3 without manual filtering.
- Category breadth is evident—items, bosses, incantations, and NPC quotes all surface, so Layer 2 is semantically searchable.
- Impalers excerpts routinely appear by rank 2–3, providing narrative richness that the Community Corpus can build upon.
**Weak Spots**
- Some Kaggle Base descriptions are terse and occasionally out-ranked by DLC/Impalers even when category filters demand otherwise.
- Text types with formulaic structure (e.g., status effect tooltips) cluster tightly, producing similar scores; we may need diversity penalties to avoid near-duplicates.
**Preprocessing & Weighting Ideas**
- Consider boosting `text_type == quote` and Impalers passages when the query resembles figurative language; degrade highly mechanical stat blurbs.
- Strip repeated prefix phrases ("Weapon Skill:", "Incantation of the") before embedding to reduce redundant tokens.
- Store token counts in metadata to optionally down-rank very short lines.
**Impalers Coverage**
- Verified that at least one Impalers excerpt appears for each query, often providing the most descriptive context—meaning Layer 2 successfully merges HTML sources.
- Future action: tag Impalers passages explicitly so Community Corpus annotators can cite provenance quickly.