# Local HF Pipeline - End-to-End Test

This notebook validates that the **fully local** KohakuRAG pipeline works
without any network calls. It tests:

1. Local embeddings (`LocalHFEmbeddingModel` via sentence-transformers)
2. Local LLM chat (`HuggingFaceLocalChatModel` via transformers)
3. Full RAG pipeline: index documents, retrieve, and answer

**Prerequisites:**
- Kernel: `kohaku-gb10` (or your project venv)
- Dependencies installed: `pip install -r local_requirements.txt`
- Vendored packages installed: `pip install -e vendor/KohakuVault && pip install -e vendor/KohakuRAG`

## Step 1 - Verify imports

In [None]:
import torch
import transformers
import sentence_transformers

print(f"torch:                {torch.__version__}")
print(f"transformers:         {transformers.__version__}")
print(f"sentence-transformers: {sentence_transformers.__version__}")
print(f"CUDA available:       {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU:                  {torch.cuda.get_device_name(0)}")
print()

import kohakurag
import kohakuvault
print(f"kohakurag:  {kohakurag.__file__}")
print(f"kohakuvault: {kohakuvault.__file__}")
print("\nAll imports OK")

## Step 2 - Test local embeddings

In [None]:
from kohakurag.embeddings import LocalHFEmbeddingModel

# Use a small, fast model for testing
embedder = LocalHFEmbeddingModel(model_name="BAAI/bge-base-en-v1.5")
print(f"Embedding model loaded: BAAI/bge-base-en-v1.5")
print(f"Embedding dimension:    {embedder.dimension}")

In [None]:
import numpy as np

test_texts = [
    "Solar panels convert sunlight into electricity.",
    "Photovoltaic cells generate power from solar radiation.",
    "The capital of France is Paris.",
]

vecs = await embedder.embed(test_texts)
print(f"Embedding shape: {vecs.shape}")
print(f"Dtype:           {vecs.dtype}")

# Cosine similarity (vectors are already normalized)
sim_01 = float(np.dot(vecs[0], vecs[1]))
sim_02 = float(np.dot(vecs[0], vecs[2]))
print(f"\nSimilarity (solar vs photovoltaic): {sim_01:.4f}  (should be high)")
print(f"Similarity (solar vs Paris):         {sim_02:.4f}  (should be low)")
assert sim_01 > sim_02, "Semantic similarity check failed!"
print("\nEmbedding sanity check PASSED")

## Step 3 - Test local LLM chat

This loads a local HF model for generation. The default is `Qwen/Qwen2.5-7B-Instruct`.

**Note:** If this is too large for your GPU, change `LLM_MODEL_ID` to a smaller model
like `Qwen/Qwen2.5-1.5B-Instruct` or `TinyLlama/TinyLlama-1.1B-Chat-v1.0`.

In [None]:
# Configure the LLM model - adjust if needed for your hardware
LLM_MODEL_ID = "Qwen/Qwen2.5-7B-Instruct"  # change to smaller model if OOM
LLM_DTYPE = "bf16"  # "bf16", "fp16", or "auto"

print(f"Will load: {LLM_MODEL_ID} ({LLM_DTYPE})")

In [None]:
from kohakurag.llm import HuggingFaceLocalChatModel

chat = HuggingFaceLocalChatModel(
    model=LLM_MODEL_ID,
    dtype=LLM_DTYPE,
    max_new_tokens=256,
    temperature=0.0,  # greedy for reproducibility
)
print(f"LLM loaded: {LLM_MODEL_ID}")

In [None]:
response = await chat.complete(
    "What is 2 + 2? Answer with just the number.",
    system_prompt="You are a helpful assistant. Be concise.",
)
print(f"LLM response: {response!r}")
assert "4" in response, f"Expected '4' in response, got: {response}"
print("LLM sanity check PASSED")

## Step 4 - Full RAG pipeline with train_QA.csv

This step loads real WattBot questions from `data/train_QA.csv`, creates
a small document corpus from our sample data, indexes it with proper
hierarchy (document -> paragraph -> sentence), and runs retrieval + QA.

In [None]:
import csv
from pathlib import Path

# Load train_QA.csv
qa_path = Path("../data/train_QA.csv")
if not qa_path.exists():
    qa_path = Path("data/train_QA.csv")  # fallback if running from repo root

with qa_path.open(newline="", encoding="utf-8-sig") as f:
    reader = csv.DictReader(f)
    qa_rows = list(reader)

print(f"Loaded {len(qa_rows)} questions from {qa_path.name}")
print(f"Columns: {list(qa_rows[0].keys())}")
print(f"\nFirst 5 questions:")
for row in qa_rows[:5]:
    print(f"  [{row['id']}] {row['question'][:90]}...")
    print(f"         expected: {row['answer_value']} ({row.get('answer_unit', '')})")

In [None]:
from kohakurag.types import NodeKind, StoredNode
from kohakurag.embeddings import average_embeddings
from kohakurag.pipeline import RAGPipeline
from kohakurag.datastore import InMemoryNodeStore

# Sample documents (sustainable AI topics that overlap with train_QA questions)
documents = [
    {
        "id": "patterson2021",
        "title": "Carbon Emissions and Large Neural Networks",
        "sentences": [
            "Training GPT-3 (175B parameters) was estimated to emit approximately 552 tonnes of CO2.",
            "Training GShard-600B used 24 MWh and produced 4.3 net tCO2e.",
            "Smaller models like Llama-2-7B require roughly 30x less compute.",
            "Techniques such as mixed-precision training and gradient checkpointing can further reduce energy consumption by 20-30%.",
        ],
    },
    {
        "id": "wu2021b",
        "title": "Sustainable AI and Data Center Efficiency",
        "sentences": [
            "Modern data centers consume approximately 1-2% of global electricity.",
            "Hyperscale data centers in 2020 achieved more than 40% higher efficiency compared to traditional data centers.",
            "Google reported a PUE (Power Usage Effectiveness) of 1.10 across its fleet in 2023.",
            "Liquid cooling systems can reduce energy usage by up to 40% compared to traditional air cooling.",
        ],
    },
    {
        "id": "li2025b",
        "title": "Water Consumption of AI Systems",
        "sentences": [
            "GPT-3 needs to drink a 500ml bottle of water for roughly 10 to 50 medium-length responses.",
            "The estimated total operational water consumption for training GPT-3 in Microsoft's U.S. data centers was 5.439 million liters.",
            "Microsoft committed to being carbon negative by 2030.",
            "Azure data centers in Sweden run on 100% renewable energy.",
        ],
    },
    {
        "id": "strubell2019",
        "title": "Energy and Policy Considerations for Deep Learning",
        "sentences": [
            "Authors should report training time and computational resources required for reproducibility.",
            "Tracking the runtime of a training job is an important step for estimating compute cost in GPU-based or cloud environments.",
            "The financial cost of training a large transformer model can exceed $1 million.",
        ],
    },
]

# Build hierarchical nodes: document -> paragraph -> sentence
# The pipeline expects parent nodes to exist when walking the hierarchy
nodes = []

for doc in documents:
    doc_id = doc["id"]
    
    # Embed all sentences at once
    sent_vecs = await embedder.embed(doc["sentences"])
    
    # Create sentence nodes
    sent_node_ids = []
    for s_idx, (sent, vec) in enumerate(zip(doc["sentences"], sent_vecs)):
        sent_id = f"{doc_id}:p0:s{s_idx}"
        sent_node_ids.append(sent_id)
        nodes.append(StoredNode(
            node_id=sent_id,
            parent_id=f"{doc_id}:p0",
            kind=NodeKind.SENTENCE,
            title=doc["title"],
            text=sent,
            metadata={"document_id": doc_id},
            embedding=vec,
            child_ids=[],
        ))
    
    # Create paragraph node (parent of sentences) with averaged embedding
    para_vec = average_embeddings([v for v in sent_vecs])
    nodes.append(StoredNode(
        node_id=f"{doc_id}:p0",
        parent_id=doc_id,
        kind=NodeKind.PARAGRAPH,
        title=doc["title"],
        text=" ".join(doc["sentences"]),
        metadata={"document_id": doc_id},
        embedding=para_vec,
        child_ids=sent_node_ids,
    ))
    
    # Create document node (root) with averaged embedding
    nodes.append(StoredNode(
        node_id=doc_id,
        parent_id=None,
        kind=NodeKind.DOCUMENT,
        title=doc["title"],
        text=doc["title"],
        metadata={"document_id": doc_id},
        embedding=para_vec,  # same as paragraph for single-paragraph docs
        child_ids=[f"{doc_id}:p0"],
    ))

# Create in-memory store and index
store = InMemoryNodeStore()
await store.upsert_nodes(nodes)
print(f"Indexed {len(nodes)} nodes ({len(documents)} docs) into in-memory store")
print(f"  Hierarchy: document -> paragraph -> sentences")

In [None]:
# Assemble pipeline with local components
pipeline = RAGPipeline(
    store=store,
    embedder=embedder,
    chat_model=chat,
    top_k=3,
)
print("Pipeline assembled (local embedder + local LLM + in-memory store)")

In [None]:
# Test retrieval with a real WattBot question
question = qa_rows[0]["question"]  # First question from train_QA.csv

result = await pipeline.retrieve(question, top_k=3)
print(f"Question: {question}")
print(f"Retrieved {len(result.matches)} matches:\n")
for i, match in enumerate(result.matches):
    print(f"  [{i+1}] score={match.score:.4f}  node={match.node.node_id}")
    print(f"      {match.node.text[:120]}...\n")

In [None]:
# Test full QA (retrieve + generate)
answer = await pipeline.answer(question)

print(f"Question: {answer['question']}\n")
print(f"Expected: {qa_rows[0]['answer_value']}")
print(f"\nResponse:\n{answer['response']}")

## Step 5 - Structured QA (JSON output)

This tests the `run_qa` method with the same prompt templates used in production.

In [None]:
import json

system_prompt = (
    "You must answer strictly based on the provided context snippets. "
    "Do NOT use external knowledge. If the context does not support an answer, "
    "output 'is_blank' for answer_value. Respond in JSON with keys: "
    "explanation, answer, answer_value, ref_id."
)

user_template = """Question: {question}

Context:
{context}

Additional info: {additional_info_json}

Return STRICT JSON:
- explanation: 1-2 sentences
- answer: short answer
- answer_value: numeric/categorical value or "is_blank"
- ref_id: list of document ids used

JSON Answer:"""

# Use a question from train_QA that should match our sample docs
# q009: "What were the net CO2e emissions from training the GShard-600B model?"
gshard_row = next(r for r in qa_rows if "GShard" in r["question"])

structured_result = await pipeline.run_qa(
    question=gshard_row["question"],
    system_prompt=system_prompt,
    user_template=user_template,
    additional_info={"answer_unit": gshard_row.get("answer_unit", "")},
    top_k=3,
)

print(f"Question:     {gshard_row['question']}")
print(f"Expected:     {gshard_row['answer_value']} ({gshard_row.get('answer_unit', '')})")
print(f"Answer:       {structured_result.answer.answer}")
print(f"Answer value: {structured_result.answer.answer_value}")
print(f"Ref IDs:      {structured_result.answer.ref_id}")
print(f"Explanation:  {structured_result.answer.explanation}")
print(f"\nRaw LLM output:\n{structured_result.raw_response[:500]}")

## Step 6 - Offline validation

Confirm no network calls were made by unsetting API keys and re-running.

In [None]:
import os

# Clear any API keys to prove we're fully local
for key in ["OPENROUTER_API_KEY", "OPENAI_API_KEY", "JINA_API_KEY"]:
    os.environ.pop(key, None)

# Re-run a query - should work without any API keys
offline_answer = await pipeline.answer(
    "What percentage of global electricity do data centers use?"
)
print(f"Offline response:\n{offline_answer['response']}")
print("\nOFFLINE VALIDATION PASSED - no API keys needed!")

## Summary

If all cells above ran successfully, your local HF pipeline is working:

| Component | Provider | Model |
|-----------|----------|-------|
| Embeddings | `LocalHFEmbeddingModel` | `BAAI/bge-base-en-v1.5` |
| LLM | `HuggingFaceLocalChatModel` | Configured above |
| Vector store | `InMemoryNodeStore` | (in-memory, no DB needed) |

**What was tested:**
- Steps 1-3: Individual component verification (imports, embeddings, LLM)
- Step 4: Full RAG pipeline with synthetic documents
- Step 5: Structured JSON QA (production format)
- Step 6: Offline validation (no API keys)
- Step 7: Real WattBot questions from `data/train_QA.csv`

To use with the full production pipeline (KVaultNodeStore + pre-indexed docs),
set `llm_provider = "hf_local"` and `embedding_model = "hf_local"` in your config.

## Step 7 - Batch test with multiple WattBot questions

Run several questions from `train_QA.csv` through the pipeline, including
ones that should match our sample docs and ones that won't (testing "is_blank").

In [None]:
# Pick questions that test different scenarios
sample_questions = [
    # Should match: hyperscale data centers (wu2021b)
    next(r for r in qa_rows if "Hyperscale" in r["question"]),
    # Should match: water consumption (li2025b)
    next(r for r in qa_rows if "water consumption" in r["question"].lower() and "training GPT-3" in r["question"]),
    # Should match: tracking runtime (strubell2019)
    next(r for r in qa_rows if "runtime" in r["question"].lower() and "training job" in r["question"].lower()),
    # Should NOT match: elephant (tests is_blank)
    next(r for r in qa_rows if "elephant" in r["question"].lower()),
]

print(f"Running {len(sample_questions)} WattBot questions through local pipeline...\n")
print("=" * 70)

for row in sample_questions:
    qid = row["id"]
    question = row["question"]
    expected = row["answer_value"]
    unit = row.get("answer_unit", "")

    result = await pipeline.run_qa(
        question=question,
        system_prompt=system_prompt,
        user_template=user_template,
        additional_info={"answer_unit": unit},
        top_k=3,
    )

    print(f"\n[{qid}] {question[:85]}...")
    print(f"  Expected:  {expected} ({unit})")
    print(f"  Got:       {result.answer.answer_value}")
    print(f"  Answer:    {result.answer.answer}")
    print(f"  Ref IDs:   {result.answer.ref_id}")
    print("-" * 70)

print("\nBatch WattBot test completed (pipeline ran without errors)")