### Setup env. 

In a terminal in Jupyter lab, run the following commands:

```bash
pwd
```

```output
/workspace2
```

```bash
git clone https://github.com/qualiaMachine/GB10_Tests.git
```

Cd to RAG directory
```bash
cd GB10_Tests/WattBot
```

Create cache and temp directories on /workspace2 (the large disk) and tell uv/pip to use them instead of the small home volume. 
```bash
# create cache and temp directories on /workspace2 (large disk)
mkdir -p \
  /workspace2/.cache/uv \
  /workspace2/.cache/pip \
  /workspace2/.cache/huggingface \
  /workspace2/.cache/torch \
  /workspace2/.tmp \
  /workspace2/.tmp/uv \
  /workspace2/.tmp/hf

# -------------------------
# System-wide temp (many libs respect these)
# -------------------------
export TMPDIR=/workspace2/.tmp
export TEMP=/workspace2/.tmp
export TMP=/workspace2/.tmp

# -------------------------
# Python / package managers
# -------------------------
export UV_CACHE_DIR=/workspace2/.cache/uv
export PIP_CACHE_DIR=/workspace2/.cache/pip

# (optional) uv can also use its own temp dir if you want separation
# export TMPDIR=/workspace2/.tmp/uv

# -------------------------
# Hugging Face / PyTorch
# -------------------------
export HF_HOME=/workspace2/.cache/huggingface
export TRANSFORMERS_CACHE=/workspace2/.cache/huggingface
export HF_DATASETS_CACHE=/workspace2/.cache/huggingface
export TORCH_HOME=/workspace2/.cache/torch

```


Install UV
```bash
pip install uv
```

Create venv
```bash
uv venv # creates .venv folder
```

Activate venv
```bash
source .venv/bin/activate
```

Install requirements
```bash
uv pip install -r requirements.txt
```

Add venv as named kernel in Jupyter lab
```bash
python -m ipykernel install \
  --user \
  --name wattbot \
  --display-name "wattbot"
```


After waiting a minute, you can select the "wattbot" kernel in Jupyter lab notebooks to use this environment and run the `GB10_Tests/WattBot/02_RunAI_RAG_WattBot.ipynb` notebook.

In [1]:
# !rm -rf /workspace2/.cache/huggingface/* # so we don't fill our full 1TB when we re-run this code...


In [2]:
# !mkdir -p /workspace2/.cache/huggingface/hub # add folder back in so we don't install models into home by accident


In [3]:
import os
from pathlib import Path

WS = Path("/workspace2")
TMP_BASE = WS / ".tmp"
CACHE_BASE = WS / ".cache"

# Make sure directories exist
for p in [
    TMP_BASE,
    TMP_BASE / "uv",
    TMP_BASE / "hf",
    CACHE_BASE / "uv",
    CACHE_BASE / "pip",
    CACHE_BASE / "huggingface",
    CACHE_BASE / "torch",
]:
    p.mkdir(parents=True, exist_ok=True)

# System-wide temp
os.environ["TMPDIR"] = str(TMP_BASE)
os.environ["TEMP"]  = str(TMP_BASE)
os.environ["TMP"]   = str(TMP_BASE)

# uv / pip caches
os.environ["UV_CACHE_DIR"]  = str(CACHE_BASE / "uv")
os.environ["PIP_CACHE_DIR"] = str(CACHE_BASE / "pip")

# Hugging Face / PyTorch caches
os.environ["HF_HOME"] = str(CACHE_BASE / "huggingface")
os.environ["TRANSFORMERS_CACHE"] = str(CACHE_BASE / "huggingface")
os.environ["HF_DATASETS_CACHE"]  = str(CACHE_BASE / "huggingface")
os.environ["TORCH_HOME"] = str(CACHE_BASE / "torch")

# Optional: general cache root for other libs
os.environ.setdefault("XDG_CACHE_HOME", str(CACHE_BASE))

print("Set TMPDIR =", os.environ["TMPDIR"])
print("Set HF_HOME =", os.environ["HF_HOME"])


Set TMPDIR = /workspace2/.tmp
Set HF_HOME = /workspace2/.cache/huggingface


Run the below cell to make sure the temp dirs are set to live within /workspace2 (checks setup above)

In [4]:
import os

keys = [
    "TMPDIR", "TEMP", "TMP",
    "UV_CACHE_DIR", "PIP_CACHE_DIR",
    "HF_HOME", "TRANSFORMERS_CACHE", "HF_DATASETS_CACHE",
    "TORCH_HOME", "XDG_CACHE_HOME",
]

for k in keys:
    print(f"{k:20} =", os.environ.get(k))


TMPDIR               = /workspace2/.tmp
TEMP                 = /workspace2/.tmp
TMP                  = /workspace2/.tmp
UV_CACHE_DIR         = /workspace2/.cache/uv
PIP_CACHE_DIR        = /workspace2/.cache/pip
HF_HOME              = /workspace2/.cache/huggingface
TRANSFORMERS_CACHE   = /workspace2/.cache/huggingface
HF_DATASETS_CACHE    = /workspace2/.cache/huggingface
TORCH_HOME           = /workspace2/.cache/torch
XDG_CACHE_HOME       = /workspace2/.cache


---
title: "RAG with a Notebook GPU"
teaching: 30
exercises: 15
---

:::::::::::::::::::::::::::::::::::::: questions

- How can we run a basic Retrieval-Augmented Generation (RAG) pipeline entirely from a single GPU-backed SageMaker notebook?
- How do we go from raw PDFs and CSV files to a searchable embedding space for WattBot documents?
- How can we generate WattBot-style answers (including citations and evidence) that follow the competition‚Äôs scoring conventions?

::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: objectives

- Verify that our SageMaker notebook instance has a working GPU and compatible Python environment.
- Load the WattBot metadata and question‚Äìanswer files from local storage and inspect their structure.
- Download all referenced PDFs from `metadata.csv` and turn them into a collection of text pages with useful metadata attached.
- Implement a simple, explicit ‚Äúfrom scratch‚Äù text-chunking and embedding pipeline without relying on FAISS or production vector DBs.
- Build a small retrieval helper that finds the most relevant chunks for a question using cosine similarity in embedding space.
- Wire the retriever to a local Qwen 7B-style generator to produce WattBot-format answers (including `answer`, `ref_id`, `ref_url`, and `supporting_materials`).
- Add a second LLM pass that generates short explanations and marks whether the evidence comes from text, figures, tables, or a combination.

::::::::::::::::::::::::::::::::::::::::::::::::



## Working with AWS for RAG Experiments 

In the previous episode, we briefly introduced several approaches for implementing RAG in AWS. Here, we are simply selecting a good GPU instance that can handle whatever RAG system we want to build. This approach is:

- Very easy to understand core on the AWS side of things (just select GPU instance and you're good to move on)
- Ideal for learning retrieval and generation steps  
- Great for experimentation and debugging  

However, it is **not the most cost‚Äëefficient method**. In upcoming episodes we will introduce more efficient and production‚Äëaligned GPU strategies, including:

- On-demand GPU tasks  
- Fully managed asynchronous jobs  
- Serverless or streaming LLM inference  
- SageMaker batch transform & RAG pipelines  
- Embedding jobs that run only when needed  

Those techniques bring you closer to best practice for scalable and budget‚Äëfriendly research computing.

**Remember to Shut Down Your AWS Instance**: GPU notebook instances continue billing **even when idle**.  Always:

- Save your work  
- Shut down or stop the instance when not in use
- Verify the status in the AWS console  

This habit prevents accidental ongoing GPU charges.


## Overview: WattBot RAG on a single notebook GPU

In this episode we build a **minimal but realistic RAG pipeline** from the [WattBot 2025](https://www.kaggle.com/competitions/WattBot2025/overview) challenge that runs entirely from a single GPU-backed SageMaker notebook.

In this episode we will:

1. **Work directly with the WattBot data.**
   - Use `train_QA.csv` and `metadata.csv` from the competition dataset.
   - Download all referenced PDFs (our RAG corpus) using the URLs in `metadata.csv`.
2. **Implement the core RAG steps explicitly in code.**
   - Read PDFs, extract per-page text, and attach document metadata.
   - Chunk text into overlapping segments suitable for embedding.
   - Embed chunks with a sentence-transformer (`thenlper/gte-base`)
   - Implement cosine-similarity search over the embedding matrix.
3. **Connect to a local Qwen-style generator.**
   - Use a quantized 7B model on a GPU-backed instance (e.g., `ml.g5.xlarge`).
   - Construct WattBot-style answers that we can compare against `train_QA.csv`.
4. **Add an explanation pass.**
   - Use an LLM to look at the retrieved evidence, the answer, and citations.
   - Generate a short explanation and label the **evidence type**: `[Quote]`, `[Table]`, `[Figure]`, or `[Mixed]`.


## Notebook + dataset setup

For this episode, we assume you are running on an AWS SageMaker notebook instance with a GPU, such as:

- `ml.g5.xlarge` (recommended) or
- `ml.g4dn.xlarge` (may work with smaller models / more aggressive quantization).

See [Instances for ML](https://carpentries-incubator.github.io/ML_with_AWS_SageMaker/instances-for-ML.html) for further guidance.


### Step 1 ‚Äì Download `data.zip` locally

We‚Äôll use the **WattBot 2025** dataset. Download the workshop data archive to your laptop or desktop:

- Open this link in your browser: https://github.com/carpentries-incubator/ML_with_AWS_SageMaker/blob/main/data/data.zip
- Save `data.zip` somewhere you can find it easily and unzip the folder contents

This archive should include a `data/wattbot/` folder containing:

- `metadata.csv` ‚Äì index of all WattBot papers.
- `train_QA.csv` ‚Äì labeled questions + ground truth answers.

### Step 2 ‚Äì Create a WattBot S3 bucket

In the AWS console:

1. Go to **S3**.
2. Create a new bucket named something like:  
   `teamname-yourname-wattbot`
3. Keep **Block all public access** enabled.
4. (Optional, but recommended) Add tags so we can track costs:  
   - `Project = your-team-name`  
   - `Name = your-name`  
   - `Purpose = RAG-demo`

### Step 3 ‚Äì Upload the WattBot files to S3

1. In your new bucket, click **Upload**.
2. Drag the `data/wattbot/` folder contents from `data.zip` into the upload dialog.
3. Upload it so that your bucket contains paths like:

   - `metadata.csv`
   - `train_QA.csv`

We‚Äôll pull these files from S3 into the notebook in the next steps.


###  Verify GPU and basic environment

In [5]:
!nvidia-smi || echo "No GPU detected ‚Äì please switch to a GPU-backed instance (e.g., ml.g5.xlarge) before running this notebook."

Fri Jan 23 18:01:03 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:89:00.0 Off |                    0 |
| N/A   50C    P8             34W /  450W |       3MiB /  97887MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX PRO 6000 Blac...    On  |   00

In [6]:
# also verify you've selected teh conda_pytorch_p310 kernel
import torch
print("torch cuda available:", torch.cuda.is_available())
print("num gpus:", torch.cuda.device_count())

torch cuda available: True
num gpus: 2


## Load data

In [7]:
import os
import json
import time
import math
# from typing import List, Dict, Any

# import boto3
import pandas as pd
import numpy as np


In [8]:
# Local working directory in the notebook instance
local_data_dir = "./data"

print("Local data dir:", local_data_dir)

Local data dir: ./data


In [9]:
# Download metadata.csv and train_QA.csv
metadata_key = "metadata.csv"
train_qa_key = "train_QA.csv"

metadata_path = os.path.join(local_data_dir, metadata_key)
train_qa_path = os.path.join(local_data_dir, train_qa_key)


## Step 1 ‚Äì Imports, paths, and safe CSV loading

In [10]:
import os
import time
import json
import math
import zipfile
from typing import List, Dict, Any, Tuple

import requests
import numpy as np
import pandas as pd

import torch
from torch import nn

from transformers import AutoModelForCausalLM, AutoTokenizer
from sentence_transformers import SentenceTransformer



In [11]:
def smart_read_csv(path: str) -> pd.DataFrame:
    """Try several encodings when reading a CSV file.

    Some CSVs (especially those with special characters in author names or titles)
    may not be valid UTF-8. This helper rotates through common encodings and raises
    the last error only if all fail.
    """
    encodings = ["utf-8", "latin1", "ISO-8859-1", "cp1252"]
    last_error = None
    for enc in encodings:
        try:
            return pd.read_csv(path, encoding=enc)
        except Exception as e:
            last_error = e
    if last_error is not None:
        raise last_error
    raise RuntimeError(f"Unable to read CSV at {path}")


train_df = smart_read_csv(train_qa_path)
metadata_df = smart_read_csv(metadata_path)

print("train_QA.csv columns:", train_df.columns.tolist())
print("metadata.csv columns:", metadata_df.columns.tolist())
print("\nNumber of training QAs:", len(train_df))
print("Number of metadata rows:", len(metadata_df))

train_df.head(15)

train_QA.csv columns: ['id', 'question', 'answer', 'answer_value', 'answer_unit', 'ref_id', 'ref_url', 'supporting_materials', 'explanation']
metadata.csv columns: ['id', 'type', 'title', 'year', 'citation', 'url']

Number of training QAs: 41
Number of metadata rows: 32


Unnamed: 0,id,question,answer,answer_value,answer_unit,ref_id,ref_url,supporting_materials,explanation
0,q003,What is the name of the benchmark suite presen...,The ML.ENERGY Benchmark,ML.ENERGY Benchmark,is_blank,['chung2025'],['https://arxiv.org/pdf/2505.06371'],"We present the ML.ENERGY Benchmark, a benchmar...",Quote
1,q009,What were the net CO2e emissions from training...,4.3 tCO2e,4.3,tCO2e,['patterson2021'],['https://arxiv.org/pdf/2104.10350'],"""Training GShard-600B used 24 MWh and produced...",Quote
2,q054,What is the model size in gigabytes (GB) for t...,64.7 GB,64.7,GB,['chen2024'],['https://arxiv.org/pdf/2405.01814'],Table 3: Large language models used for evalua...,Table 3
3,q062,What was the total electricity consumption of ...,Unable to answer with confidence based on the ...,is_blank,MWh,is_blank,is_blank,is_blank,is_blank
4,q075,True or False: Hyperscale data centers in 2020...,TRUE,1,is_blank,"['wu2021b','patterson2021']","['https://arxiv.org/abs/2108.06738','https://a...","Wu 2021, body text near Fig. 1: ""‚Ä¶between trad...",The >40% statement is explicit in Wu. Patterso...
5,q078,For every medium-length GPT-3 completion (prom...,0.02 to 0.1 bottles,"[0.02,0.1]",500 mL bottles,['li2025b'],['https://arxiv.org/pdf/2304.03271'],"""Additionally, GPT-3 needs to -drink- (i.e., c...",The paper states that one 500ml bottle is cons...
6,q091,From a sample of 60 papers from top AI confere...,55%,55,percent,['schwartz2019'],['https://arxiv.org/pdf/1907.10597'],"""A large majority of the papers target accurac...",Requires calculation (75-20)
7,q102,True or False: The AI Act makes energy consump...,FALSE,0,is_blank,['ebert2024'],['https://arxiv.org/pdf/2410.06681'],Section 4.3 Transparency: 'Where the Act does ...,Quote
8,q105,What is the projected maximum batch size (in s...,28 samples per batch,28,samples,['xia2024'],['https://arxiv.org/pdf/2408.04693'],Figure 13,Figure
9,q106,What was the approximate speedup in inference ...,2x,2,multiplier,['samsi2024'],['https://arxiv.org/pdf/2310.03003'],"""anywhere from a 2 times (7B) ‚Ä¶ increase ‚Ä¶ on ...",Quote


## Step 2 ‚Äì Download all PDFs from `metadata.csv`

Next we will...

1. Read the `url` column from `metadata.csv`.
2. Download each PDF via HTTP and save it locally as `<id>.pdf` under `pdfs/`.
3. Report any failures (e.g., missing or malformed URLs) at the end.
4. Upload zipped version of corpus to S3

In [12]:
import os
from typing import List, Tuple
import pandas as pd
import requests

PDF_DIR = os.path.join(local_data_dir, "pdfs")
os.makedirs(PDF_DIR, exist_ok=True)

def download_all_pdfs_from_urls(
    metadata: pd.DataFrame,
    local_pdf_dir: str,
    url_col: str = "url",
    id_col: str = "id",
    timeout: int = 20,
) -> None:
    """Download all PDFs referenced in `metadata` using their URLs.

    - Saves each file as `<id>.pdf` in `local_pdf_dir`.
    - Skips download if the file already exists.
    - Strips whitespace from the URL.
    - Skips rows with missing or non-HTTP URLs.
    - Prints a short summary of any failures.
    """
    os.makedirs(local_pdf_dir, exist_ok=True)
    errors: List[Tuple[str, str]] = []

    print(f"Saving PDFs to: {local_pdf_dir}\n")

    for _, row in metadata.iterrows():
        doc_id = str(row[id_col]).strip()
        local_path = os.path.join(local_pdf_dir, f"{doc_id}.pdf")

        # Skip if file already exists
        if os.path.exists(local_path):
            print(f"Skipping {doc_id}: already exists")
            continue

        raw_url = row.get(url_col, None)
        if not isinstance(raw_url, str):
            errors.append((doc_id, "URL is not a string"))
            continue

        pdf_url = raw_url.strip()
        if not pdf_url.startswith("http"):
            errors.append((doc_id, f"Invalid URL: {pdf_url!r}"))
            continue

        try:
            print(f"Downloading {doc_id} from {pdf_url} ...")
            resp = requests.get(pdf_url, timeout=timeout, allow_redirects=True)
            resp.raise_for_status()

            content_type = resp.headers.get("Content-Type", "")
            if "pdf" not in content_type.lower() and not pdf_url.lower().endswith(".pdf"):
                print(f"  Warning: Content-Type does not look like PDF ({content_type})")

            with open(local_path, "wb") as f:
                f.write(resp.content)

        except Exception as e:
            print(f"  -> FAILED for {doc_id}: {e}")
            errors.append((doc_id, str(e)))

    if errors:
        print("\nSome PDFs could not be downloaded:")
        for doc_id, err in errors:
            print(f"  {doc_id}: {err}")
    else:
        print("\nAll PDFs downloaded successfully!")


download_all_pdfs_from_urls(
    metadata_df,
    PDF_DIR,
    url_col="url",
    id_col="id",
    timeout=20,
)

len(os.listdir(PDF_DIR))


Saving PDFs to: ./data/pdfs

Skipping amazon2023: already exists
Skipping chen2024: already exists
Skipping chung2025: already exists
Skipping cottier2024: already exists
Skipping dodge2022: already exists
Skipping ebert2024: already exists
Skipping erben2023: already exists
Skipping fernandez2025: already exists
Skipping griggs2024: already exists
Skipping han2024: already exists
Skipping jegham2025: already exists
Skipping khan2025: already exists
Skipping kim2025: already exists
Skipping li2025a: already exists
Skipping li2025b: already exists
Skipping luccioni2023: already exists
Skipping luccioni2024: already exists
Skipping luccioni2025a: already exists
Skipping luccioni2025b: already exists
Skipping luccioni2025c: already exists
Skipping morrison2025: already exists
Skipping patterson2021: already exists
Skipping rubei2025: already exists
Skipping samsi2024: already exists
Skipping schwartz2019: already exists
Skipping shen2024: already exists
Skipping stone2022: already exists


32

## Step 3 ‚Äì Turn PDFs into page-level ‚Äúdocuments‚Äù

Next, we convert each PDF into a list of **page-level records**. Each record stores:

- `text`: page text (as extracted by `pypdf`).
- `doc_id`: short ID from `metadata.csv` (e.g., `strubell2019`).
- `title`: title of the document.
- `url`: original PDF URL.
- `page_num`: zero-based page index.
- `page_label`: label used inside the PDF (often 1-based).

Later, we will **chunk these pages** into smaller overlapping segments for embedding.

### Why we page-chunk first

We split the PDF into **pages before chunking** because pages give us a stable, easy-to-interpret unit.  
This helps with:

- **Keeping metadata** (doc ID, URL, page labels) tied to the text.  
- **Debugging retrieval** ‚Äî it‚Äôs much easier to understand what the model saw if we know which page(s) were used.  
- **Cleaning text** before making smaller overlapping chunks.  
- **Flexibility later** ‚Äî once pages are structured, we can try different chunk sizes or strategies without re-extracting the PDF.

In short: **pages first ‚Üí then chunks** keeps the workflow cleaner and easier to reason about.

In [13]:
# !pip install pypdf

In [14]:
from pypdf import PdfReader

def pdfs_to_page_docs(metadata: pd.DataFrame, pdf_dir: str) -> List[Dict[str, Any]]:
    """Load each PDF into a list of page-level dictionaries.

    Each dict has keys: text, doc_id, title, url, page_num, page_label, total_pages.
    """
    page_docs: List[Dict[str, Any]] = []

    for _, row in metadata.iterrows():
        doc_id = str(row["id"]).strip()
        title = str(row.get("title", "")).strip()
        url = str(row.get("url", "")).strip()

        pdf_path = os.path.join(pdf_dir, f"{doc_id}.pdf")
        if not os.path.exists(pdf_path):
            print(f"Missing PDF for {doc_id}, skipping.")
            continue

        try:
            reader = PdfReader(pdf_path)
        except Exception as e:
            print(f"Failed to read {pdf_path}: {e}")
            continue

        total_pages = len(reader.pages)
        for i, page in enumerate(reader.pages):
            try:
                text = page.extract_text() or ""
            except Exception as e:
                print(f"Failed to extract text from {doc_id} page {i}: {e}")
                text = ""

            text = text.strip()
            if not text:
                # Still keep the page so we know it exists, but mark it as empty
                text = "[[EMPTY PAGE TEXT ‚Äì see original PDF for tables/figures]]"

            page_docs.append(
                {
                    "text": text,
                    "doc_id": doc_id,
                    "title": title,
                    "url": url,
                    "page_num": i,
                    "page_label": str(i + 1),
                    "total_pages": total_pages,
                }
            )

    return page_docs


page_docs = pdfs_to_page_docs(metadata_df, PDF_DIR)
print(f"Loaded {len(page_docs)} page-level records from {len(metadata_df)} PDFs.")
page_docs[0] if page_docs else None

Failed to extract text from patterson2021 page 1: 'bbox'
Failed to extract text from patterson2021 page 4: 'bbox'
Loaded 639 page-level records from 32 PDFs.


{'text': 'Amazon \nSustainability \nReport\n2023',
 'doc_id': 'amazon2023',
 'title': '2023 Amazon Sustainability Report',
 'url': 'https://sustainability.aboutamazon.com/2023-amazon-sustainability-report.pdf',
 'page_num': 0,
 'page_label': '1',
 'total_pages': 98}

## Step 4 ‚Äì Simple, explicit text chunking

RAG systems typically break documents into **chunks** so that:

- Each chunk is long enough to carry meaningful context.
- No chunk is so long that it blows up the embedding/LLM context window.

For this workshop we will implement a **simple sliding-window chunker** that operates on characters:

- `chunk_size_chars`: maximum characters per chunk (e.g., 1,000‚Äì1,500).
- `chunk_overlap_chars`: overlap between consecutive chunks (e.g., 200).

In our own work, you may wish to plug in more sophisticated *semantic chunking*  methods(e.g., splitting on headings, section titles, or sentence boundaries). For now, we'll keep the implementation explicit and easy to debug.

In [15]:
def split_text_into_chunks(
    text: str,
    chunk_size_chars: int = 1200,
    chunk_overlap_chars: int = 200,
) -> List[str]:
    """Split `text` into overlapping character-based chunks.

    This is a simple baseline; more advanced versions might:
    - split on sentence boundaries, or
    - merge short paragraphs and respect section headings.
    """
    text = text.strip()
    if not text:
        return []

    chunks: List[str] = []
    start = 0
    text_len = len(text)

    while start < text_len:
        end = min(start + chunk_size_chars, text_len)
        chunk = text[start:end]
        chunks.append(chunk)
        if end == text_len:
            break
        # Move the window forward, keeping some overlap
        start = end - chunk_overlap_chars

    return chunks


def make_chunked_docs(
    page_docs: List[Dict[str, Any]],
    chunk_size_chars: int = 1200,
    chunk_overlap_chars: int = 200,
) -> List[Dict[str, Any]]:
    """Turn page-level records into smaller overlapping text chunks.

    Each chunk keeps a pointer back to its document and page metadata.
    """
    chunked: List[Dict[str, Any]] = []
    for page in page_docs:
        page_text = page["text"]
        chunks = split_text_into_chunks(
            page_text,
            chunk_size_chars=chunk_size_chars,
            chunk_overlap_chars=chunk_overlap_chars,
        )
        for idx, chunk_text in enumerate(chunks):
            chunked.append(
                {
                    "text": chunk_text,
                    "doc_id": page["doc_id"],
                    "title": page["title"],
                    "url": page["url"],
                    "page_num": page["page_num"],
                    "page_label": page["page_label"],
                    "total_pages": page["total_pages"],
                    "chunk_idx_in_page": idx,
                }
            )
    return chunked

In [16]:
import os, json

chunks_s3_key = 'chunks.jsonl'
chunks_jsonl_path = os.path.join(local_data_dir, chunks_s3_key)

def save_chunked_docs_jsonl(path, chunks):
    with open(path, "w", encoding="utf-8") as f:
        for rec in chunks:
            json.dump(rec, f, ensure_ascii=False)
            f.write("\n")


def load_chunked_docs_jsonl(path):
    with open(path, "r", encoding="utf-8") as f:
        return [json.loads(line) for line in f]

# -------------------------------------------------------------------
# Cached chunking logic
# -------------------------------------------------------------------
if os.path.exists(chunks_jsonl_path):
    print(f"Found existing chunk file: {chunks_jsonl_path}")
    chunked_docs = load_chunked_docs_jsonl(chunks_jsonl_path)
    print("Loaded chunked docs:", len(chunked_docs))
else:
    print("No chunk file found. Running chunking step...")
    chunked_docs = make_chunked_docs(page_docs)
    save_chunked_docs_jsonl(chunks_jsonl_path, chunked_docs)
    print(f"Saved chunked docs to {chunks_jsonl_path}")

# Show first chunk
print("Raw pages:", len(page_docs))
print("Chunked docs:", len(chunked_docs))
chunked_docs[0] if chunked_docs else None

Found existing chunk file: ./data/chunks.jsonl
Loaded chunked docs: 2874
Raw pages: 639
Chunked docs: 2874


{'text': 'Amazon \nSustainability \nReport\n2023',
 'doc_id': 'amazon2023',
 'title': '2023 Amazon Sustainability Report',
 'url': 'https://sustainability.aboutamazon.com/2023-amazon-sustainability-report.pdf',
 'page_num': 0,
 'page_label': '1',
 'total_pages': 98,
 'chunk_idx_in_page': 0}

## Step 5 ‚Äì Build an embedding matrix

Now we embed each chunk into a vector using a **sentence-transformer** model. For WattBot, a strong and relatively efficient choice is:

### `thenlper/gte-large` (Recommended baseline embedder)

- Size / parameters:  ~335M parameters, roughly 1.3‚Äì1.4 GB in BF16/FP16 when loaded on GPU. Fits cleanly on T4 (16 GB), L4, A10G, A10, A100, and all g5.* instances.  Offers noticeably better retrieval quality than smaller 100M‚Äì150M models without requiring high-end GPU memory. Runs comfortably on g4dn.xlarge, g5.xlarge, or g5.2xlarge during workshops. Lets participants see meaningful improvements from chunking and retrieval methods without excessive compute cost.

- Intended use:  General-purpose retrieval and semantic search across academic PDFs, sustainability reports, and mixed-domain long-form documents. Stronger semantic coherence than gte-base or MiniLM, but still lightweight enough for workshop hardware.

- Throughput expectations:
  - CPU only: workable for small corpora (<2k chunks) but slow for anything larger.  
  - GPU (T4, L4, A10G, A100) with batch sizes around 64‚Äì128:  
    - 20k‚Äì40k chunks/min on L4 or A10G  
    - 10k‚Äì15k chunks/min on T4  
    - 50k+ chunks/min on A100  
      
We will:

1. Load the embedding model on GPU if available.
2. Encode all chunks in batches.
3. Store the resulting matrix as a `torch.Tensor` or `numpy.ndarray` along with the original `chunked_docs` list.

Later, we‚Äôll implement a small retrieval helper that does cosine-similarity search over this matrix‚Äîno additional indexing library required.

In [17]:
import numpy as np
import time
from sentence_transformers import SentenceTransformer

# We'll use a stronger embedding model now that we have a GPU.
# This model has ~335M parameters and benefits from GPU acceleration,
# but is still reasonable to run on a single 24 GB GPU.
embedding_model_id = "thenlper/gte-large"

use_cuda_for_embeddings = torch.cuda.is_available()
print("CUDA available for embeddings:", use_cuda_for_embeddings)

# Single shared embedder object that we can pass around.
embedder = SentenceTransformer(
    embedding_model_id,
    device="cuda" if use_cuda_for_embeddings else "cpu"
)

CUDA available for embeddings: True


In [18]:
import torch, time
print("CUDA:", torch.cuda.is_available())
print("capability:", torch.cuda.get_device_capability(0))
print("name:", torch.cuda.get_device_name(0))

x = torch.randn(4096, 4096, device="cuda")
torch.cuda.synchronize()
t0 = time.perf_counter()
y = x @ x
torch.cuda.synchronize()
print("matmul ok, seconds:", time.perf_counter()-t0, "mean:", y.mean().item())


CUDA: True
capability: (12, 0)
name: NVIDIA RTX PRO 6000 Blackwell Server Edition
matmul ok, seconds: 0.023830278776586056 mean: -0.0036396931391209364


In [19]:
print(embedder.device)


cuda:0


In [20]:
import torch
print(torch.cuda.get_arch_list())


['sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']


In [21]:
import torch, time
torch.manual_seed(0)

# conv (cuDNN)
x = torch.randn(16, 3, 224, 224, device="cuda")
w = torch.randn(64, 3, 7, 7, device="cuda")
torch.cuda.synchronize()
t0 = time.perf_counter()
y = torch.nn.functional.conv2d(x, w, stride=2, padding=3)
torch.cuda.synchronize()
print("conv ok:", y.mean().item(), "sec:", time.perf_counter()-t0)

# layernorm
x = torch.randn(1024, 4096, device="cuda")
ln = torch.nn.LayerNorm(4096).cuda()
torch.cuda.synchronize()
t0 = time.perf_counter()
y = ln(x)
torch.cuda.synchronize()
print("layernorm ok:", y.mean().item(), "sec:", time.perf_counter()-t0)


conv ok: -0.003982101567089558 sec: 0.038530409801751375
layernorm ok: -1.6916601452976465e-10 sec: 0.0034753610379993916


In [22]:
def embed_texts(embedder, docs, batch_size: int = 128) -> np.ndarray:
    """Embed all chunk texts into a dense matrix of shape (N, D)."""
    texts = [d["text"] for d in docs]
    all_embeddings = []
    start = time.time()
    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]
        emb = embedder.encode(
            batch,
            batch_size=batch_size,
            convert_to_numpy=True,
            show_progress_bar=False,
            normalize_embeddings=True,
        )
        all_embeddings.append(emb)
    embeddings = np.vstack(all_embeddings) if all_embeddings else np.zeros((0, 768))
    print(f"Computed embeddings for {len(texts)} chunks in {time.time() - start:.1f}s")
    return embeddings

In [23]:
import time
import torch

# Make sure any previous GPU work is finished
if torch.cuda.is_available():
    torch.cuda.synchronize()

t0 = time.perf_counter()

chunk_embeddings = embed_texts(embedder, chunked_docs)

# Synchronize again so timing includes actual GPU compute
if torch.cuda.is_available():
    torch.cuda.synchronize()

dt = time.perf_counter() - t0

# Save for later metrics export
timing_embedding_s = dt
embedding_batch_size = 128  # matches embed_texts default in this notebook

print("chunk_embeddings shape:", chunk_embeddings.shape)
print(f"Embedding time: {dt:.2f} seconds")
print(f"Docs/sec: {len(chunked_docs) / dt:.2f}")

# Computed embeddings for 2874 chunks in 76.4s
# chunk_embeddings shape: (2874, 1024)
# Embedding time: 76.41 seconds
# Docs/sec: 37.61

Computed embeddings for 2874 chunks in 9.2s
chunk_embeddings shape: (2874, 1024)
Embedding time: 9.17 seconds
Docs/sec: 313.36


### 6. Build a simple retrieval step (cosine similarity)

We are **not** using a heavy vector database in this first episode.

Instead, we:

1. Embed each chunk with `thenlper/gte-large` (done above).
2. Embed each question.
3. Compute cosine similarity between the question embedding and all chunk embeddings.
4. Take the top‚Äìk most similar chunks as our retrieved context.

This keeps the retrieval logic completely transparent for teaching, while still matching the *spirit* of
production systems that use FAISS, Chroma, Weaviate, etc.

#### When might FAISS or a vector database be worth exploring?

For small‚Äìto‚Äìmedium experiments (a few thousand to maybe tens of thousands of chunks), this "plain NumPy + cosine
similarity" approach is usually enough. You might consider FAISS or a full vector DB when:

- **Your corpus gets big**  
  Once you‚Äôre in the hundreds of thousands to millions of chunks, brute-force similarity search can become slow
  and memory-hungry. FAISS and friends provide *approximate nearest neighbor* search that scales much better.

- **You need low-latency, repeated queries**  
  If many users (or a web app) will hit your RAG system concurrently, you‚Äôll want:
  - fast indexing,
  - efficient caching, and
  - sub-second query latency.  
  Vector DBs are designed for this use case.

- **You need rich filtering or metadata search**  
  Vector DBs often support:
  - filtering by metadata (e.g., `paper = "chung2025"`, `year > 2021`),
  - combining keyword + vector search (‚Äúhybrid search‚Äù),
  - role-based access control and multi-tenant setups.

- **You want to share an index across services**  
  If multiple notebooks, microservices, or teams need to reuse the **same embedding index**, a shared FAISS index or
  hosted vector DB is much easier to manage than passing around `.npy` files.

- **You need GPU-accelerated or distributed search**  
  FAISS can use GPUs and sharding to speed up search on very large embedding collections. This is overkill for our
  teaching demo (and the Wattbot project in general), but very relevant for production-scale systems.

In this episode we deliberately stick with a simple in-memory index so the retrieval step is easy to inspect and
debug. In later episodes (or your own projects), you can **swap out the retrieval layer** for FAISS or a vector DB
without changing the overall RAG architecture: the model still sees ‚Äútop‚Äìk retrieved chunks‚Äù as context.

In [24]:
from typing import List, Dict, Any

def cosine_similarity_matrix(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    """Compute cosine similarity between rows of a and rows of b."""
    a_norm = a / (np.linalg.norm(a, axis=1, keepdims=True) + 1e-12)
    b_norm = b / (np.linalg.norm(b, axis=1, keepdims=True) + 1e-12)
    return np.dot(a_norm, b_norm.T)

def retrieve_top_k(
    query_embedding: np.ndarray,
    chunk_embeddings: np.ndarray,
    chunked_docs: List[Dict[str, Any]],
    k: int = 5,
) -> List[Dict[str, Any]]:
    """Return top-k most similar chunks for a query embedding."""
    if chunk_embeddings.shape[0] == 0:
        return []

    # query_embedding is 1D (D,)
    sims = cosine_similarity_matrix(query_embedding.reshape(1, -1), chunk_embeddings)[0]
    top_idx = np.argsort(-sims)[:k]

    results: List[Dict[str, Any]] = []
    for idx in top_idx:
        doc = chunked_docs[idx]
        results.append(
            {
                "score": float(sims[idx]),
                "text": doc["text"],
                "doc_id": doc["doc_id"],
                "page_num": doc["page_num"],
                "title": doc["title"],
                "url": doc["url"],
            }
        )
    return results

In [25]:
# Quick sanity check for `retrieve_top_k` on the first training question
first_row = train_df.iloc[0]
test_question = first_row["question"]
print("Sample question:", test_question)

test_q_emb = embedder.encode(
    [test_question],
    convert_to_numpy=True,
    normalize_embeddings=True,
)[0]

test_retrieved = retrieve_top_k(
    query_embedding=test_q_emb,
    chunk_embeddings=chunk_embeddings,
    chunked_docs=chunked_docs,
    k=3,
)

print(f"Top {len(test_retrieved)} retrieved chunks:")
for r in test_retrieved:
    snippet = r["text"].replace("\n", " ")
    if len(snippet) > 160:
        snippet = snippet[:160] + "..."
    print(f"- score={r['score']:.3f} | doc_id={r['doc_id']} | page={r['page_num']} | snippet={snippet}")

Sample question: What is the name of the benchmark suite presented in a recent paper for measuring inference energy consumption?
Top 3 retrieved chunks:
- score=0.922 | doc_id=chung2025 | page=0 | snippet=The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization Jae-Won Chung Jeff J. Ma Ruofan Wu Jiachen Liu Oh Jun Kweon Yuxuan Xia Z...
- score=0.921 | doc_id=chung2025 | page=9 | snippet=e generalizability and reproducibility of the results (Section 2.1). The ML.ENERGY Benchmark is the first inference energy benchmark for modern generative AI mo...
- score=0.917 | doc_id=chung2025 | page=9 | snippet=ially, requires direct access to the system under test to physically install the power analyzer, which significantly limits who can run the benchmarks (Section ...


### 7. Load the Qwen model for answer generation

For this episode we use **Qwen2.5-7B-Instruct** via the Hugging Face `transformers` library.

- Parameter count: ~7 billion.
- VRAM needs: ~14‚Äì16 GB in bfloat16 / 4-bit; fine for `ml.g5.xlarge` or a similar single-GPU instance.
- Intended use here: short, grounded answers plus a normalized `answer_value`.

We will:

1. Call Qwen once to propose an answer and supporting evidence.
2. Call Qwen a **second time** with a smaller prompt to generate a short explanation (<= 100 characters).

In [26]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig

DTYPE_MAP = {
    "fp16": torch.float16,
    "bf16": torch.bfloat16,
    "fp32": torch.float32,
}

def load_textgen_model(
    model_id: str,
    quant: str | None = None,          # None | "8bit" | "4bit"
    dtype: str = "bf16",
    device_map: str | None = None,     # None | "auto"  (kept for compatibility)
    trust_remote_code: bool = True,
    force_gpu: bool = False,           # NEW: force everything onto cuda:0
):
    dtype = DTYPE_MAP[dtype]
    tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=trust_remote_code)

    quant_cfg = None
    if quant == "8bit":
        quant_cfg = BitsAndBytesConfig(load_in_8bit=True)
    elif quant == "4bit":
        quant_cfg = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_torch_dtype=dtype,
        )
    elif quant is None:
        pass
    else:
        raise ValueError("quant must be None, '8bit', or '4bit'")

    # Decide device_map behavior
    # - force_gpu=True: prevent CPU/disk dispatch (fixes your 72B bnb error)
    # - otherwise: honor device_map param (defaulting to "auto" for quantized loads)
    if force_gpu:
        dm = {"": 0}  # put the entire model on cuda:0
    else:
        if quant is not None:
            dm = device_map or "auto"
        else:
            dm = device_map  # usually None

    # Load model
    if dm is not None:
        # accelerate/device_map path ‚Üí DO NOT pass device= to pipeline
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=dtype,
            quantization_config=quant_cfg,
            device_map=dm,
            trust_remote_code=trust_remote_code,
            low_cpu_mem_usage=True,
        )
        gen_pipe = pipeline("text-generation", model=model, tokenizer=tok)
    else:
        # simple single-GPU path
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=dtype,
            quantization_config=quant_cfg,
            device_map=None,
            trust_remote_code=trust_remote_code,
            low_cpu_mem_usage=True,
        ).to("cuda")
        gen_pipe = pipeline("text-generation", model=model, tokenizer=tok, device=0)

    return tok, model, gen_pipe

def call_chat(tok, gen_pipe, system_prompt: str, user_prompt: str, max_new_tokens: int = 384) -> str:
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]
    prompt_text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    out = gen_pipe(prompt_text, max_new_tokens=max_new_tokens, do_sample=False)
    full = out[0]["generated_text"]
    return full[len(prompt_text):].strip()


In [27]:
import os, torch
print("CUDA_VISIBLE_DEVICES =", os.environ.get("CUDA_VISIBLE_DEVICES"))
print("torch.cuda.device_count() =", torch.cuda.device_count())
assert torch.cuda.device_count() == 2, "Expected 2 GPUs visible (check Run:AI GPU allocation and env)."


CUDA_VISIBLE_DEVICES = None
torch.cuda.device_count() = 2


In [28]:
# model_id = "Qwen/Qwen2.5-3B-Instruct"
# model_id = "Qwen/Qwen2.5-7B-Instruct"
# model_id = "Qwen/Qwen2.5-32B-Instruct"
model_id = "Qwen/Qwen2.5-72B-Instruct"
quant="4bit"
t_load_start = time.perf_counter()
tok, model, gen_pipe = load_textgen_model(
    model_id,
    quant=quant,
    dtype="bf16",       
    device_map="balanced",
    force_gpu=False,
)
torch.cuda.synchronize()
t_load = time.perf_counter() - t_load_start
print(f"Model load time: {t_load:.2f} seconds")
# Save for later metrics export
timing_model_load_s = t_load
run_meta = {
    "model_id": model_id,
    "quant": quant,
    "dtype": "bf16",
    "device_map": "balanced",
    "force_gpu": False,
}


print(call_chat(tok, gen_pipe, "You are helpful.", "One sentence: what is RAG?", max_new_tokens=64))


`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/37 [00:00<?, ?it/s]

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Model load time: 18.07 seconds
RAG stands for "Retrieval-Augmented Generation," a technique that combines information retrieval with language generation to improve the accuracy and relevance of generated text.


In [29]:
model_id

'Qwen/Qwen2.5-72B-Instruct'

In [30]:
# --- NEW: model footprint helpers (disk + memory) ---
import os, sys, subprocess, threading
from pathlib import Path

def _dir_size_bytes(path: Path) -> int:
    total = 0
    for p in path.rglob("*"):
        try:
            if p.is_file():
                total += p.stat().st_size
        except FileNotFoundError:
            pass
    return total

def _fmt_bytes(n: int) -> str:
    # human readable
    units = ["B","KB","MB","GB","TB"]
    f = float(n)
    for u in units:
        if f < 1024 or u == units[-1]:
            return f"{f:.2f} {u}"
        f /= 1024

# def hf_repo_disk_bytes(repo_id: str) -> int | None:
#     """Best-effort: compute on-disk footprint for a HF repo already present in cache.

#     Returns bytes or None if we can't locate it.
#     """
#     # Respect HF_HOME if set; otherwise default cache locations.
#     hf_home = Path(os.environ.get("HF_HOME", Path.home() / ".cache" / "huggingface"))
#     hub_dir = hf_home / "hub"
#     if not hub_dir.exists():
#         return None

#     # HF cache names repos like: models--org--name
#     safe = repo_id.replace("/", "--")
#     repo_dir = hub_dir / f"models--{safe}"
#     if not repo_dir.exists():
#         # Could be a dataset, or different cache layout
#         return None

#     # The bulk is usually in snapshots/<hash>/
#     # We'll sum the whole repo folder to include refs + blobs + snapshots.
#     return _dir_size_bytes(repo_dir)

def gpu_mem_snapshot_gb() -> dict:
    """Current allocated/reserved on each visible CUDA device."""
    out = {}
    if not torch.cuda.is_available():
        return out
    for d in range(torch.cuda.device_count()):
        out[str(d)] = {
            "allocated_gb": float(torch.cuda.memory_allocated(d) / 1e9),
            "reserved_gb": float(torch.cuda.memory_reserved(d) / 1e9),
        }
    return out

# Record model disk footprint (after from_pretrained has pulled weights)
# model_disk_bytes = hf_repo_disk_bytes(model_id)
# print("Model on-disk footprint (HF cache):", None if model_disk_bytes is None else _fmt_bytes(model_disk_bytes))


# Record baseline GPU memory right after load
model_gpu_mem_after_load = gpu_mem_snapshot_gb()
print("GPU mem after load:", model_gpu_mem_after_load)


run_meta["gpu_mem_after_load"] = model_gpu_mem_after_load


GPU mem after load: {'0': {'allocated_gb': 26.983323648, 'reserved_gb': 50.832867328}, '1': {'allocated_gb': 15.64194048, 'reserved_gb': 25.541214208}}


In [31]:
"""
DISK SIZE (WEIGHTS) ‚Äî CANONICAL QWEN 2.5 ONLY (NO FILE LOOKUPS)

What this code does:
- Estimates on-disk weight size using a hard-coded map of canonical parameter counts
  from the Qwen 2.5 model cards.

What this code explicitly does NOT do (because it failed / is unreliable in our environment):
- It does NOT locate Hugging Face cache snapshots.
- It does NOT sum actual .safetensors/.bin files on disk.
- It does NOT rely on snapshot_download(..., local_files_only=True), because we hit
  LocalEntryNotFoundError / couldn't reliably resolve the cached snapshot folder.

Why we do it this way:
- For standard Hugging Face Qwen repos, the cached weights are typically stored as FP16/BF16.
- bitsandbytes quantization (e.g., '4bit') affects runtime memory, not the canonical files on disk.
- Given we could not reliably find the cached files, the most defensible option is:
      disk_weight_bytes_est = canonical_params_from_docs * bytes_per_param_on_disk

Important caveats:
- This is an estimate of weights only (not tokenizer/config, not multiple snapshots).
- If you use a pre-quantized repo (GPTQ/AWQ), this estimate may overstate disk usage.
"""

# Canonical parameter counts from Qwen 2.5 docs (model cards)
QWEN25_CANONICAL_PARAMS = {
    "Qwen/Qwen2.5-3B-Instruct":  3_090_000_000,   # 3.09B
    "Qwen/Qwen2.5-7B-Instruct":  7_610_000_000,   # 7.61B
    "Qwen/Qwen2.5-32B-Instruct": 32_500_000_000,  # 32.5B
    "Qwen/Qwen2.5-72B-Instruct": 72_700_000_000,  # 72.7B
}

def fmt_bytes(n: int) -> str:
    f = float(n)
    for u in ["B", "KB", "MB", "GB", "TB", "PB"]:
        if f < 1024:
            return f"{f:.2f} {u}"
        f /= 1024
    return f"{f:.2f} EB"

# Standard HF Qwen repos store weights on disk as FP16/BF16
BYTES_PER_PARAM_ON_DISK = 2  # set to 4 only if you know the repo stores FP32 weights

canonical_n_params = QWEN25_CANONICAL_PARAMS[model_id]
disk_bytes_est = canonical_n_params * BYTES_PER_PARAM_ON_DISK

print("Model:", model_id)
print("Canonical params (from docs):", f"{canonical_n_params:,}")
print("Bytes per param on disk (assumed):", BYTES_PER_PARAM_ON_DISK)
print("Estimated disk size (weights only):", fmt_bytes(disk_bytes_est))

run_meta["canonical_n_params"] = canonical_n_params
run_meta["disk_bytes_est"] = disk_bytes_est


Model: Qwen/Qwen2.5-72B-Instruct
Canonical params (from docs): 72,700,000,000
Bytes per param on disk (assumed): 2
Estimated disk size (weights only): 135.41 GB


In [32]:
# Quick sanity check for `call_qwen_chat`
test_system_prompt = "You are a concise assistant who answers simple questions clearly."
test_user_prompt = "What is 2 + 2? Answer in one short sentence."
print(call_chat(tok, gen_pipe, test_system_prompt, test_user_prompt))


2 + 2 is 4.


### 8. Build prompts for answers and explanations

We keep the prompts **very explicit**:

- The first call asks Qwen to return JSON with:
  - `answer` (short text),
  - `answer_value` (normalized scalar or category),
  - `ref_id` (comma‚Äëseparated doc ids, e.g. `"jegham2025"`),
  - `supporting_material` (short quote or paraphrase).

- The second call asks Qwen to generate a **single sentence explanation** (<= 100 characters).
  We will prepend an evidence type tag (e.g. `[text]` or `[text+table]`) in code rather than
  asking the model to output it.

In [33]:
def format_context_for_prompt(retrieved_chunks):
    """Format retrieved chunks so the LLM can see where text came from."""
    blocks = []
    for r in retrieved_chunks:
        header = f"[DOC {r['doc_id']} | page {r['page_num']} | score {r['score']:.3f}]"
        blocks.append(header + "\n" + r["text"])
    return "\n\n".join(blocks)

explanation_system_prompt = (
    "You are helping annotate how an answer is supported by a research paper. "
    "You will see a question, an answer, and the supporting text used. "
    "Your job is to (1) choose the MAIN type of evidence and "
    "(2) give a VERY short explanation (<= 100 characters). "
    "Valid evidence types are: text, figure, table, text+figure, table+figure, etc. "
    "Respond in the strict format: evidence_type: explanation"
)

def build_explanation_prompt(question, answer, supporting_materials, ref_id_list):
    ref_str = ", ".join(ref_id_list) if ref_id_list else "unknown"
    return f"""Question: {question}

Answer: {answer}

Supporting materials:
{supporting_materials}

Cited document ids: {ref_str}

Remember:
- evidence_type in [text, figure, table, text+figure, table+figure, etc.]
- explanation <= 100 characters
- Format: evidence_type: explanation
"""

### 9. Run over the full WattBot training set

Now we:

1. Iterate over **all** questions in `train_QA.csv`.
2. Retrieve the top-\(k\) chunks for each question.
3. Ask Qwen for an answer proposal (JSON).
4. Derive:
   - `answer` and `answer_value` from the JSON,
   - `answer_unit` **copied directly from the ground truth** (never guessed),
   - `ref_id` from the JSON,
   - `ref_url` by mapping `ref_id` to `metadata.csv`,
   - `supporting_material` from the JSON,
   - `evidence_type` from the supporting text,
   - `explanation` via a second Qwen call, prefixed with `[evidence_type]`.
5. Save `wattbot_solutions.csv` in the project folder.

In [34]:
import re
from decimal import Decimal

def normalize_answer_value(raw_answer_value, answer_text, answer_unit, is_blank):
    """
    Normalize answer_value into the conventions used by train_QA:
      - 'is_blank' for unanswerable questions
      - plain numeric strings without units, commas, or scientific notation
      - booleans as 1/0
      - categorical strings (e.g., 'ML.ENERGY Benchmark') unchanged
      - ranges like '[0.02,0.1]' preserved as-is
    """
    s = str(raw_answer_value).strip()
    if is_blank:
        return "is_blank"
    if not s or s.lower() == "is_blank":
        return "is_blank"

    # Preserve ranges like [0.02,0.1]
    if s.startswith("[") and s.endswith("]"):
        return s

    lower = s.lower()

    # Booleans -> 1/0
    if lower in {"true", "false"}:
        return "1" if lower == "true" else "0"

    # Pure categorical (no digits) -> leave as-is
    if not any(ch.isdigit() for ch in s):
        return s

    # Try to extract the first numeric token from either the raw string or the answer text
    txt_candidates = [s, str(answer_text)]
    match = None
    for txt in txt_candidates:
        if not txt:
            continue
        match = re.search(r"[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?", str(txt).replace(",", ""))
        if match:
            break

    if not match:
        # Fallback: strip obvious formatting characters
        cleaned = s.replace(",", "").replace("%", "").strip()
        return cleaned or "is_blank"

    num_str = match.group(0)

    # Format without scientific notation, trim trailing zeros
    try:
        d = Decimal(num_str)
        normalized = format(d.normalize(), "f")
    except Exception:
        try:
            f = float(num_str)
            normalized = ("%.15f" % f).rstrip("0").rstrip(".")
        except Exception:
            normalized = num_str

    return normalized

### Running the full RAG pipeline for one question

At this point we have all the building blocks we need:

- an **embedder** to turn questions into vectors  
- a **retriever** (`retrieve_top_k`) to grab the most relevant text chunks  
- a **chat helper** (`call_qwen_chat`) to talk to Qwen and get JSON back  
- a small post-processing helper (`normalize_answer_value`) to clean numbers

In the next few cells we tie these pieces together. We keep the code split into
small helper functions so learners can follow each step:

1. Retrieve context for a question.  
2. Ask the LLM for an answer, references, and a quote.  
3. Clean and normalize the structured fields (answer_value, ref_id, is_blank).  
4. Ask a second LLM call for a short explanation and evidence type.


### üîç Retrieving Relevant Context
This function embeds the question and fetches the top‚ÄêK most relevant text chunks. It‚Äôs the first step of the RAG pipeline and determines what evidence the LLM can see.

In [35]:
# Build a lookup from document id -> URL using metadata
docid_to_url = {
    str(row["id"]).strip(): row["url"]
    for _, row in metadata_df.iterrows()
    if isinstance(row.get("url", None), str)
}

def retrieve_context_for_question(question, embedder, chunk_embeddings, chunked_docs, top_k: int = 8):
    """Embed the question and retrieve the top-k most similar chunks."""
    q_emb = embedder.encode(
        [question],
        convert_to_numpy=True,
        normalize_embeddings=True,
    )[0]
    retrieved = retrieve_top_k(q_emb, chunk_embeddings, chunked_docs, k=top_k)
    context = format_context_for_prompt(retrieved)
    return retrieved, context

### First LLM Step: Producing an Answer
Here we prompt the model to:
- Decide if the question is answerable
- Extract a numeric/categorical answer
- Identify supporting evidence
- Select relevant document IDs

In [36]:
def answer_phase_for_question(qid, question, answer_unit, context):
    """
    First LLM call: ask Qwen for an answer, answer_value, is_blank, ref_ids,
    and a short supporting quote. Then normalize these fields.
    """
    # System prompt: what role Qwen should play
    system_prompt_answer = (
        "You answer questions about AI energy, carbon, and water from academic papers.\n"
        "You are given:\n"
        "- a question\n"
        "- retrieved text chunks from the relevant paper(s)\n\n"
        "You must:\n"
        "1. Decide if the question can be answered from the provided context.\n"
        "2. If answerable, extract a concise numeric or short-text answer_value.\n"
        "3. Use the provided answer_unit EXACTLY as given (do NOT invent units).\n"
        "4. Select one or more document ids as ref_id from the supplied chunks.\n"
        "5. Copy a short supporting quote (<= 300 chars) into supporting_materials.\n"
        "6. If the context is insufficient, mark is_blank = true and set all\n"
        "   other fields to 'is_blank' except answer_unit (keep it as provided).\n"
        "Return a JSON object with fields:\n"
        "  answer (string)\n"
        "  answer_value (string)\n"
        "  is_blank (true or false)\n"
        "  ref_id (list of doc_id strings)\n"
        "  supporting_materials (string)\n"
    )

    context_block = context if context.strip() else "[NO CONTEXT FOUND]"

    # User prompt: question + unit hint + retrieved context
    user_prompt_answer = f"""Question: {question}
Expected answer unit: {answer_unit}

Retrieved context:
{context_block}

Return JSON ONLY with keys:
  answer (string)
  answer_value (string)
  is_blank (true or false)
  ref_id (list of doc_id strings)
  supporting_materials (string)
"""

    raw_answer = call_chat(tok, gen_pipe, system_prompt_answer, user_prompt_answer, max_new_tokens=384)

    # Try to parse JSON from the model output
    parsed = {
        "answer": "",
        "answer_value": "is_blank",
        "is_blank": True,
        "ref_id": [],
        "supporting_materials": "is_blank",
    }
    try:
        first_brace = raw_answer.find("{")
        last_brace = raw_answer.rfind("}")
        if first_brace != -1 and last_brace != -1:
            json_str = raw_answer[first_brace : last_brace + 1]
        else:
            json_str = raw_answer
        candidate = json.loads(json_str)
        if isinstance(candidate, dict):
            parsed.update(candidate)
    except Exception as e:
        print(f"JSON parse error for question {qid}: {e}")
        # fall back to defaults in `parsed`

    # Normalize parsed fields
    is_blank = bool(parsed.get("is_blank", False))
    ref_ids = parsed.get("ref_id") or []
    if isinstance(ref_ids, str):
        ref_ids = [ref_ids]
    ref_ids = [str(r).strip() for r in ref_ids if str(r).strip()]

    answer = str(parsed.get("answer", "")).strip()
    answer_value = str(parsed.get("answer_value", "")).strip() or "is_blank"
    answer_value = normalize_answer_value(
        raw_answer_value=answer_value,
        answer_text=answer,
        answer_unit=answer_unit,
        is_blank=is_blank,
    )
    supporting_materials = str(parsed.get("supporting_materials", "")).strip()

    # If context is empty, force is_blank behaviour but keep a useful diagnostic supporting_materials.
    if not context.strip():
        is_blank = True
        answer = ""
        answer_value = "is_blank"
        ref_ids = []
        supporting_materials = "[NO CONTEXT FOUND]"

    # String formatting for ref_id / ref_url to match training style
    if not ref_ids:
        ref_id_str = "is_blank"
        ref_url_str = "is_blank"
    else:
        ref_id_str = str(ref_ids)

        # Resolve ref_url via metadata
        ref_url = "is_blank"
        for rid in ref_ids:
            if rid in docid_to_url:
                ref_url = docid_to_url[rid]
                break
        if not ref_url:
            ref_url = "is_blank"
        ref_url_str = str([ref_url])

    return answer, answer_value, is_blank, ref_ids, supporting_materials, ref_id_str, ref_url_str


### Second LLM Step: Explanation and Evidence Type
Now that we have an answer, we produce a short explanation and classify the evidence type. This step matches WattBot‚Äôs expected metadata.

In [37]:
def explanation_phase_for_question(question, answer, supporting_materials, ref_ids, is_blank):
    """
    Second LLM call: ask Qwen to label an evidence_type and provide a short
    explanation tying the answer back to the supporting materials.
    """
    # Even when is_blank=True, we still generate an explanation describing why the
    # supporting materials do not answer the question with confidence.
    expl_user_prompt = build_explanation_prompt(
        question=question,
        answer=answer,
        supporting_materials=supporting_materials,
        ref_id_list=ref_ids,
    )
    raw_expl = call_chat(
        tok, 
        gen_pipe, 
        explanation_system_prompt,
        expl_user_prompt,
        max_new_tokens=64,
    )

    # Take the first non-empty line (we expect something like "text: short reason")
    first_line = ""
    for line in raw_expl.splitlines():
        if line.strip():
            first_line = line.strip()
            break

    if ":" in first_line:
        etype, expl = first_line.split(":", 1)
        evidence_type = etype.strip().lower() or "other"
        explanation = expl.strip()
    else:
        evidence_type = "other"
        explanation = first_line.strip() or "short justification"

    # Keep explanations short for the CSV
    if len(explanation) > 100:
        explanation = explanation[:100]

    return evidence_type, explanation


###  Orchestration: `run_single_qa`
This high‚Äêlevel function ties together retrieval, answering, normalization, and explanation into one full pass over a single question.



### Handling unanswerable questions

Some WattBot questions truly **cannot** be answered from the retrieved papers.  
We use a simple hybrid rule to detect these cases:

- We look at the **top retrieval score** (cosine similarity).  
- We also use the LLM's own `is_blank` flag from the first JSON response.  

If **either** of these says the evidence is too weak, we mark the question as unanswerable
and set all relevant fields (`answer_value`, `ref_id`, `supporting_materials`) to `is_blank`.

The `THRESHOLD` inside `run_single_qa` controls how strict this behaviour is:

- lower values ‚Üí fewer questions marked unanswerable  
- higher values ‚Üí more questions marked unanswerable  

You can change `THRESHOLD` and then re-run the notebook and `Score.py` to see
how this trade-off affects your final WattBot score.

In [38]:
def run_single_qa(
    row,
    embedder,
    chunk_embeddings,
    chunked_docs,
    top_k: int = 8,
):
    """Run retrieval + Qwen for a single training QA row.

    This is the high-level orchestration function that calls three helpers:

    1. retrieve_context_for_question  -> get relevant text chunks
    2. answer_phase_for_question      -> generate answer from context, returning citations and supporting materials
    3. explanation_phase_for_question -> evidence type + short explanation
    """

    # Confidence threshold for retrieval.
    # If the top similarity score is below this value, we treat the question
    # as unanswerable, even if the LLM tried to produce an answer.
    THRESHOLD = 0.1

    qid = row["id"]
    question = row["question"]
    answer_unit = row.get("answer_unit", "")

    # 1. Retrieval step
    retrieved, context = retrieve_context_for_question(
        question=question,
        embedder=embedder,
        chunk_embeddings=chunk_embeddings,
        chunked_docs=chunked_docs,
        top_k=top_k,
    )

    top_score = retrieved[0]["score"] if retrieved else 0.0

    # 2. Answer + refs + supporting materials (LLM's view)
    (
        answer,
        answer_value,
        is_blank_llm,
        ref_ids,
        supporting_materials,
        ref_id_str,
        ref_url_str,
    ) = answer_phase_for_question(
        qid=qid,
        question=question,
        answer_unit=answer_unit,
        context=context,
    )

    # Hybrid is_blank decision:
    # - if retrieval is weak (top_score < THRESHOLD)
    # - OR the LLM marks is_blank = true
    # then we treat the question as unanswerable.
    is_blank = bool(is_blank_llm) or (top_score < THRESHOLD)

    if is_blank:
        # Always keep a diagnostic trail (retrieved refs + supporting materials + explanation)
        answer = ""
        answer_value = "is_blank"

        # If the answer phase did not return usable refs/materials, fall back to retrieval output.
        if (not ref_ids) and retrieved:
            ref_ids = []
            for r in retrieved:
                rid = str(r.get("doc_id", "")).strip()
                if rid and rid not in ref_ids:
                    ref_ids.append(rid)
        if (not supporting_materials) or (supporting_materials == "is_blank"):
            if retrieved:
                supporting_materials = "\n".join([f"{r.get('doc_id','')}: {str(r.get('text','')).strip()[:300]}" for r in retrieved[:3]])
            else:
                supporting_materials = "[NO SUPPORTING MATERIALS FOUND]"

        # String formatting for ref_id / ref_url
        if not ref_ids:
            ref_id_str = "[]"
            ref_url_str = "[]"
        else:
            ref_id_str = str(ref_ids)
            ref_url = "is_blank"
            for rid in ref_ids:
                if rid in docid_to_url:
                    ref_url = docid_to_url[rid]
                    break
            if not ref_url:
                ref_url = "is_blank"
            ref_url_str = str([ref_url])

    # Always copy answer_unit from train_QA.csv (do NOT let the LLM invent it)
    answer_unit = row.get("answer_unit", "")

    # 3. Explanation + evidence_type
    evidence_type, explanation = explanation_phase_for_question(
        question=question,
        answer=answer,
        supporting_materials=supporting_materials,
        ref_ids=ref_ids,
        is_blank=is_blank,
    )

    return {
        "id": qid,
        "question": question,
        "answer": answer,
        "answer_value": answer_value,
        "answer_unit": answer_unit,
        "is_blank": "true" if is_blank else "false",
        "ref_id": ref_id_str,
        "ref_url": ref_url_str,
        "supporting_materials": supporting_materials,
        "evidence_type": evidence_type,
        "explanation": explanation,
    }


In [39]:
# --- NEW: runtime monitors (CPU RSS, GPU peak memory, GPU energy/power) ---
import time, subprocess, threading
from dataclasses import dataclass

try:
    import psutil
except ImportError:
    !pip -q install psutil
    import psutil

@dataclass
class MonitorResults:
    wall_s: float
    cpu_rss_peak_bytes: int | None
    gpu_peak_mem: dict
    gpu_energy_wh: dict | None
    gpu_power_samples: dict | None

class CPURSSMonitor:
    def __init__(self, interval_s: float = 0.25):
        self.interval_s = interval_s
        self._stop = threading.Event()
        self._thr = None
        self.peak_rss = 0
        self._proc = psutil.Process()

    def _loop(self):
        while not self._stop.is_set():
            try:
                rss = self._proc.memory_info().rss
                if rss > self.peak_rss:
                    self.peak_rss = rss
            except Exception:
                pass
            time.sleep(self.interval_s)

    def start(self):
        self.peak_rss = 0
        self._stop.clear()
        self._thr = threading.Thread(target=self._loop, daemon=True)
        self._thr.start()

    def stop(self):
        self._stop.set()
        if self._thr is not None:
            self._thr.join(timeout=2.0)
        return self.peak_rss

def _nvidia_smi_query(fields: list[str]) -> str:
    cmd = [
        "nvidia-smi",
        f"--query-gpu={','.join(fields)}",
        "--format=csv,noheader,nounits",
    ]
    return subprocess.check_output(cmd, text=True).strip()

class GPUPowerSampler:
    """Samples GPU power.draw (W) and integrates to Wh. Works even when energy counters are unavailable."""
    def __init__(self, interval_s: float = 0.5):
        self.interval_s = interval_s
        self._stop = threading.Event()
        self._thr = None
        self.samples = {}  # gpu_index -> list[(t, watts)]

    def _loop(self):
        # Initialize sample lists
        n = torch.cuda.device_count() if torch.cuda.is_available() else 0
        self.samples = {str(i): [] for i in range(n)}
        t0 = time.time()
        last_t = t0
        while not self._stop.is_set():
            try:
                # one line per GPU in index order
                out = _nvidia_smi_query(["index", "power.draw"])
                now = time.time()
                for line in out.splitlines():
                    idx_s, p_s = [x.strip() for x in line.split(",")]
                    self.samples.setdefault(idx_s, []).append((now, float(p_s)))
                last_t = now
            except Exception:
                pass
            time.sleep(self.interval_s)

    def start(self):
        self._stop.clear()
        self._thr = threading.Thread(target=self._loop, daemon=True)
        self._thr.start()

    def stop(self):
        self._stop.set()
        if self._thr is not None:
            self._thr.join(timeout=2.0)
        # Integrate to Wh per GPU
        energy_wh = {}
        for idx, s in self.samples.items():
            if len(s) < 2:
                continue
            e_ws = 0.0
            for (t1, p1), (t2, p2) in zip(s[:-1], s[1:]):
                dt = max(0.0, t2 - t1)
                # trapezoid integration
                e_ws += (p1 + p2) / 2.0 * dt
            energy_wh[idx] = e_ws / 3600.0
        return energy_wh, self.samples

class NVMLTotalEnergy:
    """Uses NVML total_energy_consumption counter if supported (best)."""
    def __init__(self):
        self.ok = False
        self._handles = []
        self._start_mj = {}
        try:
            import pynvml
            self.pynvml = pynvml
            pynvml.nvmlInit()
            n = pynvml.nvmlDeviceGetCount()
            self._handles = [pynvml.nvmlDeviceGetHandleByIndex(i) for i in range(n)]
            # Probe support
            for i, h in enumerate(self._handles):
                _ = pynvml.nvmlDeviceGetTotalEnergyConsumption(h)  # mJ
            self.ok = True
        except Exception:
            self.ok = False

    def start(self):
        if not self.ok:
            return
        self._start_mj = {}
        for i, h in enumerate(self._handles):
            self._start_mj[str(i)] = float(self.pynvml.nvmlDeviceGetTotalEnergyConsumption(h))

    def stop(self):
        if not self.ok:
            return None
        out_wh = {}
        for i, h in enumerate(self._handles):
            end_mj = float(self.pynvml.nvmlDeviceGetTotalEnergyConsumption(h))
            start_mj = self._start_mj.get(str(i), end_mj)
            delta_mj = max(0.0, end_mj - start_mj)
            out_wh[str(i)] = (delta_mj / 1000.0) / 3600.0  # mJ->J->Wh
        return out_wh

def start_run_monitors():
    # reset GPU peak memory stats
    if torch.cuda.is_available():
        for d in range(torch.cuda.device_count()):
            torch.cuda.reset_peak_memory_stats(d)

    cpu_mon = CPURSSMonitor(interval_s=0.25)
    cpu_mon.start()

    nvml_energy = NVMLTotalEnergy()
    sampler = None
    if nvml_energy.ok:
        nvml_energy.start()
    else:
        sampler = GPUPowerSampler(interval_s=0.5)
        sampler.start()

    t0 = time.perf_counter()
    return t0, cpu_mon, nvml_energy, sampler

def stop_run_monitors(t0, cpu_mon, nvml_energy, sampler):
    wall_s = time.perf_counter() - t0

    # peak GPU memory (allocated + reserved) per device
    gpu_peak = {}
    if torch.cuda.is_available():
        for d in range(torch.cuda.device_count()):
            gpu_peak[str(d)] = {
                "max_allocated_gb": float(torch.cuda.max_memory_allocated(d) / 1e9),
                "max_reserved_gb": float(torch.cuda.max_memory_reserved(d) / 1e9),
            }

    cpu_peak = cpu_mon.stop()

    if nvml_energy.ok:
        energy_wh = nvml_energy.stop()
        samples = None
    else:
        energy_wh, samples = sampler.stop() if sampler is not None else (None, None)

    return MonitorResults(
        wall_s=wall_s,
        cpu_rss_peak_bytes=cpu_peak,
        gpu_peak_mem=gpu_peak,
        gpu_energy_wh=energy_wh,
        gpu_power_samples=samples,
    )

def _maybe_fmt_bytes(n):
    return None if n is None else _fmt_bytes(int(n))

print("Monitors ready. Note: GPU energy is best-effort; some GPUs/drivers don't expose an energy counter.")


Monitors ready. Note: GPU energy is best-effort; some GPUs/drivers don't expose an energy counter.


In [40]:
# -------------------------------------------------------------------
# Run over max_N training questions (this can take a while!)
# -------------------------------------------------------------------
all_results = []
error_log = []
max_N = np.inf

# Start monitors for this full RAG run (wall time, CPU RSS peak, GPU peak mem, GPU energy)
t0, cpu_mon, nvml_energy, power_sampler = start_run_monitors()
if torch.cuda.is_available():
    torch.cuda.synchronize()

for idx, row in train_df.iterrows():
    if idx >= max_N:
        break

    qnum = idx + 1
    question = row["question"]
    expected_value = row.get("answer_value", "")
    expected_refids = row.get("ref_id", "")

    print(f"########################################################################################################")
    print(f"QUESTION {qnum}: {question}")

    res = run_single_qa(
        row=row,
        embedder=embedder,
        chunk_embeddings=chunk_embeddings,
        chunked_docs=chunked_docs,
        top_k=8,
    )

    answer = res["answer"]
    answer_value = res["answer_value"]
    ref_ids = res["ref_id"]
    explanation = res["explanation"]
    print(f"EXPECTED answer_value: {expected_value}")
    print(f"PREDICTED answer_value: {answer_value}")
    print(f"EXPECTED ref_id(s): {expected_refids}")
    print(f"RECOVERED ref_id(s): {ref_ids}")
    print(f"EXPLANATION: {explanation}")
    print(f"ANSWER (raw): {answer}")

    all_results.append(res)

if torch.cuda.is_available():
    torch.cuda.synchronize()

mon = stop_run_monitors(t0, cpu_mon, nvml_energy, power_sampler)
# --- determine energy method used ---
if mon.gpu_energy_wh is None:
    gpu_energy_method = "none"
elif mon.gpu_power_samples is None:
    gpu_energy_method = "nvml_total_energy"
else:
    gpu_energy_method = "power_sampling"

# --- save to metadata ---
run_meta["gpu_energy_wh"] = mon.gpu_energy_wh
run_meta["gpu_energy_method"] = gpu_energy_method
# Save for later metrics export
timing_generation_s = float(mon.wall_s)
print(f"Generation time (total): {timing_generation_s:.2f} seconds")
print(f"Questions/minute: {len(all_results) / (timing_generation_s/60):.4f}")

solutions_df = pd.DataFrame(all_results)

_model_tag = str(model_id).replace("/", "__")

solutions_path = os.path.join(local_data_dir, f"train_solutions_NVIDIA{torch.cuda.device_count()}_{quant}_{_model_tag}.csv")
solutions_df.to_csv(solutions_path, index=False)

print(f"Saved solutions for {len(solutions_df)} questions to: {solutions_path}")

# -------------------------------------------------------------------
# Save run metadata + timing summary (for cross-machine comparisons)
# -------------------------------------------------------------------
import json, sys, platform
import transformers, sentence_transformers

timings_summary = {
    "timing_model_load_s": float(globals().get("timing_model_load_s", float("nan"))),
    "timing_embedding_s": float(globals().get("timing_embedding_s", float("nan"))),
    "timing_generation_s": float(globals().get("timing_generation_s", float("nan"))),
    "num_chunks": int(len(chunked_docs)),
    "num_questions": int(len(all_results)),
    "chunks_per_s": float(len(chunked_docs) / globals().get("timing_embedding_s", float("nan"))),
    "questions_per_s": float(len(all_results) / globals().get("timing_generation_s", float("nan"))) if globals().get("timing_generation_s", 0) else None,

    # --- NEW: footprint + energy ---
    "cpu_rss_peak_bytes": int(mon.cpu_rss_peak_bytes) if mon.cpu_rss_peak_bytes is not None else None,
    "cpu_rss_peak_human": _maybe_fmt_bytes(mon.cpu_rss_peak_bytes),
    "gpu_peak_mem": mon.gpu_peak_mem,                 # per GPU: max_allocated_gb / max_reserved_gb
    "gpu_energy_wh": mon.gpu_energy_wh,               # per GPU (best-effort)
}

run_metadata = {
    **globals().get("run_meta", {}),
    "embedding_model_id": globals().get("embedding_model_id", None),
    "embedding_batch_size": int(globals().get("embedding_batch_size", 128)),
    "top_k": 8,
    "answer_max_new_tokens": 384,
    "python": sys.version.split()[0],
    "platform": platform.platform(),
    "torch": torch.__version__,
    "torch_cuda": torch.version.cuda,
    "cuda_capability": str(torch.cuda.get_device_capability(0)) if torch.cuda.is_available() else None,
    "gpu_name": torch.cuda.get_device_name(0) if torch.cuda.is_available() else None,
    "transformers": transformers.__version__,
    "sentence_transformers": sentence_transformers.__version__,
}

metrics_out = {
    "run_metadata": run_metadata,
    "timings": timings_summary,
}

timings_path = solutions_path.replace(".csv", "_run_metrics.json")
with open(timings_path, "w") as f:
    json.dump(metrics_out, f, indent=2)

print(f"Saved run metrics to: {timings_path}")
print(f"Number of questions with errors (filled as blank): {len(error_log)}")

solutions_df.head(20)


########################################################################################################
QUESTION 1: What is the name of the benchmark suite presented in a recent paper for measuring inference energy consumption?
EXPECTED answer_value: ML.ENERGY Benchmark
PREDICTED answer_value: ML.ENERGY Benchmark
EXPECTED ref_id(s): ['chung2025']
RECOVERED ref_id(s): ['chung2025']
EXPLANATION: Directly stated in the supporting text.
ANSWER (raw): ML.ENERGY Benchmark
########################################################################################################
QUESTION 2: What were the net CO2e emissions from training the GShard-600B model?
EXPECTED answer_value: 4.3
PREDICTED answer_value: 4.3
EXPECTED ref_id(s): ['patterson2021']
RECOVERED ref_id(s): ['patterson2021']
EXPLANATION: Table 4 shows GShard-600B's emissions as 4.3 tCO2e.
ANSWER (raw): 4.3 tCO2e
########################################################################################################
QUESTION 3: Wha

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


EXPECTED answer_value: is_blank
PREDICTED answer_value: is_blank
EXPECTED ref_id(s): is_blank
RECOVERED ref_id(s): ['is_blank']
EXPLANATION: The provided text does not contain the specific information required to answer the question.
ANSWER (raw): 
########################################################################################################
QUESTION 5: True or False: Hyperscale data centers in 2020 achieved more than 40% higher efficiency compared to traditional data centers.
EXPECTED answer_value: 1
PREDICTED answer_value: 1
EXPECTED ref_id(s): ['wu2021b','patterson2021']
RECOVERED ref_id(s): ['wu2021b']
EXPLANATION: Explanation states hyperscale data centers have over 40% higher efficiency compared to traditional o
ANSWER (raw): True
########################################################################################################
QUESTION 6: For every medium-length GPT-3 completion (prompt= 800 words; response 150 - 300 words), roughly how many 500 mL bottles of wat

Unnamed: 0,id,question,answer,answer_value,answer_unit,is_blank,ref_id,ref_url,supporting_materials,evidence_type,explanation
0,q003,What is the name of the benchmark suite presen...,ML.ENERGY Benchmark,ML.ENERGY Benchmark,is_blank,False,['chung2025'],['https://arxiv.org/pdf/2505.06371'],"We present the ML.ENERGY Benchmark, a benchmar...",text,Directly stated in the supporting text.
1,q009,What were the net CO2e emissions from training...,4.3 tCO2e,4.3,tCO2e,False,['patterson2021'],['https://arxiv.org/pdf/2104.10350'],GShard-600B‚Äôs emissions (Table 4) are 4.3 tCO2...,table,Table 4 shows GShard-600B's emissions as 4.3 t...
2,q054,What is the model size in gigabytes (GB) for t...,,is_blank,GB,True,['is_blank'],['is_blank'],samsi2024: generation\nlength of 256. The 7B m...,table,TABLE II provides hardware requirements for LL...
3,q062,What was the total electricity consumption of ...,,is_blank,MWh,True,['is_blank'],['is_blank'],"patterson2021: 46 for TPU v2, 123 fo...",text,The provided text does not contain the specifi...
4,q075,True or False: Hyperscale data centers in 2020...,True,1,is_blank,False,['wu2021b'],['https://arxiv.org/pdf/2108.06738'],"Furthermore, between traditional and highly op...",text,Explanation states hyperscale data centers hav...
5,q078,For every medium-length GPT-3 completion (prom...,10 - 50,10,500 mL bottles,False,['li2025b'],['https://arxiv.org/pdf/2304.03271'],"GPT-3 needs to 'drink' (i.e., consume) a 500ml...",text,Explanation given directly in the supporting t...
6,q091,From a sample of 60 papers from top AI confere...,55,55,percent,False,['schwartz2019'],['https://arxiv.org/pdf/1907.10597'],"As shown in Figure 2, in all conferences we co...",figure,Figure 2 shows 75% of CVPR papers target accur...
7,q102,True or False: The AI Act makes energy consump...,False,0,is_blank,False,['ebert2024'],['https://arxiv.org/pdf/2410.06681'],Where the Act does mandate the disclosure of e...,text,Energy consumption data is restricted to autho...
8,q105,What is the projected maximum batch size (in s...,28,28,samples,False,['xia2024'],['https://arxiv.org/pdf/2408.04693'],"For GPU memory capacities of 100GB and 120GB, ...",text,Explanation from text for 100GB GPU memory cap...
9,q106,What was the approximate speedup in inference ...,2,2,multiplier,False,['samsi2024'],['https://arxiv.org/pdf/2310.03003'],we see anywhere from a 2 times (7B) to a 1.25 ...,text,The text states a 2 times increase in inferenc...


In [41]:
_model_tag

'Qwen__Qwen2.5-72B-Instruct'

In [42]:
run_metadata

{'model_id': 'Qwen/Qwen2.5-72B-Instruct',
 'quant': '4bit',
 'dtype': 'bf16',
 'device_map': 'balanced',
 'force_gpu': False,
 'gpu_mem_after_load': {'0': {'allocated_gb': 26.983323648,
   'reserved_gb': 50.832867328},
  '1': {'allocated_gb': 15.64194048, 'reserved_gb': 25.541214208}},
 'canonical_n_params': 72700000000,
 'disk_bytes_est': 145400000000,
 'gpu_energy_wh': {'0': 39.95516601621872, '1': 28.312784113056484},
 'gpu_energy_method': 'power_sampling',
 'embedding_model_id': 'thenlper/gte-large',
 'embedding_batch_size': 128,
 'top_k': 8,
 'answer_max_new_tokens': 384,
 'python': '3.12.3',
 'platform': 'Linux-5.15.0-164-generic-x86_64-with-glibc2.39',
 'torch': '2.9.1+cu130',
 'torch_cuda': '13.0',
 'cuda_capability': '(12, 0)',
 'gpu_name': 'NVIDIA RTX PRO 6000 Blackwell Server Edition',
 'transformers': '4.57.6',
 'sentence_transformers': '5.2.0'}

### Compare answers to ground truth

WattBot evaluates each prediction using a weighted score that combines three components. Most of the credit (0.75) comes from the `answer_value`, which must match the ground truth after normalization (numeric answers must be within ¬±0.1% relative tolerance; categorical values must match exactly). An additional 0.15 comes from `ref_id`, where partial credit is given based on the Jaccard overlap between your cited document IDs and the ground-truth set. The final 0.10 comes from correctly marking unanswerable questions: if a question is truly unanswerable, you must set `answer_value`, `ref_id`, and `supporting_materials` to `is_blank`. Any other combination scores zero for this component.

| Component      | Weight | What counts as correct |
|----------------|--------|------------------------|
| answer_value   | 0.75   | Numeric within ¬±0.1% relative tolerance; categorical exact match; `is_blank` if unanswerable |
| ref_id         | 0.15   | Jaccard overlap with the ground-truth reference set (case-insensitive) |
| is_NA          | 0.10   | If the truly unanswerable questions, how many were correctly identified as unanswerable? |

In [43]:
model_id

'Qwen/Qwen2.5-72B-Instruct'

In [44]:
import pandas as pd
import numpy as np

def _to_bool_flag(x):
    """Convert typical truthy/falsey strings to bool."""
    if isinstance(x, str):
        s = x.strip().lower()
        if s in {"1", "True", "true", "yes"}:
            return True
        if s in {"0", "False", "false", "no"}:
            return False
    return bool(x)

def _parse_float_or_none(x):
    try:
        return float(str(x).strip())
    except Exception:
        return None

def _answer_value_correct(gt_val, pred_val, rel_tol=1e-3):
    """
    gt_val, pred_val: values from answer_value columns.
    rel_tol = 0.001 => 0.1% relative tolerance.
    """
    gt_str = str(gt_val).strip()
    pred_str = str(pred_val).strip()
    
    # If either is 'is_blank', treat as categorical
    if gt_str.lower() == "is_blank" or pred_str.lower() == "is_blank":
        return gt_str.lower() == pred_str.lower()
    
    gt_num = _parse_float_or_none(gt_val)
    pred_num = _parse_float_or_none(pred_val)
    
    # If both numeric, use relative tolerance
    if gt_num is not None and pred_num is not None:
        if gt_num == 0:
            return abs(pred_num - gt_num) <= rel_tol  # small absolute tolerance around 0
        rel_err = abs(pred_num - gt_num) / max(abs(gt_num), 1e-12)
        return rel_err <= rel_tol
    
    # Otherwise, fall back to normalized string match
    return gt_str.lower() == pred_str.lower()

def _ref_id_jaccard(gt_ref, pred_ref):
    """
    Jaccard overlap between sets of ref_ids.
    Strings may contain semicolon-separated IDs, or 'is_blank'.
    Case-insensitive.
    """
    def to_set(s):
        if s is None:
            return set()
        s = str(s).strip()
        if not s or s.lower() == "is_blank":
            return set()
        parts = [p.strip().lower() for p in s.split(";") if p.strip()]
        return set(parts)
    
    gt_set = to_set(gt_ref)
    pred_set = to_set(pred_ref)
    
    if not gt_set and not pred_set:
        return 1.0
    union = gt_set | pred_set
    if not union:
        return 0.0
    inter = gt_set & pred_set
    return len(inter) / len(union)

def compute_wattbot_score(
    train_qa_path="train_QA.csv",
    preds_path="train_solutions_qwen.csv",
    id_col="id",
    gt_answer_col="answer_value",
    gt_ref_col="ref_id",
    gt_is_na_col="is_NA",   # can also pass "is_blank" or None
    pred_answer_col="answer_value",
    pred_ref_col="ref_id",
    pred_is_na_col=None,    # can pass "is_blank", or leave None to auto
    n_examples=10,          # how many incorrect examples to print
):
    """
    Compare your solutions to train_QA.csv using a WattBot-style score.

    NA logic:
    - If an explicit NA column is found/used (e.g. is_NA), we use it via _to_bool_flag.
    - If you pass gt_is_na_col="is_blank" or pred_is_na_col="is_blank",
      we *derive* NA from answer_value == "is_blank" instead of expecting a real column.
    - If no NA column is available at all, we derive from answer_value == "is_blank".

    Also prints up to `n_examples` rows where the model is not perfect
    (answer_score < 1, ref_id_score < 1, or is_NA_score < 1).
    """
    gt = pd.read_csv(train_qa_path)
    preds = pd.read_csv(preds_path)
    
    # Inner join on id to be strict
    merged = gt.merge(preds, on=id_col, suffixes=("_gt", "_pred"))
    if merged.empty:
        raise ValueError("No overlapping ids between ground truth and predictions.")

    # ----- ground truth NA flags -----
    if gt_is_na_col is not None and gt_is_na_col in merged.columns:
        # Use explicit column (e.g. "is_NA")
        gt_is_na_series = merged[gt_is_na_col].map(_to_bool_flag)
    elif gt_is_na_col is not None and gt_is_na_col.lower() == "is_blank":
        # Special meaning: derive NA from answer_value_gt == "is_blank"
        gt_is_na_series = merged[f"{gt_answer_col}_gt"].astype(str).str.lower().eq("is_blank")
        merged["gt_is_blank_flag"] = gt_is_na_series
    else:
        # Fallback: if we have is_NA or is_blank col, use it; else derive
        if "is_NA" in merged.columns:
            gt_is_na_series = merged["is_NA"].map(_to_bool_flag)
        elif "is_blank" in merged.columns:
            gt_is_na_series = merged["is_blank"].map(_to_bool_flag)
        else:
            gt_is_na_series = merged[f"{gt_answer_col}_gt"].astype(str).str.lower().eq("is_blank")
            merged["gt_is_blank_flag"] = gt_is_na_series

    # ----- prediction NA flags -----
    if pred_is_na_col is not None and pred_is_na_col in merged.columns:
        pred_is_na_series = merged[pred_is_na_col].map(_to_bool_flag)
    elif pred_is_na_col is not None and pred_is_na_col.lower() == "is_blank":
        # Same convention: derive from answer_value_pred
        pred_is_na_series = merged[f"{pred_answer_col}_pred"].astype(str).str.lower().eq("is_blank")
        merged["pred_is_blank_flag"] = pred_is_na_series
    else:
        # Auto-detect or derive if no NA column in preds
        if "is_NA" in merged.columns:
            pred_is_na_series = merged["is_NA"].map(_to_bool_flag)
        elif "is_blank" in merged.columns:
            pred_is_na_series = merged["is_blank"].map(_to_bool_flag)
        else:
            pred_is_na_series = merged[f"{pred_answer_col}_pred"].astype(str).str.lower().eq("is_blank")
            merged["pred_is_blank_flag"] = pred_is_na_series

    ans_scores = []
    ref_scores = []
    na_scores = []  # will be NaN for non-NA ground truth rows

    for idx, row in merged.iterrows():
        gt_ans = row[f"{gt_answer_col}_gt"]
        pred_ans = row[f"{pred_answer_col}_pred"]
        gt_ref = row[f"{gt_ref_col}_gt"]
        pred_ref = row[f"{pred_ref_col}_pred"]

        gt_is_na = bool(gt_is_na_series.iloc[idx])
        pred_is_na = bool(pred_is_na_series.iloc[idx])

        # 1) answer_value component (unchanged)
        ans_correct = _answer_value_correct(gt_ans, pred_ans)
        ans_scores.append(1.0 * ans_correct)

        # 2) ref_id Jaccard (unchanged)
        ref_j = _ref_id_jaccard(gt_ref, pred_ref)
        ref_scores.append(ref_j)

        # 3) NA component: only score on GT-unanswerable rows
        #    "Out of the unanswerables, did you successfully NOT answer?"
        if gt_is_na:
            na_scores.append(1.0 if pred_is_na else 0.0)
        else:
            na_scores.append(np.nan)

    merged["answer_score"] = ans_scores
    merged["ref_id_score"] = ref_scores
    merged["is_NA_score"] = na_scores  # NaN for answerable rows

    # NA recall over ONLY the GT-unanswerable subset
    na_recall = merged["is_NA_score"].mean()  # mean ignores NaN by default in pandas

    # Overall score uses NA recall (global), not per-row matching
    overall_score = (
        0.75 * merged["answer_score"].mean()
        + 0.15 * merged["ref_id_score"].mean()
        + 0.10 * (0.0 if pd.isna(na_recall) else na_recall)
    )

    print(f"Rows compared: {len(merged)}")
    print(f"Mean answer_value score: {merged['answer_score'].mean():.4f}")
    print(f"Mean ref_id score:       {merged['ref_id_score'].mean():.4f}")
    print(f"NA recall (GT NA only):  {(0.0 if pd.isna(na_recall) else na_recall):.4f}")
    print(f"Overall WattBot score:   {overall_score:.4f}")

    # If you still want a per-row wattbot_score column for debugging,
    # you can broadcast NA recall to all rows:
    merged["wattbot_score"] = overall_score
    
    # ----- Show some incorrect examples -----
    incorrect = merged[
        (merged["answer_score"] < 1.0)
        | (merged["ref_id_score"] < 1.0)
        | (merged["is_NA_score"] < 1.0)
    ]
    
    if not incorrect.empty and n_examples > 0:
        print("\nExamples of incorrect / partially correct responses "
              f"(up to {n_examples} rows):\n")
        # Grab up to n_examples "worst" rows by wattbot_score
        for _, row in incorrect.sort_values("wattbot_score").head(n_examples).iterrows():
            q = row["question_gt"] if "question_gt" in row.index else None
            print("-" * 80)
            print(f"id: {row[id_col]}")
            if q is not None:
                print(f"Question: {q}")
            print(f"GT answer_value:   {row[f'{gt_answer_col}_gt']}")
            print(f"Pred answer_value: {row[f'{pred_answer_col}_pred']}")
            print(f"GT ref_id:         {row[f'{gt_ref_col}_gt']}")
            print(f"Pred ref_id:       {row[f'{pred_ref_col}_pred']}")
            print(f"answer_score: {row['answer_score']:.3f}, "
                  f"ref_id_score: {row['ref_id_score']:.3f}, "
                  f"is_NA_score: {row['is_NA_score']:.3f}")
        print("-" * 80)
    
    return merged

In [45]:
results_df = compute_wattbot_score(
    train_qa_path="./data/train_QA.csv",
    preds_path=solutions_path,
    gt_is_na_col="is_blank",   # or "is_blank" / None depending on how you mark NAs
    n_examples=10,
)

Rows compared: 41
Mean answer_value score: 0.6585
Mean ref_id score:       0.6829
NA recall (GT NA only):  1.0000
Overall WattBot score:   0.6963

Examples of incorrect / partially correct responses (up to 10 rows):

--------------------------------------------------------------------------------
id: q054
Question: What is the model size in gigabytes (GB) for the LLaMA-33B model?
GT answer_value:   64.7
Pred answer_value: is_blank
GT ref_id:         ['chen2024']
Pred ref_id:       ['is_blank']
answer_score: 0.000, ref_id_score: 0.000, is_NA_score: 1.000
--------------------------------------------------------------------------------
id: q062
Question: What was the total electricity consumption of all Google Cloud TPU pods worldwide in 2023, in megawatt-hours?
GT answer_value:   is_blank
Pred answer_value: is_blank
GT ref_id:         is_blank
Pred ref_id:       ['is_blank']
answer_score: 1.000, ref_id_score: 0.000, is_NA_score: 1.000
-----------------------------------------------------

## Recap and next steps

In this episode, we:

- Loaded a small corpus of AI / ML energy papers into our notebook environment.
- Split long documents into manageable chunks and cached those chunks to disk so we don‚Äôt have to re-run the chunking step every time.
- Created vector embeddings for each chunk and used similarity search to retrieve relevant context for a given question.
- Used an LLM to generate answers from retrieved context and wrote results out to a CSV for later scoring and analysis.
- Handled unanswerable questions with an `is_blank` flag so the system can explicitly say ‚ÄúI don‚Äôt know‚Äù when the evidence isn‚Äôt there.

This is just a first pass at a RAG pipeline: it works, but there‚Äôs a lot of headroom to improve both accuracy and robustness. Some natural next steps:

- **Increase the size/quality of models used for embedding and generation**: Try stronger embedding models (e.g., larger sentence-transformers or domain-tuned embeddings) and more capable LLMs for answer generation, especially if you have GPU budget.

- **Add a reranking step**: Instead of sending the top-k raw nearest neighbors directly to the LLM, use a cross-encoder or reranker model to re-score those candidates and send only the best ones.

- **Handle figures and tables more carefully**: Many key numbers live in tables, figure captions, or plots. Consider:
  - OCR / table-parsing tools (e.g., `pytesseract`, table extractors, PDF parsers).
  - Multimodal models that can embed or interpret figures and diagrams, not just text.
  - Separate chunking strategies for captions, tables, and main text.

- **Enrich chunks with metadata**: Attach metadata like section headings (e.g., *Methods*, *Results*), paper ID, year, or paragraph type. You can:
  - Filter or boost chunks by metadata at retrieval time.
  - Use metadata in the prompt so the LLM knows where evidence is coming from.

- **Look for LLMs tuned for scientific literature**: Experiment with models that are explicitly trained or finetuned on scientific text (e.g., arXiv / PubMed) so they:
  - Parse equations and technical language more reliably.
  - Are less likely to hallucinate when reading dense scientific prose.

As you iterate, the goal is to treat this notebook as a baseline RAG ‚Äúworkbench‚Äù: you can swap in better models, smarter retrieval strategies, and richer document preprocessing without changing the overall pipeline structure.

In the next episodes, we will repeat largely the same exact RAG pipeline using slightly different approaches on AWS (processing jobs and Bedrock).



::::::::::::::::::::::::::::::::::::: keypoints

- **Notebook setup**: Start by provisioning a GPU-backed notebook instance
  (e.g., `ml.g5.xlarge`) so that both the embedding model and Qwen2.5-7B
  can run comfortably.
- **Local-first RAG**: For teaching (and small corpora), we avoid an external vector database
  and instead perform cosine similarity search over in-memory embeddings.
- **Ground-truth units**: The `answer_unit` column is always copied directly
  from `train_QA.csv`, never guessed by the LLM.
- **Two-stage LLM use**: One call focuses on *answering and citing*; a second,
  lighter call produces a short explanation tagged with an evidence type.
- **WattBot conventions**: We respect the Kaggle competition format,
  using `is_blank` for unanswerable questions and for missing fields.
- **Scalability path**: The same logic can later be swapped to FAISS/Chroma
  and larger models, while preserving the interface used here.
::::::::::::::::::::::::::::::::::::::::::::::::

In [46]:
import torch
print(torch.__version__)
print(torch.version.cuda)
print(torch.utils.collect_env.get_pretty_env_info())


2.9.1+cu130
13.0
PyTorch version: 2.9.1+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.1 LTS (x86_64)
GCC version: (Ubuntu 13.2.0-23ubuntu4) 13.2.0
Clang version: Could not collect
CMake version: version 3.31.0
Libc version: glibc-2.39

Python version: 3.12.3 (main, Sep 11 2024, 14:17:37) [GCC 13.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-164-generic-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 12.6.85
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 1: NVIDIA RTX PRO 6000 Blackwell Server Edition

Nvidia driver version: 580.82.07
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_run

## 