# Run RAG (Colab Version)

- **Goal**: For your RAG system, you will need the following three components:
1.	Document & query embedder (can use existing models)
2.	Document retriever (implement sparse, dense and hybrid retrieval)
3.	Document reader (aka. question-answering system) (can use existing models)

- **Inputs**: Data layout expected:
1.  data/corpus/chunks.clean.jsonl
2.   data/test/questions.txt
3.   data/test/reference_answers.json

- **Outputs**:
1. "index" folder
    - bm25.pkl
    - dense_meta.json
    - faiss.index
    - metas.pkl
    - texts.pkl
2. "system_outputs" folder.
    - Files are in the form: system_output_.json

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import os
print(os.listdir('/content/drive/MyDrive/rag-pgh'))


# # Download essential packages and check if they work
# %pip install -q rank-bm25 faiss-gpu-cu12 sentence-transformers transformers torch numpy scikit-learn tqdm regex
# faiss-gpu-cu12 is the best for colab

import faiss, numpy, torch
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM
print("OK ✅")



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
['run_rag.ipynb', 'rag.py', 'data', '__pycache__', 'index']
OK ✅


In [None]:
import json
from types import SimpleNamespace
import warnings
warnings.filterwarnings("ignore")
from huggingface_hub import login
# If rag.py isn't in the working dir, uncomment the next two lines:
# import sys
# sys.path.append('/content/drive/MyDrive/rag-pgh')
%cd /content/drive/MyDrive/rag-pgh
import rag
from rag import cmd_index, cmd_answer, load_index  # uses rag.py

def nb_index(
    chunks="data/corpus/chunks.clean.jsonl",
    embedder="sentence-transformers/all-MiniLM-L6-v2",
    outdir="index",
    batch=256,
):
    ''' Index a chunk set using a specified embedder.'''
    args = SimpleNamespace(chunks=chunks, embedder=embedder, outdir=outdir, batch=batch)
    cmd_index(args)
    texts, metas, _, _ = load_index(outdir)
    return {"indexdir": outdir, "n_chunks": len(texts), "embedder": embedder}


def nb_answer(
    questions="data/test_small/questions.txt",
    indexdir="index",
    reader="TinyLlama/TinyLlama-1.1B-Chat-v1.0",   # choose any HF open-weight model ≤32B
    k=8,
    mode="hybrid",                 # "sparse" | "dense" | "hybrid"
    fusion="weighted",             # "weighted" | "rrf"
    alpha=0.5,                     # weight for dense in weighted fusion
    pool=50,                       # candidate pool per retriever before fusion
    rrfK=60,                       # RRF constant
    max_new_tokens=64,
    temp=0.0,
    system_out="system_outputs/system_output_.json",
    include_sources=False,         # set as False for submission. True for demos
    hf_token=None,                 # NEW: pass hf_... token here if needed. Not needed for public models
):
    '''
    Answer a question set using saved indexes + a HF reader.
    If hf_token is provided, we'll authenticate with Hugging Face Hub in-notebook.
    '''
    # Optional Hugging Face authentication for non-public models
    if hf_token:
        # store for this process and future hf_hub calls
        os.environ["HUGGINGFACE_HUB_TOKEN"] = hf_token
        try:
            login(token=hf_token, add_to_git_credential=False)
        except Exception as e:
            print(f"[warn] Hugging Face login failed: {e}")

    args = SimpleNamespace(
        questions=questions,
        indexdir=indexdir,
        reader=reader,
        k=k,
        mode=mode,
        fusion=fusion,
        alpha=alpha,
        pool=pool,
        rrfK=rrfK,
        max_new_tokens=max_new_tokens,
        temp=temp,
        system_out=system_out,
        include_sources=include_sources,
    )
    cmd_answer(args)
    with open(system_out, "r", encoding="utf-8") as f:
        return json.load(f)


/content/drive/MyDrive/rag-pgh


In [None]:
# (1) build indices
idx_info = nb_index(
    chunks="data/corpus/chunks.clean.jsonl",
    embedder="sentence-transformers/all-MiniLM-L6-v2",
    outdir="index"
)
idx_info


Loaded 2092 chunks
Building BM25...
Building dense index with sentence-transformers/all-MiniLM-L6-v2 ...


Batches:   0%|          | 0/9 [00:00<?, ?it/s]

Saved indexes to index


{'indexdir': 'index',
 'n_chunks': 2092,
 'embedder': 'sentence-transformers/all-MiniLM-L6-v2'}

In [None]:
# check the pickle files to see if indices are successfully built
import pickle
with open(os.path.join(idx_info["indexdir"], "texts.pkl"), "rb") as f:
    texts = pickle.load(f)
display(texts[:3])

with open(os.path.join(idx_info["indexdir"], "bm25.pkl"), "rb") as f:
    embeddings = pickle.load(f)
display(embeddings['tokenized_docs'][0][:10])

['Event - Downtown Pittsburgh Come Swing with Me Oct 3, 2025 - Oct 5, 2025 Heinz Hall 600 Penn Avenue Pittsburgh, PA 15222 From Sinatra and the Rat Pack to Louis Prima and Bobby Darin, the young, soulful crooner Paul Loren takes you on a musical journey through The Great American Songbook and highlights some of the most iconic voices of all time along the way.',
 "Kennywood | Visit Pittsburgh Minutes from Downtown, Kennywood Park is one of Pittsburgh's best-loved historic landmarks. But don't let the history fool you! Three beloved, landmark wooden roller coasters are paired with Pennsylvania's fastest (Phantom's Revenge) and tallest (Steel Curtain) coasters, with dozens of other great attractions for all ages. Young children flock to Kiddieland and Thomas Town, featuring the world-famous children's characters, while all ages cherish a ride on the park's historic carousel and other classic attractions. Plus unique food, entertaining games, and engaging special events throughout the yea

['event',
 'downtown',
 'pittsburgh',
 'come',
 'swing',
 'with',
 'me',
 'oct',
 '3',
 '2025']

In [None]:
# (2) answer with TinyLlama (ungated model)
answers = nb_answer(
    questions="data/test_small/questions.txt",
    indexdir="index",
    reader="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    mode="hybrid",
    fusion="weighted",
    alpha=0.5,
    k=8,      # top 8 documents
    system_out="system_outputs/system_output_.json",
    include_sources=False
)

list(answers.items())[:5]  # peek at a few


Loaded index with 2092 chunks


`torch_dtype` is deprecated! Use `dtype` instead!
Answering:   0%|          | 0/3 [00:00<?, ?it/s]

Q1 sources:
  [1] Carnegie Mellon Named a Top 20 US University — https://www.cmu.edu/news/stories/archives/2025/september/carnegie-mellon-named-a-top-20-us-university
  [2] CMU125 - CMU125 - Carnegie Mellon University — https://www.cmu.edu/125/index.html
  [3] Legal | Carnegie Mellon University — https://www.cmu.edu/legal/
  [4] Legal | Carnegie Mellon University — https://www.cmu.edu/legal/
  [5] Campus Life | Carnegie Mellon University — https://www.cmu.edu/campus-life
  [6] About — https://www.ml.cmu.edu/about/
  [7] Undergraduate Admission - Undergraduate Admission - Carnegie Mellon University — https://www.cmu.edu/admission/
  [8] CMU125 - CMU125 - Carnegie Mellon University — https://www.cmu.edu/125/index.html


Answering:  33%|███▎      | 1/3 [01:41<03:23, 101.96s/it]

Q2 sources:
  [1] Carnegie Mellon Named a Top 20 US University — https://www.cmu.edu/news/stories/archives/2025/september/carnegie-mellon-named-a-top-20-us-university
  [2] Campus Life | Carnegie Mellon University — https://www.cmu.edu/campus-life
  [3] Undergraduate Admission - Undergraduate Admission - Carnegie Mellon University — https://www.cmu.edu/admission/
  [4] Legal | Carnegie Mellon University — https://www.cmu.edu/legal/
  [5] CMU125 - CMU125 - Carnegie Mellon University — https://www.cmu.edu/125/index.html
  [6] History | Carnegie Mellon University — https://www.cmu.edu/about/history.html
  [7] Legal | Carnegie Mellon University — https://www.cmu.edu/legal/
  [8] Division of Student Affairs — https://www.cmu.edu/student-affairs/


Answering:  67%|██████▋   | 2/3 [03:37<01:49, 109.85s/it]

Q3 sources:
  [1] Carnegie Mellon Named a Top 20 US University — https://www.cmu.edu/news/stories/archives/2025/september/carnegie-mellon-named-a-top-20-us-university
  [2] CMU125 - CMU125 - Carnegie Mellon University — https://www.cmu.edu/125/index.html
  [3] Legal | Carnegie Mellon University — https://www.cmu.edu/legal/
  [4] Legal | Carnegie Mellon University — https://www.cmu.edu/legal/
  [5] Legal | Carnegie Mellon University — https://www.cmu.edu/legal/
  [6] History | Carnegie Mellon University — https://www.cmu.edu/about/history.html
  [7] Giving Opportunities | Engage with CMU — https://www.cmu.edu/engage/give/opportunities/index.html
  [8] Home - Commencement - Carnegie Mellon University — https://www.cmu.edu/commencement/


Answering: 100%|██████████| 3/3 [05:44<00:00, 114.72s/it]


Wrote system_outputs/system_output_.json


[('1',
  'Carnegie Mellon University was founded in the spring of 2006 as the world’s first machine learning academic department.'),
 ('2',
  'Answer: Carnegie Mellon University was founded by Andrew Carnegie, a self-educated "working boy" who emigrated from Scotland in 1848 and settled in Pittsburgh, Pennsylvania.'),
 ('3',
  'The original name of Carnegie Mellon University was the Carnegie Institute of Technology.')]