<a href="https://colab.research.google.com/github/xingji1337/week6HO/blob/trackb/Week6_2_MultiHop_QA_fromCSV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Week6-2 (from CSV): Multi-Hop QA

**Goal:** Use the graph built from `week5_corpus.csv` to run *multi-hop* questions with a clear hop-by-hop trace and evidence.
This notebook expects you already ran **Week6-1_GraphRAG_Build_fromCSV.ipynb**, which creates `fromcsv_outputs/graph_store.json`.

**Outputs**
- Hop-level traces inline
- (Optional) Log results to `/mnt/data/ablation_results_graph.csv`



## 0) Configuration
We will try to locate `fromcsv_outputs/graph_store.json`. If not found, we fall back to `rag_graph_run_config.json → persist_paths.graph_store`.


In [1]:

from pathlib import Path
import json, os

# Defaults
DEFAULT_GRAPH = "fromcsv_outputs/graph_store.json"
RUN_CFG_PATH = "rag_graph_run_config.json"

# Load run config if present
cfg = {}
if Path(RUN_CFG_PATH).exists():
    cfg = json.loads(Path(RUN_CFG_PATH).read_text())

graph_store = DEFAULT_GRAPH if Path(DEFAULT_GRAPH).exists() else cfg.get("persist_paths", {}).get("graph_store", "./graph_store.json")
neighbor_hops = cfg.get("graph", {}).get("neighbor_hops", 2)
embed_model = cfg.get("embedding_model", "sentence-transformers/all-MiniLM-L6-v2")

print("Graph store:", graph_store)
print("Neighbor hops:", neighbor_hops)
print("Embedding model:", embed_model)


Graph store: fromcsv_outputs/graph_store.json
Neighbor hops: 2
Embedding model: sentence-transformers/all-MiniLM-L6-v2



## 1) Load Graph-RAG + MultiHop
We import your module versions if available; otherwise we fall back to minimal inline implementations.


In [2]:

from typing import Dict, Any, List
import traceback

GraphRAG, MultiHopQA = None, None

try:
    from modules.graph_rag import GraphRAG as _GraphRAG
    GraphRAG = _GraphRAG
    print("Using modules.graph_rag.GraphRAG")
except Exception as e:
    print("Could not import modules.graph_rag.GraphRAG:", e)

try:
    from modules.multi_hop import MultiHopQA as _MultiHopQA
    MultiHopQA = _MultiHopQA
    print("Using modules.multi_hop.MultiHopQA")
except Exception as e:
    print("Could not import modules.multi_hop.MultiHopQA:", e)

# Fallbacks (only if modules not found)
if GraphRAG is None:
    import json, networkx as nx
    from sentence_transformers import SentenceTransformer
    class GraphRAG:
        def __init__(self, model_name: str, graph_store_path: str, neighbor_hops: int = 2):
            self.model = SentenceTransformer(model_name)
            self.graph_store_path = graph_store_path
            self.hops = neighbor_hops
            self.graph = nx.MultiDiGraph()
            self._load_graph()
        def _load_graph(self):
            if not Path(self.graph_store_path).exists():
                raise FileNotFoundError(self.graph_store_path)
            data = json.loads(Path(self.graph_store_path).read_text())
            for n in data.get("nodes", []):
                self.graph.add_node(n["id"], **n)
            for e in data.get("edges", []):
                self.graph.add_edge(e["src"], e["dst"], **e)
        def answer_with_graph(self, query: str) -> Dict[str, Any]:
            # Extremely minimal seed/evidence flow for demo purposes
            # (For advanced behavior, use the modules version.)
            spans = []
            for nid, d in list(self.graph.nodes(data=True))[:10]:
                for ev in d.get("evidence", [])[:1]:
                    spans.append(f"\"{ev['span'][:180]}...\" ({ev['doc_id']})")
            return {"query": query, "seeds": [], "support": spans, "graph_stats": {"nodes": self.graph.number_of_nodes(), "edges": self.graph.number_of_edges()}}

if MultiHopQA is None:
    from dataclasses import dataclass
    @dataclass
    class HopResult:
        sub_question: str
        answer: str
        support: List[str]
    class MultiHopQA:
        def __init__(self, graph: GraphRAG, max_hops: int = 2):
            self.graph = graph; self.max_hops = max_hops
        def decompose(self, question: str) -> List[str]:
            if " and " in question:
                return [p.strip() for p in question.split(" and ") if p.strip()]
            if "," in question:
                return [p.strip() for p in question.split(",") if p.strip()]
            return [question]
        def run(self, question: str) -> Dict[str, Any]:
            subs = self.decompose(question)[: self.max_hops]
            hops = []
            for sq in subs:
                out = self.graph.answer_with_graph(sq)
                ans = f"See {len(out['support'])} evidence spans"
                hops.append(HopResult(sub_question=sq, answer=ans, support=out["support"]))
            final = " → ".join([h.answer for h in hops]) if hops else "No answer"
            return {"question": question, "final_answer": final, "hops": [h.__dict__ for h in hops]}


Could not import modules.graph_rag.GraphRAG: No module named 'modules'
Could not import modules.multi_hop.MultiHopQA: No module named 'modules'



## 2) Initialize engines


In [3]:

gr = GraphRAG(embed_model, graph_store_path=graph_store, neighbor_hops=neighbor_hops)
mh = MultiHopQA(gr, max_hops=2)  # adjust hops as needed
print("Ready.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Ready.



## 3) Helper: pretty print a multi-hop run


In [4]:

from pprint import pprint

def show_run(result: Dict[str, Any]):
    print("\nQuestion:", result["question"])
    print("Final Answer:", result["final_answer"])
    print("\nTrace:")
    for i, hop in enumerate(result["hops"], start=1):
        print(f"  Hop {i}: {hop['sub_question']}")
        print(f"    Answer: {hop['answer']}")
        print("    Evidence (top 5):")
        for s in hop["support"][:5]:
            print("     -", s[:180])



## 4) Run an example multi-hop query
Per the assignment, try: **“Which author proposed Method B, and which dataset did they evaluate it on?”**


In [5]:

q = "Which author proposed Method B, and which dataset did they evaluate it on?"
res = mh.run(q)
show_run(res)



Question: Which author proposed Method B, and which dataset did they evaluate it on?
Final Answer: See 10 evidence spans → See 10 evidence spans

Trace:
  Hop 1: Which author proposed Method B,
    Answer: See 10 evidence spans
    Evidence (top 5):
     - "►BLACKS DECKER'A .. .. - _ DAVENPORT PUBLIC LIBRARY 32^ MAIN STREET DAVENPORT, IOWA 52801-1400 Creative Publishing international MINNEAPOLIS, MINNESOTA www.creativepub.com Credits
     - "►BLACKS DECKER'A .. .. - _ DAVENPORT PUBLIC LIBRARY 32^ MAIN STREET DAVENPORT, IOWA 52801-1400 Creative Publishing international MINNEAPOLIS, MINNESOTA www.creativepub.com Credits
     - "►BLACKS DECKER'A .. .. - _ DAVENPORT PUBLIC LIBRARY 32^ MAIN STREET DAVENPORT, IOWA 52801-1400 Creative Publishing international MINNEAPOLIS, MINNESOTA www.creativepub.com Credits
     - "►BLACKS DECKER'A .. .. - _ DAVENPORT PUBLIC LIBRARY 32^ MAIN STREET DAVENPORT, IOWA 52801-1400 Creative Publishing international MINNEAPOLIS, MINNESOTA www.creativepub.com Credi


## 5) Compare to single-hop baseline (sanity check)
This will just retrieve once using the graph retriever.


In [6]:

single = gr.answer_with_graph(q)
print("Single-hop baseline evidence (top 10):")
for s in single.get("support", [])[:10]:
    print("-", s[:180])
print("\nGraph stats:", single.get("graph_stats"))


Single-hop baseline evidence (top 10):
- "►BLACKS DECKER'A .. .. - _ DAVENPORT PUBLIC LIBRARY 32^ MAIN STREET DAVENPORT, IOWA 52801-1400 Creative Publishing international MINNEAPOLIS, MINNESOTA www.creativepub.com Credits
- "►BLACKS DECKER'A .. .. - _ DAVENPORT PUBLIC LIBRARY 32^ MAIN STREET DAVENPORT, IOWA 52801-1400 Creative Publishing international MINNEAPOLIS, MINNESOTA www.creativepub.com Credits
- "►BLACKS DECKER'A .. .. - _ DAVENPORT PUBLIC LIBRARY 32^ MAIN STREET DAVENPORT, IOWA 52801-1400 Creative Publishing international MINNEAPOLIS, MINNESOTA www.creativepub.com Credits
- "►BLACKS DECKER'A .. .. - _ DAVENPORT PUBLIC LIBRARY 32^ MAIN STREET DAVENPORT, IOWA 52801-1400 Creative Publishing international MINNEAPOLIS, MINNESOTA www.creativepub.com Credits
- "►BLACKS DECKER'A .. .. - _ DAVENPORT PUBLIC LIBRARY 32^ MAIN STREET DAVENPORT, IOWA 52801-1400 Creative Publishing international MINNEAPOLIS, MINNESOTA www.creativepub.com Credits
- "►BLACKS DECKER'A .. .. - _ DAVENPORT PUBLIC


## 6) Provide at least one case where single-hop fails but multi-hop succeeds
Try crafting a compositional query specific to your corpus (replace with your domain terms). Example:


In [7]:

q2 = "Find the paper about HomeRepairGPT, and then tell me the dataset they used."
res2 = mh.run(q2)
show_run(res2)



Question: Find the paper about HomeRepairGPT, and then tell me the dataset they used.
Final Answer: See 10 evidence spans → See 10 evidence spans

Trace:
  Hop 1: Find the paper about HomeRepairGPT,
    Answer: See 10 evidence spans
    Evidence (top 5):
     - "►BLACKS DECKER'A .. .. - _ DAVENPORT PUBLIC LIBRARY 32^ MAIN STREET DAVENPORT, IOWA 52801-1400 Creative Publishing international MINNEAPOLIS, MINNESOTA www.creativepub.com Credits
     - "►BLACKS DECKER'A .. .. - _ DAVENPORT PUBLIC LIBRARY 32^ MAIN STREET DAVENPORT, IOWA 52801-1400 Creative Publishing international MINNEAPOLIS, MINNESOTA www.creativepub.com Credits
     - "►BLACKS DECKER'A .. .. - _ DAVENPORT PUBLIC LIBRARY 32^ MAIN STREET DAVENPORT, IOWA 52801-1400 Creative Publishing international MINNEAPOLIS, MINNESOTA www.creativepub.com Credits
     - "►BLACKS DECKER'A .. .. - _ DAVENPORT PUBLIC LIBRARY 32^ MAIN STREET DAVENPORT, IOWA 52801-1400 Creative Publishing international MINNEAPOLIS, MINNESOTA www.creativepub.com 


## 7) (Optional) Log to ablation CSV
Append a row to `/mnt/data/ablation_results_graph.csv` describing this run so you can plot later.


In [12]:
import csv, time, os
abl_path = "/content/ablation_results_graph.csv"
row = ["Multi-Hop", "self_ask_2hops_fromcsv", 6, neighbor_hops, "", "", f"ran={time.strftime('%Y-%m-%d %H:%M:%S')}"]

if os.path.exists(abl_path):
    with open(abl_path, "a", newline="", encoding="utf-8") as f:
        csv.writer(f).writerow(row)
    print("Appended to", abl_path)
else:
    with open(abl_path, "w", newline="", encoding="utf-8") as f:
        csv.writer(f).writerow(["track","variant","retrieval_top_k","neighbor_hops","exact_match","f1","notes"])
        csv.writer(f).writerow(row)
    print("Created", abl_path, "and wrote a row.")

Created /content/ablation_results_graph.csv and wrote a row.
