Using the dataset
https://www.kaggle.com/datasets/thedevastator/the-stanford-question-answering-dataset to
build the RAG & RAGA system for Question-Answer on the 500 Wikipedia topics. The project
should follow the guideline as:
1. Load the dataset into a vector database
2. Using BERT, build the RAG & RAGA system
3. The same as 2. but using GPT
4. The same as 2. but using Ollama
5. Write up the comparison conclusions

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd

# Load your CSV file
csv_path = "/content/drive/MyDrive/RAG_project/train.csv"
df = pd.read_csv(csv_path)

# Preview
df.head()

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': array(['Saint Bernadette Soubirous'],..."
1,5733be284776f4190066117f,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': array(['a copper statue of Christ'], ..."
2,5733be284776f41900661180,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': array(['the Main Building'], dtype=ob..."
3,5733be284776f41900661181,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,{'text': array(['a Marian place of prayer and ...
4,5733be284776f4190066117e,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,{'text': array(['a golden statue of the Virgin...


## Load and Chunk Data

In [None]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document

# Use the column containing the Wikipedia text
texts = df['context'].dropna().tolist()  # Change 'context' to your column name

# Convert to LangChain Documents
documents = [Document(page_content=text) for text in texts]

# Chunking
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)

# Show one chunk
chunks[0]


Document(metadata={}, page_content='Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.')

##Generate Embeddings

In [None]:
!pip install -q sentence-transformers

from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
texts_chunked = [doc.page_content for doc in chunks]
embeddings = embedding_model.encode(texts_chunked, show_progress_bar=True)


Batches:   0%|          | 0/2738 [00:00<?, ?it/s]

## **Load into FAISS Vector Database**

In [None]:
!pip install -q faiss-cpu

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m69.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:

import faiss
import numpy as np
import pickle

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

# Save FAISS index and chunks
faiss.write_index(index, "/content/drive/MyDrive/squad_vector.index")

with open("/content/drive/MyDrive/squad_chunks.pkl", "wb") as f:
    pickle.dump(texts_chunked, f)


In [None]:
import os

print(os.path.exists("/content/drive/MyDrive/squad_vector.index"))   # Should be True
print(os.path.exists("/content/drive/MyDrive/squad_chunks.pkl"))     # Should be True


True
True


In [None]:
# --- RAG Pipeline---

import faiss
import pickle
import numpy as np
from transformers import pipeline
from sentence_transformers import SentenceTransformer

# Load models
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
qa_model = pipeline("question-answering", model="bert-large-uncased-whole-word-masking-finetuned-squad")

# Load FAISS and chunked texts
index = faiss.read_index("/content/drive/MyDrive/squad_vector.index")
with open("/content/drive/MyDrive/squad_chunks.pkl", "rb") as f:
    texts_chunked = pickle.load(f)

# Retrieve top-k contexts
def retrieve_context(question, k=3):
    q_embedding = embedding_model.encode([question])
    D, I = index.search(np.array(q_embedding), k)
    return [texts_chunked[i] for i in I[0]]

# RAG QA with BERT and context tracking
def rag_bert_with_context(question, k=3):
    contexts = retrieve_context(question, k)
    results = []
    for ctx in contexts:
        result = qa_model({'question': question, 'context': ctx})
        result['used_context'] = ctx  # manually add context
        results.append(result)
    best = max(results, key=lambda x: x['score'])
    return best['answer'], best['used_context']

# Questions + ground truths
questions = [
    "What is the capital of France?",
    "Who discovered penicillin?",
    "When did the Cold War end?"
]
ground_truths = [
    ["Paris"],
    ["Alexander Fleming"],
    ["1991"]
]

# Run QA and collect results
answers, contexts = [], []
for q in questions:
    ans, ctx = rag_bert_with_context(q)
    answers.append(ans)
    contexts.append([ctx])  # wrap for ragas

# Display
for i in range(len(questions)):
    print(f"\nQ{i+1}: {questions[i]}")
    print(f"Answer: {answers[i]}")
    print(f"Ground Truth: {ground_truths[i][0]}")
    print(f"Context snippet: {contexts[i][0][:300]}...")


Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0



Q1: What is the capital of France?
Answer: Paris
Ground Truth: Paris
Context snippet: Paris is located in northern central France. By road it is 450 kilometres (280 mi) south-east of London, 287 kilometres (178 mi) south of Calais, 305 kilometres (190 mi) south-west of Brussels, 774 kilometres (481 mi) north of Marseille, 385 kilometres (239 mi) north-east of Nantes, and 135 kilometr...

Q2: Who discovered penicillin?
Answer: Alexander Fleming
Ground Truth: Alexander Fleming
Context snippet: The effects of some types of mold on infection had been noticed many times over the course of history (see: History of penicillin). In 1928, Alexander Fleming noticed the same effect in a Petri dish, where a number of disease-causing bacteria were killed by a fungus of the genus Penicillium. Fleming...

Q3: When did the Cold War end?
Answer: 1989
Ground Truth: 1991
Context snippet: The Cold War saw periods of both heightened tension and relative calm. International crises arose, such as the Berlin

In [None]:
!pip install python-Levenshtein




## Evaluation  without OpenAI api key

In [None]:
import numpy as np
import re
from sklearn.metrics import f1_score

def normalize(text):
    return re.sub(r'\W+', ' ', text.strip().lower())

def exact_match(pred, ref):
    return normalize(pred) == normalize(ref)

def token_metrics(pred, ref):
    pred_tokens = normalize(pred).split()
    ref_tokens = normalize(ref).split()

    if not pred_tokens or not ref_tokens:
        return 0.0, 0.0, 0.0

    common = set(pred_tokens) & set(ref_tokens)
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(ref_tokens)

    if precision + recall == 0:
        f1 = 0.0
    else:
        f1 = 2 * (precision * recall) / (precision + recall)

    return precision, recall, f1

# If you still want Levenshtein (optional)
try:
    from Levenshtein import distance as lev_distance
    def levenshtein_distance(s1, s2):
        return lev_distance(normalize(s1), normalize(s2))
except ImportError:
    def levenshtein_distance(s1, s2):
        return None  # or skip it

# Evaluate
em_scores, f1_scores, prec_scores, recall_scores, lev_dists = [], [], [], [], []

for pred, ref in zip(answers, [gt[0] for gt in ground_truths]):
    em_scores.append(exact_match(pred, ref))
    p, r, f1 = token_metrics(pred, ref)
    prec_scores.append(p)
    recall_scores.append(r)
    f1_scores.append(f1)
    dist = levenshtein_distance(pred, ref)
    lev_dists.append(dist if dist is not None else -1)

# Print results
print(f"\n🔍 Local Evaluation Metrics:")
print(f"✅ Exact Match (EM):     {np.mean(em_scores):.2f}")
print(f"✅ Precision Score:      {np.mean(prec_scores):.2f}")
print(f"✅ Recall Score:         {np.mean(recall_scores):.2f}")
print(f"✅ F1 Score:             {np.mean(f1_scores):.2f}")
if lev_dists[0] != -1:
    print(f"✅ Avg Edit Distance:    {np.mean(lev_dists):.2f}")



🔍 Local Evaluation Metrics:
✅ Exact Match (EM):     0.67
✅ Precision Score:      0.67
✅ Recall Score:         0.67
✅ F1 Score:             0.67
✅ Avg Edit Distance:    0.67


### **bert-large-uncased-whole-word-masking-finetuned-squad** with **all-mpnet-base-v2**(SentenceTransformer)

In [None]:
!pip install python-Levenshtein

Collecting python-Levenshtein
  Downloading python_levenshtein-0.27.1-py3-none-any.whl.metadata (3.7 kB)
Collecting Levenshtein==0.27.1 (from python-Levenshtein)
  Downloading levenshtein-0.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein==0.27.1->python-Levenshtein)
  Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading python_levenshtein-0.27.1-py3-none-any.whl (9.4 kB)
Downloading levenshtein-0.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (161 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m53.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages:

In [None]:
# --- RAG + BERT (Improved Retrieval + QA Filtering in One Cell) ---

import pandas as pd
import numpy as np
import faiss
import pickle
import re
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document

# Step 1: Prepare the corpus
df = pd.read_csv("/content/drive/MyDrive/RAG_project/train.csv")  # Adjust path if needed
texts = df['context'].dropna().tolist()
documents = [Document(page_content=text) for text in texts]

# Step 2: Chunk the texts (larger chunks for better context)
text_splitter = CharacterTextSplitter(chunk_size=700, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)
texts_chunked = [doc.page_content for doc in chunks]

# Step 3: Encode the chunks using a stronger retriever model
embedding_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = embedding_model.encode(texts_chunked, show_progress_bar=True)

# Step 4: Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

# Step 5: Save the index and chunked texts
faiss.write_index(index, "/content/drive/MyDrive/squad_vector.index")
with open("/content/drive/MyDrive/squad_chunks.pkl", "wb") as f:
    pickle.dump(texts_chunked, f)

# Step 6: Load QA model
qa_model = pipeline("question-answering", model="bert-large-uncased-whole-word-masking-finetuned-squad")

# Step 7: Define RAG pipeline with context retrieval + filtering
def retrieve_context(question, k=5):
    q_embedding = embedding_model.encode([question])
    D, I = index.search(np.array(q_embedding), k)
    return [texts_chunked[i] for i in I[0]]

def rag_bert_with_context(question, k=5, score_threshold=0.2):
    contexts = retrieve_context(question, k)
    candidates = []
    for ctx in contexts:
        result = qa_model(question=question, context=ctx)
        if result['score'] >= score_threshold:
            result['used_context'] = ctx
            candidates.append(result)
    if candidates:
        best = max(candidates, key=lambda x: x['score'])
        return best['answer'], best['used_context']
    else:
        return "No confident answer found.", ""

# Step 8: Run QA
questions = [
    "What is the capital of France?",
    "Who discovered penicillin?",
    "When did the Cold War end?"
]
ground_truths = [
    ["Paris"],
    ["Alexander Fleming"],
    ["1991"]
]

answers, contexts = [], []
for q in questions:
    ans, ctx = rag_bert_with_context(q)
    answers.append(ans)
    contexts.append([ctx])  # wrap for later use

# Step 9: Evaluate
def normalize(text):
    return re.sub(r'\W+', ' ', text.strip().lower())

def exact_match(pred, ref):
    return normalize(pred) == normalize(ref)

def token_metrics(pred, ref):
    pred_tokens = normalize(pred).split()
    ref_tokens = normalize(ref).split()
    if not pred_tokens or not ref_tokens:
        return 0.0, 0.0, 0.0
    common = set(pred_tokens) & set(ref_tokens)
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(ref_tokens)
    f1 = 2 * (precision * recall) / (precision + recall) if precision + recall else 0.0
    return precision, recall, f1

try:
    from Levenshtein import distance as lev_distance
    def levenshtein_distance(s1, s2):
        return lev_distance(normalize(s1), normalize(s2))
except ImportError:
    def levenshtein_distance(s1, s2):
        return None

em_scores, f1_scores, prec_scores, recall_scores, lev_dists = [], [], [], [], []
for pred, ref in zip(answers, [gt[0] for gt in ground_truths]):
    em_scores.append(exact_match(pred, ref))
    p, r, f1 = token_metrics(pred, ref)
    prec_scores.append(p)
    recall_scores.append(r)
    f1_scores.append(f1)
    dist = levenshtein_distance(pred, ref)
    lev_dists.append(dist if dist is not None else -1)

# Step 10: Print results
for i in range(len(questions)):
    print(f"\nQ{i+1}: {questions[i]}")
    print(f"Answer: {answers[i]}")
    print(f"Ground Truth: {ground_truths[i][0]}")
    print(f"Context snippet: {contexts[i][0][:300]}...")

print(f"\n🔍 Local Evaluation Metrics:")
print(f"✅ Exact Match (EM):     {np.mean(em_scores):.2f}")
print(f"✅ Precision Score:      {np.mean(prec_scores):.2f}")
print(f"✅ Recall Score:         {np.mean(recall_scores):.2f}")
print(f"✅ F1 Score:             {np.mean(f1_scores):.2f}")
if lev_dists[0] != -1:
    print(f"✅ Avg Edit Distance:    {np.mean(lev_dists):.2f}")


Batches:   0%|          | 0/2738 [00:00<?, ?it/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0



Q1: What is the capital of France?
Answer: Paris
Ground Truth: Paris
Context snippet: Paris is located in northern central France. By road it is 450 kilometres (280 mi) south-east of London, 287 kilometres (178 mi) south of Calais, 305 kilometres (190 mi) south-west of Brussels, 774 kilometres (481 mi) north of Marseille, 385 kilometres (239 mi) north-east of Nantes, and 135 kilometr...

Q2: Who discovered penicillin?
Answer: Florey and Chain
Ground Truth: Alexander Fleming
Context snippet: Florey and Chain succeeded in purifying the first penicillin, penicillin G, in 1942, but it did not become widely available outside the Allied military before 1945. Later, Norman Heatley developed the back extraction technique for efficiently purifying penicillin in bulk. The chemical structure of p...

Q3: When did the Cold War end?
Answer: late 1980s and the early 1990s
Ground Truth: 1991
Context snippet: The Cold War drew to a close in the late 1980s and the early 1990s. The United States under 

### **deepset/roberta-base-squad2** with **multi-qa-MiniLM-L6-cos-v1**

In [None]:
# --- Install Required Packages ---
!pip install sentence-transformers transformers faiss-cpu python-Levenshtein pandas langchain --quiet

# --- Imports ---
import pandas as pd
import numpy as np
import faiss
import pickle
import re
from sklearn.metrics import f1_score
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from Levenshtein import distance as lev_distance
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document

# --- Step 1: Load and Chunk Wikipedia Contexts ---
csv_path = "/content/drive/MyDrive/RAG_project/train.csv"
df = pd.read_csv(csv_path)
texts = df['context'].dropna().tolist()
documents = [Document(page_content=text) for text in texts]

# ✅ Updated chunk size
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=500)
chunks = text_splitter.split_documents(documents)
texts_chunked = [doc.page_content for doc in chunks]

# --- Step 2: Generate Embeddings using QA-optimized model ---
embedding_model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")
embeddings = embedding_model.encode(texts_chunked, show_progress_bar=True)
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

# Save index and chunks (optional)
faiss.write_index(index, "/content/drive/MyDrive/squad_vector.index")
with open("/content/drive/MyDrive/squad_chunks.pkl", "wb") as f:
    pickle.dump(texts_chunked, f)

# --- Step 3: Load Strong QA Model (RoBERTa) ---
qa_model = pipeline("question-answering", model="deepset/roberta-base-squad2")

# --- Step 4: RAG Answering (No Threshold, k=5) ---
def retrieve_context(question, k=5):
    q_embedding = embedding_model.encode([question])
    D, I = index.search(np.array(q_embedding), k)
    return [texts_chunked[i] for i in I[0]]

def rag_bert_with_context(question, k=5):
    contexts = retrieve_context(question, k)
    results = []
    for ctx in contexts:
        result = qa_model({'question': question, 'context': ctx})
        print(f"🧪 QA | Score: {result['score']:.3f} | Answer: {result['answer'][:50]}")
        result['context'] = ctx
        results.append(result)
    best = max(results, key=lambda x: x['score'])
    return best['answer'], best['context']

# --- Step 5: Sample Questions + Ground Truths ---
questions = [
    "What is the capital of France?",
    "Who discovered penicillin?",
    "When did the Cold War end?"
]
ground_truths = [["Paris"], ["Alexander Fleming"], ["1991"]]

# --- Step 6: Run QA ---
answers, contexts = [], []
for q in questions:
    ans, ctx = rag_bert_with_context(q)
    answers.append(ans)
    contexts.append([ctx])

# --- Step 7: Evaluation ---
def normalize(text):
    return re.sub(r'\W+', ' ', text.strip().lower())

def exact_match(pred, ref):
    return normalize(pred) == normalize(ref)

def token_metrics(pred, ref):
    pred_tokens = normalize(pred).split()
    ref_tokens = normalize(ref).split()
    if not pred_tokens or not ref_tokens:
        return 0.0, 0.0, 0.0
    common = set(pred_tokens) & set(ref_tokens)
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(ref_tokens)
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0
    return precision, recall, f1

# Compute metrics
em_scores, f1_scores, prec_scores, recall_scores, lev_dists = [], [], [], [], []
refs = [gt[0] for gt in ground_truths]

for pred, ref in zip(answers, refs):
    em_scores.append(exact_match(pred, ref))
    p, r, f1 = token_metrics(pred, ref)
    prec_scores.append(p)
    recall_scores.append(r)
    f1_scores.append(f1)
    lev_dists.append(lev_distance(normalize(pred), normalize(ref)))

# --- Step 8: Print Final Results ---
print(f"\n🔍 Local Evaluation Metrics:")
print(f"✅ Exact Match (EM):     {np.mean(em_scores):.2f}")
print(f"✅ Precision Score:      {np.mean(prec_scores):.2f}")
print(f"✅ Recall Score:         {np.mean(recall_scores):.2f}")
print(f"✅ F1 Score:             {np.mean(f1_scores):.2f}")
print(f"✅ Avg Edit Distance:    {np.mean(lev_dists):.2f}")

# --- Step 9: Show Q&A with Context ---
for i in range(len(questions)):
    print(f"\nQ{i+1}: {questions[i]}")
    print(f"Predicted: {answers[i]}")
    print(f"Ground Truth: {refs[i]}")
    print(f"Context Snippet: {contexts[i][0][:300]}...")


Batches:   0%|          | 0/2738 [00:00<?, ?it/s]

Device set to use cuda:0


🧪 QA | Score: 0.342 | Answer: Paris
🧪 QA | Score: 0.342 | Answer: Paris
🧪 QA | Score: 0.342 | Answer: Paris
🧪 QA | Score: 0.020 | Answer: Paris
🧪 QA | Score: 0.020 | Answer: Paris
🧪 QA | Score: 0.136 | Answer: Fleming
🧪 QA | Score: 0.136 | Answer: Fleming
🧪 QA | Score: 0.136 | Answer: Fleming
🧪 QA | Score: 0.136 | Answer: Fleming
🧪 QA | Score: 0.136 | Answer: Fleming
🧪 QA | Score: 0.011 | Answer: 1948–1949
🧪 QA | Score: 0.011 | Answer: 1948–1949
🧪 QA | Score: 0.011 | Answer: 1948–1949
🧪 QA | Score: 0.011 | Answer: 1948–1949
🧪 QA | Score: 0.011 | Answer: 1948–1949

🔍 Local Evaluation Metrics:
✅ Exact Match (EM):     0.33
✅ Precision Score:      0.67
✅ Recall Score:         0.50
✅ F1 Score:             0.56
✅ Avg Edit Distance:    5.33

Q1: What is the capital of France?
Predicted: Paris
Ground Truth: Paris
Context Snippet: Most French rulers since the Middle Ages made a point of leaving their mark on a city that, contrary to many other of the world's capitals, has never been destroyed b

##**BERT with Weaviate cloud**

In [None]:
!pip install --upgrade pip

Collecting pip
  Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m70.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.1.1


In [None]:
!pip install numpy pandas sentence-transformers transformers huggingface-hub


Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [None]:
!pip uninstall weaviate-client
!pip install "weaviate-client>=3.26.7,<4.0.0"


[0mCollecting weaviate-client<4.0.0,>=3.26.7
  Downloading weaviate_client-3.26.7-py3-none-any.whl.metadata (3.4 kB)
Collecting validators<1.0.0,>=0.21.2 (from weaviate-client<4.0.0,>=3.26.7)
  Downloading validators-0.35.0-py3-none-any.whl.metadata (3.9 kB)
Collecting authlib<2.0.0,>=1.3.1 (from weaviate-client<4.0.0,>=3.26.7)
  Downloading authlib-1.6.0-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading weaviate_client-3.26.7-py3-none-any.whl (120 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.1/120.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading authlib-1.6.0-py2.py3-none-any.whl (239 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.0/240.0 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading validators-0.35.0-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.7/44.7 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: validators, authlib, w

## **Weaviate** with **all-MiniLM-L6-v2** and **bert-large-uncased-whole-word-masking-finetuned-squad**

In [None]:
# IMPORTS
import weaviate
import pandas as pd
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import numpy as np
import re
from weaviate.util import generate_uuid5

# === CONFIG ===
WEAVIATE_URL = "https://sagdo76qtw2urrz9seupg.c0.us-east1.gcp.weaviate.cloud"  # your cluster REST endpoint
CSV_PATH = "/content/drive/MyDrive/RAG_project/train.csv"  # your file path
CLASS_NAME = "QAChunk"

# === STEP 1: LOAD DATA ===
df = pd.read_csv(CSV_PATH)
texts = df['context'].dropna().tolist()

# === STEP 2: CHUNKING ===
def chunk_text(text, chunk_size=500, overlap=50):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunk = text[i:i+chunk_size]
        if chunk:
            chunks.append(chunk)
    return chunks

all_chunks = []
for text in texts:
    all_chunks.extend(chunk_text(text))

# === STEP 3: EMBEDDING ===
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(all_chunks, show_progress_bar=True)

# === STEP 4: SETUP WEAVIATE CONNECTION ===
from weaviate import Client

client = Client(
    url=WEAVIATE_URL,
    auth_client_secret=weaviate.AuthClientPassword(
        username="vandana.jv@gmail.com",
        password="Vinayaka@143"
    )
)

# === STEP 5: RESET CLASS ===
if client.schema.exists(CLASS_NAME):
    client.schema.delete_class(CLASS_NAME)

schema = {
    "class": CLASS_NAME,
    "vectorizer": "none",
    "properties": [{"name": "content", "dataType": ["text"]}]
}
client.schema.create_class(schema)

# === STEP 6: ADD DATA TO WEAVIATE ===
with client.batch as batch:
    batch.batch_size = 100
    for text, vector in zip(all_chunks, embeddings):
        properties = {"content": text}
        batch.add_data_object(properties, CLASS_NAME, vector=vector)

# === STEP 7: QA RETRIEVAL SETUP ===
qa_pipeline = pipeline("question-answering", model="bert-large-uncased-whole-word-masking-finetuned-squad")

def retrieve_context(question, k=3):
    query_vector = model.encode([question])[0]
    result = client.query.get(CLASS_NAME, ["content"]) \
        .with_near_vector({"vector": query_vector}) \
        .with_limit(k).do()
    return [item["content"] for item in result["data"]["Get"][CLASS_NAME]]

def rag_qa(question, k=3):
    contexts = retrieve_context(question, k)
    answers = []
    for context in contexts:
        result = qa_pipeline({'question': question, 'context': context})
        result['context'] = context
        answers.append(result)
    best = max(answers, key=lambda x: x['score'])
    return best['answer'], best['context']

# === STEP 8: EVALUATION ===
questions = [
    "What is the capital of France?",
    "Who discovered penicillin?",
    "When did the Cold War end?"
]
ground_truths = [
    ["Paris"],
    ["Alexander Fleming"],
    ["1991"]
]

def normalize(text):
    return re.sub(r'\W+', ' ', text.strip().lower())

def exact_match(pred, ref):
    return normalize(pred) == normalize(ref)

def token_metrics(pred, ref):
    pred_tokens = normalize(pred).split()
    ref_tokens = normalize(ref).split()
    if not pred_tokens or not ref_tokens:
        return 0.0, 0.0, 0.0
    common = set(pred_tokens) & set(ref_tokens)
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(ref_tokens)
    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0
    return precision, recall, f1

answers, contexts = [], []
for q in questions:
    ans, ctx = rag_qa(q)
    answers.append(ans)
    contexts.append(ctx)

em_scores, f1_scores, prec_scores, recall_scores = [], [], [], []

for pred, ref in zip(answers, [gt[0] for gt in ground_truths]):
    em_scores.append(exact_match(pred, ref))
    p, r, f1 = token_metrics(pred, ref)
    prec_scores.append(p)
    recall_scores.append(r)
    f1_scores.append(f1)

# === RESULTS ===
for i, q in enumerate(questions):
    print(f"\nQ{i+1}: {q}")
    print(f"Answer: {answers[i]}")
    print(f"Ground Truth: {ground_truths[i][0]}")
    print(f"Context snippet: {contexts[i][:300]}...")

print(f"\n📊 EVALUATION METRICS:")
print(f"✅ Exact Match: {np.mean(em_scores):.2f}")
print(f"✅ Precision:   {np.mean(prec_scores):.2f}")
print(f"✅ Recall:      {np.mean(recall_scores):.2f}")
print(f"✅ F1 Score:    {np.mean(f1_scores):.2f}")


Batches:   0%|          | 0/6033 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cuda:0



Q1: What is the capital of France?
Answer: Paris
Ground Truth: Paris
Context snippet: , Paris....

Q2: Who discovered penicillin?
Answer: Dorothy Crowfoot Hodgkin
Ground Truth: Alexander Fleming
Context snippet: Florey and Chain succeeded in purifying the first penicillin, penicillin G, in 1942, but it did not become widely available outside the Allied military before 1945. Later, Norman Heatley developed the back extraction technique for efficiently purifying penicillin in bulk. The chemical structure of p...

Q3: When did the Cold War end?
Answer: 1989
Ground Truth: 1991
Context snippet: The Cold War saw periods of both heightened tension and relative calm. International crises arose, such as the Berlin Blockade (1948–1949), the Korean War (1950–1953), the Berlin Crisis of 1961, the Vietnam War (1959–1975), the Cuban Missile Crisis (1962), the Soviet war in Afghanistan (1979–1989) a...

📊 EVALUATION METRICS:
✅ Exact Match: 0.33
✅ Precision:   0.33
✅ Recall:      0.33
✅ F1 Score:    

## Weaviate with **CLEAN + FILTERED CHUNKING** and **Reranking** with **bert-large-uncased-whole-word-masking-finetuned-squad**

In [None]:
# IMPORTS
import weaviate
import pandas as pd
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import numpy as np
import re
from weaviate.util import generate_uuid5

# === CONFIG ===
WEAVIATE_URL = "https://sagdo76qtw2urrz9seupg.c0.us-east1.gcp.weaviate.cloud"
CSV_PATH = "/content/drive/MyDrive/RAG_project/train.csv"
CLASS_NAME = "QAChunk"

# === STEP 1: LOAD DATA ===
df = pd.read_csv(CSV_PATH)
texts = df['context'].dropna().tolist()

# === STEP 2: CLEAN + FILTERED CHUNKING ===
def clean_chunk(text):
    """Remove noise and skip very short or whitespace-only text."""
    text = re.sub(r'\s+', ' ', text).strip()
    return text if len(text.split()) > 5 else None

def chunk_text(text, chunk_size=500, overlap=50):
    chunks = []
    seen = set()
    for i in range(0, len(text), chunk_size - overlap):
        raw_chunk = text[i:i + chunk_size]
        chunk = clean_chunk(raw_chunk)
        if chunk and chunk not in seen:
            chunks.append(chunk)
            seen.add(chunk)
    return chunks

all_chunks = []
for text in texts:
    all_chunks.extend(chunk_text(text))

# === STEP 3: EMBEDDING ===
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(all_chunks, show_progress_bar=True)

# === STEP 4: SETUP WEAVIATE CONNECTION ===
from weaviate import Client

client = Client(
    url=WEAVIATE_URL,
    auth_client_secret=weaviate.AuthClientPassword(
        username="vandana.jv@gmail.com",
        password="Vinayaka@143"
    )
)

# === STEP 5: RESET CLASS ===
if client.schema.exists(CLASS_NAME):
    client.schema.delete_class(CLASS_NAME)

schema = {
    "class": CLASS_NAME,
    "vectorizer": "none",
    "properties": [{"name": "content", "dataType": ["text"]}]
}
client.schema.create_class(schema)

# === STEP 6: ADD DATA TO WEAVIATE ===
with client.batch as batch:
    batch.batch_size = 100
    for text, vector in zip(all_chunks, embeddings):
        properties = {"content": text}
        batch.add_data_object(properties, CLASS_NAME, vector=vector)

# === STEP 7: QA RETRIEVAL SETUP ===
qa_pipeline = pipeline("question-answering", model="bert-large-uncased-whole-word-masking-finetuned-squad")

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_context(question, k=3, fetch_k=10):
    query_vector = model.encode([question])[0]
    result = client.query.get(CLASS_NAME, ["content"]) \
        .with_near_vector({"vector": query_vector}) \
        .with_limit(fetch_k).do()
    candidates = [item["content"] for item in result["data"]["Get"][CLASS_NAME]]

    # Re-rank top-k using CrossEncoder
    pairs = [[question, doc] for doc in candidates]
    scores = reranker.predict(pairs)
    top_indices = np.argsort(scores)[-k:][::-1]

    return [candidates[i] for i in top_indices]


def rag_qa(question, k=3):
    contexts = retrieve_context(question, k)
    answers = []
    for context in contexts:
        result = qa_pipeline({'question': question, 'context': context})
        result['context'] = context
        answers.append(result)
    best = max(answers, key=lambda x: x['score'])
    return best['answer'], best['context']

# === STEP 8: EVALUATION ===
questions = [
    "What is the capital of France?",
    "Who discovered penicillin?",
    "When did the Cold War end?"
]
ground_truths = [
    ["Paris"],
    ["Alexander Fleming"],
    ["1991"]
]

def normalize(text):
    return re.sub(r'\W+', ' ', text.strip().lower())

def exact_match(pred, ref):
    return normalize(pred) == normalize(ref)

def token_metrics(pred, ref):
    pred_tokens = normalize(pred).split()
    ref_tokens = normalize(ref).split()
    if not pred_tokens or not ref_tokens:
        return 0.0, 0.0, 0.0
    common = set(pred_tokens) & set(ref_tokens)
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(ref_tokens)
    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0
    return precision, recall, f1

answers, contexts = [], []
for q in questions:
    ans, ctx = rag_qa(q)
    answers.append(ans)
    contexts.append(ctx)

em_scores, f1_scores, prec_scores, recall_scores = [], [], [], []

for pred, ref in zip(answers, [gt[0] for gt in ground_truths]):
    em_scores.append(exact_match(pred, ref))
    p, r, f1 = token_metrics(pred, ref)
    prec_scores.append(p)
    recall_scores.append(r)
    f1_scores.append(f1)

# === RESULTS ===
for i, q in enumerate(questions):
    print(f"\nQ{i+1}: {q}")
    print(f"Answer: {answers[i]}")
    print(f"Ground Truth: {ground_truths[i][0]}")
    print(f"Context snippet: {contexts[i][:300]}...")

print(f"\n📊 EVALUATION METRICS:")
print(f"✅ Exact Match: {np.mean(em_scores):.2f}")
print(f"✅ Precision:   {np.mean(prec_scores):.2f}")
print(f"✅ Recall:      {np.mean(recall_scores):.2f}")
print(f"✅ F1 Score:    {np.mean(f1_scores):.2f}")


Batches:   0%|          | 0/5901 [00:00<?, ?it/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.66k [00:00<?, ?B/s]




Q1: What is the capital of France?
Answer: Paris
Ground Truth: Paris
Context snippet: is also based in the city. Paris also holds the headquarters of the La Poste, France's national postal carrier....

Q2: Who discovered penicillin?
Answer: Fleming
Ground Truth: Alexander Fleming
Context snippet: eria and had low toxicity in humans. Furthermore, its activity was not inhibited by biological constituents such as pus, unlike the synthetic sulfonamides. The discovery of such a powerful antibiotic was unprecedented, and the development of penicillin led to renewed interest in the search for antib...

Q3: When did the Cold War end?
Answer: September 1949
Ground Truth: 1991
Context snippet: In simple terms, the Cold War could be viewed as an expression of the ideological struggle between communism and capitalism. The United States faced a new uncertainty beginning in September 1949, when it lost its monopoly on the atomic bomb. American intelligence agencies discovered that the Soviet ...

📊

### Weaviate with CLEAN + FILTERED CHUNKING and Reranking with **deepset/roberta-base-squad2**


In [None]:
!pip install numpy pandas sentence-transformers transformers huggingface-hub

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [None]:

!pip install "weaviate-client>=3.26.7,<4.0.0"



In [None]:
# IMPORTS
import weaviate
import pandas as pd
from sentence_transformers import SentenceTransformer, CrossEncoder
from transformers import pipeline
import numpy as np
import re

# === CONFIG ===
WEAVIATE_URL = "https://sagdo76qtw2urrz9seupg.c0.us-east1.gcp.weaviate.cloud"
CSV_PATH = "/content/drive/MyDrive/RAG_project/train.csv"
CLASS_NAME = "QAChunk"

# === STEP 1: LOAD DATA ===
df = pd.read_csv(CSV_PATH)
texts = df['context'].dropna().tolist()

# === STEP 2: CLEAN + FILTERED CHUNKING ===
def clean_chunk(text):
    """Remove noise and skip very short or whitespace-only text."""
    text = re.sub(r'\s+', ' ', text).strip()
    # Only accept chunk if longer than 30 words (avoid noisy/small chunks)
    return text if len(text.split()) > 30 else None

def chunk_text(text, chunk_size=500, overlap=50):
    chunks = []
    seen = set()
    for i in range(0, len(text), chunk_size - overlap):
        raw_chunk = text[i:i + chunk_size]
        chunk = clean_chunk(raw_chunk)
        if chunk and chunk not in seen:
            chunks.append(chunk)
            seen.add(chunk)
    return chunks

all_chunks = []
for text in texts:
    all_chunks.extend(chunk_text(text))

print(f"Total chunks extracted: {len(all_chunks)}")

# === STEP 3: EMBEDDING ===
bi_encoder = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = bi_encoder.encode(all_chunks, show_progress_bar=True)

# === STEP 4: SETUP WEAVIATE CONNECTION ===
client = weaviate.Client(
    url=WEAVIATE_URL,
    auth_client_secret=weaviate.AuthClientPassword(
        username="vandana.jv@gmail.com",
        password="Vinayaka@143"
    )
)

# === STEP 5: RESET CLASS ===
if client.schema.exists(CLASS_NAME):
    client.schema.delete_class(CLASS_NAME)

schema = {
    "class": CLASS_NAME,
    "vectorizer": "none",
    "properties": [{"name": "content", "dataType": ["text"]}]
}
client.schema.create_class(schema)

# === STEP 6: ADD DATA TO WEAVIATE ===
with client.batch as batch:
    batch.batch_size = 100
    for text, vector in zip(all_chunks, embeddings):
        props = {"content": text}
        batch.add_data_object(props, CLASS_NAME, vector=vector)

print("Data added to Weaviate successfully.")

# === STEP 7: QA + RERANKER SETUP ===
# Recommended stronger QA model for better results
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_context(question, k=3, fetch_k=10):
    query_vector = bi_encoder.encode([question])[0]
    result = client.query.get(CLASS_NAME, ["content"]) \
        .with_near_vector({"vector": query_vector}) \
        .with_limit(fetch_k).do()

    candidates = []
    if "data" in result and "Get" in result["data"] and CLASS_NAME in result["data"]["Get"]:
        candidates = [item["content"] for item in result["data"]["Get"][CLASS_NAME]]

    if not candidates:
        return []

    pairs = [[question, doc] for doc in candidates]
    scores = reranker.predict(pairs)
    top_indices = np.argsort(scores)[-k:][::-1]

    return [candidates[i] for i in top_indices]

def rag_qa(question, k=3):
    contexts = retrieve_context(question, k)
    if not contexts:
        return "No relevant context found.", ""
    answers = []
    for context in contexts:
        result = qa_pipeline({'question': question, 'context': context})
        result['context'] = context
        answers.append(result)
    best = max(answers, key=lambda x: x['score'])
    return best['answer'], best['context']

questions = [
    "What is the capital city of France?",
    "Which British scientist discovered penicillin in 1928?",
    "What year did the Soviet Union collapse, marking the end of the Cold War?"
]
ground_truths = [
    ["Paris"],
    ["Alexander Fleming"],
    ["1991"]
]

def normalize(text):
    return re.sub(r'\W+', ' ', text.strip().lower())

def exact_match(pred, ref):
    return normalize(pred) == normalize(ref)

def token_metrics(pred, ref):
    pred_tokens = normalize(pred).split()
    ref_tokens = normalize(ref).split()
    if not pred_tokens or not ref_tokens:
        return 0.0, 0.0, 0.0
    common = set(pred_tokens) & set(ref_tokens)
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(ref_tokens)
    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0
    return precision, recall, f1

answers, contexts = [], []
for q in questions:
    ans, ctx = rag_qa(q)
    answers.append(ans)
    contexts.append(ctx)

em_scores, f1_scores, prec_scores, recall_scores = [], [], [], []

for pred, ref in zip(answers, [gt[0] for gt in ground_truths]):
    em_scores.append(exact_match(pred, ref))
    p, r, f1 = token_metrics(pred, ref)
    prec_scores.append(p)
    recall_scores.append(r)
    f1_scores.append(f1)

# === RESULTS ===
for i, q in enumerate(questions):
    print(f"\nQ{i+1}: {q}")
    print(f"Answer: {answers[i]}")
    print(f"Ground Truth: {ground_truths[i][0]}")
    print(f"Context snippet: {contexts[i][:300]}...")

print(f"\n📊 EVALUATION METRICS:")
print(f"✅ Exact Match: {np.mean(em_scores):.2f}")
print(f"✅ Precision:   {np.mean(prec_scores):.2f}")
print(f"✅ Recall:      {np.mean(recall_scores):.2f}")
print(f"✅ F1 Score:    {np.mean(f1_scores):.2f}")


Total chunks extracted: 151796


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/4744 [00:00<?, ?it/s]

Data added to Weaviate successfully.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Device set to use cuda:0


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.66k [00:00<?, ?B/s]




Q1: What is the capital city of France?
Answer: Paris
Ground Truth: Paris
Context snippet: Paris is located in northern central France. By road it is 450 kilometres (280 mi) south-east of London, 287 kilometres (178 mi) south of Calais, 305 kilometres (190 mi) south-west of Brussels, 774 kilometres (481 mi) north of Marseille, 385 kilometres (239 mi) north-east of Nantes, and 135 kilometr...

Q2: Which British scientist discovered penicillin in 1928?
Answer: Alexander Fleming
Ground Truth: Alexander Fleming
Context snippet: The effects of some types of mold on infection had been noticed many times over the course of history (see: History of penicillin). In 1928, Alexander Fleming noticed the same effect in a Petri dish, where a number of disease-causing bacteria were killed by a fungus of the genus Penicillium. Fleming...

Q3: What year did the Soviet Union collapse, marking the end of the Cold War?
Answer: 1991
Ground Truth: 1991
Context snippet: The Cold War drew to a close in the l

## Same as above with more questions

In [None]:
# IMPORTS
import weaviate
import pandas as pd
from sentence_transformers import SentenceTransformer, CrossEncoder
from transformers import pipeline
import numpy as np
import re

# === CONFIG ===
WEAVIATE_URL = "https://sagdo76qtw2urrz9seupg.c0.us-east1.gcp.weaviate.cloud"
CSV_PATH = "/content/drive/MyDrive/RAG_project/train.csv"
CLASS_NAME = "QAChunk"

# === STEP 1: LOAD DATA ===
df = pd.read_csv(CSV_PATH)
texts = df['context'].dropna().tolist()

# === STEP 2: CLEAN + FILTERED CHUNKING ===
def clean_chunk(text):
    """Remove noise and skip very short or whitespace-only text."""
    text = re.sub(r'\s+', ' ', text).strip()
    # Only accept chunk if longer than 30 words (avoid noisy/small chunks)
    return text if len(text.split()) > 30 else None

def chunk_text(text, chunk_size=500, overlap=50):
    chunks = []
    seen = set()
    for i in range(0, len(text), chunk_size - overlap):
        raw_chunk = text[i:i + chunk_size]
        chunk = clean_chunk(raw_chunk)
        if chunk and chunk not in seen:
            chunks.append(chunk)
            seen.add(chunk)
    return chunks

all_chunks = []
for text in texts:
    all_chunks.extend(chunk_text(text))

print(f"Total chunks extracted: {len(all_chunks)}")

# === STEP 3: EMBEDDING ===
bi_encoder = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = bi_encoder.encode(all_chunks, show_progress_bar=True)

# === STEP 4: SETUP WEAVIATE CONNECTION ===
client = weaviate.Client(
    url=WEAVIATE_URL,
    auth_client_secret=weaviate.AuthClientPassword(
        username="vandana.jv@gmail.com",
        password="Vinayaka@143"
    )
)

# === STEP 5: RESET CLASS ===
if client.schema.exists(CLASS_NAME):
    client.schema.delete_class(CLASS_NAME)

schema = {
    "class": CLASS_NAME,
    "vectorizer": "none",
    "properties": [{"name": "content", "dataType": ["text"]}]
}
client.schema.create_class(schema)

# === STEP 6: ADD DATA TO WEAVIATE ===
with client.batch as batch:
    batch.batch_size = 100
    for text, vector in zip(all_chunks, embeddings):
        props = {"content": text}
        batch.add_data_object(props, CLASS_NAME, vector=vector)

print("Data added to Weaviate successfully.")

# === STEP 7: QA + RERANKER SETUP ===
# Recommended stronger QA model for better results
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_context(question, k=3, fetch_k=10):
    query_vector = bi_encoder.encode([question])[0]
    result = client.query.get(CLASS_NAME, ["content"]) \
        .with_near_vector({"vector": query_vector}) \
        .with_limit(fetch_k).do()

    candidates = []
    if "data" in result and "Get" in result["data"] and CLASS_NAME in result["data"]["Get"]:
        candidates = [item["content"] for item in result["data"]["Get"][CLASS_NAME]]

    if not candidates:
        return []

    pairs = [[question, doc] for doc in candidates]
    scores = reranker.predict(pairs)
    top_indices = np.argsort(scores)[-k:][::-1]

    return [candidates[i] for i in top_indices]

def rag_qa(question, k=3):
    contexts = retrieve_context(question, k)
    if not contexts:
        return "No relevant context found.", ""
    answers = []
    for context in contexts:
        result = qa_pipeline({'question': question, 'context': context})
        result['context'] = context
        answers.append(result)
    best = max(answers, key=lambda x: x['score'])
    return best['answer'], best['context']

# === STEP 8: EVALUATION ===
# questions = [
#     "What is the capital of France?",
#     "Who discovered penicillin?",
#     "When did the Cold War end?"
# ]
questions = [
    "What is the capital city of France?",
    "Which British scientist discovered penicillin in 1928?",
    "What year did the Soviet Union collapse, marking the end of the Cold War?",
    "Which school at Notre Dame was established in 1921?",
    "In what year was the College of Science at Notre Dame founded?",
    "Which building is the center of the College of Arts and Letters?",
    "What religious structure is located on the campus of the University of Notre Dame?",
    "Who designed the Basilica of the Sacred Heart at Notre Dame?"
]

ground_truths = [
    ["Paris"],
    ["Alexander Fleming"],
    ["1991"],
    ["College of Commerce"],
    ["1865"],
    ["O'Shaughnessy Hall"],
    ["Basilica of the Sacred Heart"],
    ["Fr. Sorin"]
]

def normalize(text):
    return re.sub(r'\W+', ' ', text.strip().lower())

def exact_match(pred, ref):
    return normalize(pred) == normalize(ref)

def token_metrics(pred, ref):
    pred_tokens = normalize(pred).split()
    ref_tokens = normalize(ref).split()
    if not pred_tokens or not ref_tokens:
        return 0.0, 0.0, 0.0
    common = set(pred_tokens) & set(ref_tokens)
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(ref_tokens)
    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0
    return precision, recall, f1

answers, contexts = [], []
for q in questions:
    ans, ctx = rag_qa(q)
    answers.append(ans)
    contexts.append(ctx)

em_scores, f1_scores, prec_scores, recall_scores = [], [], [], []

for pred, ref in zip(answers, [gt[0] for gt in ground_truths]):
    em_scores.append(exact_match(pred, ref))
    p, r, f1 = token_metrics(pred, ref)
    prec_scores.append(p)
    recall_scores.append(r)
    f1_scores.append(f1)

# === RESULTS ===
for i, q in enumerate(questions):
    print(f"\nQ{i+1}: {q}")
    print(f"Answer: {answers[i]}")
    print(f"Ground Truth: {ground_truths[i][0]}")
    print(f"Context snippet: {contexts[i][:300]}...")

print(f"\n📊 EVALUATION METRICS:")
print(f"✅ Exact Match: {np.mean(em_scores):.2f}")
print(f"✅ Precision:   {np.mean(prec_scores):.2f}")
print(f"✅ Recall:      {np.mean(recall_scores):.2f}")
print(f"✅ F1 Score:    {np.mean(f1_scores):.2f}")


Total chunks extracted: 151796


Batches:   0%|          | 0/4744 [00:00<?, ?it/s]

Data added to Weaviate successfully.


Device set to use cuda:0



Q1: What is the capital city of France?
Answer: Paris
Ground Truth: Paris
Context snippet: Paris is located in northern central France. By road it is 450 kilometres (280 mi) south-east of London, 287 kilometres (178 mi) south of Calais, 305 kilometres (190 mi) south-west of Brussels, 774 kilometres (481 mi) north of Marseille, 385 kilometres (239 mi) north-east of Nantes, and 135 kilometr...

Q2: Which British scientist discovered penicillin in 1928?
Answer: Alexander Fleming
Ground Truth: Alexander Fleming
Context snippet: The effects of some types of mold on infection had been noticed many times over the course of history (see: History of penicillin). In 1928, Alexander Fleming noticed the same effect in a Petri dish, where a number of disease-causing bacteria were killed by a fungus of the genus Penicillium. Fleming...

Q3: What year did the Soviet Union collapse, marking the end of the Cold War?
Answer: 1991
Ground Truth: 1991
Context snippet: The Cold War drew to a close in the l

In [2]:
!pip install numpy pandas sentence-transformers transformers huggingface-hub


Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [3]:
!pip uninstall weaviate-client
!pip install "weaviate-client>=3.26.7,<4.0.0"

[0mCollecting weaviate-client<4.0.0,>=3.26.7
  Downloading weaviate_client-3.26.7-py3-none-any.whl.metadata (3.4 kB)
Collecting validators<1.0.0,>=0.21.2 (from weaviate-client<4.0.0,>=3.26.7)
  Downloading validators-0.35.0-py3-none-any.whl.metadata (3.9 kB)
Collecting authlib<2.0.0,>=1.3.1 (from weaviate-client<4.0.0,>=3.26.7)
  Downloading authlib-1.6.0-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading weaviate_client-3.26.7-py3-none-any.whl (120 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.1/120.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading authlib-1.6.0-py2.py3-none-any.whl (239 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.0/240.0 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading validators-0.35.0-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.7/44.7 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: validators, authlib, w

## deepset/roberta-large-squad2 with more questions

In [None]:
# IMPORTS
import weaviate
import pandas as pd
from sentence_transformers import SentenceTransformer, CrossEncoder
from transformers import pipeline
import numpy as np
import re

# === CONFIG ===
WEAVIATE_URL = "https://sagdo76qtw2urrz9seupg.c0.us-east1.gcp.weaviate.cloud"
CSV_PATH = "/content/drive/MyDrive/RAG_project/train.csv"
CLASS_NAME = "QAChunk"

# === STEP 1: LOAD DATA ===
df = pd.read_csv(CSV_PATH)
texts = df['context'].dropna().tolist()

# === STEP 2: CLEAN + FILTERED CHUNKING ===
def clean_chunk(text):
    """Remove noise and skip very short or whitespace-only text."""
    text = re.sub(r'\s+', ' ', text).strip()
    # Only accept chunk if longer than 30 words (avoid noisy/small chunks)
    return text if len(text.split()) > 30 else None

def chunk_text(text, chunk_size=500, overlap=50):
    chunks = []
    seen = set()
    for i in range(0, len(text), chunk_size - overlap):
        raw_chunk = text[i:i + chunk_size]
        chunk = clean_chunk(raw_chunk)
        if chunk and chunk not in seen:
            chunks.append(chunk)
            seen.add(chunk)
    return chunks

all_chunks = []
for text in texts:
    all_chunks.extend(chunk_text(text))

print(f"Total chunks extracted: {len(all_chunks)}")

# === STEP 3: EMBEDDING ===
bi_encoder = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = bi_encoder.encode(all_chunks, show_progress_bar=True)

# === STEP 4: SETUP WEAVIATE CONNECTION ===
client = weaviate.Client(
    url=WEAVIATE_URL,
    auth_client_secret=weaviate.AuthClientPassword(
        username="vandana.jv@gmail.com",
        password="Vinayaka@143"
    )
)

# === STEP 5: RESET CLASS ===
if client.schema.exists(CLASS_NAME):
    client.schema.delete_class(CLASS_NAME)

schema = {
    "class": CLASS_NAME,
    "vectorizer": "none",
    "properties": [{"name": "content", "dataType": ["text"]}]
}
client.schema.create_class(schema)

# === STEP 6: ADD DATA TO WEAVIATE ===
with client.batch as batch:
    batch.batch_size = 100
    for text, vector in zip(all_chunks, embeddings):
        props = {"content": text}
        batch.add_data_object(props, CLASS_NAME, vector=vector)

print("Data added to Weaviate successfully.")

# === STEP 7: QA + RERANKER SETUP ===
# Recommended stronger QA model for better results
qa_pipeline = pipeline("question-answering", model="deepset/roberta-large-squad2")

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_context(question, k=3, fetch_k=10):
    query_vector = bi_encoder.encode([question])[0]
    result = client.query.get(CLASS_NAME, ["content"]) \
        .with_near_vector({"vector": query_vector}) \
        .with_limit(fetch_k).do()

    candidates = []
    if "data" in result and "Get" in result["data"] and CLASS_NAME in result["data"]["Get"]:
        candidates = [item["content"] for item in result["data"]["Get"][CLASS_NAME]]

    if not candidates:
        return []

    pairs = [[question, doc] for doc in candidates]
    scores = reranker.predict(pairs)
    top_indices = np.argsort(scores)[-k:][::-1]

    return [candidates[i] for i in top_indices]

def rag_qa(question, k=3):
    contexts = retrieve_context(question, k)
    if not contexts:
        return "No relevant context found.", ""
    answers = []
    for context in contexts:
        result = qa_pipeline({'question': question, 'context': context})
        result['context'] = context
        answers.append(result)
    best = max(answers, key=lambda x: x['score'])
    return best['answer'], best['context']

questions = [
    "What is the capital city of France?",
    "Which British scientist discovered penicillin in 1928?",
    "What year did the Soviet Union collapse, marking the end of the Cold War?",
    "Which school at Notre Dame was established in 1921?",
    "In what year was the College of Science at Notre Dame founded?",
    "Which building is the center of the College of Arts and Letters?",
    "What religious structure is located on the campus of the University of Notre Dame?",
    "Who designed the Basilica of the Sacred Heart at Notre Dame?",
    "Which saint is the golden statue atop the Main Building modeled after?","What color is the dome at the University of Notre Dame?",
]

ground_truths = [
    ["Paris"],
    ["Alexander Fleming"],
    ["1991"],
    ["College of Commerce"],
    ["1865"],
    ["O'Shaughnessy Hall"],
    ["Basilica of the Sacred Heart"],
    ["Fr. Sorin"],
    ["Virgin Mary"],
    ["golden"]
]

def normalize(text):
    return re.sub(r'\W+', ' ', text.strip().lower())

def exact_match(pred, ref):
    return normalize(pred) == normalize(ref)

def token_metrics(pred, ref):
    pred_tokens = normalize(pred).split()
    ref_tokens = normalize(ref).split()
    if not pred_tokens or not ref_tokens:
        return 0.0, 0.0, 0.0
    common = set(pred_tokens) & set(ref_tokens)
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(ref_tokens)
    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0
    return precision, recall, f1

answers, contexts = [], []
for q in questions:
    ans, ctx = rag_qa(q)
    answers.append(ans)
    contexts.append(ctx)

em_scores, f1_scores, prec_scores, recall_scores = [], [], [], []

for pred, ref in zip(answers, [gt[0] for gt in ground_truths]):
    em_scores.append(exact_match(pred, ref))
    p, r, f1 = token_metrics(pred, ref)
    prec_scores.append(p)
    recall_scores.append(r)
    f1_scores.append(f1)

# === RESULTS ===
for i, q in enumerate(questions):
    print(f"\nQ{i+1}: {q}")
    print(f"Answer: {answers[i]}")
    print(f"Ground Truth: {ground_truths[i][0]}")
    print(f"Context snippet: {contexts[i][:300]}...")

print(f"\n📊 EVALUATION METRICS:")
print(f"✅ Exact Match: {np.mean(em_scores):.2f}")
print(f"✅ Precision:   {np.mean(prec_scores):.2f}")
print(f"✅ Recall:      {np.mean(recall_scores):.2f}")
print(f"✅ F1 Score:    {np.mean(f1_scores):.2f}")


Total chunks extracted: 151796


Batches:   0%|          | 0/4744 [00:00<?, ?it/s]

Data added to Weaviate successfully.


Device set to use cuda:0



Q1: What is the capital city of France?
Answer: Paris
Ground Truth: Paris
Context snippet: Paris is located in northern central France. By road it is 450 kilometres (280 mi) south-east of London, 287 kilometres (178 mi) south of Calais, 305 kilometres (190 mi) south-west of Brussels, 774 kilometres (481 mi) north of Marseille, 385 kilometres (239 mi) north-east of Nantes, and 135 kilometr...

Q2: Which British scientist discovered penicillin in 1928?
Answer: Alexander Fleming
Ground Truth: Alexander Fleming
Context snippet: The effects of some types of mold on infection had been noticed many times over the course of history (see: History of penicillin). In 1928, Alexander Fleming noticed the same effect in a Petri dish, where a number of disease-causing bacteria were killed by a fungus of the genus Penicillium. Fleming...

Q3: What year did the Soviet Union collapse, marking the end of the Cold War?
Answer: 1991
Ground Truth: 1991
Context snippet: The Cold War drew to a close in the l

### GPT2 with weaviate

In [None]:
# === IMPORTS ===
import weaviate
import pandas as pd
import numpy as np
import re
import torch
from sentence_transformers import SentenceTransformer, CrossEncoder
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# === CONFIG ===
WEAVIATE_URL = "https://sagdo76qtw2urrz9seupg.c0.us-east1.gcp.weaviate.cloud"
CSV_PATH = "/content/drive/MyDrive/RAG_project/train.csv"
CLASS_NAME = "GPT2Chunks"


# === STEP 1: LOAD DATA ===
df = pd.read_csv(CSV_PATH)
texts = df['context'].dropna().tolist()

# === STEP 2: CLEAN + CHUNK ===
def clean_chunk(text):
    text = re.sub(r'\s+', ' ', text).strip()
    return text if len(text.split()) > 5 else None

def chunk_text(text, chunk_size=500, overlap=50):
    chunks = []
    seen = set()
    for i in range(0, len(text), chunk_size - overlap):
        raw_chunk = text[i:i + chunk_size]
        chunk = clean_chunk(raw_chunk)
        if chunk and chunk not in seen:
            chunks.append(chunk)
            seen.add(chunk)
    return chunks

all_chunks = []
for text in texts:
    all_chunks.extend(chunk_text(text))
print(f"✅ Total chunks: {len(all_chunks)}")

# === STEP 3: EMBEDDING ===
bi_encoder = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = bi_encoder.encode(all_chunks, show_progress_bar=True)

# === STEP 4: CONNECT TO WEAVIATE ===
client = weaviate.Client(
    url=WEAVIATE_URL,
    auth_client_secret=weaviate.AuthClientPassword(
        username="vandana.jv@gmail.com",
        password="Vinayaka@143"
    )
)

# === STEP 5: RESET CLASS ===
if client.schema.exists(CLASS_NAME):
    client.schema.delete_class(CLASS_NAME)

schema = {
    "class": CLASS_NAME,
    "vectorizer": "none",
    "properties": [{"name": "content", "dataType": ["text"]}]
}
client.schema.create_class(schema)

# === STEP 6: LOAD TO WEAVIATE ===
with client.batch as batch:
    batch.batch_size = 100
    for text, vector in zip(all_chunks, embeddings):
        batch.add_data_object({"content": text}, CLASS_NAME, vector=vector)

print("✅ Data added to Weaviate")

# === STEP 7: RERANKING SETUP ===
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_context(question, k=3, fetch_k=15):
    query_vector = bi_encoder.encode([question])[0]
    result = client.query.get(CLASS_NAME, ["content"]) \
        .with_near_vector({"vector": query_vector}) \
        .with_limit(fetch_k).do()
    candidates = [item["content"] for item in result["data"]["Get"][CLASS_NAME]]
    pairs = [[question, doc] for doc in candidates]
    scores = reranker.predict(pairs)
    top_indices = np.argsort(scores)[-k:][::-1]
    return [candidates[i] for i in top_indices]

# === STEP 8: GPT2 QA SETUP ===
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2").to("cuda")
model.eval()

def generate_answer_gpt2(question, context):
    prompt = f"Context: {context}\nQuestion: {question}\nAnswer:"
    inputs = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        outputs = model.generate(inputs, max_length=256, do_sample=False, pad_token_id=tokenizer.eos_token_id)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer.replace(prompt, "").strip()

def rag_qa(question, k=3):
    contexts = retrieve_context(question, k)
    answers = [generate_answer_gpt2(question, ctx) for ctx in contexts]
    return answers[0], contexts[0]

# === STEP 9: EVALUATION ===
questions = [
    "What is the capital city of France?",
    "Which British scientist discovered penicillin in 1928?",
    "What year did the Soviet Union collapse, marking the end of the Cold War?"
]
ground_truths = [
    ["Paris"],
    ["Alexander Fleming"],
    ["1991"]
]

def normalize(text):
    return re.sub(r'\W+', ' ', text.strip().lower())

def exact_match(pred, ref):
    return normalize(pred) == normalize(ref)

def token_metrics(pred, ref):
    pred_tokens = normalize(pred).split()
    ref_tokens = normalize(ref).split()
    if not pred_tokens or not ref_tokens:
        return 0.0, 0.0, 0.0
    common = set(pred_tokens) & set(ref_tokens)
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(ref_tokens)
    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0
    return precision, recall, f1

answers, contexts = [], []
em_scores, f1_scores, prec_scores, recall_scores = [], [], [], []

for q, gt in zip(questions, ground_truths):
    ans, ctx = rag_qa(q)
    answers.append(ans)
    contexts.append(ctx)
    em_scores.append(exact_match(ans, gt[0]))
    p, r, f1 = token_metrics(ans, gt[0])
    prec_scores.append(p)
    recall_scores.append(r)
    f1_scores.append(f1)

# === RESULTS ===
for i, q in enumerate(questions):
    print(f"\nQ{i+1}: {q}")
    print(f"Answer: {answers[i]}")
    print(f"Ground Truth: {ground_truths[i][0]}")
    print(f"Context snippet: {contexts[i][:300]}...")

print(f"\n📊 EVALUATION METRICS:")
print(f"✅ Exact Match: {np.mean(em_scores):.2f}")
print(f"✅ Precision:   {np.mean(prec_scores):.2f}")
print(f"✅ Recall:      {np.mean(recall_scores):.2f}")
print(f"✅ F1 Score:    {np.mean(f1_scores):.2f}")


✅ Total chunks: 188816


Batches:   0%|          | 0/5901 [00:00<?, ?it/s]

✅ Data added to Weaviate

Q1: What is the capital city of France?
Answer: Paris is the capital of France.
Question: What is the capital city of the United States?
Answer: The capital city of the United States is the capital of the United States.
Question: What is the capital city of the United Kingdom?
Answer: The capital city of the United Kingdom is the capital of the United Kingdom.
Question: What is the capital city of the United States?
Answer: The capital city of the United States is the capital of the United States.
Question: What is the capital city of the United States?
Answer: The capital city of the United States is the capital of the United States.
Question: What is the capital city of the United States?
Answer: The capital city of the United States is the capital of the United States.
Question: What is the capital city of the United States?
Answer: The capital city of the United States is the capital of the United States.
Question: What is the capital city of the United St

## GPT2-medium

In [None]:
# === IMPORTS ===
import weaviate
import pandas as pd
import numpy as np
import re
from sentence_transformers import SentenceTransformer
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# === CONFIG ===
WEAVIATE_URL = "https://sagdo76qtw2urrz9seupg.c0.us-east1.gcp.weaviate.cloud"
CSV_PATH = "/content/drive/MyDrive/RAG_project/train.csv"
CLASS_NAME = "GPTQAChunk"

# === STEP 1: LOAD DATA ===
df = pd.read_csv(CSV_PATH)
texts = df['context'].dropna().tolist()

# === STEP 2: CLEAN + FILTERED CHUNKING ===
def clean_chunk(text):
    text = re.sub(r'\s+', ' ', text).strip()
    return text if len(text.split()) > 30 else None

def chunk_text(text, chunk_size=500, overlap=50):
    chunks = []
    seen = set()
    for i in range(0, len(text), chunk_size - overlap):
        raw_chunk = text[i:i + chunk_size]
        chunk = clean_chunk(raw_chunk)
        if chunk and chunk not in seen:
            chunks.append(chunk)
            seen.add(chunk)
    return chunks

all_chunks = []
for text in texts:
    all_chunks.extend(chunk_text(text))

print(f"Total chunks extracted: {len(all_chunks)}")

# === STEP 3: EMBEDDINGS USING sentence-transformers ===
embedder = SentenceTransformer('all-MiniLM-L6-v2')  # fast and good quality

def get_local_embedding(text):
    return embedder.encode(text)

embeddings = [get_local_embedding(text) for text in all_chunks]

# === STEP 4: CONNECT TO WEAVIATE ===
client = weaviate.Client(
    url=WEAVIATE_URL,
    auth_client_secret=weaviate.AuthClientPassword(
        username="vandana.jv@gmail.com",
        password="Vinayaka@143"
    )
)

# === STEP 5: CREATE SCHEMA ===
if client.schema.exists(CLASS_NAME):
    client.schema.delete_class(CLASS_NAME)

schema = {
    "class": CLASS_NAME,
    "vectorizer": "none",
    "properties": [{"name": "content", "dataType": ["text"]}]
}
client.schema.create_class(schema)

# === STEP 6: ADD TO WEAVIATE ===
with client.batch as batch:
    batch.batch_size = 100
    for text, vector in zip(all_chunks, embeddings):
        props = {"content": text}
        batch.add_data_object(props, CLASS_NAME, vector=vector.tolist())

print("✅ Data added to Weaviate.")

# === STEP 7: RAG RETRIEVER + GPT-2 MEDIUM QA ===

# Load GPT-2 medium model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
model = GPT2LMHeadModel.from_pretrained("gpt2-medium")
model.eval()

if torch.cuda.is_available():
    model.to("cuda")

def retrieve_local_context(question, k=3):
    question_vector = get_local_embedding(question)
    result = client.query.get(CLASS_NAME, ["content"]) \
        .with_near_vector({"vector": question_vector.tolist()}) \
        .with_limit(k).do()

    if "data" in result and "Get" in result["data"] and CLASS_NAME in result["data"]["Get"]:
        return [item["content"] for item in result["data"]["Get"][CLASS_NAME]]
    return []

def generate_gpt2_answer(question, contexts):
    prompt = f"Context:\n{chr(10).join(contexts)}\n\nQuestion: {question}\nAnswer:"
    inputs = tokenizer.encode(prompt, return_tensors="pt")
    if torch.cuda.is_available():
        inputs = inputs.to("cuda")
    outputs = model.generate(
        inputs,
        max_length=inputs.shape[1] + 100,
        num_beams=5,
        no_repeat_ngram_size=2,
        early_stopping=True
    )
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract only the answer part after "Answer:"
    answer = generated.split("Answer:")[-1].strip()
    return answer

# === STEP 8: TEST EXAMPLES ===
questions = [
    "What is the capital city of France?",
    "Which British scientist discovered penicillin in 1928?",
    "What year did the Soviet Union collapse, marking the end of the Cold War?",
]

for q in questions:
    ctxs = retrieve_local_context(q, k=3)
    answer = generate_gpt2_answer(q, ctxs)
    print(f"\nQuestion: {q}\nAnswer: {answer}\nContexts used: {ctxs}")


Total chunks extracted: 151796


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Data added to Weaviate.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Question: What is the capital city of France?
Answer: Paris. The city's name derives from the French word "paris", which means "city" or "town". It was founded by Louis XIV in 1492, and was named after his wife, Marie Antoinette, who was born in Paris and died there in 1793. It is one of Europe's oldest cities, dating back to the 12th century, when it was first settled by the Normans. Today, it is home to more than 200 million people, making it the
Contexts used: ["Most French rulers since the Middle Ages made a point of leaving their mark on a city that, contrary to many other of the world's capitals, has never been destroyed by catastrophe or war. In modernising its infrastructure through the centuries, Paris has preserved even its earliest history in its street map.[citation needed] At its origin, before the Middle Ages, the city was composed around several islands and sandbanks in a bend of the Seine; of those, two remain today: the île Saint-Louis, th", "Most French rulers since 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Question: Which British scientist discovered penicillin in 1928?
Answer: The answer is: none of them. None of the British scientists who were involved in the discovery were British. They were scientists from the United States, Canada, Australia, New Zealand, South Africa, India, China, Japan, France, Germany, Italy, Spain, Portugal, Switzerland, the Netherlands, Belgium, Denmark, Sweden, Norway, Finland, Iceland, Russia, Czechoslovakia, Poland, Hungary, Romania, Bulgaria, Yugoslavia, Greece, Turkey, Egypt, Morocco, Algeria, Tunisia
Contexts used: ['Florey and Chain succeeded in purifying the first penicillin, penicillin G, in 1942, but it did not become widely available outside the Allied military before 1945. Later, Norman Heatley developed the back extraction technique for efficiently purifying penicillin in bulk. The chemical structure of penicillin was determined by Dorothy Crowfoot Hodgkin in 1945. Purified penicillin displayed potent antibacterial activity against a wide range o

In [None]:
import re
import numpy as np

# === QUESTIONS AND GROUND TRUTHS ===
questions = [
    "What is the capital city of France?",
    "Which British scientist discovered penicillin in 1928?",
    "What year did the Soviet Union collapse, marking the end of the Cold War?",
]
ground_truths = [
    ["Paris"],
    ["Alexander Fleming"],
    ["1991"]
]

# === NORMALIZATION + METRICS ===
def normalize(text):
    return re.sub(r'\W+', ' ', text.strip().lower())

def exact_match(pred, ref):
    return normalize(pred) == normalize(ref)

def token_metrics(pred, ref):
    pred_tokens = normalize(pred).split()
    ref_tokens = normalize(ref).split()
    if not pred_tokens or not ref_tokens:
        return 0.0, 0.0, 0.0
    common = set(pred_tokens) & set(ref_tokens)
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(ref_tokens)
    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0
    return precision, recall, f1

# === RAG QA FUNCTION ===
answers, contexts = [], []
em_scores, f1_scores, prec_scores, recall_scores = [], [], [], []

for q, gt in zip(questions, ground_truths):
    ctxs = retrieve_local_context(q, k=3)
    ans = generate_gpt2_answer(q, ctxs)
    answers.append(ans)
    contexts.append(ctxs)

    em_scores.append(exact_match(ans, gt[0]))
    p, r, f1 = token_metrics(ans, gt[0])
    prec_scores.append(p)
    recall_scores.append(r)
    f1_scores.append(f1)

# === RESULTS ===
for i, q in enumerate(questions):
    print(f"\nQ{i+1}: {q}")
    print(f"Answer: {answers[i]}")
    print(f"Ground Truth: {ground_truths[i][0]}")
    print(f"Context snippet: {contexts[i][0][:300] if contexts[i] else 'No context found'}...")

print(f"\n📊 EVALUATION METRICS:")
print(f"✅ Exact Match: {np.mean(em_scores):.2f}")
print(f"✅ Precision:   {np.mean(prec_scores):.2f}")
print(f"✅ Recall:      {np.mean(recall_scores):.2f}")
print(f"✅ F1 Score:    {np.mean(f1_scores):.2f}")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Q1: What is the capital city of France?
Answer: Paris. The city's name derives from the French word "paris", which means "city" or "town". It was founded by Louis XIV in 1492, and was named after his wife, Marie Antoinette, who was born in Paris and died there in 1793. It is one of Europe's oldest cities, dating back to the 12th century, when it was first settled by the Normans. Today, it is home to more than 200 million people, making it the
Ground Truth: Paris
Context snippet: Most French rulers since the Middle Ages made a point of leaving their mark on a city that, contrary to many other of the world's capitals, has never been destroyed by catastrophe or war. In modernising its infrastructure through the centuries, Paris has preserved even its earliest history in its st...

Q2: Which British scientist discovered penicillin in 1928?
Answer: The answer is: none of them. None of the British scientists who were involved in the discovery were British. They were scientists from the Unit

## GPT2-medium with CharacterTextSplitter

In [None]:
# === IMPORTS ===
import weaviate
import pandas as pd
import numpy as np
import re
from sentence_transformers import SentenceTransformer
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
from langchain.text_splitter import CharacterTextSplitter

# === CONFIG ===
WEAVIATE_URL = "https://sagdo76qtw2urrz9seupg.c0.us-east1.gcp.weaviate.cloud"
CSV_PATH = "/content/drive/MyDrive/RAG_project/train.csv"
CLASS_NAME = "GPTQAChunk"

# === STEP 1: LOAD DATA ===
df = pd.read_csv(CSV_PATH)
texts = df['context'].dropna().tolist()

# === STEP 2: CHUNKING USING CharacterTextSplitter ===
chunker = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separator="\n\n"
)


all_chunks = []
for text in texts:
    chunks = chunker.split_text(text)
    # Clean chunks (remove short or empty)
    cleaned = [chunk.strip() for chunk in chunks if len(chunk.strip().split()) > 30]
    all_chunks.extend(cleaned)

print(f"Total chunks extracted: {len(all_chunks)}")

# === STEP 3: EMBEDDINGS USING sentence-transformers ===
embedder = SentenceTransformer('all-MiniLM-L6-v2')  # fast and good quality

def get_local_embedding(text):
    return embedder.encode(text)

embeddings = [get_local_embedding(text) for text in all_chunks]

# === STEP 4: CONNECT TO WEAVIATE ===
client = weaviate.Client(
    url=WEAVIATE_URL,
    auth_client_secret=weaviate.AuthClientPassword(
        username="vandana.jv@gmail.com",
        password="Vinayaka@143"
    )
)

# === STEP 5: CREATE SCHEMA ===
if client.schema.exists(CLASS_NAME):
    client.schema.delete_class(CLASS_NAME)

schema = {
    "class": CLASS_NAME,
    "vectorizer": "none",
    "properties": [{"name": "content", "dataType": ["text"]}]
}
client.schema.create_class(schema)

# === STEP 6: ADD TO WEAVIATE ===
with client.batch as batch:
    batch.batch_size = 100
    for text, vector in zip(all_chunks, embeddings):
        props = {"content": text}
        batch.add_data_object(props, CLASS_NAME, vector=vector.tolist())

print("✅ Data added to Weaviate.")

# === STEP 7: LOAD GPT-2 MEDIUM MODEL & TOKENIZER ===
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
model = GPT2LMHeadModel.from_pretrained("gpt2-medium")
model.eval()

# Set pad token (important for padding & attention mask)
tokenizer.pad_token = tokenizer.eos_token

if torch.cuda.is_available():
    model.to("cuda")

# === STEP 8: DEFINE RETRIEVAL AND GENERATION FUNCTIONS ===

def retrieve_local_context(question, k=3):
    question_vector = get_local_embedding(question)
    result = client.query.get(CLASS_NAME, ["content"]) \
        .with_near_vector({"vector": question_vector.tolist()}) \
        .with_limit(k).do()

    if "data" in result and "Get" in result["data"] and CLASS_NAME in result["data"]["Get"]:
        return [item["content"] for item in result["data"]["Get"][CLASS_NAME]]
    return []

def generate_gpt2_answer(question, contexts):
    prompt = f"Context:\n{chr(10).join(contexts)}\n\nQuestion: {question}\nAnswer:"
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=1024
    )
    if torch.cuda.is_available():
        inputs = {k: v.to("cuda") for k, v in inputs.items()}
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=inputs["input_ids"].shape[1] + 100,
        num_beams=5,
        no_repeat_ngram_size=2,
        early_stopping=True
    )
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = generated.split("Answer:")[-1].strip()
    return answer

# === STEP 9: EVALUATION METRICS ===

questions = [
    "What is the capital city of France?",
    "Which British scientist discovered penicillin in 1928?",
    "What year did the Soviet Union collapse, marking the end of the Cold War?",
]
ground_truths = [
    ["Paris"],
    ["Alexander Fleming"],
    ["1991"]
]

def normalize(text):
    return re.sub(r'\W+', ' ', text.strip().lower())

def exact_match(pred, ref):
    return normalize(pred) == normalize(ref)

def token_metrics(pred, ref):
    pred_tokens = normalize(pred).split()
    ref_tokens = normalize(ref).split()
    if not pred_tokens or not ref_tokens:
        return 0.0, 0.0, 0.0
    common = set(pred_tokens) & set(ref_tokens)
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(ref_tokens)
    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0
    return precision, recall, f1

answers, contexts = [], []
em_scores, f1_scores, prec_scores, recall_scores = [], [], [], []

for q, gt in zip(questions, ground_truths):
    ctxs = retrieve_local_context(q, k=3)
    ans = generate_gpt2_answer(q, ctxs)
    answers.append(ans)
    contexts.append(ctxs)

    em_scores.append(exact_match(ans, gt[0]))
    p, r, f1 = token_metrics(ans, gt[0])
    prec_scores.append(p)
    recall_scores.append(r)
    f1_scores.append(f1)

# === STEP 10: PRINT RESULTS ===
for i, q in enumerate(questions):
    print(f"\nQ{i+1}: {q}")
    print(f"Answer: {answers[i]}")
    print(f"Ground Truth: {ground_truths[i][0]}")
    print(f"Context snippet: {contexts[i][0][:300] if contexts[i] else 'No context found'}...")

print(f"\n📊 EVALUATION METRICS:")
print(f"✅ Exact Match: {np.mean(em_scores):.2f}")
print(f"✅ Precision:   {np.mean(prec_scores):.2f}")
print(f"✅ Recall:      {np.mean(recall_scores):.2f}")
print(f"✅ F1 Score:    {np.mean(f1_scores):.2f}")


Total chunks extracted: 86818
✅ Data added to Weaviate.


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Q1: What is the capital city of France?
Answer: Paris is a French city with a capital of Paris. It is located in the north-eastern part of what is now France, between the Rhône River and the Mediterranean Sea. Its name derives from the French word "paris", which means "city". It was founded by Louis XIV in 1689 and became the first French capital in 1789. Since then, it has been the seat of government, commerce, finance, education, culture, science, art, literature, architecture,
Ground Truth: Paris
Context snippet: Most French rulers since the Middle Ages made a point of leaving their mark on a city that, contrary to many other of the world's capitals, has never been destroyed by catastrophe or war. In modernising its infrastructure through the centuries, Paris has preserved even its earliest history in its st...

Q2: Which British scientist discovered penicillin in 1928?
Answer: The answer to this question depends on who you ask. If you look at the list of British scientists who hav

##** GPT-NEO**

In [None]:
!pip install "weaviate-client>=3.26.7,<4.0.0"

Collecting weaviate-client<4.0.0,>=3.26.7
  Downloading weaviate_client-3.26.7-py3-none-any.whl.metadata (3.4 kB)
Collecting validators<1.0.0,>=0.21.2 (from weaviate-client<4.0.0,>=3.26.7)
  Downloading validators-0.35.0-py3-none-any.whl.metadata (3.9 kB)
Collecting authlib<2.0.0,>=1.3.1 (from weaviate-client<4.0.0,>=3.26.7)
  Downloading authlib-1.6.0-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading weaviate_client-3.26.7-py3-none-any.whl (120 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.1/120.1 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading authlib-1.6.0-py2.py3-none-any.whl (239 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.0/240.0 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading validators-0.35.0-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.7/44.7 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: validators, authlib, weavi

In [None]:
!pip install numpy pandas sentence-transformers transformers huggingface-hub

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [None]:
# === IMPORTS ===
import weaviate
import pandas as pd
import numpy as np
import re
import torch
from sentence_transformers import SentenceTransformer
from transformers import GPTNeoForCausalLM, GPT2TokenizerFast
from langchain.text_splitter import RecursiveCharacterTextSplitter

# === CONFIG ===
WEAVIATE_URL = "https://sagdo76qtw2urrz9seupg.c0.us-east1.gcp.weaviate.cloud"
CSV_PATH = "/content/drive/MyDrive/RAG_project/train.csv"
CLASS_NAME = "GPTNeoQAChunk"

# === STEP 1: LOAD DATA ===
df = pd.read_csv(CSV_PATH)
texts = df['context'].dropna().tolist()

# === STEP 2: CHUNKING USING RecursiveCharacterTextSplitter ===
chunker = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", " ", ""]
)

chunks = []
for text in texts:
    splits = chunker.split_text(text)
    chunks.extend([chunk for chunk in splits if len(chunk.split()) > 30])

print(f"Total chunks extracted: {len(chunks)}")

# === STEP 3: EMBEDDINGS USING sentence-transformers ===
embedder = SentenceTransformer('all-MiniLM-L6-v2')

def get_local_embedding(text):
    return embedder.encode(text)

embeddings = [get_local_embedding(text) for text in chunks]

# === STEP 4: CONNECT TO WEAVIATE ===
client = weaviate.Client(
    url=WEAVIATE_URL,
    auth_client_secret=weaviate.AuthClientPassword(
        username="vandana.jv@gmail.com",
        password="Vinayaka@143"
    )
)

# === STEP 5: CREATE SCHEMA ===
if client.schema.exists(CLASS_NAME):
    client.schema.delete_class(CLASS_NAME)

schema = {
    "class": CLASS_NAME,
    "vectorizer": "none",
    "properties": [{"name": "content", "dataType": ["text"]}]
}
client.schema.create_class(schema)

# === STEP 6: ADD TO WEAVIATE IN SMALLER BATCHES ===
from tqdm import tqdm
import time

batch_size = 1000  # Smaller batch size to avoid memory or timeout issues
total_chunks = len(chunks)

print(f"Starting upload of {total_chunks} chunks in batches of {batch_size}...")

for i in tqdm(range(0, total_chunks, batch_size)):
    batch_chunks = chunks[i:i + batch_size]
    batch_embeddings = embeddings[i:i + batch_size]

    with client.batch as batch:
        batch.batch_size = batch_size
        for text, vector in zip(batch_chunks, batch_embeddings):
            props = {"content": text}
            batch.add_data_object(props, CLASS_NAME, vector=vector.tolist())

    time.sleep(0.1)  # Optional: throttle to prevent overloading Weaviate

print("✅ All data uploaded in batches.")


# === STEP 7: LOAD GPT-NEO ===
tokenizer = GPT2TokenizerFast.from_pretrained("EleutherAI/gpt-neo-1.3B")
tokenizer.pad_token = tokenizer.eos_token
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
model.eval()
if torch.cuda.is_available():
    model.to("cuda")

# === RETRIEVER + ANSWER GENERATOR ===
def retrieve_local_context(question, k=3):
    question_vector = get_local_embedding(question)
    result = client.query.get(CLASS_NAME, ["content"]) \
        .with_near_vector({"vector": question_vector.tolist()}) \
        .with_limit(k).do()
    if "data" in result and "Get" in result["data"] and CLASS_NAME in result["data"]["Get"]:
        return [item["content"] for item in result["data"]["Get"][CLASS_NAME]]
    return []

def generate_gptneo_answer(question, contexts):
    prompt = f"Context:\n{chr(10).join(contexts)}\n\nQuestion: {question}\nAnswer:"
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)
    if torch.cuda.is_available():
        inputs = {k: v.to("cuda") for k, v in inputs.items()}
    outputs = model.generate(
        **inputs,
        max_length=inputs['input_ids'].shape[1] + 100,
        num_beams=5,
        no_repeat_ngram_size=2,
        early_stopping=True
    )
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated.split("Answer:")[-1].strip()

# === EVALUATION ===
questions = [
    "What is the capital city of France?",
    "Which British scientist discovered penicillin in 1928?",
    "What year did the Soviet Union collapse, marking the end of the Cold War?",
]
ground_truths = [
    ["Paris"],
    ["Alexander Fleming"],
    ["1991"]
]

answers, contexts = [], []
em_scores, f1_scores, prec_scores, recall_scores = [], [], [], []

def normalize(text):
    return re.sub(r'\W+', ' ', text.strip().lower())

def exact_match(pred, ref):
    return normalize(pred) == normalize(ref)

def token_metrics(pred, ref):
    pred_tokens = normalize(pred).split()
    ref_tokens = normalize(ref).split()
    if not pred_tokens or not ref_tokens:
        return 0.0, 0.0, 0.0
    common = set(pred_tokens) & set(ref_tokens)
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(ref_tokens)
    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0
    return precision, recall, f1

for q, gt in zip(questions, ground_truths):
    ctxs = retrieve_local_context(q, k=3)
    ans = generate_gptneo_answer(q, ctxs)
    answers.append(ans)
    contexts.append(ctxs)
    em_scores.append(exact_match(ans, gt[0]))
    p, r, f1 = token_metrics(ans, gt[0])
    prec_scores.append(p)
    recall_scores.append(r)
    f1_scores.append(f1)

# === RESULTS ===
for i, q in enumerate(questions):
    print(f"\nQ{i+1}: {q}")
    print(f"Answer: {answers[i]}")
    print(f"Ground Truth: {ground_truths[i][0]}")
    print(f"Context snippet: {contexts[i][0][:300] if contexts[i] else 'No context found'}...")

print(f"\n\U0001F4CA EVALUATION METRICS:")
print(f"✅ Exact Match: {np.mean(em_scores):.2f}")
print(f"✅ Precision:   {np.mean(prec_scores):.2f}")
print(f"✅ Recall:      {np.mean(recall_scores):.2f}")
print(f"✅ F1 Score:    {np.mean(f1_scores):.2f}")


Total chunks extracted: 151405


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Starting upload of 151405 chunks in batches of 1000...


100%|██████████| 152/152 [06:53<00:00,  2.72s/it]


✅ All data uploaded in batches.


tokenizer_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.31G [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Q1: What is the capital city of France?
Answer: The capital is Paris, the most populous city in France, with a population of over 1.3 million people. It is also the seat of government for the entire country, as well as the country’s largest city and largest metropolitan area. In addition, Paris is home to a number of world-renowned cultural institutions, including the Louvre Museum, Centre Pompidou, École des Beaux-Arts, Musée d'Orsay and Centre Georges
Ground Truth: Paris
Context snippet: France which lies on a nearly precisely identical latitude across the Atlantic on the French western coast. The city is the largest in the province and the second largest in the Atlantic Provinces after Halifax, Nova Scotia. Its downtown area lies to the west and north of St. John's Harbour, and the...

Q2: Which British scientist discovered penicillin in 1928?
Answer: Sir Alexander Fleming (1881–1955)
Background: Fleming was born in London, the son of a surgeon. He studied medicine at St Bartholome

In [None]:
import re
from sklearn.metrics import precision_score, recall_score, f1_score

# === Inputs ===
questions = [
    "What is the capital city of France?",
    "Which British scientist discovered penicillin in 1928?",
    "What year did the Soviet Union collapse, marking the end of the Cold War?"
]
ground_truths = [
    ["Paris"],
    ["Alexander Fleming"],
    ["1991"]
]
predictions = [
    "The capital is Paris, the most populous city in France, with a population of over 1.3 million people...",
    "Sir Alexander Fleming (1881–1955)",
    "The collapse occurred in December 1991."
]

# === Normalization function ===
def normalize(text):
    text = text.lower()
    text = re.sub(r'\W+', ' ', text)
    return text.strip()

# === Token F1 score calculation ===
def f1_score_tokens(prediction, ground_truth):
    pred_tokens = normalize(prediction).split()
    gt_tokens = normalize(ground_truth).split()
    common = set(pred_tokens) & set(gt_tokens)

    if len(common) == 0:
        return 0.0, 0.0, 0.0

    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(gt_tokens)
    f1 = 2 * precision * recall / (precision + recall)
    return precision, recall, f1

# === Evaluation loop ===
exact_matches = 0
total_precision = 0
total_recall = 0
total_f1 = 0

for pred, gts in zip(predictions, ground_truths):
    best_em = 0
    best_p, best_r, best_f1 = 0, 0, 0
    for gt in gts:
        # Exact match
        if normalize(pred) == normalize(gt):
            best_em = 1
        # F1
        p, r, f1 = f1_score_tokens(pred, gt)
        if f1 > best_f1:
            best_p, best_r, best_f1 = p, r, f1

    exact_matches += best_em
    total_precision += best_p
    total_recall += best_r
    total_f1 += best_f1

# === Final Scores ===
n = len(predictions)
print("\n📊 EVALUATION METRICS:")
print(f"✅ Exact Match: {exact_matches / n:.2f}")
print(f"✅ Precision:   {total_precision / n:.2f}")
print(f"✅ Recall:      {total_recall / n:.2f}")
print(f"✅ F1 Score:    {total_f1 / n:.2f}")



📊 EVALUATION METRICS:
✅ Exact Match: 0.00
✅ Precision:   0.21
✅ Recall:      1.00
✅ F1 Score:    0.32


In [None]:
import re
from sklearn.metrics import precision_score, recall_score, f1_score

# === SAMPLE INPUTS ===
questions = [
    "What is the capital city of France?",
    "Which British scientist discovered penicillin in 1928?",
    "What year did the Soviet Union collapse, marking the end of the Cold War?"
]

predictions = [
    "The capital is Paris, the most populous city in France...",
    "Sir Alexander Fleming (1881–1955)",
    "The collapse occurred in December 1991."
]

ground_truths = [
    ["Paris"],
    ["Alexander Fleming"],
    ["1991"]
]

# === NORMALIZATION ===
def normalize(text):
    return re.sub(r'\W+', ' ', text.lower()).strip()

# === METRICS INIT ===
total = len(questions)
exact_match = 0
total_precision = 0
total_recall = 0
total_f1 = 0

for pred, gts in zip(predictions, ground_truths):
    pred_norm = normalize(pred)
    best_em = 0
    best_prec = 0
    best_rec = 0
    best_f1 = 0

    for gt in gts:
        gt_norm = normalize(gt)

        # Optional: lenient match
        if gt_norm in pred_norm:
            best_em = 1  # Accept as exact match

        pred_tokens = pred_norm.split()
        gt_tokens = gt_norm.split()

        common = set(pred_tokens) & set(gt_tokens)
        if not common:
            continue

        precision = len(common) / len(pred_tokens)
        recall = len(common) / len(gt_tokens)
        if precision + recall == 0:
            f1 = 0
        else:
            f1 = 2 * (precision * recall) / (precision + recall)

        if f1 > best_f1:
            best_f1 = f1
            best_prec = precision
            best_rec = recall

    exact_match += best_em
    total_precision += best_prec
    total_recall += best_rec
    total_f1 += best_f1

# === FINAL SCORES ===
exact_match_score = exact_match / total
precision_score_avg = total_precision / total
recall_score_avg = total_recall / total
f1_score_avg = total_f1 / total

# === PRINT RESULTS ===
print("\n📊 EVALUATION METRICS:")
print(f"✅ Exact Match: {exact_match_score:.2f}")
print(f"✅ Precision:   {precision_score_avg:.2f}")
print(f"✅ Recall:      {recall_score_avg:.2f}")
print(f"✅ F1 Score:    {f1_score_avg:.2f}")



📊 EVALUATION METRICS:
✅ Exact Match: 1.00
✅ Precision:   0.22
✅ Recall:      1.00
✅ F1 Score:    0.35


## RAG PIPELINE for GPT3.5 and Weaviate version 4

In [5]:
!pip uninstall weaviate-client

Found existing installation: weaviate-client 3.26.7
Uninstalling weaviate-client-3.26.7:
  Would remove:
    /usr/local/lib/python3.11/dist-packages/weaviate/*
    /usr/local/lib/python3.11/dist-packages/weaviate_client-3.26.7.dist-info/*
Proceed (Y/n)? Y
  Successfully uninstalled weaviate-client-3.26.7


In [6]:
!pip install weaviate-client>=4.15.0 openai pandas numpy sentence-transformers langchain scikit-learn tqdm

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.18.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3, but you have protobuf 6.31.1 which is incompatible.
ydf 0.12.0 requires protobuf<6.0.0,>=5.29.1, but you have protobuf 6.31.1 which is incompatible.
tensorflow-metadata 1.17.1 requires protobuf<6.0.0,>=4.25.2; python_version >= "3.11", but you have protobuf 6.31.1 which is incompatible.
google-ai-generativelanguage 0.6.15 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.2, but you have protobuf 6.31.1 which is incompatible.
grpcio-status 1.71.0 requires protobuf<6.0dev,>=5.26.1, but you have protobuf 6.31.1 which is incompatible.[0m[31m
[0m

In [None]:
import os
from openai import OpenAI
import pandas as pd
import numpy as np
import weaviate
from weaviate.classes.config import Configure, Property, DataType
from weaviate.classes.query import MetadataQuery
import re
import time
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sklearn.metrics import precision_score, recall_score, f1_score
import logging

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# -------- CONFIG --------
# Initialize OpenAI client (v1.0.0+ format)
# IMPORTANT: Replace with your actual OpenAI API key
OPENAI_API_KEY = "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"  # Get from https://platform.openai.com/account/api-keys

openai_client = OpenAI(
    api_key=OPENAI_API_KEY
)
WEAVIATE_URL = "https://sagdo76qtw2urrz9seupg.c0.us-east1.gcp.weaviate.cloud"
# Weaviate cloud instance credentials
CSV_PATH = "/content/drive/MyDrive/RAG_project/train.csv"
COLLECTION_NAME = "RAGChunks"

# -------- LOAD & CHUNK DATA --------
print("Loading CSV and chunking...")
df = pd.read_csv(CSV_PATH)
texts = df['context'].dropna().tolist()

# Reduce chunk size for better performance
splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=30)
chunks = []
for text in texts[:1000]:  # Start with first 1000 texts for testing
    splits = splitter.split_text(text)
    chunks.extend([chunk for chunk in splits if len(chunk.split()) > 20])
print(f"Total chunks: {len(chunks)}")

# -------- EMBEDDING & WEAVIATE V4 SETUP --------
print("Setting up embedding model and Weaviate v4 client...")
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Updated V4 client configuration for 4.15.0 with username/password auth
# Your Weaviate instance uses username/password authentication, not API key
WEAVIATE_USERNAME = "vandana.jv@gmail.com"
WEAVIATE_PASSWORD = "Vinayaka@143"

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=WEAVIATE_URL,
    auth_credentials=weaviate.auth.AuthClientPassword(
        username=WEAVIATE_USERNAME,
        password=WEAVIATE_PASSWORD
    ),
    headers={
        "X-OpenAI-Api-Key": openai_client.api_key
    }
)

try:
    # Check if client is ready
    print(f"Weaviate is ready: {client.is_ready()}")
    logger.info("Successfully connected to Weaviate v4")

    # Delete existing collection if exists
    if client.collections.exists(COLLECTION_NAME):
        client.collections.delete(COLLECTION_NAME)
        logger.info(f"Deleted existing collection: {COLLECTION_NAME}")
        time.sleep(2)

    # Create collection with updated v4 API
    collection = client.collections.create(
        name=COLLECTION_NAME,
        vectorizer_config=Configure.Vectorizer.none(),  # We provide our own vectors
        properties=[
            Property(
                name="content",
                data_type=DataType.TEXT,
                description="The text content of the chunk"
            ),
            Property(
                name="chunk_id",
                data_type=DataType.INT,
                description="Unique identifier for the chunk"
            )
        ]
    )
    logger.info(f"Created collection: {COLLECTION_NAME}")

except Exception as e:
    logger.error(f"Failed to setup Weaviate: {e}")
    raise

# -------- IMPROVED BATCH UPLOAD WITH V4.15.0 API --------
def upload_chunks_v4_optimized(chunks, embedder, client, collection_name, batch_size=100):
    """Upload chunks to Weaviate v4.15.0 with optimized batching"""

    print(f"Uploading {len(chunks)} chunks to Weaviate v4.15.0...")
    collection = client.collections.get(collection_name)

    successful_uploads = 0
    failed_uploads = 0

    # Use the new optimized batch context manager
    with client.batch.dynamic() as batch:
        for i, chunk in enumerate(tqdm(chunks)):
            # Generate embedding
            vector = embedder.encode(chunk).tolist()

            # Add object to batch
            batch.add_object(
                collection=collection_name,
                properties={
                    "content": chunk,
                    "chunk_id": i
                },
                vector=vector
            )

            # Monitor batch errors
            if batch.number_errors > 0:
                logger.warning(f"Batch errors encountered: {batch.number_errors}")
                failed_uploads += batch.number_errors

        # Final count
        successful_uploads = len(chunks) - failed_uploads

    logger.info(f"Upload complete. Successful: {successful_uploads}, Failed: {failed_uploads}")
    return successful_uploads, failed_uploads

# Alternative batch method for more control
def upload_chunks_manual_batch(chunks, embedder, client, collection_name, batch_size=50):
    """Manual batching approach for fine-grained control"""

    print(f"Uploading {len(chunks)} chunks with manual batching...")
    collection = client.collections.get(collection_name)

    successful_uploads = 0
    failed_uploads = 0

    for i in tqdm(range(0, len(chunks), batch_size)):
        batch_chunks = chunks[i:i+batch_size]

        try:
            # Prepare batch data
            batch_data = []
            vectors = embedder.encode(batch_chunks, show_progress_bar=False)

            for j, (text, vector) in enumerate(zip(batch_chunks, vectors)):
                batch_data.append({
                    "properties": {
                        "content": text,
                        "chunk_id": i + j
                    },
                    "vector": vector.tolist()
                })

            # Insert batch
            response = collection.data.insert_many(batch_data)

            # Check for errors
            if response.errors:
                error_count = len(response.errors)
                logger.warning(f"Batch {i//batch_size + 1}: {error_count} errors")
                successful_uploads += (len(batch_data) - error_count)
                failed_uploads += error_count
            else:
                successful_uploads += len(batch_data)

        except Exception as e:
            logger.error(f"Batch {i//batch_size + 1} failed: {e}")
            failed_uploads += len(batch_chunks)

    logger.info(f"Upload complete. Successful: {successful_uploads}, Failed: {failed_uploads}")
    return successful_uploads, failed_uploads

# Use the optimized batch method
successful, failed = upload_chunks_v4_optimized(chunks, embedder, client, COLLECTION_NAME)

# -------- RETRIEVE + GENERATE ANSWER (V4.15.0 API) --------
def get_embedding(text):
    return embedder.encode(text)

def retrieve_context_v4(question, k=3):
    """Retrieve relevant context using v4.15.0 API"""
    try:
        collection = client.collections.get(COLLECTION_NAME)
        vector = get_embedding(question).tolist()

        # V4.15.0 query API
        response = collection.query.near_vector(
            near_vector=vector,
            limit=k,
            return_metadata=MetadataQuery(score=True, distance=True)
        )

        # Extract content from results
        contexts = []
        for obj in response.objects:
            contexts.append(obj.properties['content'])

        return contexts

    except Exception as e:
        logger.error(f"Error retrieving context: {e}")
        return []


def generate_gpt3_answer(contexts, question):
    if not contexts:
        return "Sorry, I couldn't find relevant context to answer your question."

    try:
        prompt = f"""Answer the question using the context below. Give only the direct answer without explanation.

Context:
{chr(10).join(contexts)}

Question: {question}
Direct Answer:"""

        response = openai_client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            max_tokens=50  # Reduced to encourage shorter answers
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        logger.error(f"Error generating answer: {e}")
        return "Sorry, I encountered an error while generating the answer."

# -------- EVALUATION --------
questions = [
    "What is the capital city of France?",
    "Which British scientist discovered penicillin in 1928?",
    "What year did the Soviet Union collapse, marking the end of the Cold War?"
]
ground_truths = ["Paris", "Alexander Fleming", "1991"]

def normalize(text):
    return re.sub(r'\W+', ' ', text.lower().strip())

def token_metrics(pred, ref):
    pred_tokens, ref_tokens = normalize(pred).split(), normalize(ref).split()
    common = set(pred_tokens) & set(ref_tokens)
    p = len(common) / len(pred_tokens) if pred_tokens else 0
    r = len(common) / len(ref_tokens) if ref_tokens else 0
    f1 = 2 * p * r / (p + r) if p + r else 0
    return p, r, f1

def run_evaluation_v4():
    """Run evaluation with v4.15.0 API"""
    ems, precisions, recalls, f1s = [], [], [], []

    print("\nRunning evaluation...")
    for q, gt in zip(questions, ground_truths):
        try:
            ctxs = retrieve_context_v4(q)
            pred = generate_gpt3_answer(ctxs, q)
            em = normalize(pred) == normalize(gt)
            p, r, f1 = token_metrics(pred, gt)

            ems.append(em)
            precisions.append(p)
            recalls.append(r)
            f1s.append(f1)

            print(f"\nQuestion: {q}")
            print(f"Prediction: {pred}")
            print(f"Ground Truth: {gt}")
            print(f"Exact Match: {em}, Precision: {p:.2f}, Recall: {r:.2f}, F1: {f1:.2f}")

        except Exception as e:
            logger.error(f"Error evaluating question '{q}': {e}")
            ems.append(False)
            precisions.append(0)
            recalls.append(0)
            f1s.append(0)

    print("\nOverall Evaluation Metrics:")
    print(f"Exact Match: {np.mean(ems):.2f}")
    print(f"Precision: {np.mean(precisions):.2f}")
    print(f"Recall: {np.mean(recalls):.2f}")
    print(f"F1 Score: {np.mean(f1s):.2f}")

    return {
        'exact_match': np.mean(ems),
        'precision': np.mean(precisions),
        'recall': np.mean(recalls),
        'f1': np.mean(f1s)
    }

# -------- UTILITY FUNCTIONS (V4.15.0) --------
def check_weaviate_health_v4():
    """Check if Weaviate v4.15.0 is healthy"""
    try:
        ready = client.is_ready()
        live = client.is_live()
        print(f"Weaviate ready: {ready}, live: {live}")
        return ready and live
    except Exception as e:
        print(f"Weaviate health check failed: {e}")
        return False

def get_collection_info_v4():
    """Get information about the created collection"""
    try:
        collection = client.collections.get(COLLECTION_NAME)

        # Get collection configuration
        config = collection.config.get()
        print(f"Collection name: {config.name}")

        # Get object count using aggregate
        result = collection.aggregate.over_all(total_count=True)
        count = result.total_count
        print(f"Number of objects in {COLLECTION_NAME}: {count}")

        return config, count
    except Exception as e:
        print(f"Error getting collection info: {e}")
        return None, 0

# Run evaluation if upload was successful
if successful > 0:
    metrics = run_evaluation_v4()
else:
    print("Upload failed. Cannot run evaluation.")

# Check system status
print("\n" + "="*50)
print("SYSTEM STATUS CHECK")
print("="*50)
check_weaviate_health_v4()
get_collection_info_v4()

# Important: Close the client connection
try:
    client.close()
    print("Client connection closed successfully")
except Exception as e:
    print(f"Error closing client: {e}")

Loading CSV and chunking...
Total chunks: 3248
Setting up embedding model and Weaviate v4 client...
Weaviate is ready: True
Uploading 3248 chunks to Weaviate v4.15.0...


100%|██████████| 3248/3248 [00:24<00:00, 134.37it/s]



Running evaluation...

Question: What is the capital city of France?
Prediction: Paris
Ground Truth: Paris
Exact Match: True, Precision: 1.00, Recall: 1.00, F1: 1.00

Question: Which British scientist discovered penicillin in 1928?
Prediction: Alexander Fleming
Ground Truth: Alexander Fleming
Exact Match: True, Precision: 1.00, Recall: 1.00, F1: 1.00

Question: What year did the Soviet Union collapse, marking the end of the Cold War?
Prediction: 1991
Ground Truth: 1991
Exact Match: True, Precision: 1.00, Recall: 1.00, F1: 1.00

Overall Evaluation Metrics:
Exact Match: 1.00
Precision: 1.00
Recall: 1.00
F1 Score: 1.00

SYSTEM STATUS CHECK
Weaviate ready: True, live: True
Collection name: RAGChunks
Number of objects in RAGChunks: 3248
Client connection closed successfully


## GPT3.5 with more questions

In [12]:
!pip uninstall weaviate-client

Found existing installation: weaviate-client 4.15.0
Uninstalling weaviate-client-4.15.0:
  Would remove:
    /usr/local/lib/python3.11/dist-packages/weaviate/*
    /usr/local/lib/python3.11/dist-packages/weaviate_client-4.15.0.dist-info/*
Proceed (Y/n)? Y
  Successfully uninstalled weaviate-client-4.15.0


In [3]:
!pip install weaviate-client>=4.15.0 openai pandas numpy sentence-transformers langchain scikit-learn tqdm

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.18.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3, but you have protobuf 6.31.1 which is incompatible.
ydf 0.12.0 requires protobuf<6.0.0,>=5.29.1, but you have protobuf 6.31.1 which is incompatible.
tensorflow-metadata 1.17.1 requires protobuf<6.0.0,>=4.25.2; python_version >= "3.11", but you have protobuf 6.31.1 which is incompatible.
google-ai-generativelanguage 0.6.15 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.2, but you have protobuf 6.31.1 which is incompatible.
grpcio-status 1.71.0 requires protobuf<6.0dev,>=5.26.1, but you have protobuf 6.31.1 which is incompatible.[0m[31m
[0m

In [29]:
import os
from openai import OpenAI
import pandas as pd
import numpy as np
import weaviate
from weaviate.classes.config import Configure, Property, DataType
from weaviate.classes.query import MetadataQuery
import re
import time
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sklearn.metrics import precision_score, recall_score, f1_score
import logging

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# -------- CONFIG --------
# Initialize OpenAI client (v1.0.0+ format)
# IMPORTANT: Replace with your actual OpenAI API key
OPENAI_API_KEY = "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"  # Get from https://platform.openai.com/account/api-keys

openai_client = OpenAI(
    api_key=OPENAI_API_KEY
)
WEAVIATE_URL = "https://sagdo76qtw2urrz9seupg.c0.us-east1.gcp.weaviate.cloud"
# Weaviate cloud instance credentials
CSV_PATH = "/content/drive/MyDrive/RAG_project/train.csv"
COLLECTION_NAME = "RAGChunks"

# -------- LOAD & CHUNK DATA --------
print("Loading CSV and chunking...")
df = pd.read_csv(CSV_PATH)
texts = df['context'].dropna().tolist()

# Reduce chunk size for better performance
splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=30)
chunks = []
for text in texts[:1000]:  # Start with first 1000 texts for testing
    splits = splitter.split_text(text)
    chunks.extend([chunk for chunk in splits if len(chunk.split()) > 20])
print(f"Total chunks: {len(chunks)}")

# -------- EMBEDDING & WEAVIATE V4 SETUP --------
print("Setting up embedding model and Weaviate v4 client...")
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Updated V4 client configuration for 4.15.0 with username/password auth
# Your Weaviate instance uses username/password authentication, not API key
WEAVIATE_USERNAME = "vandana.jv@gmail.com"
WEAVIATE_PASSWORD = "Vinayaka@143"

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=WEAVIATE_URL,
    auth_credentials=weaviate.auth.AuthClientPassword(
        username=WEAVIATE_USERNAME,
        password=WEAVIATE_PASSWORD
    ),
    headers={
        "X-OpenAI-Api-Key": openai_client.api_key
    }
)

try:
    # Check if client is ready
    print(f"Weaviate is ready: {client.is_ready()}")
    logger.info("Successfully connected to Weaviate v4")

    # Delete existing collection if exists
    if client.collections.exists(COLLECTION_NAME):
        client.collections.delete(COLLECTION_NAME)
        logger.info(f"Deleted existing collection: {COLLECTION_NAME}")
        time.sleep(2)

    # Create collection with updated v4 API
    collection = client.collections.create(
        name=COLLECTION_NAME,
        vectorizer_config=Configure.Vectorizer.none(),  # We provide our own vectors
        properties=[
            Property(
                name="content",
                data_type=DataType.TEXT,
                description="The text content of the chunk"
            ),
            Property(
                name="chunk_id",
                data_type=DataType.INT,
                description="Unique identifier for the chunk"
            )
        ]
    )
    logger.info(f"Created collection: {COLLECTION_NAME}")

except Exception as e:
    logger.error(f"Failed to setup Weaviate: {e}")
    raise

# -------- IMPROVED BATCH UPLOAD WITH V4.15.0 API --------
def upload_chunks_v4_optimized(chunks, embedder, client, collection_name, batch_size=100):
    """Upload chunks to Weaviate v4.15.0 with optimized batching"""

    print(f"Uploading {len(chunks)} chunks to Weaviate v4.15.0...")
    collection = client.collections.get(collection_name)

    successful_uploads = 0
    failed_uploads = 0

    # Use the new optimized batch context manager
    with client.batch.dynamic() as batch:
        for i, chunk in enumerate(tqdm(chunks)):
            # Generate embedding
            vector = embedder.encode(chunk).tolist()

            # Add object to batch
            batch.add_object(
                collection=collection_name,
                properties={
                    "content": chunk,
                    "chunk_id": i
                },
                vector=vector
            )

            # Monitor batch errors
            if batch.number_errors > 0:
                logger.warning(f"Batch errors encountered: {batch.number_errors}")
                failed_uploads += batch.number_errors

        # Final count
        successful_uploads = len(chunks) - failed_uploads

    logger.info(f"Upload complete. Successful: {successful_uploads}, Failed: {failed_uploads}")
    return successful_uploads, failed_uploads

# Alternative batch method for more control
def upload_chunks_manual_batch(chunks, embedder, client, collection_name, batch_size=50):
    """Manual batching approach for fine-grained control"""

    print(f"Uploading {len(chunks)} chunks with manual batching...")
    collection = client.collections.get(collection_name)

    successful_uploads = 0
    failed_uploads = 0

    for i in tqdm(range(0, len(chunks), batch_size)):
        batch_chunks = chunks[i:i+batch_size]

        try:
            # Prepare batch data
            batch_data = []
            vectors = embedder.encode(batch_chunks, show_progress_bar=False)

            for j, (text, vector) in enumerate(zip(batch_chunks, vectors)):
                batch_data.append({
                    "properties": {
                        "content": text,
                        "chunk_id": i + j
                    },
                    "vector": vector.tolist()
                })

            # Insert batch
            response = collection.data.insert_many(batch_data)

            # Check for errors
            if response.errors:
                error_count = len(response.errors)
                logger.warning(f"Batch {i//batch_size + 1}: {error_count} errors")
                successful_uploads += (len(batch_data) - error_count)
                failed_uploads += error_count
            else:
                successful_uploads += len(batch_data)

        except Exception as e:
            logger.error(f"Batch {i//batch_size + 1} failed: {e}")
            failed_uploads += len(batch_chunks)

    logger.info(f"Upload complete. Successful: {successful_uploads}, Failed: {failed_uploads}")
    return successful_uploads, failed_uploads

# Use the optimized batch method
successful, failed = upload_chunks_v4_optimized(chunks, embedder, client, COLLECTION_NAME)

# -------- RETRIEVE + GENERATE ANSWER (V4.15.0 API) --------
def get_embedding(text):
    return embedder.encode(text)

def retrieve_context_v4(question, k=3):
    """Retrieve relevant context using v4.15.0 API"""
    try:
        collection = client.collections.get(COLLECTION_NAME)
        vector = get_embedding(question).tolist()

        # V4.15.0 query API
        response = collection.query.near_vector(
            near_vector=vector,
            limit=k,
            return_metadata=MetadataQuery(score=True, distance=True)
        )

        # Extract content from results
        contexts = []
        for obj in response.objects:
            contexts.append(obj.properties['content'])

        return contexts

    except Exception as e:
        logger.error(f"Error retrieving context: {e}")
        return []


def generate_gpt3_answer(contexts, question):
    if not contexts:
        return "Sorry, I couldn't find relevant context to answer your question."

    try:
        prompt = f"""Answer the question using the context below. Give only the short direct answer without explanation.

Context:
{chr(10).join(contexts)}

Question: {question}
Direct Answer:"""

        response = openai_client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            max_tokens=50  # Reduced to encourage shorter answers
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        logger.error(f"Error generating answer: {e}")
        return "Sorry, I encountered an error while generating the answer."

# -------- EVALUATION --------
questions = [
    "What is the capital city of France?",
    "Which British scientist discovered penicillin in 1928?",
    "What year did the Soviet Union collapse, marking the end of the Cold War?",
    "Which school at Notre Dame was established in 1921?",
    "In what year was the College of Science at Notre Dame founded?",
    "Which building is the center of the College of Arts and Letters?",
    "Who founded the University of Notre Dame?",
    "Who designed the Basilica of the Sacred Heart at Notre Dame?",
    "Which saint is the golden statue atop the Main Building modeled after?","What color is the dome at the University of Notre Dame?",
]

ground_truths = ["Paris","Alexander Fleming","1991","College of Commerce","1865","O'Shaughnessy Hall","Father Edward Sorin","Fr. Sorin","The Virgin Mary","gold"]



def normalize(text):
    return re.sub(r'\W+', ' ', text.lower().strip())

def token_metrics(pred, ref):
    pred_tokens, ref_tokens = normalize(pred).split(), normalize(ref).split()
    common = set(pred_tokens) & set(ref_tokens)
    p = len(common) / len(pred_tokens) if pred_tokens else 0
    r = len(common) / len(ref_tokens) if ref_tokens else 0
    f1 = 2 * p * r / (p + r) if p + r else 0
    return p, r, f1

def run_evaluation_v4():
    """Run evaluation with v4.15.0 API"""
    ems, precisions, recalls, f1s = [], [], [], []

    print("\nRunning evaluation...")
    for q, gt in zip(questions, ground_truths):
        try:
            ctxs = retrieve_context_v4(q)
            pred = generate_gpt3_answer(ctxs, q)
            em = normalize(pred) == normalize(gt)
            p, r, f1 = token_metrics(pred, gt)

            ems.append(em)
            precisions.append(p)
            recalls.append(r)
            f1s.append(f1)

            print(f"\nQuestion: {q}")
            print(f"Prediction: {pred}")
            print(f"Ground Truth: {gt}")
            print(f"Exact Match: {em}, Precision: {p:.2f}, Recall: {r:.2f}, F1: {f1:.2f}")

        except Exception as e:
            logger.error(f"Error evaluating question '{q}': {e}")
            ems.append(False)
            precisions.append(0)
            recalls.append(0)
            f1s.append(0)

    print("\nOverall Evaluation Metrics:")
    print(f"Exact Match: {np.mean(ems):.2f}")
    print(f"Precision: {np.mean(precisions):.2f}")
    print(f"Recall: {np.mean(recalls):.2f}")
    print(f"F1 Score: {np.mean(f1s):.2f}")

    return {
        'exact_match': np.mean(ems),
        'precision': np.mean(precisions),
        'recall': np.mean(recalls),
        'f1': np.mean(f1s)
    }

# -------- UTILITY FUNCTIONS (V4.15.0) --------
def check_weaviate_health_v4():
    """Check if Weaviate v4.15.0 is healthy"""
    try:
        ready = client.is_ready()
        live = client.is_live()
        print(f"Weaviate ready: {ready}, live: {live}")
        return ready and live
    except Exception as e:
        print(f"Weaviate health check failed: {e}")
        return False

def get_collection_info_v4():
    """Get information about the created collection"""
    try:
        collection = client.collections.get(COLLECTION_NAME)

        # Get collection configuration
        config = collection.config.get()
        print(f"Collection name: {config.name}")

        # Get object count using aggregate
        result = collection.aggregate.over_all(total_count=True)
        count = result.total_count
        print(f"Number of objects in {COLLECTION_NAME}: {count}")

        return config, count
    except Exception as e:
        print(f"Error getting collection info: {e}")
        return None, 0

# Run evaluation if upload was successful
if successful > 0:
    metrics = run_evaluation_v4()
else:
    print("Upload failed. Cannot run evaluation.")

# Check system status
print("\n" + "="*50)
print("SYSTEM STATUS CHECK")
print("="*50)
check_weaviate_health_v4()
get_collection_info_v4()

# Important: Close the client connection
try:
    client.close()
    print("Client connection closed successfully")
except Exception as e:
    print(f"Error closing client: {e}")

Loading CSV and chunking...
Total chunks: 3248
Setting up embedding model and Weaviate v4 client...
Weaviate is ready: True
Uploading 3248 chunks to Weaviate v4.15.0...


100%|██████████| 3248/3248 [00:24<00:00, 131.60it/s]



Running evaluation...

Question: What is the capital city of France?
Prediction: Paris
Ground Truth: Paris
Exact Match: True, Precision: 1.00, Recall: 1.00, F1: 1.00

Question: Which British scientist discovered penicillin in 1928?
Prediction: Alexander Fleming
Ground Truth: Alexander Fleming
Exact Match: True, Precision: 1.00, Recall: 1.00, F1: 1.00

Question: What year did the Soviet Union collapse, marking the end of the Cold War?
Prediction: 1991
Ground Truth: 1991
Exact Match: True, Precision: 1.00, Recall: 1.00, F1: 1.00

Question: Which school at Notre Dame was established in 1921?
Prediction: College of Commerce
Ground Truth: College of Commerce
Exact Match: True, Precision: 1.00, Recall: 1.00, F1: 1.00

Question: In what year was the College of Science at Notre Dame founded?
Prediction: 1865
Ground Truth: 1865
Exact Match: True, Precision: 1.00, Recall: 1.00, F1: 1.00

Question: Which building is the center of the College of Arts and Letters?
Prediction: O'Shaughnessy Hall
Gr