# RAG-Powered Q&A System Notebook

This notebook demonstrates a simple Retrieval-Augmented Generation (RAG)–style question-answering system using open-source tools only. You can run it cell by cell, with no paid APIs or cloud services required.

Key steps:
1. Define a small knowledge corpus.
2. Build a TF-IDF retriever to find relevant passages.
3. Use a Hugging Face QA model to answer questions based on retrieved context.

---

## 1. Install dependencies

You need `transformers` and `scikit-learn`. Run this cell once:


In [1]:
!pip install transformers scikit-learn




[notice] A new release of pip is available: 24.3.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## 2. Import libraries

Import the necessary Python modules.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import pipeline
import numpy as np

## 3. Define knowledge corpus

Here we define a small set of documents for demonstration. Feel free to expand.

In [3]:
corpus = [
    "The Eiffel Tower is located in Paris and is one of the most famous landmarks in the world.",
    "The Great Wall of China is more than 13,000 miles long and was built over centuries.",
    "Python is a popular programming language known for its readability and versatility.",
    "The Mona Lisa is a famous painting by Leonardo da Vinci housed in the Louvre Museum.",
    "The Taj Mahal is a white marble mausoleum in Agra, India, built by Mughal emperor Shah Jahan."
]

## 4. Build TF-IDF retriever

Vectorize the corpus and prepare for similarity searches.


In [4]:
vectorizer = TfidfVectorizer()
# Compute TF-IDF matrix for the corpus
tfidf_matrix = vectorizer.fit_transform(corpus)

## 5. Retriever function

Given a query, compute similar documents using cosine similarity.


In [5]:
def retrieve(query, top_k=2):
    """
    Returns the top_k most relevant passages for the query.
    """
    query_vec = vectorizer.transform([query])
    # Compute cosine similarities
    sims = cosine_similarity(query_vec, tfidf_matrix).flatten()
    # Get indices of top_k results
    top_indices = np.argsort(sims)[-top_k:][::-1]
    return [corpus[i] for i in top_indices]

# Example retrieval
print(retrieve("Where is the Eiffel Tower located?", top_k=1))


['The Eiffel Tower is located in Paris and is one of the most famous landmarks in the world.']


## 6. Load QA model

We use a local, free model from Hugging Face: distilbert-base-cased-distilled-squad.


In [6]:
qa_pipeline = pipeline(
    "question-answering",
    model="distilbert-base-cased-distilled-squad",
    tokenizer="distilbert-base-cased-distilled-squad"
)

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda:0


## 7. Q&A function

Combine retrieval with the QA model.


In [7]:
def answer_question(question, top_k=2):
    # Retrieve relevant context
    contexts = retrieve(question, top_k=top_k)
    combined = " ".join(contexts)
    # Run QA pipeline
    result = qa_pipeline({
        "question": question,
        "context": combined
    })
    return result["answer"], contexts

# Example QA
ans, ctx = answer_question("Who built the Taj Mahal?")
print(f"Answer: {ans}\nContext used: {ctx}")



Answer: Mughal emperor Shah Jahan
Context used: ['The Taj Mahal is a white marble mausoleum in Agra, India, built by Mughal emperor Shah Jahan.', 'The Great Wall of China is more than 13,000 miles long and was built over centuries.']


## 8. Try your own questions

Run this cell and replace the text with your own questions.


In [8]:
question = "What programming language is known for readability?"
answer, used_context = answer_question(question)
print(f"Question: {question}\nAnswer: {answer}\nContext: {used_context}")

Question: What programming language is known for readability?
Answer: Python
Context: ['Python is a popular programming language known for its readability and versatility.', 'The Eiffel Tower is located in Paris and is one of the most famous landmarks in the world.']
