# Lab 8:  Question answering: Similarity based model training + conditional answering, simple RAG system

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# -------------------------------
# 1) Knowledge Base (Documents)
# -------------------------------
docs = [
    "The Apollo 11 mission landed the first humans on the Moon in 1969.",
    "Python is a versatile programming language used for web development, data science, and automation.",
    "The Great Wall of China is over 13,000 miles long and is one of the most famous landmarks in the world.",
    "Machine learning is a subset of artificial intelligence focused on building predictive models from data."
]

# -------------------------------
# 2) TF-IDF Vectorization
# -------------------------------
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(docs)

# -------------------------------
# 3) Retrieval + Generation
# -------------------------------
def answer_question(query, top_k=1):
    # Vectorize query
    query_vec = vectorizer.transform([query])
    
    # Compute cosine similarity with all docs
    sims = cosine_similarity(query_vec, doc_vectors)[0]
    
    # Get top-K most relevant documents
    top_idxs = np.argsort(sims)[::-1][:top_k]
    
    # For this demo, we'll "generate" answer by returning top doc sentences
    answers = [docs[i] for i in top_idxs]
    
    return answers

# -------------------------------
# 4) Try it out
# -------------------------------
query1 = "Who first landed on the Moon?"
query2 = "Tell me about Python programming."
query3 = "Explain machine learning."

print("Q:", query1)
print("A:", answer_question(query1)[0], "\n")

print("Q:", query2)
print("A:", answer_question(query2)[0], "\n")

print("Q:", query3)
print("A:", answer_question(query3)[0], "\n")


Q: Who first landed on the Moon?
A: The Apollo 11 mission landed the first humans on the Moon in 1969. 

Q: Tell me about Python programming.
A: Python is a versatile programming language used for web development, data science, and automation. 

Q: Explain machine learning.
A: Machine learning is a subset of artificial intelligence focused on building predictive models from data. 



## Selecting Top-k Similar Documents

In similarity-based information retrieval or question-answering systems, after computing similarity scores between a query and a set of documents, we often want to retrieve the top-k most relevant documents.

### Step-by-step:

1. **Compute Similarities**
   - Use a similarity measure (e.g., cosine similarity) to compare the query vector with each document vector.
   - Example:  
     ```python
     sims = [0.1, 0.8, 0.3, 0.5]  # similarity scores
     ```

2. **Sort Document Indices**
   - `np.argsort(sims)` returns the indices that would sort the array in ascending order.  
   - Example:  
     ```python
     np.argsort(sims)  # [0, 2, 3, 1]
     ```

3. **Reverse for Descending Order**
   - Using `[::-1]` reverses the array to get descending order (highest similarity first).  
   - Example:  
     ```python
     np.argsort(sims)[::-1]  # [1, 3, 2, 0]
     ```

4. **Select Top-k**
   - Slice the first `k` indices to get the most similar documents.  
   - Example (`top_k = 2`):  
     ```python
     top_idxs = [1, 3]  # indices of the top 2 documents
     ```

### Summary

- **Purpose:** Retrieve the most relevant documents efficiently.
- **Key Function:**  
  ```python
  top_idxs = np.argsort(sims)[::-1][:top_k]
