# Solar System RAG: Advanced Pipeline & Comparison

**Goal:** Build and compare two RAG pipelines (Local vs. OpenAI) for the Solar System.

**Structure:**
1. **Part 1: Python Cheat Sheet** - With detailed comments explaining the operations.
2. **Part 2: Exam Simulation** - 15 Exercises to build the pipeline, including A/B testing and "I don't know" constraints.

In [96]:
# Standard imports from your course materials
import os
import glob
import pickle
import numpy as np
import pandas as pd
import faiss
from dotenv import load_dotenv
from openai import OpenAI
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer
from langchain_text_splitters import RecursiveCharacterTextSplitter

## Part 1: Python Operations Cheat Sheet (Commented)

In [None]:
# --- 1. FILE & DIRECTORY MANAGEMENT ---
# os.makedirs ensures the folder exists. 'exist_ok=True' prevents errors if it's already there.
folder_name = "test_folder"
os.makedirs(folder_name, exist_ok=True)

# glob.glob is a pattern matcher. '*.txt' finds any file ending in .txt.
# It returns a list of file paths like ['test_folder/a.txt', 'test_folder/b.txt']
found_files = glob.glob(f"{folder_name}/*.txt")

# --- 2. STRING MANIPULATION ---
# We often need to turn a list of text chunks into one big string for the LLM.
# '\n---\n'.join(list) puts a separator between each item.
chunks = ["Fact A", "Fact B"]
context_string = "\n---\n".join(chunks)
# Result: "Fact A\n---\nFact B"

# --- 3. LIST COMPREHENSION (The 'Pythonic' Loop) ---
# Instead of writing a 3-line for-loop, we do it in one line.
prices = [10, 20, 30]
# Syntax: [OPERATION for ITEM in LIST]
doubled = [p * 2 for p in prices]
# Result: [20, 40, 60]

# --- 4. DICTIONARY SAFE ACCESS ---
# config['key'] crashes if the key is missing.
# config.get('key', default) returns the default value instead.
config = {"model": "gpt-4"}
temp = config.get("temperature", 0.7) # Returns 0.7 because 'temperature' isn't in config

---

## Part 2: Solar System Pipeline Exercises

We will build **two** configurations to compare them later:
* **Config A:** Small chunks (100 chars), Local Embeddings (SentenceTransformer).
* **Config B:** Large chunks (500 chars), OpenAI Embeddings.

### Exercise 1: Setup & Directory
**Task:** Initialize OpenAI and create `data_collection`.

In [98]:
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Create the folder to store our text/pdf files
os.makedirs("data_collection", exist_ok=True)
print("Setup complete.")

Setup complete.


### Exercise 2: Basic Chat
**Task:** Verify LLM connection.

In [99]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is the largest planet? Answer in 1 word."}]
)
print(response.choices[0].message.content)

Jupiter.


### Exercise 3: Generate Solar System Data
**Task:** Create 4 text files in `data_collection`. These are our "knowledge base".

In [100]:
facts = {
    "mars.txt": "Mars is the Red Planet. It has two small moons, Phobos and Deimos. It is cold and dusty.",
    "jupiter.txt": "Jupiter is the largest planet. It is a gas giant with a Great Red Spot storm. It has over 79 moons.",
    "saturn.txt": "Saturn is known for its beautiful ring system made of ice and rock. Titan is its largest moon.",
    "venus.txt": "Venus is the hottest planet (462°C) due to a thick CO2 atmosphere. It rotates backwards."
}

for name, content in facts.items():
    # os.path.join handles / vs \ separators automatically for Windows/Mac
    path = os.path.join("data_collection", name)
    with open(path, "w") as f:
        f.write(content)

print("Knowledge base created.")

Knowledge base created.


### Exercise 4: Universal Loader (TXT & PDF)
**Task:** Write a function that reads **both** `.txt` and `.pdf` files from the directory into a single string.

In [101]:
def load_data(folder):
    combined_text = ""
    
    # 1. Grab all TXT files
    # glob returns a list of paths
    for filepath in glob.glob(os.path.join(folder, "*.txt")):
        with open(filepath, "r") as f:
            combined_text += f.read() + "\n"

    # 2. Grab all PDF files
    # We use PyPDF2 to parse the binary PDF format
    for filepath in glob.glob(os.path.join(folder, "*.pdf")):
        try:
            reader = PdfReader(filepath)
            for page in reader.pages:
                text = page.extract_text()
                if text:
                    combined_text += text + "\n"
        except Exception as e:
            print(f"Error reading {filepath}: {e}")
            
    return combined_text

raw_text = load_data("data_collection")
print(f"Loaded {len(raw_text)} characters.")

Loaded 14078 characters.


### Exercise 5: Creating Config A (Local Pipeline)
**Task:** Split text into `chunks_a` (Size: 100, Overlap: 20) and create embeddings using `SentenceTransformer`.

In [102]:
# --- CONFIG A: Small chunks, Local Model ---
splitter_a = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
chunks_a = splitter_a.split_text(raw_text)

# Initialize local model (runs on CPU/GPU)
embedder_a = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings
# convert_to_numpy=True is required for FAISS
embeddings_a = embedder_a.encode(chunks_a, convert_to_numpy=True)

print(f"[Config A] Chunks: {len(chunks_a)} | Embedding Shape: {embeddings_a.shape}")

[Config A] Chunks: 203 | Embedding Shape: (203, 384)


### Exercise 6: Creating Config B (OpenAI Pipeline)
**Task:** Split text into `chunks_b` (Size: 300, Overlap: 50) and create embeddings using `client.embeddings.create`.

In [103]:
# --- CONFIG B: Larger chunks, OpenAI Model ---
splitter_b = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks_b = splitter_b.split_text(raw_text)

# OpenAI embeddings batch call
# We send the list of strings (chunks_b) to the API
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=chunks_b
)

# Extract the vector list from the response object
vecs = [item.embedding for item in response.data]
# Convert list of lists to NumPy array (float32 is standard for FAISS)
embeddings_b = np.array(vecs, dtype="float32")

print(f"[Config B] Chunks: {len(chunks_b)} | Embedding Shape: {embeddings_b.shape}")

[Config B] Chunks: 60 | Embedding Shape: (60, 1536)


### Exercise 7: Saving FAISS Indices (Persistence)
**Task:** Create and save two separate FAISS indices: `index_a.index` and `index_b.index`.

In [104]:
# Index A (Local Dim: 384)
dim_a = embeddings_a.shape[1]
index_a = faiss.IndexFlatL2(dim_a)
index_a.add(embeddings_a)
faiss.write_index(index_a, os.path.join("data_collection", "index_a.index"))

# Index B (OpenAI Dim: 1536 usually)
dim_b = embeddings_b.shape[1]
index_b = faiss.IndexFlatL2(dim_b)
index_b.add(embeddings_b)
faiss.write_index(index_b, os.path.join("data_collection", "index_b.index"))

print("Both indices saved to disk.")

Both indices saved to disk.


### Exercise 8: Saving Text Chunks
**Task:** Save both chunk lists to pickle files so we can retrieve the text later.

In [105]:
# We need to save the text because FAISS only stores numbers.
with open(os.path.join("data_collection", "chunks_a.pkl"), "wb") as f:
    pickle.dump(chunks_a, f)

with open(os.path.join("data_collection", "chunks_b.pkl"), "wb") as f:
    pickle.dump(chunks_b, f)

print("Chunk text lists saved.")

Chunk text lists saved.


### Exercise 9: Loading Everything Back
**Task:** Load indices and chunks into variables to mimic a production server starting up.

In [106]:
# Load Indices
loaded_idx_a = faiss.read_index(os.path.join("data_collection", "index_a.index"))
loaded_idx_b = faiss.read_index(os.path.join("data_collection", "index_b.index"))

# Load Text
with open(os.path.join("data_collection", "chunks_a.pkl"), "rb") as f:
    loaded_chunks_a = pickle.load(f)

with open(os.path.join("data_collection", "chunks_b.pkl"), "rb") as f:
    loaded_chunks_b = pickle.load(f)

print("All data loaded into memory.")

All data loaded into memory.


### Exercise 10: Retrieval Function A (Local)
**Task:** Implement `retrieve_a(query)` using `embedder_a` and `loaded_idx_a`.

In [107]:
def retrieve_a(query, k=2):
    # 1. Embed query with Local model
    q_vec = embedder_a.encode([query], convert_to_numpy=True)
    
    # 2. Search Index A
    # search returns (distances, indices)
    _, indices = loaded_idx_a.search(q_vec, k)
    
    # 3. Retrieve Text from Chunks A
    return [loaded_chunks_a[i] for i in indices[0]]

print(retrieve_a("Is Venus hot?"))

['Venus is the hottest planet (462°C) due to a thick CO2 atmosphere. It rotates backwards.', 'heit). Because the planet has no atmosphere to retain that heat,']


### Exercise 11: Retrieval Function B (OpenAI)
**Task:** Implement `retrieve_b(query)` using OpenAI embeddings and `loaded_idx_b`.

In [108]:
def retrieve_b(query, k=2):
    # 1. Embed query with OpenAI
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=[query]
    )
    # Convert to numpy array, reshape to (1, dimension)
    q_vec = np.array([resp.data[0].embedding], dtype="float32")
    
    # 2. Search Index B
    _, indices = loaded_idx_b.search(q_vec, k)
    
    # 3. Retrieve Text from Chunks B
    return [loaded_chunks_b[i] for i in indices[0]]

print(retrieve_b("Is Venus hot?"))

['Venus is the hottest planet (462°C) due to a thick CO2 atmosphere. It rotates backwards.\nSaturn is known for its beautiful ring system made of ice and rock. Titan is its largest moon.\nMars is the Red Planet. It has two small moons, Phobos and Deimos. It is cold and dusty.', 'heit). Because the planet has no atmosphere to retain that heat, \nnighttime temperatures on the surface can drop to –180  degrees \nCelsius (–290 degrees Fahrenheit).\nBecause Mercury is so close to the Sun, it is hard to directly ob -\nserve from Earth except during dawn or twilight. Mercury makes']


### Exercise 12: Advanced Prompting (Constraint)
**Task:** Create `format_prompt` that includes strict instructions: **"If the answer is not in the context, state 'I do not know'."**

In [109]:
def format_prompt(query, docs):
    context = "\n---\n".join(docs)
    return f"""You are a strict assistant. Answer ONLY based on the context below.
If the answer is not in the context, say 'I do not know'. Do not make things up.

Context:
{context}

Question: {query}
"""

### Exercise 13: RAG Answer Generators
**Task:** Create `rag_answer_a(query)` and `rag_answer_b(query)`.

In [110]:
def rag_answer_a(query):
    docs = retrieve_a(query)
    prompt = format_prompt(query, docs)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

def rag_answer_b(query):
    docs = retrieve_b(query)
    prompt = format_prompt(query, docs)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

### Exercise 14: Automated Question Generation
**Task:** Generate a question from a random chunk in List A to test our system.

In [111]:
import random
chunk = random.choice(loaded_chunks_a)
print(f"Source: {chunk}")

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": f"Write a short question for this text: {chunk}"}]
)
generated_q = resp.choices[0].message.content
print(f"Gen Question: {generated_q}")

Source: Inclination of Equator to Orbit  29.58 deg  
Rotation Period  16.11 hours
Gen Question: What is the inclination of the equator to its orbit?


### Exercise 15: Side-by-Side Comparison & Out-of-Context Test
**Task:** 
1. Define a list containing valid questions AND an **out-of-context** question (e.g., about France).
2. Loop through them, getting answers from Pipeline A and Pipeline B.
3. Display a Pandas DataFrame comparing the two results.

In [113]:
test_questions = [
    "What is the great red spot on jupiter?",     # Should retrieve from Jupiter
    "How hot is Venus?",                          # Should retrieve from Venus
    "What is the capital of France?"              # Out of context -> Should say "I do not know"
]

results = []

for q in test_questions:
    # Run both pipelines
    ans_a = rag_answer_a(q)
    ans_b = rag_answer_b(q)
    
    results.append({
        "Question": q,
        "Answer A (Local)": ans_a,
        "Answer B (OpenAI)": ans_b
    })

# Create Comparison Table
df_compare = pd.DataFrame(results)
pd.set_option('display.max_colwidth', None)
df_compare

Unnamed: 0,Question,Answer A (Local),Answer B (OpenAI)
0,What is the great red spot on jupiter?,The Great Red Spot is a storm on Jupiter.,The Great Red Spot is a storm on Jupiter.
1,How hot is Venus?,Venus is 462°C.,Venus is 462°C.
2,What is the capital of France?,I do not know.,I do not know.
