# 🚀 Embeddings: From Theory to RAG Applications

## Welcome to Week 4: Embeddings Deep Dive!

This notebook will take you on a journey from basic vector concepts to advanced RAG (Retrieval-Augmented Generation) applications. We'll build understanding progressively with lots of practical examples.

### 🎯 Learning Objectives
- Understand vector spaces and mathematical foundations
- Master embedding concepts through intuitive examples
- Learn semantic similarity and search techniques
- Apply embeddings in RAG systems
- Build practical applications step-by-step

### 📚 Prerequisites
- Basic Python knowledge
- Understanding of lists and arrays
- Curiosity to learn! 🧠

---

## 📦 Setup and Imports

Let's start by importing the necessary libraries for our journey.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import TfidfVectorizer
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# For embeddings
import openai
from sentence_transformers import SentenceTransformer

# Utilities
import json
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ All libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## 🔧 Environment Setup

Make sure you have your OpenAI API key set up. If you don't have one, you can still follow along with the theoretical concepts and use alternative embedding models.

In [None]:
# Load a pre-trained embedding model
print("📥 Loading embedding model...")
model = SentenceTransformer('all-MiniLM-L6-v2')  # Lightweight, fast model
print("✅ Model loaded successfully!")

## 🎯 Section 1: Understanding Vectors - The Foundation

Before we dive into embeddings, let's understand the fundamental building blocks: **vectors** and **vector spaces**.

### What is a Vector?

Think of a vector as a list of numbers that represents a point in space. Just like how your GPS coordinates (latitude, longitude) tell you where you are on Earth, a vector tells you where something is in a mathematical space.

Let's start with simple examples:

In [None]:
# Example 1: Simple 2D Vectors (like GPS coordinates)
print("📍 2D Vector Examples (like GPS coordinates)")

# Vector representing a point in 2D space
point_a = [3, 4]  # x=3, y=4
point_b = [1, 2]  # x=1, y=2

print(f"Point A: {point_a}")
print(f"Point B: {point_b}")

# Visualize these points
plt.figure(figsize=(8, 6))
plt.scatter([point_a[0]], [point_a[1]], color='red', s=100, label='Point A (3,4)')
plt.scatter([point_b[0]], [point_b[1]], color='blue', s=100, label='Point B (1,2)')
plt.grid(True, alpha=0.3)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('2D Vector Visualization')
plt.legend()
plt.axis('equal')
plt.show()

# Example 2: 3D Vectors (like 3D coordinates)
print("\n🌍 3D Vector Examples (like 3D coordinates)")

# Vector representing a point in 3D space
point_3d = [2, 3, 5]  # x=2, y=3, z=5
print(f"3D Point: {point_3d}")

# Example 3: High-dimensional vectors (like embeddings!)
print("\n🧠 High-dimensional Vector Example (like embeddings)")
# A 5-dimensional vector (much smaller than real embeddings which can be 768+ dimensions)
embedding_example = [0.1, -0.3, 0.8, 0.2, -0.5]
print(f"5D Embedding Vector: {embedding_example}")
print(f"Length of vector: {len(embedding_example)} dimensions")

## 🔢 Section 2: Vector Operations - The Math Behind Similarity

Now let's learn the key mathematical operations that make embeddings powerful: **dot product** and **cosine similarity**.

### Dot Product: The Foundation of Similarity

The dot product measures how much two vectors point in the same direction. It's like asking: "How similar are these two directions?

In [None]:
# Understanding Dot Product with Intuitive Examples

# Example 1: Simple 2D vectors
vector_a = np.array([3, 4])
vector_b = np.array([1, 2])

# Manual calculation
dot_product_manual = vector_a[0] * vector_b[0] + vector_a[1] * vector_b[1]
dot_product_numpy = np.dot(vector_a, vector_b)

print(f"Vector A: {vector_a}")
print(f"Vector B: {vector_b}")
print(f"Dot Product (manual): {dot_product_manual}")
print(f"Dot Product (numpy): {dot_product_numpy}")

# Visualize the vectors
plt.figure(figsize=(10, 8))

# Plot vectors
plt.quiver(0, 0, vector_a[0], vector_a[1], angles='xy', scale_units='xy', scale=1, color='red', label='Vector A')
plt.quiver(0, 0, vector_b[0], vector_b[1], angles='xy', scale_units='xy', scale=1, color='blue', label='Vector B')

plt.grid(True, alpha=0.3)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Vector Visualization for Dot Product')
plt.legend()
plt.axis('equal')
plt.xlim(-1, 5)
plt.ylim(-1, 5)
plt.show()

# Example 2: Different scenarios
print("\n🔍 Dot Product in Different Scenarios:")

# Scenario 1: Parallel vectors (pointing same direction)
parallel_a = np.array([2, 0])
parallel_b = np.array([4, 0])
dot_parallel = np.dot(parallel_a, parallel_b)
print(f"Parallel vectors: {parallel_a} and {parallel_b}")
print(f"Dot product: {dot_parallel} (high positive value)")

# Scenario 2: Perpendicular vectors (90 degrees)
perpendicular_a = np.array([2, 0])
perpendicular_b = np.array([0, 3])
dot_perpendicular = np.dot(perpendicular_a, perpendicular_b)
print(f"\nPerpendicular vectors: {perpendicular_a} and {perpendicular_b}")
print(f"Dot product: {dot_perpendicular} (zero!)")

# Scenario 3: Opposite vectors (180 degrees)
opposite_a = np.array([2, 0])
opposite_b = np.array([-2, 0])
dot_opposite = np.dot(opposite_a, opposite_b)
print(f"\nOpposite vectors: {opposite_a} and {opposite_b}")
print(f"Dot product: {dot_opposite} (negative value)")

print("\n💡 Key Insight: Dot product tells us about the 'alignment' of vectors!")

## 📐 Section 3: Cosine Similarity - The Magic Formula

While dot product is useful, it has a problem: it depends on the magnitude (length) of vectors. We want to measure similarity regardless of size.

**Cosine Similarity** solves this by normalizing the dot product by the magnitudes of both vectors.

### The Formula:
```
cosine_similarity = dot_product / (magnitude_a * magnitude_b)
```

This gives us a value between -1 and 1:
- **1**: Vectors point in exactly the same direction (most similar)
- **0**: Vectors are perpendicular (no similarity)
- **-1**: Vectors point in opposite directions (least similar)

In [None]:
# Understanding Cosine Similarity

def cosine_similarity_manual(a, b):
    """Calculate cosine similarity manually"""
    dot_product = np.dot(a, b)
    magnitude_a = np.linalg.norm(a)
    magnitude_b = np.linalg.norm(b)
    return dot_product / (magnitude_a * magnitude_b)

# Example 1: Same direction, different magnitudes
vector_1 = np.array([1, 1])
vector_2 = np.array([2, 2])  # Same direction, twice the magnitude

dot_product = np.dot(vector_1, vector_2)
cosine_sim = cosine_similarity_manual(vector_1, vector_2)

print(f"Vector 1: {vector_1}, Magnitude: {np.linalg.norm(vector_1):.2f}")
print(f"Vector 2: {vector_2}, Magnitude: {np.linalg.norm(vector_2):.2f}")
print(f"Dot Product: {dot_product}")
print(f"Cosine Similarity: {cosine_sim:.3f} (Perfect similarity!)")

# Example 2: Different angles
angles = [0, 45, 90, 135, 180]  # degrees
similarities = []

for angle in angles:
    # Create vectors at different angles
    rad = np.radians(angle)
    vector_a = np.array([1, 0])
    vector_b = np.array([np.cos(rad), np.sin(rad)])
    
    sim = cosine_similarity_manual(vector_a, vector_b)
    similarities.append(sim)
    print(f"Angle: {angle}°, Cosine Similarity: {sim:.3f}")

# Visualize the relationship
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(angles, similarities, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Angle (degrees)')
plt.ylabel('Cosine Similarity')
plt.title('Cosine Similarity vs Angle')
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='black', linestyle='-', alpha=0.3)
plt.axhline(y=1, color='green', linestyle='--', alpha=0.5, label='Perfect Similarity')
plt.axhline(y=-1, color='red', linestyle='--', alpha=0.5, label='Opposite')
plt.legend()

plt.subplot(1, 2, 2)
# Show some example vectors
example_vectors = [
    ([1, 0], [1, 0], '0° (Same)'),
    ([1, 0], [0.7, 0.7], '45°'),
    ([1, 0], [0, 1], '90° (Perpendicular)'),
    ([1, 0], [-0.7, 0.7], '135°'),
    ([1, 0], [-1, 0], '180° (Opposite)')
]

for i, (v1, v2, label) in enumerate(example_vectors):
    plt.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1, 
               color='red', alpha=0.7, width=0.02)
    plt.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1, 
               color='blue', alpha=0.7, width=0.02)
    plt.text(v2[0]*1.2, v2[1]*1.2, label, fontsize=8)

plt.grid(True, alpha=0.3)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Vector Angles and Similarity')
plt.axis('equal')
plt.xlim(-1.5, 1.5)
plt.ylim(-1.5, 1.5)

plt.tight_layout()
plt.show()

print("\n🎯 Key Insight: Cosine similarity measures direction similarity, not magnitude!")

## 🧠 Section 4: What are Embeddings? - The Bridge to AI

Now that we understand vectors and similarity, let's dive into **embeddings** - the magical way we convert text, images, or any data into vectors that capture meaning.

### The Big Idea

An embedding is a way to represent complex data (like text) as a vector of numbers in a high-dimensional space, where:
- **Similar meanings** are close together
- **Different meanings** are far apart
- **Semantic relationships** are preserved

Think of it like this: If you could plot all words in a 3D space, "king" and "queen" would be close, "king" and "apple" would be far apart, and "king" - "man" + "woman" would be close to "queen"!

In [None]:
# Understanding Embeddings with Simple Examples

# Example 1: Simple word embeddings (conceptual)
print("🔤 Simple Word Embeddings Example")

# Let's create a simple 3D embedding space for words
word_embeddings = {
    'king': [0.8, 0.2, 0.1],      # High royalty, low gender, low object
    'queen': [0.8, 0.8, 0.1],     # High royalty, high gender, low object
    'man': [0.2, 0.2, 0.1],       # Low royalty, low gender, low object
    'woman': [0.2, 0.8, 0.1],     # Low royalty, high gender, low object
    'apple': [0.1, 0.1, 0.9],     # Low royalty, low gender, high object
    'banana': [0.1, 0.1, 0.8]     # Low royalty, low gender, high object
}

# Convert to numpy arrays for easier manipulation
embeddings_array = np.array(list(word_embeddings.values()))
words = list(word_embeddings.keys())

print("Word Embeddings (3D vectors):")
for word, embedding in word_embeddings.items():
    print(f"{word:8}: {embedding}")

# Calculate similarities
print("\n🔍 Similarity Analysis:")
for i, word1 in enumerate(words):
    for j, word2 in enumerate(words[i+1:], i+1):
        sim = cosine_similarity([embeddings_array[i]], [embeddings_array[j]])[0][0]
        print(f"{word1:8} vs {word2:8}: {sim:.3f}")

# Visualize in 3D
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

colors = ['red', 'blue', 'green', 'orange', 'purple', 'brown']
for i, (word, embedding) in enumerate(word_embeddings.items()):
    ax.scatter(embedding[0], embedding[1], embedding[2], 
               c=colors[i], s=100, label=word)
    ax.text(embedding[0], embedding[1], embedding[2], word, fontsize=12)

ax.set_xlabel('Royalty')
ax.set_ylabel('Gender')
ax.set_zlabel('Object-ness')
ax.set_title('Simple 3D Word Embeddings')
ax.legend()
plt.show()

print("\n💡 Notice how similar words (king/queen, apple/banana) are close together!")

## 📊 Section 5: TF-IDF - The Simple Text Embedding Method

Before we dive into modern neural embeddings, let's understand **TF-IDF** (Term Frequency-Inverse Document Frequency), a classic method that's still useful today.

### What is TF-IDF?

TF-IDF converts text into vectors based on word frequency, giving more weight to rare, important words.

- **TF (Term Frequency)**: How often a word appears in a document
- **IDF (Inverse Document Frequency)**: How rare a word is across all documents
- **TF-IDF Score**: TF × IDF = Importance of word in document

In [None]:
# TF-IDF Example: Document Similarity

# Sample documents
documents = [
    "The cat sat on the mat",
    "The dog sat on the floor", 
    "The cat and dog are pets",
    "The weather is sunny today",
    "The cat is sleeping on the mat"
]

print("📄 Sample Documents:")
for i, doc in enumerate(documents, 1):
    print(f"{i}. {doc}")

# Create TF-IDF vectors
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

print(f"\n🔤 Vocabulary: {feature_names}")
print(f"📊 TF-IDF Matrix Shape: {tfidf_matrix.shape}")

# Show TF-IDF scores for first document
print("\n📈 TF-IDF Scores for Document 1:")
doc1_scores = tfidf_matrix[0].toarray()[0]
for word, score in zip(feature_names, doc1_scores):
    if score > 0:
        print(f"{word:10}: {score:.3f}")

# Calculate similarities between all documents
similarities = cosine_similarity(tfidf_matrix)

print("\n🔍 Document Similarities:")
print("      ", end="")
for i in range(len(documents)):
    print(f"Doc{i+1:6}", end="")
print()

for i in range(len(documents)):
    print(f"Doc{i+1:6}", end="")
    for j in range(len(documents)):
        print(f"{similarities[i][j]:6.3f}", end="")
    print()

# Visualize similarities
plt.figure(figsize=(10, 8))
sns.heatmap(similarities, 
            xticklabels=[f'Doc{i+1}' for i in range(len(documents))],
            yticklabels=[f'Doc{i+1}' for i in range(len(documents))],
            annot=True, cmap='Blues', vmin=0, vmax=1)
plt.title('Document Similarity Matrix (TF-IDF)')
plt.show()

print("\n💡 Notice how documents with similar words (1, 3, 5) have higher similarity!")

## 🤖 Section 6: Modern Neural Embeddings - The Power of AI

TF-IDF is good, but it has limitations:
- Doesn't understand word meanings
- "bank" (financial) and "bank" (river) are treated the same
- No understanding of context

**Neural embeddings** solve these problems by learning semantic relationships from vast amounts of text data.

### How Neural Embeddings Work

1. **Training**: Model reads millions of sentences
2. **Learning**: Understands word relationships and context
3. **Output**: Each word/text gets a high-dimensional vector (384-1536 dimensions)
4. **Magic**: Similar meanings = similar vectors, regardless of exact words!

In [None]:
# Modern Neural Embeddings Example

# Sample texts with similar meanings but different words
texts = [
    "The cat is sleeping on the mat",
    "A feline is resting on the carpet",
    "The weather is sunny today",
    "It's a beautiful day with clear skies",
    "I love eating pizza",
    "Pizza is my favorite food",
    "The bank is closed today",
    "The financial institution is not open",
    "The river bank is muddy",
    "The shore of the stream is wet"
]

print("📝 Sample Texts:")
for i, text in enumerate(texts, 1):
    print(f"{i:2}. {text}")

# Generate embeddings using our model
print("\n🔄 Generating embeddings...")
embeddings = model.encode(texts)

print(f"📊 Embedding shape: {embeddings.shape}")
print(f"🔢 Each text is now a {embeddings.shape[1]}-dimensional vector!")

# Calculate similarities
similarities = cosine_similarity(embeddings)

print("\n🔍 Neural Embedding Similarities:")
print("      ", end="")
for i in range(len(texts)):
    print(f"T{i+1:3}", end="")
print()

for i in range(len(texts)):
    print(f"T{i+1:3}", end="")
    for j in range(len(texts)):
        print(f"{similarities[i][j]:5.3f}", end="")
    print()

# Find most similar pairs
print("\n🎯 Most Similar Text Pairs:")
pairs = []
for i in range(len(texts)):
    for j in range(i+1, len(texts)):
        pairs.append((i, j, similarities[i][j]))

# Sort by similarity
pairs.sort(key=lambda x: x[2], reverse=True)

for i, j, sim in pairs[:5]:
    print(f"Similarity {sim:.3f}:")
    print(f"  Text {i+1}: {texts[i]}")
    print(f"  Text {j+1}: {texts[j]}")
    print()

# Visualize in 2D using PCA
print("📈 Visualizing embeddings in 2D...")
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

plt.figure(figsize=(12, 8))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=100, alpha=0.7)

# Add text labels
for i, (x, y) in enumerate(embeddings_2d):
    plt.annotate(f'T{i+1}', (x, y), xytext=(5, 5), textcoords='offset points')

plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('Text Embeddings in 2D Space (PCA)')
plt.grid(True, alpha=0.3)
plt.show()

print("\n🎯 Key Insights:")
print("1. Similar meanings cluster together (cat/feline, weather/sunny)")
print("2. Different meanings are separated (bank financial vs bank river)")
print("3. Semantic similarity > word overlap!")

## 🔍 Section 7: Semantic Search - Finding Meaning, Not Just Words

Now let's build a **semantic search engine** - the core component of RAG systems. Instead of finding exact word matches, we find documents with similar meanings.

### How Semantic Search Works

1. **Index**: Convert all documents to embeddings
2. **Query**: Convert search query to embedding
3. **Search**: Find documents with most similar embeddings
4. **Rank**: Return results ranked by similarity

This is much more powerful than traditional keyword search!

In [None]:
# Building a Semantic Search Engine

# Sample knowledge base (like documents in a RAG system)
knowledge_base = [
    "Python is a popular programming language for data science and machine learning.",
    "Machine learning algorithms can learn patterns from data without explicit programming.",
    "Deep learning uses neural networks with multiple layers to solve complex problems.",
    "Natural language processing helps computers understand human language.",
    "Data visualization is important for understanding and communicating insights.",
    "Artificial intelligence aims to create systems that can perform tasks requiring human intelligence.",
    "Big data refers to large datasets that are difficult to process using traditional methods.",
    "Cloud computing provides on-demand access to computing resources over the internet.",
    "Cybersecurity protects computer systems from theft, damage, and unauthorized access.",
    "Blockchain is a distributed ledger technology that ensures secure and transparent transactions."
]

print("📚 Knowledge Base Documents:")
for i, doc in enumerate(knowledge_base, 1):
    print(f"{i:2}. {doc}")

# Create embeddings for all documents
print("\n🔄 Creating document embeddings...")
doc_embeddings = model.encode(knowledge_base)

print(f"📊 Embedding shape: {doc_embeddings.shape}")

# Function for semantic search
def semantic_search(query, doc_embeddings, documents, top_k=3):
    """Perform semantic search"""
    # Encode the query
    query_embedding = model.encode([query])
    
    # Calculate similarities
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
    
    # Get top-k results
    top_indices = similarities.argsort()[-top_k:][::-1]
    
    results = []
    for idx in top_indices:
        results.append({
            'document': documents[idx],
            'similarity': similarities[idx],
            'index': idx
        })
    
    return results

# Test different queries
test_queries = [
    "How do computers learn from data?",
    "What is the best programming language for AI?",
    "How can we protect our data?",
    "What technology is used for secure transactions?"
]

print("\n🔍 Testing Semantic Search:")
for query in test_queries:
    print(f"\n❓ Query: {query}")
    results = semantic_search(query, doc_embeddings, knowledge_base, top_k=3)
    
    for i, result in enumerate(results, 1):
        print(f"{i}. Similarity: {result['similarity']:.3f}")
        print(f"   Doc {result['index']+1}: {result['document']}")
        print()

# Interactive search demo
print("🎯 Interactive Search Demo:")
print("Try these queries or type your own:")
print("- 'neural networks'")
print("- 'data analysis'")
print("- 'computer security'")
print("- 'distributed systems'")

# You can uncomment this for interactive search
# while True:
#     query = input("\nEnter your search query (or 'quit' to exit): ")
#     if query.lower() == 'quit':
#         break
#     
#     results = semantic_search(query, doc_embeddings, knowledge_base, top_k=3)
#     print(f"\n🔍 Results for: {query}")
#     for i, result in enumerate(results, 1):
#         print(f"{i}. Similarity: {result['similarity']:.3f}")
#         print(f"   {result['document']}")
#         print()

## 🏗️ Section 8: RAG Architecture - Where Embeddings Fit In

Now let's understand how embeddings power **RAG (Retrieval-Augmented Generation)** systems.

### RAG Architecture Overview

```
User Query → Embedding → Vector Search → Relevant Documents → LLM → Answer
     ↓           ↓            ↓              ↓              ↓
   Text      Vector      Similarity      Context       Response
```

### The Role of Embeddings in RAG

1. **Document Indexing**: Convert knowledge base to embeddings
2. **Query Processing**: Convert user question to embedding
3. **Retrieval**: Find most similar documents using vector similarity
4. **Context**: Pass relevant documents to LLM for answer generation

This is why embeddings are crucial for RAG - they enable semantic retrieval!

In [None]:
# Simple RAG System Implementation

# Simulate a knowledge base about AI/ML topics
ai_knowledge_base = [
    "Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed.",
    "Deep learning uses artificial neural networks with multiple layers to model and understand complex patterns in data.",
    "Natural language processing (NLP) is a branch of AI that helps computers understand, interpret, and generate human language.",
    "Computer vision is a field of AI that trains computers to interpret and understand visual information from the world.",
    "Reinforcement learning is a type of machine learning where an agent learns to make decisions by taking actions in an environment to achieve maximum rewards.",
    "Supervised learning uses labeled training data to teach models to make predictions or classifications.",
    "Unsupervised learning finds hidden patterns in data without predefined labels or outcomes.",
    "Transfer learning allows models to apply knowledge learned from one task to a related task.",
    "Neural networks are computing systems inspired by biological neural networks in human brains.",
    "Convolutional neural networks (CNNs) are particularly effective for image recognition and computer vision tasks."
]

print("📚 AI/ML Knowledge Base:")
for i, doc in enumerate(ai_knowledge_base, 1):
    print(f"{i:2}. {doc}")

# Create embeddings
print("\n🔄 Creating embeddings...")
kb_embeddings = model.encode(ai_knowledge_base)

# Simple RAG function
def simple_rag_system(query, knowledge_base, embeddings, top_k=2):
    """Simple RAG system implementation"""
    
    print(f"\n🤖 RAG System Processing: {query}")
    print("=" * 50)
    
    # Step 1: Convert query to embedding
    query_embedding = model.encode([query])
    print("✅ Step 1: Query converted to embedding")
    
    # Step 2: Find similar documents
    similarities = cosine_similarity(query_embedding, embeddings)[0]
    top_indices = similarities.argsort()[-top_k:][::-1]
    
    print(f"✅ Step 2: Found {top_k} most similar documents")
    
    # Step 3: Retrieve relevant context
    relevant_docs = []
    for idx in top_indices:
        relevant_docs.append({
            'content': knowledge_base[idx],
            'similarity': similarities[idx],
            'index': idx
        })
        print(f"   📄 Doc {idx+1} (similarity: {similarities[idx]:.3f}): {knowledge_base[idx][:80]}...")
    
    # Step 4: Generate response (simulated)
    print("\n✅ Step 3: Generating response based on retrieved context...")
    
    # Simulate LLM response
    response = f"Based on the retrieved information, here's what I found about '{query}':\n\n"
    
    for i, doc in enumerate(relevant_docs, 1):
        response += f"{i}. {doc['content']}\n\n"
    
    response += f"These documents were selected based on semantic similarity to your query."
    
    return response, relevant_docs

# Test the RAG system
test_questions = [
    "What is machine learning?",
    "How do neural networks work?",
    "What is the difference between supervised and unsupervised learning?",
    "How does computer vision work?"
]

print("\n🧪 Testing RAG System:")
for question in test_questions:
    response, docs = simple_rag_system(question, ai_knowledge_base, kb_embeddings)
    print(response)
    print("\n" + "="*60 + "\n")

print("🎯 Key RAG Insights:")
print("1. Embeddings enable semantic document retrieval")
print("2. Retrieved context improves LLM responses")
print("3. Similarity scores help rank relevance")
print("4. RAG combines the best of retrieval and generation!")

## 🚀 Section 9: Advanced Topics - Taking It Further

Let's explore some advanced concepts and practical considerations for embeddings in production systems.

### Dimensionality Reduction for Visualization

High-dimensional embeddings (384-1536 dimensions) are hard to visualize. We use techniques like **PCA** and **t-SNE** to reduce them to 2D/3D for visualization.

### Embedding Models Comparison

Different models have different strengths:
- **all-MiniLM-L6-v2**: Fast, good for general use
- **all-mpnet-base-v2**: Better quality, slower
- **OpenAI text-embedding-ada-002**: High quality, requires API

### Production Considerations

- **Vector Databases**: Pinecone, Weaviate, Chroma for storing embeddings
- **Indexing**: HNSW, IVF for fast similarity search
- **Caching**: Store frequently used embeddings
- **Batch Processing**: Process multiple texts efficiently

In [None]:
# Advanced Topics: Dimensionality Reduction and Model Comparison

# Let's create a larger dataset for visualization
topics = [
    "machine learning algorithms",
    "deep learning neural networks", 
    "natural language processing",
    "computer vision systems",
    "reinforcement learning agents",
    "supervised learning models",
    "unsupervised learning clustering",
    "transfer learning techniques",
    "artificial intelligence systems",
    "data science analytics",
    "big data processing",
    "cloud computing services",
    "cybersecurity protection",
    "blockchain technology",
    "internet of things devices"
]

print("📊 Creating embeddings for visualization...")
topic_embeddings = model.encode(topics)

print(f"Original embedding shape: {topic_embeddings.shape}")

# PCA for dimensionality reduction
print("\n📈 Applying PCA for 2D visualization...")
pca = PCA(n_components=2)
embeddings_pca = pca.fit_transform(topic_embeddings)

print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.3f}")

# t-SNE for better clustering visualization
print("\n🎨 Applying t-SNE for better clustering...")
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
embeddings_tsne = tsne.fit_transform(topic_embeddings)

# Visualize both methods
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# PCA visualization
scatter1 = ax1.scatter(embeddings_pca[:, 0], embeddings_pca[:, 1], s=100, alpha=0.7)
ax1.set_xlabel('PCA Component 1')
ax1.set_ylabel('PCA Component 2')
ax1.set_title('PCA Dimensionality Reduction')
ax1.grid(True, alpha=0.3)

# Add labels for PCA
for i, topic in enumerate(topics):
    ax1.annotate(f'{i+1}', (embeddings_pca[i, 0], embeddings_pca[i, 1]), 
                xytext=(5, 5), textcoords='offset points', fontsize=8)

# t-SNE visualization
scatter2 = ax2.scatter(embeddings_tsne[:, 0], embeddings_tsne[:, 1], s=100, alpha=0.7)
ax2.set_xlabel('t-SNE Component 1')
ax2.set_ylabel('t-SNE Component 2')
ax2.set_title('t-SNE Dimensionality Reduction')
ax2.grid(True, alpha=0.3)

# Add labels for t-SNE
for i, topic in enumerate(topics):
    ax2.annotate(f'{i+1}', (embeddings_tsne[i, 0], embeddings_tsne[i, 1]), 
                xytext=(5, 5), textcoords='offset points', fontsize=8)

plt.tight_layout()
plt.show()

# Show topic numbers
print("\n📋 Topic Reference:")
for i, topic in enumerate(topics, 1):
    print(f"{i:2}. {topic}")

print("\n🔍 Observations:")
print("1. PCA preserves global structure but may not show clusters clearly")
print("2. t-SNE better preserves local structure and shows clusters")
print("3. Similar topics (ML/DL/NLP) tend to cluster together")
print("4. Different domains (AI vs IoT vs Blockchain) are separated")

# Model comparison (conceptual)
print("\n🤖 Model Comparison Table:")
models_info = [
    ["all-MiniLM-L6-v2", "384", "Fast", "Good", "General purpose"],
    ["all-mpnet-base-v2", "768", "Medium", "Better", "Quality-focused"],
    ["OpenAI ada-002", "1536", "Slow", "Best", "Production-ready"]
]

print(f"{'Model':<20} {'Dim':<6} {'Speed':<8} {'Quality':<8} {'Use Case':<15}")
print("-" * 60)
for model_info in models_info:
    print(f"{model_info[0]:<20} {model_info[1]:<6} {model_info[2]:<8} {model_info[3]:<8} {model_info[4]:<15}")

print("\n💡 Choose based on your needs: speed vs quality vs cost!")

## 🎉 Section 10: Conclusion and Next Steps

Congratulations! You've now mastered the fundamentals of embeddings and their role in RAG systems.

### What We've Learned

✅ **Vector Mathematics**: Dot product, cosine similarity, vector spaces
✅ **Embedding Concepts**: Converting text to meaningful vectors
✅ **TF-IDF vs Neural**: Traditional vs modern embedding methods
✅ **Semantic Search**: Finding meaning, not just words
✅ **RAG Architecture**: How embeddings power retrieval-augmented generation
✅ **Practical Implementation**: Building real systems with embeddings

### Key Takeaways

1. **Embeddings are vectors that capture meaning** - similar meanings = similar vectors
2. **Cosine similarity measures semantic similarity** - perfect for finding related content
3. **Neural embeddings understand context** - much better than keyword matching
4. **RAG uses embeddings for retrieval** - enabling AI to access knowledge
5. **Vector databases store embeddings efficiently** - essential for production systems

### Next Steps

🚀 **Build Your Own RAG System**:
- Use vector databases like Pinecone or Chroma
- Implement semantic search for your documents
- Connect to LLMs for answer generation

🔧 **Explore Advanced Topics**:
- Multi-modal embeddings (text + images)
- Fine-tuning embedding models
- Optimizing for specific domains
- Production deployment strategies

📚 **Resources**:
- [Sentence Transformers Documentation](https://www.sbert.net/)
- [Hugging Face Embeddings](https://huggingface.co/models?pipeline_tag=sentence-similarity)
- [Vector Database Comparison](https://zilliz.com/comparison)
- [RAG Best Practices](https://arxiv.org/abs/2312.10997)

### Final Challenge

Try building a semantic search system for your own documents or create a RAG chatbot for a specific domain. The concepts you've learned here are the foundation of modern AI applications!

---

**Happy embedding! 🚀**

In [None]:
# 🎯 Final Challenge: Build Your Own Semantic Search

# Your turn! Create a semantic search system for your own content
print("🎯 Final Challenge: Build Your Own Semantic Search System")
print("=" * 60)

# Example: Create a knowledge base about your favorite topic
my_knowledge_base = [
    # Add your own documents here!
    "Your first document about your topic",
    "Your second document with related information", 
    "Another document that might be relevant",
    # Add more documents...
]

if len(my_knowledge_base) > 3:  # Only run if you've added content
    print("📚 Your Knowledge Base:")
    for i, doc in enumerate(my_knowledge_base, 1):
        print(f"{i}. {doc}")
    
    # Create embeddings
    print("\n🔄 Creating embeddings...")
    my_embeddings = model.encode(my_knowledge_base)
    
    # Test your search
    my_query = "Your test query here"
    print(f"\n🔍 Testing with query: {my_query}")
    
    results = semantic_search(my_query, my_embeddings, my_knowledge_base)
    for i, result in enumerate(results, 1):
        print(f"{i}. Similarity: {result['similarity']:.3f}")
        print(f"   {result['document']}")
        print()
else:
    print("💡 Add your own documents to my_knowledge_base and test your semantic search!")
    print("Example topics: cooking recipes, travel guides, technical documentation, etc.")

print("\n🎉 Congratulations on completing the embeddings tutorial!")
print("You now have the foundation to build powerful AI applications with RAG!")