# Module 4: Vector Storage & Retrieval

 **Level:** Intermediate  
**Prerequisites:** Modules 1, 2, and 3 completed

---

## Learning Objectives

By the end of this module, you will be able to:

- Understand what vector databases are and why they're needed
- Choose the right vector database for your use case
- Implement vector storage using FAISS (local)
- Implement vector storage using Chroma (embedded)
- Perform efficient similarity search at scale
- Add filtering and metadata to retrieval
- Optimize retrieval performance

---

# 1. Why Vector Databases?

## 1.1 The Problem with Simple Storage

In Module 3, we stored embeddings in a Python list and did linear search:

```python
# Simple approach from Module 3
embeddings = [emb1, emb2, emb3, ...]  # List of vectors

# Search: compare query to EVERY vector
for emb in embeddings:
    similarity = cosine_similarity(query_emb, emb)
```

**This works for small datasets but fails at scale:**

| Documents | Vectors | Search Time |
|-----------|---------|-------------|
| 100 | 100 | ~10ms |
| 1,000 | 1,000 | ~100ms |
| 10,000 | 10,000 | ~1 second |
| 100,000 | 100,000 | ~10 seconds |
| 1,000,000 | 1,000,000 | ~100 seconds |

**Real-world RAG systems have millions of vectors. We need a better solution.**

## 1.2 What is a Vector Database?

A vector database is a specialized database designed to:
- Store high-dimensional vectors efficiently
- Perform fast similarity search (even with millions of vectors)
- Handle metadata alongside vectors
- Support filtering and hybrid search

**Key difference from regular databases:**
- Regular DB: Exact match queries (`WHERE name = 'John'`)
- Vector DB: Similarity queries (`Find vectors most similar to query vector`)

---

# 2. Popular Vector Databases

## 2.1 Comparison Table

| Database | Type | Best For | Pros | Cons |
|----------|------|----------|------|------|
| **FAISS** | Library (local) | Prototyping, research | Fast, free, flexible | No server, no persistence by default |
| **Chroma** | Embedded | Small to medium projects | Easy setup, built for LLMs | Limited scale |
| **Pinecone** | Cloud (managed) | Production, scale | Fully managed, scalable | Costs money, cloud-only |
| **Weaviate** | Self-hosted/Cloud | Production, flexibility | Feature-rich, open source | Complex setup |
| **Qdrant** | Self-hosted/Cloud | Production, performance | Fast, great filtering | Requires setup |
| **Milvus** | Self-hosted/Cloud | Large scale, enterprise | Very scalable, mature | Complex, resource-intensive |

## 2.2 Decision Guide

**Use FAISS when:**
- Learning and prototyping
- Running locally without server
- Need maximum speed and flexibility
- Don't need persistence (or can handle it yourself)

**Use Chroma when:**
- Building small to medium RAG apps
- Want simplicity and easy setup
- Need basic persistence and metadata
- Working with LangChain or LlamaIndex

**Use Pinecone when:**
- Building production applications
- Want fully managed service (no DevOps)
- Need to scale to millions of vectors
- Have budget for managed service

**Use Weaviate/Qdrant when:**
- Need production features but want to self-host
- Want advanced filtering and hybrid search
- Have DevOps resources

**For this module, we'll focus on FAISS and Chroma (most common for learning).**

---

# 3. Hands-On: FAISS Vector Storage

## 3.1 Install and Import

In [None]:
# Install required libraries
!pip install -q faiss-cpu sentence-transformers numpy

In [None]:
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

print(f"FAISS version: {faiss.__version__}")
print("‚úÖ Libraries imported successfully!")

## 3.2 Prepare Sample Data

In [None]:
# Sample documents
documents = [
    "Python is a versatile programming language used for web development and data science.",
    "Machine learning models require large amounts of training data to perform well.",
    "Neural networks are inspired by the structure of the human brain.",
    "Natural language processing enables computers to understand human language.",
    "Deep learning is a subset of machine learning using multi-layered neural networks.",
    "Data visualization helps communicate insights from complex datasets.",
    "Cloud computing provides on-demand access to computing resources.",
    "Cybersecurity protects systems and networks from digital attacks.",
    "Blockchain technology enables secure, decentralized transactions.",
    "Quantum computing uses quantum mechanics to solve complex problems."
]

print(f"Total documents: {len(documents)}")

## 3.3 Generate Embeddings

In [None]:
# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings
embeddings = model.encode(documents)

print(f"Generated {len(embeddings)} embeddings")
print(f"Each embedding has {embeddings.shape[1]} dimensions")
print(f"Embeddings shape: {embeddings.shape}")

## 3.4 Create FAISS Index

In [None]:
# Get embedding dimension
dimension = embeddings.shape[1]

# Create FAISS index (IndexFlatL2 = exact search with L2 distance)
index = faiss.IndexFlatL2(dimension)

# Add embeddings to index
index.add(embeddings)

print(f"‚úÖ FAISS index created!")
print(f"Total vectors in index: {index.ntotal}")

### üí° Understanding FAISS Indexes

**IndexFlatL2**: Exact search using L2 (Euclidean) distance. Best for small datasets or when you need perfect accuracy.

**Other index types:**
- `IndexFlatIP`: Exact search using inner product (similar to cosine similarity)
- `IndexIVFFlat`: Approximate search using clustering (faster for large datasets)
- `IndexHNSWFlat`: Graph-based approximate search (very fast)

### What is the purpose of an index?
üëâ 1. Store vectors efficiently

The index is a special data structure that stores your embeddings (vector representations of text).

üëâ 2. Allow fast similarity search

Instead of scanning all vectors one by one (slow), the index uses algorithms to find the closest vectors very fast, even when you have millions.

üëâ 3. Provide distance + nearest neighbors

When you search with a query vector, the index returns:

- I ‚Üí indices of the closest stored vectors
- D ‚Üí distances showing how similar they are

Example:

```
I = [[5, 12, 3]]   # best matches
D = [[0.12, 0.34, 0.89]]   # distances
```

## 3.5 Search with FAISS

In [None]:
# Query
query = "What is artificial intelligence and machine learning?"

# Embed query
query_embedding = model.encode([query])

# Search: find top 3 most similar vectors
k = 3
distances, indices = index.search(query_embedding, k)

print(f"Query: {query}\n")
print(f"Top {k} results:\n")

for i, (idx, distance) in enumerate(zip(indices[0], distances[0]), 1):
    print(f"{i}. (Distance: {distance:.4f})")
    print(f"   {documents[idx]}")
    print()

### üí° Understanding Distances

**L2 distance** (what IndexFlatL2 uses):
- **Lower = More similar** (opposite of cosine similarity!)
- 0 = Identical vectors
- Higher values = More different


## 3.6 Using Cosine Similarity with FAISS

In [None]:
# Normalize embeddings for cosine similarity
embeddings_normalized = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

# Create index with inner product (equivalent to cosine for normalized vectors)
index_cosine = faiss.IndexFlatIP(dimension)
index_cosine.add(embeddings_normalized)

# Search with normalized query
query_embedding_normalized = query_embedding / np.linalg.norm(query_embedding)
scores, indices = index_cosine.search(query_embedding_normalized, k=3)

print(f"Query: {query}\n")
print(f"Top {k} results with cosine similarity:\n")

for i, (idx, score) in enumerate(zip(indices[0], scores[0]), 1):
    print(f"{i}. (Similarity: {score:.4f})")
    print(f"   {documents[idx]}")
    print()

## 3.7 Saving and Loading FAISS Index

In [None]:
# Save index to disk
faiss.write_index(index_cosine, "my_faiss_index.bin")
print("‚úÖ Index saved to disk")

# Save documents separately (FAISS only stores vectors, not text)
import pickle
with open("documents.pkl", "wb") as f:
    pickle.dump(documents, f)
print("‚úÖ Documents saved")

In [None]:
# Load index from disk
loaded_index = faiss.read_index("my_faiss_index.bin")
print(f"‚úÖ Index loaded: {loaded_index.ntotal} vectors")

# Load documents
with open("documents.pkl", "rb") as f:
    loaded_documents = pickle.load(f)
print(f"‚úÖ Documents loaded: {len(loaded_documents)} documents")

---

# 4. Hands-On: Chroma Vector Database

## 4.1 Why Chroma?

**Chroma advantages over FAISS:**
- Built specifically for LLM applications
- Automatic persistence (saves to disk automatically)
- Stores documents AND embeddings together
- Rich metadata support and filtering

**Trade-off:** Less flexible than FAISS, not as fast for very large datasets

## 4.2 Install and Import

In [None]:
!pip install -q chromadb

In [None]:
import chromadb

print(f"ChromaDB version: {chromadb.__version__}")
print("‚úÖ ChromaDB imported successfully!")

## 4.3 Create Chroma Client and Collection

In [None]:
# Create Chroma client (persistent storage)
# Note: ChromaDB 0.4.0+ uses PersistentClient instead of Client(Settings(...))
client = chromadb.PersistentClient(path="./chroma_db")

# Create or get collection
collection = client.get_or_create_collection(
    name="my_documents",
    metadata={"description": "Sample document collection"}
)

print(f"‚úÖ Collection created: {collection.name}")
print(f"Current count: {collection.count()} documents")
print(f"üìÅ Data persisted to: ./chroma_db/")

## 4.4 Add Documents to Chroma

In [None]:
# Sample documents with metadata
documents = [
    "Python is a versatile programming language used for web development and data science.",
    "Machine learning models require large amounts of training data to perform well.",
    "Neural networks are inspired by the structure of the human brain.",
    "Natural language processing enables computers to understand human language.",
    "Deep learning is a subset of machine learning using multi-layered neural networks."
]

# Metadata for each document
metadatas = [
    {"category": "programming", "topic": "python"},
    {"category": "AI", "topic": "machine learning"},
    {"category": "AI", "topic": "neural networks"},
    {"category": "AI", "topic": "NLP"},
    {"category": "AI", "topic": "deep learning"}
]

# IDs for each document
ids = [f"doc_{i}" for i in range(len(documents))]

# Add to collection (Chroma handles embedding automatically!)
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

print(f"‚úÖ Added {len(documents)} documents to collection")
print(f"Total documents: {collection.count()}")

### üí° Chroma Magic

Notice: We didn't manually generate embeddings! Chroma does it automatically using a default embedding model.

**You can also specify your own embedding function (we'll see this later).**

## 4.5 Query Chroma

In [None]:
# Query the collection
results = collection.query(
    query_texts=["What is artificial intelligence?"],
    n_results=3
)

In [None]:
results

In [None]:
# Query the collection
results = collection.query(
    query_texts=["What is artificial intelligence?"],
    n_results=3
)

print("Query: What is artificial intelligence?\n")
print("Top 3 results:\n")

for i, (doc, metadata, distance) in enumerate(zip(
    results['documents'][0],
    results['metadatas'][0],
    results['distances'][0]
), 1):
    print(f"{i}. (Distance: {distance:.4f})")
    print(f"   Document: {doc}")
    print(f"   Metadata: {metadata}")
    print()

In [None]:
for i, (doc, metadata, distance) in enumerate(zip(
    results['documents'][0],
    results['metadatas'][0],
    results['distances'][0]
), 1):
    print(f"{i}. (Distance: {distance:.4f})")
    print(f"   Document: {doc}")
    print(f"   Metadata: {metadata}")
    print()

In [None]:
results['documents']

## 4.6 Filtering with Metadata

In [None]:
# Query with metadata filter
results = collection.query(
    query_texts=["Tell me about AI"],
    n_results=3,
    where={"category": "AI"}  # Only return AI documents
)

print("Query: Tell me about AI (filtered by category='AI')\n")
print("Results:\n")

for i, (doc, metadata) in enumerate(zip(
    results['documents'][0],
    results['metadatas'][0]
), 1):
    print(f"{i}. {doc}")
    print(f"   Category: {metadata['category']}, Topic: {metadata['topic']}")
    print()

# This is how the filtering is being done in real life:

```
detected_category = classify_user_query(query)  # AI ‚Üí returns "AI"

results = collection.query(
    query_texts=[query],
    n_results=5,
    where={"category": detected_category}
)
```

## 4.7 Using Custom Embedding Function

In [None]:
from chromadb.utils import embedding_functions

# Use sentence-transformers embedding function
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Create new collection with custom embedding function
collection_custom = client.get_or_create_collection(
    name="custom_embeddings",
    embedding_function=sentence_transformer_ef
)

# Add documents
collection_custom.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

print(f"‚úÖ Collection with custom embeddings created")
print(f"Documents: {collection_custom.count()}")

In [None]:
# Query the collection
results = collection_custom.query(
    query_texts=["What is artificial intelligence?"],
    n_results=3,
    include=["embeddings", "documents", "metadatas", "distances"]
)

In [None]:
results

In [None]:
# Query the collection
results = collection_custom.query(
    query_texts=["What is artificial intelligence?"],
    n_results=3,
    include=["embeddings", "documents", "metadatas", "distances"]
)

print("Query: What is artificial intelligence?\n")
print("Top 3 results:\n")

for i, (doc, metadata, distance) in enumerate(zip(
    results['documents'][0],
    results['metadatas'][0],
    results['distances'][0]
), 1):
    print(f"{i}. (Distance: {distance:.4f})")
    print(f"   Document: {doc}")
    print(f"   Metadata: {metadata}")
    print()

## 4.8 Update and Delete Documents

In [None]:
# Update a document
collection.update(
    ids=["doc_0"],
    documents=["Python is an amazing programming language for AI and data science!"],
    metadatas=[{"category": "programming", "topic": "python", "updated": True}]
)
print("‚úÖ Document updated")

# Delete a document
# collection.delete(ids=["doc_4"])
# print("‚úÖ Document deleted")

print(f"\nTotal documents after update: {collection.count()}")

---

# 5. Building a Complete RAG Retriever

## 5.1 RAG Retriever with Chroma

Let's combine everything: chunking (Module 2), embeddings (Module 3), and vector storage (Module 4)

In [None]:
import re

class RAGRetriever:
    def __init__(self, collection_name="rag_collection", persist_dir="./rag_db"):
        """
        Initialize RAG retriever with Chroma.
        """
        # Create Chroma client (using PersistentClient for ChromaDB 0.4.0+)
        self.client = chromadb.PersistentClient(path=persist_dir)
        
        # Create collection with sentence-transformers
        embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
            model_name="all-MiniLM-L6-v2"
        )
        
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            embedding_function=embedding_fn
        )
        
        print(f"‚úÖ RAG Retriever initialized")
        print(f"Collection: {collection_name}")
        print(f"Current documents: {self.collection.count()}")
        print(f"üìÅ Data persisted to: {persist_dir}/")
    
    def chunk_text(self, text, chunk_size=500):
        """
        Simple sentence-based chunking from Module 2.
        """
        sentences = re.split(r'(?<=[.!?])\s+', text)
        chunks = []
        current_chunk = ""
        
        for sentence in sentences:
            if len(current_chunk) + len(sentence) > chunk_size and current_chunk:
                chunks.append(current_chunk.strip())
                current_chunk = sentence
            else:
                current_chunk += " " + sentence if current_chunk else sentence
        
        if current_chunk:
            chunks.append(current_chunk.strip())
        
        return chunks
    
    def add_document(self, text, metadata=None, source_name="unknown"):
        """
        Add a document (chunks it automatically).
        """
        # Chunk the document
        chunks = self.chunk_text(text)
        
        # Prepare data for Chroma
        ids = [f"{source_name}_chunk_{i}" for i in range(len(chunks))]
        metadatas = [
            {
                "source": source_name,
                "chunk_index": i,
                "total_chunks": len(chunks),
                **(metadata or {})
            }
            for i in range(len(chunks))
        ]
        
        # Add to collection
        self.collection.add(
            documents=chunks,
            metadatas=metadatas,
            ids=ids
        )
        
        print(f"‚úÖ Added document '{source_name}': {len(chunks)} chunks")
        return len(chunks)
    
    def retrieve(self, query, top_k=3, filter_metadata=None):
        """
        Retrieve relevant chunks for a query.
        """
        results = self.collection.query(
            query_texts=[query],
            n_results=top_k,
            where=filter_metadata
        )
        
        return {
            'documents': results['documents'][0],
            'metadatas': results['metadatas'][0],
            'distances': results['distances'][0]
        }
    
    def format_context(self, retrieved_results):
        """
        Format retrieved chunks for LLM prompt.
        """
        context = "Context from retrieved documents:\n\n"
        
        for i, (doc, metadata, distance) in enumerate(zip(
            retrieved_results['documents'],
            retrieved_results['metadatas'],
            retrieved_results['distances']
        ), 1):
            source = metadata.get('source', 'unknown')
            context += f"[{i}] From {source} (Relevance: {1/(1+distance):.3f}):\n"
            context += f"{doc}\n\n"
        
        return context

print("‚úÖ RAGRetriever class defined!")

## 5.2 Test the RAG Retriever

In [None]:
# Create retriever
retriever = RAGRetriever(collection_name="test_rag")

# Add sample documents
doc1 = """
Machine learning is a branch of artificial intelligence that focuses on building systems 
that can learn from data. These systems improve their performance over time without being 
explicitly programmed. Common applications include image recognition, natural language 
processing, and recommendation systems.
"""

doc2 = """
Python is a high-level programming language known for its simplicity and readability. 
It's widely used in web development, data science, automation, and artificial intelligence. 
Python's extensive library ecosystem makes it ideal for rapid development.
"""

doc3 = """
Vector databases are specialized databases designed to store and query high-dimensional 
vectors efficiently. They're essential for modern AI applications like semantic search, 
recommendation systems, and retrieval-augmented generation (RAG). Popular examples include 
FAISS, Pinecone, and Chroma.
"""

# Add documents
retriever.add_document(doc1, metadata={"category": "AI"}, source_name="ml_intro.txt")
retriever.add_document(doc2, metadata={"category": "programming"}, source_name="python_guide.txt")
retriever.add_document(doc3, metadata={"category": "databases"}, source_name="vector_db_overview.txt")

In [None]:
# Test query
query = "What are vector databases used for?"

results = retriever.retrieve(query, top_k=3)

print(f"Query: {query}\n")
print("="*80)
print(retriever.format_context(results))

---

# 6. Advanced Retrieval Techniques

## 6.1 Reranking

Retrieve more candidates (e.g., top 20), then rerank them using a more sophisticated model to get the final top k.

**Benefits:**
- Better quality than single-stage retrieval
- Can use cross-encoder models (more accurate but slower)

**How it works:**
1. Retrieval: Get top 20 candidates (fast, approximate)
2. Reranking: Score all 20 with better model
3. Return: Top 3 after reranking

## 6.2 Hybrid Search

Combine semantic search (embeddings) with keyword search (BM25, TF-IDF).

**Why hybrid?**
- Semantic search: Good for concepts and meaning
- Keyword search: Good for exact terms and names
- Together: Best of both worlds

**Implementation:**
1. Get results from semantic search
2. Get results from keyword search
3. Combine and rerank (e.g., weighted average)

## 6.3 MMR (Maximal Marginal Relevance)

Retrieve diverse results instead of very similar ones.

**Problem:** Top 3 results might be too similar (redundant)

**Solution:** MMR balances relevance and diversity
- Pick most relevant document first
- For next picks, balance relevance to query vs. difference from already selected docs

**Use when:** You want variety in retrieved context

---

# 7. Performance Optimization

## 7.1 Index Selection

Choose the right FAISS index based on your needs:

**For accuracy (small datasets < 10k vectors):**
- `IndexFlatL2` or `IndexFlatIP`: Exact search, no approximation

**For speed (medium datasets 10k-1M vectors):**
- `IndexIVFFlat`: Clusters data, searches subset
- `IndexHNSWFlat`: Graph-based, very fast

**For memory (large datasets > 1M vectors):**
- `IndexIVFPQ`: Compressed vectors, saves memory
- Trade-off: Faster and smaller, but less accurate

## 7.2 Batch Processing

Add documents in batches instead of one by one for better performance.

```python
# Slow: Adding one at a time
for doc in documents:
    collection.add(documents=[doc], ids=[...])

# Fast: Adding in batch
collection.add(documents=documents, ids=[...])
```

## 7.3 Dimension Reduction

Reduce embedding dimensions to save memory and improve speed.

**Example:** 768 dims ‚Üí 384 dims or 256 dims

**Trade-off:** Faster and smaller, slightly lower quality

# üéØ Practice Exercises

## Exercise 1: Chroma with Advanced Filtering

### Task
Build a document management system using Chroma with rich metadata and filtering.

### Instructions

1. Create a collection of at least 30 documents with metadata:
   ```python
   metadata = {
       "category": "...",  # e.g., "tech", "business", "science"
       "date": "...",      # e.g., "2024-01-15"
       "author": "...",    # e.g., "John Doe"
       "priority": ...    # e.g., 1, 2, 3
   }
   ```

2. Implement queries with different filters:
   - By category
   - By date range
   - By author
   - Combined filters (e.g., category AND date)

3. Test MMR (Maximal Marginal Relevance) if Chroma supports it

4. Compare results with and without filters

### Sample Data

```python
documents = [
    {
        "text": "Python 3.12 introduces new performance improvements...",
        "metadata": {
            "category": "tech",
            "date": "2024-01-15",
            "author": "Tech Team",
            "priority": 1
        }
    },
    # Add 29 more...
]
```

### Expected Output

```
Query: "latest technology updates"

Without filters:
1. [Result from any category]
2. [Result from any category]
3. [Result from any category]

With filter (category="tech"):
1. [Tech result]
2. [Tech result]
3. [Tech result]

With filter (category="tech" AND date>="2024-01-01"):
1. [Recent tech result]
2. [Recent tech result]
3. [Recent tech result]
```


---

# 8. Summary

## Key Takeaways

1. **Vector databases enable fast similarity search** at scale using approximate nearest neighbor algorithms.

2. **FAISS** is great for learning, prototyping, and maximum flexibility. Requires manual persistence.

3. **Chroma** is perfect for RAG applications with automatic persistence, metadata support, and simple API.

4. **Production systems** typically use managed services like Pinecone or self-hosted solutions like Weaviate/Qdrant.

5. **Metadata filtering** allows you to narrow search to specific document types or categories.

6. **Advanced techniques** like reranking, hybrid search, and MMR improve retrieval quality.

7. **Choose the right index** based on your dataset size and accuracy requirements.

## What's Next?

In **Module 5: Building Your First Complete RAG System**, you'll:
- Combine all modules into a working RAG application
- Add an LLM for generation
- Handle document uploads
- Create a simple interface
- Test with real queries

Get ready to build a complete system! üöÄ