# Notebook 3: Indexing & Simple Queries

**Difficulty:** Intermediate | **Estimated Time:** 90-120 minutes

## Learning Objectives

By the end of this notebook, you will be able to:

1. ‚úÖ Integrate external vector stores (Qdrant, Chroma)
2. ‚úÖ Compare embedding models (OpenAI vs HuggingFace)
3. ‚úÖ Persist and load indexes from storage
4. ‚úÖ Configure query engines with different modes
5. ‚úÖ Implement VectorIndexRetriever and VectorIndexAutoRetriever
6. ‚úÖ Understand response synthesis modes
7. ‚úÖ Implement streaming responses

## Prerequisites

- Completed Notebooks 1 & 2
- Understanding of vector similarity and embeddings
- Qdrant or Chroma installed (optional - in-memory works too)

## Curriculum Coverage

- **Section 3.1:** Vector Store Integration
- **Section 3.2:** Creating VectorStoreIndex
- **Section 3.3:** Embedding Models
- **Section 3.4:** Query Engines
- **Section 3.5.1-3.5.2:** VectorIndexRetriever, VectorIndexAutoRetriever

---

## 1. Setup & Imports

In [2]:
# Core LlamaIndex
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Settings,
    Document,
    StorageContext,
    load_index_from_storage,
)
from llama_index.core.vector_stores import VectorStoreInfo, MetadataInfo
from llama_index.core.retrievers import VectorIndexRetriever, VectorIndexAutoRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

# Vector Stores
from llama_index.vector_stores.chroma import ChromaVectorStore
try:
    from llama_index.vector_stores.qdrant import QdrantVectorStore
    QDRANT_AVAILABLE = True
except ImportError:
    QDRANT_AVAILABLE = False
    print("‚ö†Ô∏è  Qdrant not installed. Install with: pip install llama-index-vector-stores-qdrant qdrant-client")

# Embeddings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# LLM
from llama_index.llms.openai import OpenAI

# External libraries
import chromadb
if QDRANT_AVAILABLE:
    from qdrant_client import QdrantClient

from dotenv import load_dotenv
import os
from pathlib import Path
import time
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Imports successful!")

‚úÖ Imports successful!


In [3]:
# Load environment and configure Settings
load_dotenv()

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.1)
Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    dimensions=1536
)
Settings.chunk_size = 1024
Settings.chunk_overlap = 200

print("‚úÖ Settings configured")

‚úÖ Settings configured


---

## 2. Prepare Sample Data

In [4]:
# Create comprehensive sample documents
documents = [
    Document(
        text="""
        Vector databases are specialized databases designed to store and query high-dimensional vectors.
        These vectors typically represent embeddings of text, images, or other data. Vector databases
        enable efficient similarity search using algorithms like HNSW (Hierarchical Navigable Small World)
        or IVF (Inverted File Index). Popular vector databases include Qdrant, Pinecone, Weaviate, and Milvus.
        """,
        metadata={"topic": "vector_databases", "difficulty": "intermediate", "year": 2023}
    ),
    Document(
        text="""
        HNSW (Hierarchical Navigable Small World) is a graph-based algorithm for approximate nearest neighbor
        search. It builds a multi-layer graph where each layer is a subset of the previous one. The algorithm
        achieves excellent query performance (sub-millisecond) with high recall. HNSW parameters include
        M (number of connections per node) and ef_construction (search width during construction).
        """,
        metadata={"topic": "algorithms", "difficulty": "advanced", "year": 2023}
    ),
    Document(
        text="""
        Embedding models convert text into dense vector representations that capture semantic meaning.
        OpenAI's text-embedding-3-small produces 1536-dimensional vectors and is optimized for retrieval tasks.
        Open-source alternatives include sentence-transformers models like all-MiniLM-L6-v2 (384 dimensions)
        and all-mpnet-base-v2 (768 dimensions). The choice of embedding model affects retrieval quality and cost.
        """,
        metadata={"topic": "embeddings", "difficulty": "beginner", "year": 2024}
    ),
    Document(
        text="""
        Qdrant is an open-source vector database written in Rust. It supports HNSW indexing, filtering,
        and hybrid search. Qdrant can run locally (Docker) or in the cloud. Key features include payload
        filtering, quantization for memory reduction, and distributed deployments. Qdrant is particularly
        well-suited for production RAG applications.
        """,
        metadata={"topic": "qdrant", "difficulty": "intermediate", "year": 2024}
    ),
    Document(
        text="""
        Chroma is a lightweight, embedded vector database designed for AI applications. It runs in-memory
        or can persist to disk. Chroma is easy to set up and integrates seamlessly with LangChain and LlamaIndex.
        It's ideal for prototyping and small-to-medium scale applications. Chroma supports metadata filtering
        and multiple distance metrics (cosine, euclidean, dot product).
        """,
        metadata={"topic": "chroma", "difficulty": "beginner", "year": 2024}
    ),
]

print(f"‚úÖ Created {len(documents)} sample documents")
print(f"   Topics: {', '.join(set(d.metadata['topic'] for d in documents))}")

‚úÖ Created 5 sample documents
   Topics: chroma, vector_databases, qdrant, algorithms, embeddings


---

## 3. In-Memory Vector Store (Default)

### SimpleVectorStore: LlamaIndex's Built-in Store

In [5]:
# Create index with default in-memory vector store
print("Creating VectorStoreIndex (in-memory)...")
start_time = time.time()

simple_index = VectorStoreIndex.from_documents(
    documents,
    show_progress=True,
)

elapsed = time.time() - start_time
print(f"\n‚úÖ Index created in {elapsed:.2f} seconds")
print(f"   Vector store type: SimpleVectorStore (in-memory)")

Creating VectorStoreIndex (in-memory)...


Parsing nodes:   0%|          | 0/5 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/5 [00:00<?, ?it/s]


‚úÖ Index created in 1.76 seconds
   Vector store type: SimpleVectorStore (in-memory)


### When to Use In-Memory Vector Store

**Pros:**
- ‚úÖ No external dependencies
- ‚úÖ Fast setup
- ‚úÖ Good for prototyping
- ‚úÖ Can persist to disk

**Cons:**
- ‚ùå Not optimized for large-scale (>100k vectors)
- ‚ùå Limited filtering capabilities
- ‚ùå No distributed support
- ‚ùå Slower than specialized vector DBs

**Use for**: Demos, small datasets, development

---

## 4. Chroma Vector Store Integration

In [6]:
# Initialize Chroma client (in-memory)
chroma_client = chromadb.EphemeralClient()  # In-memory
# For persistence: chromadb.PersistentClient(path="./chroma_db")

# Create collection
collection_name = "llama_index_docs"
chroma_collection = chroma_client.create_collection(collection_name)

# Create Chroma vector store
chroma_vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

# Create storage context
storage_context = StorageContext.from_defaults(vector_store=chroma_vector_store)

print("Creating VectorStoreIndex with Chroma...")
start_time = time.time()

chroma_index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    show_progress=True,
)

elapsed = time.time() - start_time
print(f"\n‚úÖ Chroma index created in {elapsed:.2f} seconds")
print(f"   Collection: {collection_name}")
print(f"   Documents indexed: {chroma_collection.count()}")

Creating VectorStoreIndex with Chroma...


Parsing nodes:   0%|          | 0/5 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/5 [00:00<?, ?it/s]


‚úÖ Chroma index created in 1.77 seconds
   Collection: llama_index_docs
   Documents indexed: 5


### Chroma Features

- **Ease of use**: Zero configuration for local dev
- **Metadata filtering**: WHERE clause support
- **Distance metrics**: Cosine (default), L2, IP
- **Persistence**: Optional disk storage
- **Scales to**: ~1M vectors comfortably

---

## 5. Qdrant Vector Store Integration (Optional)

In [7]:
if QDRANT_AVAILABLE:
    # Initialize Qdrant client (in-memory)
    qdrant_client = QdrantClient(location=":memory:")
    # For persistence: QdrantClient(path="./qdrant_db")
    # For cloud: QdrantClient(url=os.getenv("QDRANT_URL"), api_key=os.getenv("QDRANT_API_KEY"))
    
    # Create Qdrant vector store
    qdrant_vector_store = QdrantVectorStore(
        client=qdrant_client,
        collection_name="llama_index_qdrant",
    )
    
    # Create storage context
    qdrant_storage_context = StorageContext.from_defaults(vector_store=qdrant_vector_store)
    
    print("Creating VectorStoreIndex with Qdrant...")
    start_time = time.time()
    
    qdrant_index = VectorStoreIndex.from_documents(
        documents,
        storage_context=qdrant_storage_context,
        show_progress=True,
    )
    
    elapsed = time.time() - start_time
    print(f"\n‚úÖ Qdrant index created in {elapsed:.2f} seconds")
    print(f"   Collection: llama_index_qdrant")
else:
    print("‚ö†Ô∏è  Skipping Qdrant example (not installed)")
    qdrant_index = None

Creating VectorStoreIndex with Qdrant...


Parsing nodes:   0%|          | 0/5 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/5 [00:00<?, ?it/s]


‚úÖ Qdrant index created in 1.95 seconds
   Collection: llama_index_qdrant


### üéØ ML Engineering Note: Vector Database Comparison

| Feature | SimpleVectorStore | Chroma | Qdrant |
|---------|------------------|--------|--------|
| **Setup** | Built-in | Easy | Moderate |
| **Scale** | <10k vectors | ~1M vectors | 10M+ vectors |
| **Speed** | Moderate | Fast | Very Fast |
| **Filtering** | Basic | Good | Excellent |
| **Hybrid Search** | No | No | Yes |
| **Cloud Option** | No | Planned | Yes |
| **Best For** | Prototyping | Small-medium apps | Production |

**Recommendation**: 
- Prototyping: SimpleVectorStore or Chroma
- Production (<1M docs): Chroma or Qdrant
- Production (>1M docs): Qdrant, Pinecone, or Weaviate

---

## 6. Embedding Model Comparison

### 6.1 OpenAI Embeddings

In [8]:
# Test OpenAI embedding
openai_embed = OpenAIEmbedding(
    model="text-embedding-3-small",
    dimensions=1536,
)

test_text = "Vector databases enable semantic search"
start_time = time.time()
openai_vector = openai_embed.get_text_embedding(test_text)
openai_time = time.time() - start_time

print(f"OpenAI Embedding (text-embedding-3-small):")
print(f"  Dimensions: {len(openai_vector)}")
print(f"  Time: {openai_time*1000:.2f}ms")
print(f"  First 5 values: {openai_vector[:5]}")

OpenAI Embedding (text-embedding-3-small):
  Dimensions: 1536
  Time: 501.56ms
  First 5 values: [-0.02636837400496006, 0.040111102163791656, 0.008521012030541897, -0.012482762336730957, -0.007683198899030685]


### 6.2 HuggingFace Embeddings (Local)

In [9]:
# Test HuggingFace embedding
print("Loading HuggingFace model (this may take a moment on first run)...")
hf_embed = HuggingFaceEmbedding(
    model_name="sentence-transformers/all-MiniLM-L6-v2",  # 384 dimensions
)

start_time = time.time()
hf_vector = hf_embed.get_text_embedding(test_text)
hf_time = time.time() - start_time

print(f"\nHuggingFace Embedding (all-MiniLM-L6-v2):")
print(f"  Dimensions: {len(hf_vector)}")
print(f"  Time: {hf_time*1000:.2f}ms")
print(f"  First 5 values: {hf_vector[:5]}")

Loading HuggingFace model (this may take a moment on first run)...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


HuggingFace Embedding (all-MiniLM-L6-v2):
  Dimensions: 384
  Time: 2593.42ms
  First 5 values: [0.04381900280714035, -0.009547049179673195, -0.020091239362955093, 0.01502758264541626, 0.01334494911134243]


### Embedding Model Comparison

In [10]:
import pandas as pd

comparison = pd.DataFrame([
    {
        "Model": "OpenAI text-embedding-3-small",
        "Dimensions": len(openai_vector),
        "Time (ms)": f"{openai_time*1000:.2f}",
        "Cost": "$0.02/1M tokens",
        "Quality": "Excellent",
        "Hosting": "API",
    },
    {
        "Model": "all-MiniLM-L6-v2",
        "Dimensions": len(hf_vector),
        "Time (ms)": f"{hf_time*1000:.2f}",
        "Cost": "Free",
        "Quality": "Good",
        "Hosting": "Local",
    },
])

print("\nEmbedding Model Comparison:")
print(comparison.to_string(index=False))


Embedding Model Comparison:
                        Model  Dimensions Time (ms)            Cost   Quality Hosting
OpenAI text-embedding-3-small        1536    501.56 $0.02/1M tokens Excellent     API
             all-MiniLM-L6-v2         384   2593.42            Free      Good   Local


### üéØ ML Engineering Note: Embedding Selection

**OpenAI (text-embedding-3-small):**
- ‚úÖ State-of-the-art quality
- ‚úÖ Variable dimensions (256-1536)
- ‚úÖ No model hosting needed
- ‚ùå Costs per API call
- ‚ùå Data leaves your infrastructure

**HuggingFace (all-MiniLM-L6-v2):**
- ‚úÖ Free and open-source
- ‚úÖ Runs locally (data privacy)
- ‚úÖ Fast inference (especially with GPU)
- ‚ùå Lower quality than OpenAI
- ‚ùå Fixed dimensions (384)
- ‚ùå Requires model hosting

**Recommendation**: 
- Development: OpenAI (fast iteration)
- Production (high quality needed): OpenAI
- Production (cost-sensitive, privacy): HuggingFace + GPU

---

## 7. Index Persistence

### Saving Index to Disk

In [11]:
# Save index to disk
persist_dir = "./storage"

print(f"Persisting index to {persist_dir}...")
simple_index.storage_context.persist(persist_dir=persist_dir)

print("\n‚úÖ Index persisted successfully!")
print(f"   Location: {persist_dir}")

# Check what was saved
storage_path = Path(persist_dir)
if storage_path.exists():
    files = list(storage_path.glob("*"))
    print(f"   Files created: {len(files)}")
    for f in files:
        print(f"     - {f.name}")

Persisting index to ./storage...

‚úÖ Index persisted successfully!
   Location: ./storage
   Files created: 5
     - image__vector_store.json
     - graph_store.json
     - index_store.json
     - docstore.json
     - default__vector_store.json


### Loading Index from Storage

In [11]:
# Load index from disk
print(f"Loading index from {persist_dir}...")

storage_context_loaded = StorageContext.from_defaults(persist_dir=persist_dir)
loaded_index = load_index_from_storage(storage_context_loaded)

print("‚úÖ Index loaded successfully!")

# Test the loaded index
test_query_engine = loaded_index.as_query_engine(similarity_top_k=2)
test_response = test_query_engine.query("What is Qdrant?")

print(f"\nTest query on loaded index:")
print(f"  Query: What is Qdrant?")
print(f"  Response: {test_response}")

Loading index from ./storage...
‚úÖ Index loaded successfully!

Test query on loaded index:
  Query: What is Qdrant?
  Response: Qdrant is an open-source vector database that is implemented in Rust. It features HNSW indexing, filtering capabilities, and supports hybrid search. Qdrant can be deployed locally using Docker or in cloud environments. Its key functionalities include payload filtering, quantization for memory efficiency, and support for distributed deployments, making it particularly suitable for production applications involving retrieval-augmented generation (RAG).


### Persistence Best Practices

1. **Version your indexes**: Include timestamp or version in directory name
2. **Backup regularly**: Especially before re-indexing
3. **Separate storage by environment**: dev/staging/prod
4. **Monitor disk usage**: Indexes can grow large
5. **Use external vector DBs for production**: Better than file-based persistence

---

## 8. Query Engine Configuration

### 8.1 Basic Query Engine

In [12]:
# Create query engine with configuration
query_engine = chroma_index.as_query_engine(
    similarity_top_k=3,
    response_mode="compact",
)

query = "What are the main vector databases mentioned?"
response = query_engine.query(query)

print(f"Query: {query}\n")
print(f"Response:\n{response}")
print("\n" + "="*80)
print(f"\nSources used: {len(response.source_nodes)}")

Query: What are the main vector databases mentioned?

Response:
The main vector databases mentioned are Qdrant, Pinecone, Weaviate, and Milvus.


Sources used: 3


### 8.2 Response Synthesis Modes Deep Dive

In [13]:
# Test all response modes
modes = ["compact", "tree_summarize", "simple_summarize", "refine"]
test_query = "Explain HNSW algorithm"

print(f"Testing response modes with query: '{test_query}'\n")
print("="*80)

for mode in modes:
    engine = chroma_index.as_query_engine(
        similarity_top_k=2,
        response_mode=mode
    )
    
    start = time.time()
    resp = engine.query(test_query)
    elapsed = time.time() - start
    
    print(f"\nMode: {mode}")
    print(f"  Time: {elapsed:.2f}s")
    print(f"  Response length: {len(str(resp))} chars")
    print(f"  Response: {str(resp)[:200]}...")
    print("-" * 80)

Testing response modes with query: 'Explain HNSW algorithm'


Mode: compact
  Time: 4.39s
  Response length: 610 chars
  Response: HNSW, or Hierarchical Navigable Small World, is a graph-based algorithm designed for approximate nearest neighbor search. It constructs a multi-layer graph where each layer is a subset of the previous...
--------------------------------------------------------------------------------

Mode: tree_summarize
  Time: 3.96s
  Response length: 618 chars
  Response: HNSW, or Hierarchical Navigable Small World, is a graph-based algorithm designed for approximate nearest neighbor search. It constructs a multi-layer graph where each layer represents a subset of the ...
--------------------------------------------------------------------------------

Mode: simple_summarize
  Time: 3.51s
  Response length: 588 chars
  Response: HNSW (Hierarchical Navigable Small World) is a graph-based algorithm designed for approximate nearest neighbor search. It constructs a multi-la

### Response Mode Characteristics

| Mode | Speed | Quality | Best For |
|------|-------|---------|----------|
| **compact** | Fast | Good | General use |
| **tree_summarize** | Moderate | Excellent | Long contexts |
| **simple_summarize** | Very Fast | Basic | Simple queries |
| **refine** | Slow | Excellent | High quality needed |
| **accumulate** | Slow | Varied | Multiple perspectives |

### 8.3 Streaming Responses

In [15]:
# Create streaming query engine
streaming_engine = chroma_index.as_query_engine(
    similarity_top_k=2,
    streaming=True,
)

query = "What is the difference between Qdrant and Chroma?"
print(f"Query: {query}\n")
print("Streaming response:")
print("-" * 80)

response = streaming_engine.query(query)

# Stream tokens
for text in response.response_gen:
    print(text, end="", flush=True)

print("\n" + "="*80)

Query: What is the difference between Qdrant and Chroma?

Streaming response:
--------------------------------------------------------------------------------
Qdrant and Chroma are both vector databases, but they cater to different needs and use cases. Chroma is lightweight and designed for easy setup, making it suitable for prototyping and small-to-medium scale applications. It runs in-memory or can persist to disk and integrates well with tools like LangChain and LlamaIndex. Chroma supports metadata filtering and various distance metrics.

On the other hand, Qdrant is an open-source database written in Rust, offering more advanced features such as HNSW indexing, filtering, and hybrid search. It can be deployed locally using Docker or in the cloud and is optimized for production applications, particularly in retrieval-augmented generation (RAG) scenarios. Qdrant also includes features like payload filtering, quantization for memory efficiency, and support for distributed deployments.


### Why Streaming?

**Benefits:**
- ‚úÖ **Better UX**: Users see progress immediately
- ‚úÖ **Lower perceived latency**: First token arrives faster
- ‚úÖ **Interruptible**: Can stop generation early

**Use cases:**
- Chatbots and conversational interfaces
- Long-form content generation
- User-facing applications

---

## 9. VectorIndexRetriever

### Manual Retrieval Control

In [16]:
# Create custom retriever
retriever = VectorIndexRetriever(
    index=chroma_index,
    similarity_top_k=3,
)

# Retrieve nodes directly (no LLM synthesis)
query_str = "vector database algorithms"
retrieved_nodes = retriever.retrieve(query_str)

print(f"Query: {query_str}\n")
print(f"Retrieved {len(retrieved_nodes)} nodes:\n")

for i, node in enumerate(retrieved_nodes, 1):
    print(f"Node {i}:")
    print(f"  Score: {node.score:.4f}")
    print(f"  Topic: {node.metadata.get('topic')}")
    print(f"  Difficulty: {node.metadata.get('difficulty')}")
    print(f"  Text (first 150 chars): {node.text[:150]}...")
    print()

Query: vector database algorithms

Retrieved 3 nodes:

Node 1:
  Score: 0.4810
  Topic: vector_databases
  Difficulty: intermediate
  Text (first 150 chars): Vector databases are specialized databases designed to store and query high-dimensional vectors.
        These vectors typically represent embeddings ...

Node 2:
  Score: 0.2966
  Topic: algorithms
  Difficulty: advanced
  Text (first 150 chars): HNSW (Hierarchical Navigable Small World) is a graph-based algorithm for approximate nearest neighbor
        search. It builds a multi-layer graph wh...

Node 3:
  Score: 0.2869
  Topic: qdrant
  Difficulty: intermediate
  Text (first 150 chars): Qdrant is an open-source vector database written in Rust. It supports HNSW indexing, filtering,
        and hybrid search. Qdrant can run locally (Doc...



### Custom Query Engine from Retriever

In [17]:
# Build query engine from custom retriever
custom_query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever,
    response_mode="compact",
)

response = custom_query_engine.query("Explain the HNSW algorithm")
print(f"Response from custom retriever:\n{response}")

Response from custom retriever:
HNSW, or Hierarchical Navigable Small World, is a graph-based algorithm designed for approximate nearest neighbor search. It constructs a multi-layer graph where each layer is a subset of the previous one, allowing for efficient navigation through the data. The algorithm is known for its excellent query performance, achieving response times in the sub-millisecond range while maintaining high recall rates. Key parameters of HNSW include M, which determines the number of connections per node, and ef_construction, which defines the search width during the construction phase of the graph. This structure enables effective and rapid similarity searches in high-dimensional spaces.


---

## 10. VectorIndexAutoRetriever

### Natural Language Metadata Filtering

In [19]:
# Define metadata schema for auto-retriever
vector_store_info = VectorStoreInfo(
    content_info="Technical documentation about vector databases and embeddings",
    metadata_info=[
        MetadataInfo(
            name="topic",
            type="str",
            description="The main topic of the document (e.g., 'qdrant', 'chroma', 'embeddings')",
        ),
        MetadataInfo(
            name="difficulty",
            type="str",
            description="Difficulty level: 'beginner', 'intermediate', or 'advanced'",
        ),
        MetadataInfo(
            name="year",
            type="int",
            description="Year of publication (2023 or 2024)",
        ),
    ],
)

# Create auto-retriever
auto_retriever = VectorIndexAutoRetriever(
    chroma_index,
    vector_store_info=vector_store_info,
    similarity_top_k=3,
)

print("‚úÖ VectorIndexAutoRetriever configured")

‚úÖ VectorIndexAutoRetriever configured


In [20]:
# Query with natural language filters
query_with_filter = "Tell me about beginner-level topics"

print(f"Query: {query_with_filter}\n")
print("Auto-retriever will automatically extract metadata filters from the query!\n")

retrieved = auto_retriever.retrieve(query_with_filter)

print(f"Retrieved {len(retrieved)} nodes:\n")
for i, node in enumerate(retrieved, 1):
    print(f"Node {i}:")
    print(f"  Topic: {node.metadata.get('topic')}")
    print(f"  Difficulty: {node.metadata.get('difficulty')}")
    print(f"  Score: {node.score:.4f}")
    print()

Query: Tell me about beginner-level topics

Auto-retriever will automatically extract metadata filters from the query!

Retrieved 2 nodes:

Node 1:
  Topic: chroma
  Difficulty: beginner
  Score: 0.2776

Node 2:
  Topic: embeddings
  Difficulty: beginner
  Score: 0.2651



### Auto-Retriever Advantages

**How it works:**
1. LLM extracts metadata filters from natural language query
2. Applies filters to vector store
3. Performs similarity search on filtered subset

**Pros:**
- ‚úÖ Natural language interface to metadata
- ‚úÖ No manual filter construction
- ‚úÖ User-friendly for non-technical users

**Cons:**
- ‚ùå Extra LLM call (cost + latency)
- ‚ùå May not extract complex filters correctly
- ‚ùå Requires well-defined metadata schema

---

## 11. Summary: What You Learned

### ‚úÖ Completed Learning Objectives

1. **Vector Store Integration**: Used SimpleVectorStore, Chroma, and Qdrant
2. **Embedding Comparison**: Tested OpenAI vs HuggingFace embeddings
3. **Index Persistence**: Saved and loaded indexes from disk
4. **Query Engines**: Configured different response modes and streaming
5. **Retrievers**: Implemented VectorIndexRetriever and VectorIndexAutoRetriever
6. **Production Patterns**: Learned trade-offs for real-world deployment

### Key Concepts Mastered

- **Vector databases**: Chroma (easy), Qdrant (production), SimpleVectorStore (dev)
- **Embedding models**: OpenAI (high quality, API), HuggingFace (free, local)
- **Response modes**: compact, tree_summarize, refine, accumulate
- **Streaming**: Better UX for user-facing applications
- **Custom retrievers**: Direct control over retrieval logic
- **Auto-retriever**: Natural language metadata filtering


---

## üéØ Practice Exercises

1. **Vector DB Comparison**: Create the same index in SimpleVectorStore, Chroma, and Qdrant. Compare query speeds.
2. **Embedding Experiment**: Build two indexes with different embeddings (OpenAI vs HuggingFace). Which retrieves better?
3. **Response Modes**: Test all response modes on a complex query. Which gives the best answer?
4. **Persistence**: Create an index, persist it, load it in a new session, verify it works.
5. **Auto-Retriever**: Add more metadata fields and test natural language filtering.

---

## Additional Resources

- **Vector Stores**: https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores/
- **Embeddings**: https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings/
- **Query Engines**: https://docs.llamaindex.ai/en/stable/module_guides/deploying/query_engine/
- **Retrievers**: https://docs.llamaindex.ai/en/stable/module_guides/querying/retriever/