# Lab 25: Vector Store Implementation with OpenAI Embeddings and Chroma

## Learning Objectives
In this lab, you will learn how to:
- Integrate OpenAI embeddings with Chroma vector database for efficient document storage and retrieval
- Create a complete vector store from text documents using automated embedding generation
- Perform semantic similarity searches to find relevant documents based on contextual meaning
- Understand how vector databases enable intelligent document retrieval for RAG applications
- Compare search results for different query types and analyze semantic matching capabilities

## Overview
This lab demonstrates the practical implementation of a vector store using Chroma database with OpenAI embeddings. You'll learn how to automatically embed documents, store them in a searchable vector database, and perform similarity searches that understand semantic relationships rather than just keyword matching. This foundation is essential for building production-ready RAG systems that can intelligently retrieve relevant information.

In [None]:
# Import Essential Libraries for Vector Store Implementation
# This lab demonstrates how to create a complete vector store using OpenAI embeddings and Chroma database
# for efficient document storage and semantic similarity search in RAG applications

# OpenAIEmbeddings: Converts text into high-dimensional vectors that capture semantic meaning
# - Uses OpenAI's text-embedding-3-large model for state-of-the-art embedding quality
# - Enables semantic understanding beyond simple keyword matching
from langchain_openai import OpenAIEmbeddings

# Chroma: Open-source vector database optimized for AI applications
# - Provides efficient storage and retrieval of embedded documents
# - Supports similarity search with various distance metrics
# - Ideal for building RAG systems with fast document retrieval
from langchain_chroma import Chroma

In [None]:
# OpenAI API Configuration
# Set up authentication for OpenAI services to access embedding models
# The embedding service requires a valid API key for generating high-quality vector representations

import os

# Configure OpenAI API key for embedding generation
# Replace "your-api-key" with your actual OpenAI API key
# The embeddings will be generated using OpenAI's text-embedding-3-large model
# which provides 3072-dimensional vectors with excellent semantic understanding
os.environ["OPENAI_API_KEY"] = "your-api-key"

In [None]:
# Initialize OpenAI Embeddings Model
# Create an embeddings instance using OpenAI's most advanced embedding model

# text-embedding-3-large is OpenAI's highest-quality embedding model
# Key features:
# - 3072 dimensions for rich semantic representation
# - Superior performance on similarity and retrieval tasks
# - Optimized for multilingual understanding
# - Excellent performance on domain-specific content
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

print("✅ OpenAI Embeddings initialized with text-embedding-3-large model")
print("📊 This model generates 3072-dimensional vectors for semantic search")

In [None]:
# Create Sample Documents for Vector Store
# Define a collection of sports-related documents to demonstrate semantic search capabilities
# These documents cover different sports (cricket and football) with varying themes

# Sample documents with diverse sports content:
# - Cricket World Cup content (documents 1 and 3)
# - Football World Cup content (documents 2 and 4)
# - Mix of themes: championships, highlights, player stories
# This variety will help demonstrate how semantic search finds contextually relevant matches
docs = [
    "Thrilling Finale Awaits: The Countdown to the Cricket World Cup Championship",
    "Global Giants Clash: Football World Cup Semi-Finals Set the Stage for Epic Showdowns", 
    "Record Crowds and Unforgettable Moments: Highlights from the Cricket World Cup",
    "From Underdogs to Contenders: Football World Cup Surprises and Breakout Stars"
]

print(f"📄 Created {len(docs)} sample documents for vector store")
print("🏏 Documents include cricket and football World Cup content")
print("🔍 These will be embedded and stored for semantic similarity search")

In [None]:
# Create Vector Store with Automated Embedding Generation
# Use Chroma's from_texts() method to automatically embed and store documents
# This combines embedding generation and vector database storage in one operation

# Chroma.from_texts() performs several operations:
# 1. Generates embeddings for each document using the specified embedding model
# 2. Creates a new Chroma vector database instance
# 3. Stores the embedded documents with their original text
# 4. Creates indexes for efficient similarity search
# 5. Returns a searchable vector store ready for queries

vectorstore = Chroma.from_texts(texts=docs, embedding=embeddings)

print("🚀 Vector store created successfully!")
print("📦 All documents have been embedded and stored in Chroma database")
print("🔍 Vector store is ready for similarity search operations")
print(f"📊 Stored {len(docs)} documents as 3072-dimensional vectors")

In [None]:
# Perform Semantic Similarity Search - Cricket Context
# Search for documents related to "Rohit Sharma" (famous cricket player)
# This demonstrates how vector search understands semantic relationships

# The similarity search process:
# 1. Converts query "Rohit Sharma" into an embedding vector
# 2. Compares this vector against all stored document vectors
# 3. Finds documents with highest semantic similarity scores
# 4. Returns the most relevant matches based on contextual understanding

print("🏏 Searching for documents related to 'Rohit Sharma' (cricket context)")
print("🔍 Expected: Cricket-related documents should rank higher due to semantic similarity")
print()

results = vectorstore.similarity_search('Rohit Sharma', 2)

print("📋 Top 2 most similar documents:")
for i, doc in enumerate(results, 1):
    print(f"{i}. {doc.page_content}")
    
print("\n💡 Notice: Cricket documents are prioritized because Rohit Sharma is a cricket player")
print("🧠 This demonstrates semantic understanding beyond keyword matching")

In [None]:
# Perform Semantic Similarity Search - Football Context  
# Search for documents related to "Lionel Messi" (famous football player)
# This demonstrates how the same vector store adapts to different query contexts

# Comparing search contexts:
# - "Rohit Sharma" → Cricket association → Cricket documents prioritized
# - "Lionel Messi" → Football association → Football documents prioritized
# This shows how embeddings capture domain-specific semantic relationships

print("⚽ Searching for documents related to 'Lionel Messi' (football context)")
print("🔍 Expected: Football-related documents should rank higher due to semantic similarity")
print()

results = vectorstore.similarity_search('Lionel Messi', 2)

print("📋 Top 2 most similar documents:")
for i, doc in enumerate(results, 1):
    print(f"{i}. {doc.page_content}")
    
print("\n💡 Notice: Football documents are prioritized because Lionel Messi is a football player")
print("🧠 This demonstrates how vector search adapts to different semantic contexts")
print("⚖️ Compare these results with the Rohit Sharma search to see context-aware retrieval")

## Key Takeaways and Next Steps

### What You've Learned
1. **Vector Store Integration**: Successfully integrated OpenAI embeddings with Chroma database for automated document storage and retrieval
2. **Semantic Search Capabilities**: Demonstrated how vector databases understand contextual relationships (cricket vs football contexts)
3. **Production-Ready Foundation**: Built a complete vector store system that can scale to thousands of documents
4. **Query Adaptation**: Observed how the same vector store intelligently adapts to different query contexts and domains

### Technical Insights
- **Embedding Quality**: OpenAI's text-embedding-3-large model provides rich 3072-dimensional semantic representations
- **Chroma Efficiency**: The `from_texts()` method simplifies the embedding and storage process into a single operation
- **Semantic Understanding**: Vector search goes beyond keyword matching to understand meaning and context
- **Contextual Retrieval**: Search results adapt based on the semantic domain of the query (sports player associations)

### Real-World Applications
- **Document Knowledge Bases**: Build intelligent search systems for company documents, manuals, and reports
- **Customer Support**: Create semantic search for FAQ systems that understand user intent
- **Research Tools**: Enable researchers to find relevant papers and articles based on conceptual similarity
- **Content Recommendation**: Develop systems that suggest related content based on semantic relationships

### Next Steps for RAG Development
1. **Scale Up**: Add hundreds or thousands of documents to test performance at scale
2. **Hybrid Search**: Combine semantic search with keyword search for optimal retrieval
3. **Metadata Filtering**: Add document metadata for more sophisticated filtering capabilities
4. **Integration**: Connect this vector store to LLM chains for complete RAG question-answering systems