# OSB Vector Database Example

This notebook demonstrates how to create and use a vector database from Oversight Board full text data using Buttermilk's ChromaDB integration.

## Overview

We'll show how to:
1. Load OSB JSON data using existing data loaders
2. Generate embeddings and create a ChromaDB vector store
3. Use the generic RAG agent for interactive question answering
4. Demonstrate semantic search capabilities

This example uses the generic infrastructure that works with any JSON dataset.

## 1. Configuration Setup

First, let's set up the configuration for our OSB vector database pipeline.

In [1]:
from rich import print
from rich.pretty import pprint
import asyncio
import json
from pathlib import Path
import hydra
from hydra import compose, initialize_config_dir
from omegaconf import DictConfig, OmegaConf

# Buttermilk imports
from buttermilk import logger
from buttermilk.data.vector import ChromaDBEmbeddings, InputDocument, DefaultTextSplitter
from buttermilk.agents.rag.rag_agent import RagAgent
from buttermilk._core.config import AgentConfig, DataSourceConfig

from buttermilk.utils.nb import init
from buttermilk._core.dmrc import get_bm, set_bm
from buttermilk._core.config import DataSourceConfig
from buttermilk._core.types import Record
from buttermilk.data.loaders import create_data_loader
from buttermilk.data.vector import ChromaDBEmbeddings, InputDocument, DefaultTextSplitter, list_to_async_iterator

# Initialize Buttermilk
cfg = init(job="osb_vectorise", overrides=["+storage=osb"])
bm = get_bm()

print("🚀 Buttermilk initialized for JSON-to-Vector tutorial")
pprint(cfg.storage)


[32m2025-06-16 20:10:38[0m [] [1;30mINFO[0m bm_init.py:778 Logging set up for run: platform='local' name='bm_api' job='osb_vectorise' run_id='20250616T1010Z-UP4j-docker-desktop-debian' ip=None node_name='docker-desktop' save_dir='/tmp/tmpclcw4qip/bm_api/osb_vectorise/20250616T1010Z-UP4j-docker-desktop-debian' flow_api=None. Save directory: /tmp/tmpclcw4qip/bm_api/osb_vectorise/20250616T1010Z-UP4j-docker-desktop-debian


[32m2025-06-16 20:10:38[0m [] [1;30mINFO[0m nb.py:59 Starting interactive run for bm_api job osb_vectorise in notebook


[32m2025-06-16 20:10:38[0m [] [1;30mINFO[0m save.py:641 Successfully dumped data to local disk (JSON): /tmp/tmpclcw4qip/bm_api/osb_vectorise/20250616T1010Z-UP4j-docker-desktop-debian/tmpcw2gxev0.json.
[32m2025-06-16 20:10:38[0m [] [1;30mINFO[0m save.py:215 Successfully saved data using dump_to_disk to: /tmp/tmpclcw4qip/bm_api/osb_vectorise/20250616T1010Z-UP4j-docker-desktop-debian/tmpcw2gxev0.json.
[32m2025-06-16 20:10:38[0m [] [1;30mINFO[0m bm_init.py:864 {'message': 'Successfully saved data to: /tmp/tmpclcw4qip/bm_api/osb_vectorise/20250616T1010Z-UP4j-docker-desktop-debian/tmpcw2gxev0.json', 'uri': '/tmp/tmpclcw4qip/bm_api/osb_vectorise/20250616T1010Z-UP4j-docker-desktop-debian/tmpcw2gxev0.json', 'run_id': '20250616T1010Z-UP4j-docker-desktop-debian'}


## 2. Initialize Components

Let's create the storage, vector store, and text splitter components.

In [2]:
# Now we can use the clean BM API for all storage types
source = bm.get_storage(cfg.storage.osb_json)
vectorstore = bm.get_storage(cfg.storage.osb_vector)

# Create text splitter
chunker = DefaultTextSplitter(chunk_size=2000, chunk_overlap=500)

print("✅ All storage components initialized via BM API")
print(f"Source: {type(source).__name__}")
print(f"Vector store: {type(vectorstore).__name__}")
print(f"Text splitter: {type(chunker).__name__}")


[32m2025-06-16 20:10:48[0m [] [1;30mINFO[0m vector.py:249 Loading embedding model: gemini-embedding-001
[32m2025-06-16 20:10:52[0m [] [1;30mINFO[0m vector.py:257 Initializing ChromaDB client at: gs://prosocial-public/osb/chromadb
[32m2025-06-16 20:10:52[0m [] [1;30mINFO[0m vector.py:262 Using ChromaDB collection: osb_fulltext
[32m2025-06-16 20:10:52[0m [] [1;30mINFO[0m vector.py:154 Initialized RecursiveCharacterTextSplitter (chunk_size=2000, chunk_overlap=500)


In [3]:
# Ensure ChromaDB is ready - this handles both creation and reading scenarios
await vectorstore.ensure_cache_initialized()

print("✅ ChromaDB collection ready for use")
print(f"📊 Collection '{vectorstore.collection_name}' statistics:")
print(f"   📁 Directory: {vectorstore.persist_directory}")
print(f"   🧠 Model: {vectorstore.embedding_model}")
print(f"   📐 Dimensions: {vectorstore.dimensionality}")
print(f"   📦 Current count: {vectorstore.collection.count()} embeddings")


[32m2025-06-16 20:11:01[0m [] [1;30mINFO[0m vector.py:281 🔄 Downloading remote ChromaDB: gs://prosocial-public/osb/chromadb
[32m2025-06-16 20:11:01[0m [] [1;30mINFO[0m utils.py:645 Downloading ChromaDB from gs://prosocial-public/osb/chromadb to /home/debian/.cache/buttermilk/chromadb/gs___prosocial-public_osb_chromadb
[32m2025-06-16 20:11:03[0m [] [1;30mERROR[0m utils.py:676 [31mFailed to download ChromaDB from gs://prosocial-public/osb/chromadb: Remote ChromaDB directory does not exist: gs://prosocial-public/osb/chromadb[0m


OSError: ChromaDB download failed: Remote ChromaDB directory does not exist: gs://prosocial-public/osb/chromadb

## Safe Create vs Read Behavior

The `ensure_cache_initialized()` method intelligently handles both scenarios:

### 🆕 **First Run (Creation)**
- Downloads remote ChromaDB if needed
- Creates new collection with proper schema
- Logs: "🆕 Creating new collection 'osb_fulltext'"

### 📖 **Subsequent Runs (Reading)** 
- Uses existing cached ChromaDB
- Validates collection compatibility
- Logs: "📖 Found existing collection 'osb_fulltext'" 

### 🔒 **Safety Features**
- ✅ Never overwrites existing collections
- ✅ Same config works for create and read
- ✅ Schema validation with helpful warnings
- ✅ Clear logging of all operations

This means you can:
1. **Run this notebook** to create embeddings  
2. **Use same config** in production to read embeddings
3. **No config changes** needed between scenarios

In [None]:
# Load live OSB data from GCS
print("📥 Loading live OSB data from GCS...")

# Create data loader from storage config
from buttermilk.data.loaders import create_data_loader
from buttermilk._core.config import DataSourceConfig

# Convert storage config to data source config
source_config = DataSourceConfig(**cfg.storage.osb_json)
data_loader = create_data_loader(source_config)

print(f"🔗 Data source: {source_config.path}")
print(f"📋 Field mapping: {source_config.columns}")

# Load documents (limit for demo, remove limit for full production run)
documents = []
doc_limit = 5  # Set to None for full dataset

print(f"📚 Loading {doc_limit or 'all'} documents from live dataset...")

async for record in data_loader.load():
    # Convert Record to InputDocument format for vector processing
    doc = InputDocument(
        record_id=record.record_id,
        title=record.metadata.get('title', 'Untitled OSB Case'),
        full_text=record.content if isinstance(record.content, str) else str(record.content),
        metadata=record.metadata,
        file_path="",
    )
    documents.append(doc)
    
    print(f"   📄 Loaded: {doc.record_id} - {doc.title[:60]}...")
    
    if doc_limit and len(documents) >= doc_limit:
        break

print(f"\n✅ Loaded {len(documents)} live OSB documents for vector processing")


## Configuration-Driven Multi-Field Vector Store

This notebook demonstrates a **configuration-driven approach** for multi-field vector embeddings that works across any data source.

### 🧠 **The Problem**
Traditional vector stores only embed the main content, leaving rich metadata unsearchable:
```python
# Traditional approach - metadata trapped
doc.content = "Long text..."        # → Gets embedded ✅
doc.metadata.summary = "Key points"  # → Not searchable ❌
```

### 🎯 **Our Solution: Configuration-Driven Multi-Field Embeddings**
Define which fields to embed in the storage configuration:
```yaml
# conf/storage/osb.yaml
osb_vector:
  type: chromadb
  # ... basic config
  multi_field_embedding:
    content_field: "content"
    additional_fields:
      - source_field: "summary"
        chunk_type: "summary"
        min_length: 50
      - source_field: "title"
        chunk_type: "title"
        min_length: 10
```

### 🔍 **Search Capabilities**

| Search Type | Use Case | Example Query |
|-------------|----------|---------------|
| **Summary-Only** | High-level concepts | `where={"content_type": "summary"}` |
| **Title-Only** | Topic matching | `where={"content_type": "title"}` |
| **Content-Only** | Detailed analysis | `where={"content_type": "content"}` |
| **Cross-Field** | Comprehensive search | No filter = search everything |
| **Hybrid** | Semantic + exact match | `query + where={"case_number": "2024"}` |

### 🏗️ **Benefits**
- ✅ **Configuration-Driven**: No hardcoded field names
- ✅ **Data Source Agnostic**: Works with any Record structure
- ✅ **Same Config**: Creation and reading use identical configuration
- ✅ **Extensible**: Easy to add new field types for any dataset

In [ ]:
async def create_production_vector_store():
    """Production pipeline: Process live OSB data with configuration-driven multi-field embedding."""
    
    print("🏭 Starting production vector store (configuration-driven)...")
    print(f"📊 Processing {len(documents)} live OSB documents")
    
    successful_embeddings = 0
    failed_embeddings = 0
    total_chunks = 0
    
    for i, doc in enumerate(documents):
        print(f"\n📄 [{i+1}/{len(documents)}] Processing: {doc.title[:50]}...")
        
        try:
            # Simple process call - multi-field chunking happens automatically via config!
            processed_doc = await vectorstore.process(doc)
            if processed_doc:
                successful_embeddings += 1
                chunk_count = len(processed_doc.chunks)
                total_chunks += chunk_count
                
                # Count chunk types for display
                chunk_types = {}
                for chunk in processed_doc.chunks:
                    content_type = chunk.metadata.get('content_type', 'content')
                    chunk_types[content_type] = chunk_types.get(content_type, 0) + 1
                
                breakdown = ", ".join([f"{count} {ctype}" for ctype, count in chunk_types.items()])
                print(f"   ✅ Embedded {chunk_count} chunks: {breakdown}")
            else:
                failed_embeddings += 1
                print(f"   ❌ Failed to process document")
                
        except Exception as e:
            failed_embeddings += 1
            print(f"   ❌ Error processing document: {e}")
    
    # Final results
    final_count = vectorstore.collection.count()
    
    print(f"\n🎉 Configuration-Driven Vector Store Created!")
    print(f"   📊 Documents processed: {successful_embeddings + failed_embeddings}")
    print(f"   ✅ Successfully embedded: {successful_embeddings}")
    print(f"   ❌ Failed: {failed_embeddings}")
    print(f"   📦 Total chunks: {total_chunks}")
    print(f"   🔢 Total embeddings in collection: {final_count}")
    print(f"   🏪 Collection: '{vectorstore.collection_name}'")
    print(f"   ⚙️  Multi-field config: {vectorstore.multi_field_config is not None}")
    
    return {
        'successful_docs': successful_embeddings,
        'failed_docs': failed_embeddings, 
        'total_chunks': total_chunks,
        'final_embedding_count': final_count
    }

# Create the production vector store using configuration
results = await create_production_vector_store()

# Test configuration-driven multi-field search capabilities
print("🔍 Testing Configuration-Driven Multi-Field Search...")

# The content_type values come from our configuration:
# - "content" (main content field)
# - "summary" (from additional_fields config)
# - "title" (from additional_fields config)

# 1. Search summaries only (high-level concepts)
print("\n🎯 1. SUMMARY-ONLY SEARCH:")
print("   Query: 'platform safety measures'")
summary_results = vectorstore.collection.query(
    query_texts=["platform safety measures"],
    where={"content_type": "summary"},  # Based on config: source_field="summary"
    n_results=3,
    include=["documents", "metadatas", "distances"]
)

if summary_results["ids"] and summary_results["ids"][0]:
    for i, (doc, metadata, distance) in enumerate(zip(
        summary_results["documents"][0], 
        summary_results["metadatas"][0], 
        summary_results["distances"][0]
    )):
        similarity = 1 - distance
        title = metadata.get("title", "Untitled")
        print(f"   📋 Result {i+1}: {title[:40]}... (similarity: {similarity:.3f})")
        print(f"      📝 Summary: {doc[:80]}...")

# 2. Search titles only (specific topics)
print("\n🎯 2. TITLE-ONLY SEARCH:")
print("   Query: 'content moderation'")
title_results = vectorstore.collection.query(
    query_texts=["content moderation"],
    where={"content_type": "title"},  # Based on config: source_field="title"
    n_results=3,
    include=["documents", "metadatas", "distances"]
)

if title_results["ids"] and title_results["ids"][0]:
    for i, (doc, metadata, distance) in enumerate(zip(
        title_results["documents"][0], 
        title_results["metadatas"][0], 
        title_results["distances"][0]
    )):
        similarity = 1 - distance
        case_number = metadata.get("case_number", "Unknown")
        print(f"   📋 Result {i+1}: {case_number} (similarity: {similarity:.3f})")
        print(f"      🏷️  Title: {doc}")

# 3. Search main content only (detailed analysis)
print("\n🎯 3. CONTENT-ONLY SEARCH:")
print("   Query: 'automated detection algorithms'")
content_results = vectorstore.collection.query(
    query_texts=["automated detection algorithms"],
    where={"content_type": "content"},  # Based on config: content_field="content"
    n_results=2,
    include=["documents", "metadatas", "distances"]
)

if content_results["ids"] and content_results["ids"][0]:
    for i, (doc, metadata, distance) in enumerate(zip(
        content_results["documents"][0], 
        content_results["metadatas"][0], 
        content_results["distances"][0]
    )):
        similarity = 1 - distance
        title = metadata.get("title", "Untitled")
        chunk_seq = metadata.get("chunk_sequence", "?")
        print(f"   📋 Result {i+1}: {title[:30]}... chunk {chunk_seq} (similarity: {similarity:.3f})")
        print(f"      📝 Content: {doc[:100]}...")

# 4. Cross-field search (search all content types)
print("\n🎯 4. CROSS-FIELD SEARCH:")
print("   Query: 'user protection mechanisms'")
all_results = vectorstore.collection.query(
    query_texts=["user protection mechanisms"],
    n_results=5,  # No content_type filter = search everything
    include=["documents", "metadatas", "distances"]
)

if all_results["ids"] and all_results["ids"][0]:
    for i, (doc, metadata, distance) in enumerate(zip(
        all_results["documents"][0], 
        all_results["metadatas"][0], 
        all_results["distances"][0]
    )):
        similarity = 1 - distance
        content_type = metadata.get("content_type", "unknown")
        title = metadata.get("title", "Untitled")
        print(f"   📋 Result {i+1}: {content_type.upper()} from '{title[:30]}...' (similarity: {similarity:.3f})")
        print(f"      📝 Text: {doc[:80]}...")

print(f"\n✅ Configuration-driven multi-field search demonstrated!")
print(f"🎯 This approach works with ANY data source - just update the config!")
print(f"📝 For other datasets, modify conf/storage/[dataset].yaml:")
print(f"   additional_fields:")
print(f"     - source_field: 'abstract'  # For academic papers")
print(f"     - source_field: 'keywords'  # For any dataset with keywords")
print(f"     - source_field: 'category'  # For categorized content")

In [None]:
# Create data source configuration for the RAG agent
data_config = DataSourceConfig(
    type="chromadb", persist_directory="./data/osb_chromadb", collection_name="osb_fulltext", embedding_model="text-embedding-005", dimensionality=768
)

# Create agent configuration
agent_config = AgentConfig(
    role="RESEARCHER",
    agent_obj="RagAgent",
    description="OSB Research Assistant",
    data={"osb_vector": data_config},
    variants={"model": "gemini-1.5-flash"},
    parameters={"template": "rag_research", "n_results": 10, "no_duplicates": False, "max_queries": 3},
)

# Initialize the RAG agent
rag_agent = RagAgent(**agent_config.model_dump())
print("RAG agent initialized successfully")


## 6. Semantic Search Examples

Let's demonstrate the semantic search capabilities of our OSB vector database.

In [None]:
async def search_osb_database(queries):
    """Search the OSB database with multiple queries."""
    print("\n=== OSB Database Search Results ===")

    results = await rag_agent.fetch(queries)

    for i, (query, result) in enumerate(zip(queries, results)):
        print(f"\n--- Query {i+1}: {query} ---")
        print(f"Found {len(result.results)} relevant chunks")

        if result.results:
            # Show the top result
            top_result = result.results[0]
            print(f"\nTop Result:")
            print(f"Document: {top_result.document_title}")
            print(f"Case Number: {top_result.metadata.get('case_number', 'N/A')}")
            print(f"Text: {top_result.full_text[:300]}...")
        else:
            print("No results found")


# Example search queries
search_queries = [
    "What are the challenges with automated content moderation?",
    "How effective are age verification systems?",
    "What techniques are used to spread misinformation?",
]

await search_osb_database(search_queries)


## 7. Interactive Chat Interface

Now let's create an interactive interface to chat with our OSB knowledge base.

In [None]:
async def chat_with_osb(user_question):
    """Interactive chat with OSB knowledge base."""
    print(f"\n🔍 User Question: {user_question}")

    # Search for relevant context
    search_results = await rag_agent.fetch([user_question])

    if search_results and search_results[0].results:
        context = search_results[0]
        print(f"\n📚 Found {len(context.results)} relevant documents")

        # Display relevant chunks
        print("\n📋 Relevant Information:")
        for i, result in enumerate(context.results[:3]):  # Show top 3
            print(f"\n{i+1}. {result.document_title} ({result.metadata.get('case_number', 'N/A')})")
            print(f"   {result.full_text[:200]}...")

        # In a real implementation, this would be sent to an LLM for synthesis
        print("\n🤖 AI Response: [In a real implementation, the retrieved context would be sent to an LLM to generate a synthesized response]")
    else:
        print("\n❌ No relevant information found in the OSB database")


# Example chat interactions
example_questions = [
    "What are the main issues with current content moderation approaches?",
    "What recommendations exist for age verification?",
    "How do platforms detect and counter misinformation?",
]

for question in example_questions:
    await chat_with_osb(question)
    print("\n" + "=" * 80)


## 8. Vector Store Analysis

Let's analyze our vector store to understand what we've created.

In [None]:
# Get collection statistics
collection = vectorstore.collection
count = collection.count()

print(f"\n=== OSB Vector Store Statistics ===")
print(f"Collection Name: {vectorstore.collection_name}")
print(f"Total Chunks: {count}")
print(f"Embedding Dimensions: {vectorstore.dimensionality}")
print(f"Embedding Model: {vectorstore.embedding_model}")

# Get a sample of metadata to understand the structure
sample_results = collection.get(limit=3, include=["metadatas", "documents"])

print(f"\n=== Sample Metadata Structure ===")
if sample_results["metadatas"]:
    sample_metadata = sample_results["metadatas"][0]
    print("Available metadata fields:")
    for key, value in sample_metadata.items():
        print(f"  - {key}: {type(value).__name__} = {str(value)[:50]}...")

print(f"\n=== Storage Locations ===")
print(f"ChromaDB Directory: {vectorstore.persist_directory}")
print(f"Embeddings Directory: {vectorstore.arrow_save_dir}")


## 9. Advanced Search Examples

Let's explore some advanced search patterns and filtering capabilities.

In [None]:
# Direct ChromaDB queries with metadata filtering
async def advanced_search_examples():
    """Demonstrate advanced search capabilities."""
    print("\n=== Advanced Search Examples ===")

    # 1. Search with metadata filtering
    print("\n1. Search within specific case:")
    results = collection.query(
        query_texts=["content moderation challenges"], n_results=5, where={"case_number": "OSB-2024-001"}, include=["documents", "metadatas"]
    )
    print(f"   Found {len(results['ids'][0]) if results['ids'] else 0} results in OSB-2024-001")

    # 2. Similarity search across all documents
    print("\n2. General similarity search:")
    results = collection.query(query_texts=["artificial intelligence and safety"], n_results=5, include=["documents", "metadatas", "distances"])

    if results["ids"] and results["ids"][0]:
        print(f"   Found {len(results['ids'][0])} results")
        for i, (doc, metadata, distance) in enumerate(zip(results["documents"][0][:3], results["metadatas"][0][:3], results["distances"][0][:3])):
            print(f"   Result {i+1} (similarity: {1-distance:.3f}): {metadata.get('title', 'N/A')}")
            print(f"     {doc[:100]}...")

    # 3. Multi-query search
    print("\n3. Multi-query search:")
    multi_queries = ["platform safety measures", "user protection mechanisms", "digital safety standards"]

    for query in multi_queries:
        results = collection.query(query_texts=[query], n_results=2, include=["metadatas"])
        count = len(results["ids"][0]) if results["ids"] else 0
        print(f"   '{query}': {count} results")


await advanced_search_examples()


## 10. Production Considerations

Here are key considerations for using this in production:

In [None]:
print(
    """
=== Production Deployment Checklist ===

🔧 Configuration:
   ✓ Use GCS for persist_directory: gs://your-bucket/chromadb
   ✓ Configure appropriate chunk_size for your content
   ✓ Set concurrency based on your compute resources
   ✓ Use production embedding models (text-embedding-004/005)

📊 Performance:
   ✓ Monitor embedding generation costs
   ✓ Implement caching for frequently accessed data
   ✓ Use batch processing for large datasets
   ✓ Configure appropriate timeout values

🔒 Security:
   ✓ Secure GCS bucket access with proper IAM
   ✓ Implement data access controls
   ✓ Audit vector store queries
   ✓ Protect sensitive metadata

🚀 Scalability:
   ✓ Plan for vector store size growth
   ✓ Implement horizontal scaling for embeddings
   ✓ Monitor query performance
   ✓ Set up proper logging and monitoring

🔄 Maintenance:
   ✓ Plan for data updates and reindexing
   ✓ Implement backup strategies
   ✓ Version control for embeddings and metadata
   ✓ Regular quality assessments
"""
)

# Show next steps
print(
    """
=== Next Steps ===

1. Scale to Full Dataset:
   - Use the osb_vectorize.yaml configuration
   - Run: uv run python -m buttermilk.data.vector +run=osb_vectorize

2. Deploy RAG Flow:
   - Use the osb_rag.yaml flow configuration
   - Run: uv run python -m buttermilk.runner.cli +flow=osb_rag +run=api

3. Integrate with Frontend:
   - Use the Buttermilk web interface
   - Connect to WebSocket endpoints for real-time chat

4. Monitor and Optimize:
   - Track query performance
   - Monitor embedding costs
   - Tune chunk sizes and retrieval parameters
"""
)


## 🚀 Production Deployment Guide

This vector store is now ready for production use. Here's how to deploy and use it:

### 📋 **For Full Dataset Processing**
```python
# In cell 7, change this line:
doc_limit = 5  # Set to None for full dataset

# To:
doc_limit = None  # Processes all OSB documents
```

### 🏭 **Production Usage Examples**

#### **Option 1: RAG Agent Integration**
```python
from buttermilk.agents.rag.rag_agent import RagAgent
from buttermilk._core.config import AgentConfig, DataSourceConfig

# Same config as creation - no changes needed!
data_config = DataSourceConfig(**cfg.storage.osb_vector)

agent_config = AgentConfig(
    role="RESEARCHER",
    agent_obj="RagAgent", 
    description="OSB Knowledge Assistant",
    data={"osb_vector": data_config},
    parameters={"n_results": 10, "max_queries": 3}
)

rag_agent = RagAgent(**agent_config.model_dump())
```

#### **Option 2: Direct ChromaDB Access**
```python
# Create vector store instance (reads existing embeddings)
production_vectorstore = bm.get_storage(cfg.storage.osb_vector)
await production_vectorstore.ensure_cache_initialized()

# Perform semantic search
results = production_vectorstore.collection.query(
    query_texts=["platform safety policies"],
    n_results=5
)
```

#### **Option 3: Flow Integration**
```yaml
# conf/flows/osb_rag.yaml
defaults:
  - base_flow

orchestrator: buttermilk.orchestrators.groupchat.AutogenOrchestrator
data: osb_vector  # References the same storage config
agents: [rag_agent, host/sequencer]
```

### 🔒 **Production Considerations**
- ✅ **Persistent Storage**: Vector store saved to `gs://prosocial-public/osb/chromadb`  
- ✅ **Config Reuse**: Same `osb.yaml` works for both creation and reading
- ✅ **Scalability**: ChromaDB handles thousands of documents efficiently
- ✅ **Monitoring**: Check collection count and performance metrics
- ✅ **Updates**: Re-run this notebook to add new OSB documents

### 💡 **Next Steps**
1. **Scale Up**: Remove `doc_limit` to process full OSB dataset
2. **Deploy**: Use in RAG agents, search APIs, or analytical workflows  
3. **Monitor**: Track embedding quality and search relevance
4. **Iterate**: Add new documents by re-running the pipeline