# OSB Vector Database Example

This notebook demonstrates how to create and use a vector database from Oversight Board full text data using Buttermilk's ChromaDB integration.

## Overview

We'll show how to:
1. Load OSB JSON data using existing data loaders
2. Generate embeddings and create a ChromaDB vector store
3. Use the generic RAG agent for interactive question answering
4. Demonstrate semantic search capabilities

This example uses the generic infrastructure that works with any JSON dataset.

## 1. Configuration Setup

First, let's set up the configuration for our OSB vector database pipeline.

In [1]:
from rich import print
from rich.pretty import pprint
import asyncio
import json
from pathlib import Path
import hydra
from hydra import compose, initialize_config_dir
from omegaconf import DictConfig, OmegaConf

# Buttermilk imports - updated for unified storage system
from buttermilk import logger
from buttermilk.data.vector import ChromaDBEmbeddings, DefaultTextSplitter
from buttermilk.agents.rag.rag_agent import RagAgent
from buttermilk._core.config import AgentConfig
from buttermilk._core.storage_config import StorageConfig  # New unified config
from buttermilk._core.types import Record  # Enhanced Record with vector capabilities

from buttermilk.utils.nb import init
from buttermilk._core.dmrc import get_bm, set_bm

# Initialize Buttermilk
cfg = init(job="osb_vectorise", overrides=["+storage=osb", "+agents=rag_generic", "+llms=lite"])
bm = get_bm()

print("🚀 Buttermilk initialized for JSON-to-Vector tutorial")
pprint(cfg.storage)


[32m2025-06-17 02:40:03[0m [] [1;30mINFO[0m bm_init.py:778 Logging set up for run: platform='local' name='bm_api' job='osb_vectorise' run_id='20250616T1640Z-UoeW-docker-desktop-debian' ip=None node_name='docker-desktop' save_dir='/tmp/tmp6olgr11o/bm_api/osb_vectorise/20250616T1640Z-UoeW-docker-desktop-debian' flow_api=None. Save directory: /tmp/tmp6olgr11o/bm_api/osb_vectorise/20250616T1640Z-UoeW-docker-desktop-debian


[32m2025-06-17 02:40:03[0m [] [1;30mINFO[0m nb.py:59 Starting interactive run for bm_api job osb_vectorise in notebook


[32m2025-06-17 02:40:03[0m [] [1;30mINFO[0m save.py:641 Successfully dumped data to local disk (JSON): /tmp/tmp6olgr11o/bm_api/osb_vectorise/20250616T1640Z-UoeW-docker-desktop-debian/tmpxupqm897.json.
[32m2025-06-17 02:40:03[0m [] [1;30mINFO[0m save.py:215 Successfully saved data using dump_to_disk to: /tmp/tmp6olgr11o/bm_api/osb_vectorise/20250616T1640Z-UoeW-docker-desktop-debian/tmpxupqm897.json.
[32m2025-06-17 02:40:03[0m [] [1;30mINFO[0m bm_init.py:864 {'message': 'Successfully saved data to: /tmp/tmp6olgr11o/bm_api/osb_vectorise/20250616T1640Z-UoeW-docker-desktop-debian/tmpxupqm897.json', 'uri': '/tmp/tmp6olgr11o/bm_api/osb_vectorise/20250616T1640Z-UoeW-docker-desktop-debian/tmpxupqm897.json', 'run_id': '20250616T1640Z-UoeW-docker-desktop-debian'}


## 2. Initialize Components

Let's create the storage, vector store, and text splitter components.

In [2]:
# Now we can use the clean BM API for all storage types
source = bm.get_storage(cfg.storage.osb_json)
vectorstore = bm.get_storage(cfg.storage.osb_vector)

# Create text splitter
chunker = DefaultTextSplitter(chunk_size=1200, chunk_overlap=400)


[32m2025-06-17 02:40:09[0m [] [1;30mINFO[0m vector.py:313 Loading embedding model: gemini-embedding-001
[32m2025-06-17 02:40:13[0m [] [1;30mINFO[0m vector.py:321 Initializing ChromaDB client at: gs://prosocial-dev/data/osb/chromadb
[32m2025-06-17 02:40:13[0m [] [1;30mINFO[0m vector.py:326 Using ChromaDB collection: osb_fulltext
[32m2025-06-17 02:40:13[0m [] [1;30mINFO[0m vector.py:217 Initialized RecursiveCharacterTextSplitter (chunk_size=1200, chunk_overlap=400)


## Safe Create vs Read Behavior

The `ensure_cache_initialized()` method will create a new collection if required:

### 🆕 **First Run (Creation)**
- Downloads remote ChromaDB if needed
- Creates new collection with proper schema
- Logs: "🆕 Creating new collection 'osb_fulltext'"

### 📖 **Subsequent Runs (Reading)** 
- Uses existing cached ChromaDB
- Validates collection compatibility
- Logs: "📖 Found existing collection 'osb_fulltext'" 

### 🔒 **Safety Features**
- ✅ Never overwrites existing collections
- ✅ Same config works for create and read
- ✅ Schema validation with helpful warnings
- ✅ Clear logging of all operations

This means you can:
1. **Run this notebook** to create embeddings  
2. **Use same config** in production to read embeddings
3. **No config changes** needed between scenarios

In [None]:
# Ensure ChromaDB is ready - this handles both creation and reading scenarios
await vectorstore.ensure_cache_initialized()

print("✅ ChromaDB collection ready for use")
print(f"📊 Collection '{vectorstore.collection_name}' statistics:")
print(f"   📁 Directory: {vectorstore.persist_directory}")
print(f"   🧠 Model: {vectorstore.embedding_model}")
print(f"   📐 Dimensions: {vectorstore.dimensionality}")

# Get current count to see if embeddings were actually added
current_count = vectorstore.collection.count()
print(f"   📦 Current count: {current_count} embeddings")


[32m2025-06-17 02:40:14[0m [] [1;30mINFO[0m vector.py:387 📋 Using existing local cache (modified 2.3 minutes ago)
[32m2025-06-17 02:40:14[0m [] [1;30mINFO[0m vector.py:388 🔒 Skipping download to preserve local changes
[32m2025-06-17 02:40:14[0m [] [1;30mINFO[0m vector.py:350 ✅ ChromaDB cache ready at: /home/debian/.cache/buttermilk/chromadb/gs___prosocial-dev_data_osb_chromadb
[32m2025-06-17 02:40:14[0m [] [1;30mINFO[0m vector.py:456 📖 Found existing collection 'osb_fulltext'
[32m2025-06-17 02:40:14[0m [] [1;30mINFO[0m vector.py:476 ✅ Collection 'osb_fulltext' ready (0 embeddings)


In [4]:
# Load live OSB data from GCS
print("📥 Loading live OSB data from GCS...")

print(f"🔗 Data source: {source.path}")

# Load documents (limit for demo, remove limit for full production run)
records = []
doc_limit = None  # Set to None for full dataset

print(f"📚 Loading {doc_limit or 'all'} documents from live dataset...")

for record in source:
    # Enhanced Record already has all needed capabilities - no conversion needed!
    # Just ensure it has full_text for vector processing
    if not record.full_text and isinstance(record.content, str):
        record.full_text = record.content

    records.append(record)

    if doc_limit and len(records) >= doc_limit:
        break

print(f"\n✅ Loaded {len(records)} live OSB documents for vector processing")


content.str
  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.11/v/string_type
content.json-or-python[json=list[union[str,is-instance[Image]]],python=chain[is-instance[Sequence],function-wrap[sequence_validator()]]]
  Input should be an instance of Sequence [type=is_instance_of, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.11/v/is_instance_of[0m
content.str
  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.11/v/string_type
content.json-or-python[json=list[union[str,is-instance[Image]]],python=chain[is-instance[Sequence],function-wrap[sequence_validator()]]]
  Input should be an instance of Sequence [type=is_instance_of, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.

## Configuration-Driven Multi-Field Vector Store

This notebook demonstrates a **configuration-driven approach** for multi-field vector embeddings that works across any data source.

### 🧠 **The Problem**
Traditional vector stores only embed the main content, leaving rich metadata unsearchable:
```python
# Traditional approach - metadata trapped
record.content = "Long text..."        # → Gets embedded ✅
record.metadata.summary = "Key points"  # → Not searchable ❌
```

### 🎯 **Our Solution: Enhanced Record with Configuration-Driven Multi-Field Embeddings**
The enhanced Record class provides direct vector processing capabilities:
```yaml
# conf/storage/osb.yaml
osb_vector:
  type: chromadb
  # ... basic config
  multi_field_embedding:
    content_field: "content"
    additional_fields:
      - source_field: "summary"
        chunk_type: "summary"
        min_length: 50
      - source_field: "title"
        chunk_type: "title"
        min_length: 10
```

### 🔍 **Search Capabilities**

| Search Type | Use Case | Example Query |
|-------------|----------|---------------|
| **Summary-Only** | High-level concepts | `where={"content_type": "summary"}` |
| **Title-Only** | Topic matching | `where={"content_type": "title"}` |
| **Content-Only** | Detailed analysis | `where={"content_type": "content"}` |
| **Cross-Field** | Comprehensive search | No filter = search everything |
| **Hybrid** | Semantic + exact match | `query + where={"case_number": "2024"}` |

### 🏗️ **Benefits**
- ✅ **Enhanced Record**: Direct vector capabilities built into Record class
- ✅ **Configuration-Driven**: No hardcoded field names
- ✅ **Data Source Agnostic**: Works with any Record structure
- ✅ **Same Config**: Creation and reading use identical configuration
- ✅ **Extensible**: Easy to add new field types for any dataset

In [5]:
async def create_production_vector_store():
    """Production pipeline: Process live OSB data with configuration-driven multi-field embedding."""

    print("🏭 Starting production vector store (configuration-driven)...")
    print(f"📊 Processing {len(records)} live OSB documents")

    successful_embeddings = 0
    failed_embeddings = 0
    total_chunks = 0

    for i, record in enumerate(records):

        try:
            processed_record = await vectorstore.process_record(record)
            if processed_record:
                successful_embeddings += 1
                chunk_count = len(processed_record.chunks)
                total_chunks += chunk_count

                # Count chunk types for display
                chunk_types = {}
                for chunk in processed_record.chunks:
                    content_type = chunk.metadata.get("content_type", "content")
                    chunk_types[content_type] = chunk_types.get(content_type, 0) + 1

            else:
                failed_embeddings += 1

        except Exception as e:
            failed_embeddings += 1
            print(f"   ❌ Error processing record: {e}")

    # Final results
    final_count = vectorstore.collection.count()

    print(f"\n🎉 Configuration-Driven Vector Store Created!")
    print(f"   📊 Records processed: {successful_embeddings + failed_embeddings}")
    print(f"   ✅ Successfully embedded: {successful_embeddings}")
    print(f"   ❌ Failed: {failed_embeddings}")
    print(f"   📦 Total chunks: {total_chunks}")
    print(f"   🔢 Total embeddings in collection: {final_count}")
    print(f"   🏪 Collection: '{vectorstore.collection_name}'")
    print(f"   ⚙️  Multi-field config: {vectorstore.multi_field_config is not None}")

    return {
        "successful_records": successful_embeddings,
        "failed_records": failed_embeddings,
        "total_chunks": total_chunks,
        "final_embedding_count": final_count,
    }


# Create the production vector store using configuration and enhanced Records
results = await create_production_vector_store()


[32m2025-06-17 02:40:53[0m [] [1;30mINFO[0m vector.py:217 Initialized RecursiveCharacterTextSplitter (chunk_size=1200, chunk_overlap=400)
[32m2025-06-17 02:40:53[0m [] [1;30mINFO[0m vector.py:752 Created chunks for record BUN-QBBLZ8WI: {'content': 1, 'title': 1}
[32m2025-06-17 02:40:54[0m [] [1;30mINFO[0m vector.py:217 Initialized RecursiveCharacterTextSplitter (chunk_size=1200, chunk_overlap=400)
[32m2025-06-17 02:40:54[0m [] [1;30mINFO[0m vector.py:752 Created chunks for record FB-4294T386: {'content': 1, 'title': 1}
[32m2025-06-17 02:40:55[0m [] [1;30mINFO[0m vector.py:217 Initialized RecursiveCharacterTextSplitter (chunk_size=1200, chunk_overlap=400)
[32m2025-06-17 02:40:55[0m [] [1;30mINFO[0m vector.py:752 Created chunks for record FB-M8D2SOGS: {'content': 1, 'title': 1}
[32m2025-06-17 02:40:56[0m [] [1;30mINFO[0m vector.py:217 Initialized RecursiveCharacterTextSplitter (chunk_size=1200, chunk_overlap=400)
[32m2025-06-17 02:40:56[0m [] [1;30mINFO[0m

In [None]:
vectorstore.collection.peek()


{'ids': [],
 'embeddings': array([], dtype=float64),
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents', 'embeddings'],
 'data': None,
 'metadatas': []}

# Test configuration-driven multi-field search capabilities

In [None]:
print("🔍 Testing Configuration-Driven Multi-Field Search...")

# The content_type values come from our configuration:
# - "content" (main content field)
# - "summary" (from additional_fields config)
# - "title" (from additional_fields config)

# 1. Search summaries only (high-level concepts)
print("\n🎯 1. SUMMARY-ONLY SEARCH:")
print("   Query: 'human rights'")
summary_results = vectorstore.collection.query(
    query_texts=["human rights"],
    # where={"content_type": "summary"},  # Based on config: source_field="summary"
    n_results=3,
    include=["documents", "metadatas", "distances"],
)

if summary_results["ids"] and summary_results["ids"][0]:
    for i, (doc, metadata, distance) in enumerate(
        zip(summary_results["documents"][0], summary_results["metadatas"][0], summary_results["distances"][0])
    ):
        similarity = 1 - distance
        title = metadata.get("title", "Untitled")
        print(f"   📋 Result {i+1}: {title[:40]}... (similarity: {similarity:.3f})")
        print(f"      📝 Summary: {doc[:80]}...")


# Create data source configuration for the RAG agent

In [None]:
rag_agent = AgentVariants(**cfg.agents.researcher)
rag_agent


In [None]:
# Create storage configuration for the RAG agent using unified system
storage_config = StorageConfig(
    type="chromadb", persist_directory="./data/osb_chromadb", collection_name="osb_fulltext", embedding_model="text-embedding-005", dimensionality=768
)

# Create agent configuration
agent_config = AgentConfig(
    role="RESEARCHER",
    agent_obj="RagAgent",
    description="OSB Research Assistant",
    data={"osb_vector": storage_config},
    variants={"model": "gemini-1.5-flash"},
    parameters={"template": "rag_research", "n_results": 10, "no_duplicates": False, "max_queries": 3},
)

# Initialize the RAG agent
rag_agent = RagAgent(**agent_config.model_dump())
print("RAG agent initialized successfully")


# Create Enhanced RAG Agent with intelligent search capabilities
from buttermilk.agents.rag.enhanced_rag_agent import EnhancedRagAgent

# IMPORTANT: Use the SAME config as your vectorstore to avoid mismatches!
storage_config = StorageConfig(
    type="chromadb", 
    persist_directory=vectorstore.persist_directory,  # Use same directory as vectorstore
    collection_name=vectorstore.collection_name,      # Use same collection name
    embedding_model=vectorstore.embedding_model,      # Use same embedding model
    dimensionality=vectorstore.dimensionality         # Use same dimensions
)

# Create Enhanced RAG agent configuration
enhanced_agent_config = AgentConfig(
    role="ENHANCED_RESEARCHER",
    agent_obj="buttermilk.agents.rag.enhanced_rag_agent.EnhancedRagAgent",
    description="Enhanced OSB Research Assistant with intelligent search capabilities",
    data={"vectorstore": storage_config},
    parameters={
        "enable_query_planning": True,      # Use LLM to analyze queries
        "enable_result_synthesis": True,    # Use LLM to synthesize results
        "search_strategies": ["semantic", "title", "summary", "hybrid", "metadata"],
        "max_search_rounds": 3,
        "confidence_threshold": 0.5,
        "max_results_per_strategy": 5,
        "include_search_explanation": True
    }
)

# Initialize the Enhanced RAG agent
enhanced_rag_agent = EnhancedRagAgent(**enhanced_agent_config.model_dump())
print("✅ Enhanced RAG agent initialized successfully")
print(f"   🧠 Query Planning: {enhanced_agent_config.parameters['enable_query_planning']}")
print(f"   🔬 Result Synthesis: {enhanced_agent_config.parameters['enable_result_synthesis']}")
print(f"   🎯 Search Strategies: {len(enhanced_agent_config.parameters['search_strategies'])}")
print(f"   📁 Directory: {storage_config.persist_directory}")
print(f"   🏪 Collection: {storage_config.collection_name}")
print(f"   🧠 Model: {storage_config.embedding_model}")
print(f"   📐 Dimensions: {storage_config.dimensionality}")

In [None]:
async def search_osb_database(queries):
    """Search the OSB database with multiple queries."""
    print("\n=== OSB Database Search Results ===")

    results = await rag_agent.fetch(queries)

    for i, (query, result) in enumerate(zip(queries, results)):
        print(f"\n--- Query {i+1}: {query} ---")
        print(f"Found {len(result.results)} relevant chunks")

        if result.results:
            # Show the top result
            top_result = result.results[0]
            print(f"\nTop Result:")
            print(f"Document: {top_result.document_title}")
            print(f"Case Number: {top_result.metadata.get('case_number', 'N/A')}")
            print(f"Text: {top_result.full_text[:300]}...")
        else:
            print("No results found")


# Example search queries
search_queries = [
    "What are the challenges with automated content moderation?",
    "How effective are age verification systems?",
    "What techniques are used to spread misinformation?",
]

await search_osb_database(search_queries)


In [None]:
async def demonstrate_enhanced_rag():
    """Demonstrate Enhanced RAG capabilities with intelligent search planning."""
    
    print("🎯 ENHANCED RAG DEMONSTRATION")
    print("=" * 60)
    
    # Test queries that showcase different capabilities
    test_queries = [
        {
            "query": "What are the main challenges with content moderation?",
            "expected_strategy": "Should use hybrid search (title + summary + content)",
            "focus": "Broad exploratory query"
        },
        {
            "query": "Find cases about misinformation detection algorithms",
            "expected_strategy": "Should use metadata + title search",
            "focus": "Specific case-focused query"
        },
        {
            "query": "How do platforms protect user privacy?",
            "expected_strategy": "Should use summary + semantic search",
            "focus": "Policy-focused query"
        }
    ]
    
    for i, test in enumerate(test_queries, 1):
        print(f"\n🔍 TEST {i}: {test['focus']}")
        print(f"Query: '{test['query']}'")
        print(f"Expected: {test['expected_strategy']}")
        print("-" * 50)
        
        try:
            # Create AgentInput for the enhanced RAG agent
            from buttermilk._core.contract import AgentInput
            
            agent_input = AgentInput(
                inputs={"query": test["query"]},
                parameters={},
                context=[],
                records=[]
            )
            
            # Process with Enhanced RAG
            result = await enhanced_rag_agent._process(message=agent_input)
            
            print(f"✅ RESULT:")
            print(f"   Response: {result.outputs[:200]}...")
            
            # Show metadata about the search
            metadata = result.metadata
            print(f"\n📊 SEARCH METADATA:")
            print(f"   Total Results: {metadata.get('total_results', 0)}")
            print(f"   Strategies Used: {metadata.get('strategies_used', [])}")
            print(f"   Confidence Score: {metadata.get('confidence_score', 0.0):.2f}")
            print(f"   Key Themes: {metadata.get('key_themes', [])}")
            
            if metadata.get('search_explanation'):
                print(f"   Search Strategy: {metadata['search_explanation']}")
                
        except Exception as e:
            print(f"❌ ERROR: {e}")
        
        print("\n" + "=" * 60)
    
    print("\n🎉 Enhanced RAG demonstration complete!")
    print("\nKey Benefits Demonstrated:")
    print("✅ Intelligent query analysis and search planning")
    print("✅ Multi-field search across titles, summaries, and content")
    print("✅ LLM-driven result synthesis and ranking")
    print("✅ Adaptive search strategies based on query type")
    print("✅ Comprehensive metadata and confidence scoring")
    print("✅ Smart cache management prevents overwriting local changes")
    print("✅ Automatic sync-back to remote storage after embedding operations")


# Run the enhanced RAG demonstration
await demonstrate_enhanced_rag()


## 7. Interactive Chat Interface

Now let's create an interactive interface to chat with our OSB knowledge base.

In [None]:
async def chat_with_osb(user_question):
    """Interactive chat with OSB knowledge base."""
    print(f"\n🔍 User Question: {user_question}")

    # Search for relevant context
    search_results = await rag_agent.fetch([user_question])

    if search_results and search_results[0].results:
        context = search_results[0]
        print(f"\n📚 Found {len(context.results)} relevant documents")

        # Display relevant chunks
        print("\n📋 Relevant Information:")
        for i, result in enumerate(context.results[:3]):  # Show top 3
            print(f"\n{i+1}. {result.document_title} ({result.metadata.get('case_number', 'N/A')})")
            print(f"   {result.full_text[:200]}...")

        # In a real implementation, this would be sent to an LLM for synthesis
        print("\n🤖 AI Response: [In a real implementation, the retrieved context would be sent to an LLM to generate a synthesized response]")
    else:
        print("\n❌ No relevant information found in the OSB database")


# Example chat interactions
example_questions = [
    "What are the main issues with current content moderation approaches?",
    "What recommendations exist for age verification?",
    "How do platforms detect and counter misinformation?",
]

for question in example_questions:
    await chat_with_osb(question)
    print("\n" + "=" * 80)


## 8. Vector Store Analysis

Let's analyze our vector store to understand what we've created.

In [None]:
# Get collection statistics
collection = vectorstore.collection
count = collection.count()

print(f"\n=== OSB Vector Store Statistics ===")
print(f"Collection Name: {vectorstore.collection_name}")
print(f"Total Chunks: {count}")
print(f"Embedding Dimensions: {vectorstore.dimensionality}")
print(f"Embedding Model: {vectorstore.embedding_model}")

# Get a sample of metadata to understand the structure
sample_results = collection.get(limit=3, include=["metadatas", "documents"])

print(f"\n=== Sample Metadata Structure ===")
if sample_results["metadatas"]:
    sample_metadata = sample_results["metadatas"][0]
    print("Available metadata fields:")
    for key, value in sample_metadata.items():
        print(f"  - {key}: {type(value).__name__} = {str(value)[:50]}...")

print(f"\n=== Storage Locations ===")
print(f"ChromaDB Directory: {vectorstore.persist_directory}")
print(f"Embeddings Directory: {vectorstore.arrow_save_dir}")


## 9. Advanced Search Examples

Let's explore some advanced search patterns and filtering capabilities.

In [None]:
# Direct ChromaDB queries with metadata filtering
async def advanced_search_examples():
    """Demonstrate advanced search capabilities."""
    print("\n=== Advanced Search Examples ===")

    # 1. Search with metadata filtering
    print("\n1. Search within specific case:")
    results = collection.query(
        query_texts=["content moderation challenges"], n_results=5, where={"case_number": "OSB-2024-001"}, include=["documents", "metadatas"]
    )
    print(f"   Found {len(results['ids'][0]) if results['ids'] else 0} results in OSB-2024-001")

    # 2. Similarity search across all documents
    print("\n2. General similarity search:")
    results = collection.query(query_texts=["artificial intelligence and safety"], n_results=5, include=["documents", "metadatas", "distances"])

    if results["ids"] and results["ids"][0]:
        print(f"   Found {len(results['ids'][0])} results")
        for i, (doc, metadata, distance) in enumerate(zip(results["documents"][0][:3], results["metadatas"][0][:3], results["distances"][0][:3])):
            print(f"   Result {i+1} (similarity: {1-distance:.3f}): {metadata.get('title', 'N/A')}")
            print(f"     {doc[:100]}...")

    # 3. Multi-query search
    print("\n3. Multi-query search:")
    multi_queries = ["platform safety measures", "user protection mechanisms", "digital safety standards"]

    for query in multi_queries:
        results = collection.query(query_texts=[query], n_results=2, include=["metadatas"])
        count = len(results["ids"][0]) if results["ids"] else 0
        print(f"   '{query}': {count} results")


await advanced_search_examples()


## 10. Production Considerations

Here are key considerations for using this in production:

In [None]:
print(
    """
=== Production Deployment Checklist ===

🔧 Configuration:
   ✓ Use GCS for persist_directory: gs://your-bucket/chromadb
   ✓ Configure appropriate chunk_size for your content
   ✓ Set concurrency based on your compute resources
   ✓ Use production embedding models (text-embedding-004/005)

📊 Performance:
   ✓ Monitor embedding generation costs
   ✓ Implement caching for frequently accessed data
   ✓ Use batch processing for large datasets
   ✓ Configure appropriate timeout values

🔒 Security:
   ✓ Secure GCS bucket access with proper IAM
   ✓ Implement data access controls
   ✓ Audit vector store queries
   ✓ Protect sensitive metadata

🚀 Scalability:
   ✓ Plan for vector store size growth
   ✓ Implement horizontal scaling for embeddings
   ✓ Monitor query performance
   ✓ Set up proper logging and monitoring

🔄 Maintenance:
   ✓ Plan for data updates and reindexing
   ✓ Implement backup strategies
   ✓ Version control for embeddings and metadata
   ✓ Regular quality assessments
"""
)

# Show next steps
print(
    """
=== Next Steps ===

1. Scale to Full Dataset:
   - Use the osb_vectorize.yaml configuration
   - Run: uv run python -m buttermilk.data.vector +run=osb_vectorize

2. Deploy RAG Flow:
   - Use the osb_rag.yaml flow configuration
   - Run: uv run python -m buttermilk.runner.cli +flow=osb_rag +run=api

3. Integrate with Frontend:
   - Use the Buttermilk web interface
   - Connect to WebSocket endpoints for real-time chat

4. Monitor and Optimize:
   - Track query performance
   - Monitor embedding costs
   - Tune chunk sizes and retrieval parameters
"""
)


## 🔒 Smart Cache Management

The vector database now includes smart cache management to prevent overwriting local changes:

### **Problem Solved**
Previously, re-running embedding cells would download the remote ChromaDB cache and overwrite any local changes, losing newly added embeddings.

### **Solution: Smart Cache Management**
The system now includes intelligent cache handling:

```python
async def _smart_cache_management(self, remote_path: str) -> Path:
    """Smart cache management that prevents overwriting newer local changes."""
    
    # Check if local cache was recently modified (within 1 hour)
    if time_since_modified < 3600:  # 1 hour
        logger.info("🔒 Skipping download to preserve local changes")
        return cache_path
    
    # Only download if cache is stale
    logger.info("🔄 Syncing remote ChromaDB")
    return await ensure_chromadb_cache(remote_path)
```

### **Automatic Sync-Back**
After successful embedding operations, local changes are automatically synced to remote storage:

```python
async def _sync_local_changes_to_remote(self) -> None:
    """Sync local ChromaDB changes back to remote storage."""
    
    # Only sync if recently modified (indicates recent work)
    if time_since_modified < 21600:  # 6 hours
        await upload_chromadb_cache(local_path, remote_path)
        logger.info("✅ Successfully synced local changes to remote storage")
```

### **Benefits**
- ✅ **Prevents Data Loss**: Local embedding work is preserved
- ✅ **Automatic Sync**: Changes are pushed back to remote storage  
- ✅ **Time-Based Logic**: Only acts on recently modified caches
- ✅ **Transparent Operation**: Clear logging of all cache decisions
- ✅ **Production Ready**: Handles concurrent access and failures gracefully

### **Usage**
This happens automatically - no code changes needed! The smart cache management activates whenever you:
1. Run embedding operations in this notebook
2. Use the vectorstore in production flows
3. Process new documents with the vector pipeline

## 🚀 Production Deployment Guide

This vector store is now ready for production use with the unified storage system. Here's how to deploy and use it:

### 📋 **For Full Dataset Processing**
```python
# In cell 7, change this line:
doc_limit = 5  # Set to None for full dataset

# To:
doc_limit = None  # Processes all OSB documents
```

### 🏭 **Production Usage Examples**

#### **Option 1: RAG Agent Integration**
```python
from buttermilk.agents.rag.rag_agent import RagAgent
from buttermilk._core.config import AgentConfig
from buttermilk._core.storage_config import StorageConfig

# Same config as creation - no changes needed with unified storage!
storage_config = StorageConfig(**cfg.storage.osb_vector)

agent_config = AgentConfig(
    role="RESEARCHER",
    agent_obj="RagAgent", 
    description="OSB Knowledge Assistant",
    data={"osb_vector": storage_config},
    parameters={"n_results": 10, "max_queries": 3}
)

rag_agent = RagAgent(**agent_config.model_dump())
```

#### **Option 2: Direct Storage Access**
```python
# Create vector store instance (reads existing embeddings) using unified storage
production_vectorstore = bm.get_storage(cfg.storage.osb_vector)
await production_vectorstore.ensure_cache_initialized()

# Perform semantic search
results = production_vectorstore.collection.query(
    query_texts=["platform safety policies"],
    n_results=5
)
```

#### **Option 3: Flow Integration**
```yaml
# conf/flows/osb_rag.yaml
defaults:
  - base_flow

orchestrator: buttermilk.orchestrators.groupchat.AutogenOrchestrator
storage: osb_vector  # References the same storage config
agents: [rag_agent, host/sequencer]
```

### 🏗️ **Enhanced Record Benefits**
- ✅ **Direct Processing**: Records processed without conversion steps
- ✅ **Vector Fields**: Built-in support for chunks, embeddings, file_path
- ✅ **Unified API**: Same Record class used throughout the system
- ✅ **Type Safety**: Full Pydantic validation for vector operations

### 🔒 **Production Considerations**
- ✅ **Persistent Storage**: Vector store saved to `gs://prosocial-public/osb/chromadb`  
- ✅ **Config Reuse**: Same `osb.yaml` works for both creation and reading
- ✅ **Scalability**: ChromaDB handles thousands of documents efficiently
- ✅ **Monitoring**: Check collection count and performance metrics
- ✅ **Updates**: Re-run this notebook to add new OSB documents

### 💡 **Next Steps**
1. **Scale Up**: Remove `doc_limit` to process full OSB dataset
2. **Deploy**: Use in RAG agents, search APIs, or analytical workflows  
3. **Monitor**: Track embedding quality and search relevance
4. **Iterate**: Add new documents by re-running the pipeline

### 🔧 **Migration Benefits**
This notebook now uses:
- ✅ **StorageConfig**: Unified configuration for all storage types
- ✅ **Enhanced Record**: Built-in vector processing capabilities  
- ✅ **bm.get_storage()**: Unified storage access API
- ✅ **process_record()**: Direct Record processing without conversion

In [10]:
# Let's test the text splitter behavior with a sample text
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create a test text that should definitely be split
test_text = "This is a test document. " * 100  # 2500 characters
print(f"Test text length: {len(test_text)} characters")

# Test with the same config as OSB
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=400,
    length_function=len,
    is_separator_regex=False,
    add_start_index=False,
)

chunks = text_splitter.split_text(test_text)
print(f"Number of chunks created: {len(chunks)}")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {len(chunk)} characters")
