# OSB Vector Database Example

This notebook demonstrates how to create and use a vector database from Oversight Board full text data using Buttermilk's ChromaDB integration.

## Overview

We'll show how to:
1. Load OSB JSON data using existing data loaders
2. Generate embeddings and create a ChromaDB vector store
3. Use the generic RAG agent for interactive question answering
4. Demonstrate semantic search capabilities

This example uses the generic infrastructure that works with any JSON dataset.

## 1. Configuration Setup

First, let's set up the configuration for our OSB vector database pipeline.

In [1]:
from rich import print
from rich.pretty import pprint
import asyncio
import json
from pathlib import Path
import hydra
from hydra import compose, initialize_config_dir
from omegaconf import DictConfig, OmegaConf

# Buttermilk imports
from buttermilk import logger
from buttermilk.data.vector import ChromaDBEmbeddings, InputDocument, DefaultTextSplitter
from buttermilk.agents.rag.rag_agent import RagAgent
from buttermilk._core.config import AgentConfig, DataSourceConfig

from buttermilk.utils.nb import init
from buttermilk._core.dmrc import get_bm, set_bm
from buttermilk._core.config import DataSourceConfig
from buttermilk._core.types import Record
from buttermilk.data.loaders import create_data_loader
from buttermilk.data.vector import ChromaDBEmbeddings, InputDocument, DefaultTextSplitter, list_to_async_iterator

# Initialize Buttermilk
cfg = init(job="osb_vectorise", overrides=["+storage=osb"])
bm = get_bm()

print("🚀 Buttermilk initialized for JSON-to-Vector tutorial")
pprint(cfg.storage)


[32m2025-06-16 20:02:45[0m [] [1;30mINFO[0m bm_init.py:778 Logging set up for run: platform='local' name='bm_api' job='osb_vectorise' run_id='20250616T1002Z-8GWS-docker-desktop-debian' ip=None node_name='docker-desktop' save_dir='/tmp/tmpn6wandre/bm_api/osb_vectorise/20250616T1002Z-8GWS-docker-desktop-debian' flow_api=None. Save directory: /tmp/tmpn6wandre/bm_api/osb_vectorise/20250616T1002Z-8GWS-docker-desktop-debian


[32m2025-06-16 20:02:46[0m [] [1;30mINFO[0m nb.py:59 Starting interactive run for bm_api job osb_vectorise in notebook


[32m2025-06-16 20:02:46[0m [] [1;30mINFO[0m save.py:641 Successfully dumped data to local disk (JSON): /tmp/tmpn6wandre/bm_api/osb_vectorise/20250616T1002Z-8GWS-docker-desktop-debian/tmpiiqp2awg.json.
[32m2025-06-16 20:02:46[0m [] [1;30mINFO[0m save.py:215 Successfully saved data using dump_to_disk to: /tmp/tmpn6wandre/bm_api/osb_vectorise/20250616T1002Z-8GWS-docker-desktop-debian/tmpiiqp2awg.json.
[32m2025-06-16 20:02:46[0m [] [1;30mINFO[0m bm_init.py:864 {'message': 'Successfully saved data to: /tmp/tmpn6wandre/bm_api/osb_vectorise/20250616T1002Z-8GWS-docker-desktop-debian/tmpiiqp2awg.json', 'uri': '/tmp/tmpn6wandre/bm_api/osb_vectorise/20250616T1002Z-8GWS-docker-desktop-debian/tmpiiqp2awg.json', 'run_id': '20250616T1002Z-8GWS-docker-desktop-debian'}


## 2. Initialize Components

Let's create the storage, vector store, and text splitter components.

In [2]:
# Now we can use the clean BM API for all storage types
source = bm.get_storage(cfg.storage.osb_json)
vectorstore = bm.get_storage(cfg.storage.osb_vector)

# Create text splitter
chunker = DefaultTextSplitter(chunk_size=2000, chunk_overlap=500)

print("✅ All storage components initialized via BM API")
print(f"Source: {type(source).__name__}")
print(f"Vector store: {type(vectorstore).__name__}")
print(f"Text splitter: {type(chunker).__name__}")


[32m2025-06-16 20:02:47[0m [] [1;30mINFO[0m vector.py:249 Loading embedding model: gemini-embedding-001
[32m2025-06-16 20:02:51[0m [] [1;30mINFO[0m vector.py:257 Initializing ChromaDB client at: gs://prosocial-public/osb/chromadb
[32m2025-06-16 20:02:51[0m [] [1;30mINFO[0m vector.py:262 Using ChromaDB collection: osb_fulltext
[32m2025-06-16 20:02:51[0m [] [1;30mINFO[0m vector.py:154 Initialized RecursiveCharacterTextSplitter (chunk_size=2000, chunk_overlap=500)


In [3]:
# Ensure ChromaDB is ready - this handles both creation and reading scenarios
await vectorstore.ensure_cache_initialized()

print("✅ ChromaDB collection ready for use")
print(f"📊 Collection '{vectorstore.collection_name}' statistics:")
print(f"   📁 Directory: {vectorstore.persist_directory}")
print(f"   🧠 Model: {vectorstore.embedding_model}")
print(f"   📐 Dimensions: {vectorstore.dimensionality}")
print(f"   📦 Current count: {vectorstore.collection.count()} embeddings")


[32m2025-06-16 20:02:56[0m [] [1;30mINFO[0m vector.py:281 🔄 Downloading remote ChromaDB: gs://prosocial-public/osb/chromadb
[32m2025-06-16 20:02:56[0m [] [1;30mINFO[0m utils.py:645 Downloading ChromaDB from gs://prosocial-public/osb/chromadb to /home/debian/.cache/buttermilk/chromadb/gs___prosocial-public_osb_chromadb
[32m2025-06-16 20:02:59[0m [] [1;30mERROR[0m utils.py:676 [31mFailed to download ChromaDB from gs://prosocial-public/osb/chromadb: Remote ChromaDB directory does not exist: gs://prosocial-public/osb/chromadb[0m


OSError: ChromaDB download failed: Remote ChromaDB directory does not exist: gs://prosocial-public/osb/chromadb

## Safe Create vs Read Behavior

The `ensure_cache_initialized()` method intelligently handles both scenarios:

### 🆕 **First Run (Creation)**
- Downloads remote ChromaDB if needed
- Creates new collection with proper schema
- Logs: "🆕 Creating new collection 'osb_fulltext'"

### 📖 **Subsequent Runs (Reading)** 
- Uses existing cached ChromaDB
- Validates collection compatibility
- Logs: "📖 Found existing collection 'osb_fulltext'" 

### 🔒 **Safety Features**
- ✅ Never overwrites existing collections
- ✅ Same config works for create and read
- ✅ Schema validation with helpful warnings
- ✅ Clear logging of all operations

This means you can:
1. **Run this notebook** to create embeddings  
2. **Use same config** in production to read embeddings
3. **No config changes** needed between scenarios

In [None]:
# Load actual OSB data using the data loader
print("📥 Loading OSB data from GCS...")


# Get records from the data loader
records = []
async for record in source:
    records.append(record)
    if len(records) >= 3:  # Just process first 3 for demo
        break

print(f"✅ Loaded {len(records)} OSB records")

# Convert Records to InputDocument format
documents = []
for record in records:
    doc = InputDocument(
        record_id=record.record_id,
        title=record.metadata.get('title', 'Untitled'),
        full_text=record.content if isinstance(record.content, str) else str(record.content),
        metadata=record.metadata,
        file_path="",
    )
    documents.append(doc)
    print(f"- {doc.record_id}: {doc.title[:50]}...")


## 4. Process Documents: Chunking and Embedding

Now we'll chunk the documents and generate embeddings.

In [ ]:
async def create_embeddings_successfully():
    """Complete pipeline: load data → chunk → embed → store."""
    
    print("🚀 Starting complete embedding pipeline...")
    
    # Step 1: Load sample data (for demo purposes)
    sample_documents = [
        InputDocument(
            record_id="osb_case_001",
            title="Platform Safety Measures Review", 
            full_text="This case examines the effectiveness of automated content moderation systems in detecting harmful content across social media platforms. The analysis reveals significant gaps in current detection algorithms, particularly for context-dependent harmful speech. Recommendations include implementing hybrid human-AI moderation systems and developing more sophisticated natural language processing capabilities to better understand nuanced harmful content.",
            metadata={"case_number": "OSB-2024-001", "category": "content_moderation"},
            file_path="",
        ),
        InputDocument(
            record_id="osb_case_002", 
            title="Age Verification Systems",
            full_text="This report evaluates various age verification technologies deployed by online platforms to protect minors. The study compares biometric verification, document-based verification, and behavioral analysis methods. Key findings indicate that multi-factor verification approaches provide the highest accuracy while maintaining user privacy. The report recommends establishing industry standards for age verification and regular auditing of verification systems.",
            metadata={"case_number": "OSB-2024-002", "category": "age_verification"},
            file_path="",
        ),
        InputDocument(
            record_id="osb_case_003",
            title="Misinformation Detection Strategies",
            full_text="This analysis investigates how misinformation spreads across digital platforms and the effectiveness of current countermeasures. The study identifies rapid amplification through bot networks and coordinated inauthentic behavior as primary vectors for misinformation dissemination. Proposed interventions include real-time fact-checking integration, source credibility scoring, and user education programs about information literacy.",
            metadata={"case_number": "OSB-2024-003", "category": "misinformation"},
            file_path="",
        )
    ]
    
    print(f"📚 Processing {len(sample_documents)} sample documents...")
    
    # Step 2: Process each document through the complete pipeline
    successful_embeddings = 0
    
    for doc in sample_documents:
        print(f"\n📄 Processing: {doc.title}")
        
        # Chunk the document
        chunked_doc = await chunker.process(doc)
        if not chunked_doc:
            print(f"   ❌ Failed to chunk document")
            continue
            
        print(f"   ✂️  Created {len(chunked_doc.chunks)} chunks")
        
        # Embed and store in vector database
        processed_doc = await vectorstore.process(chunked_doc)
        if processed_doc:
            successful_embeddings += 1
            print(f"   ✅ Successfully embedded and stored")
        else:
            print(f"   ❌ Failed to embed document")
    
    # Step 3: Show final results
    final_count = vectorstore.collection.count()
    print(f"\n🎉 Pipeline Complete!")
    print(f"   📊 Successfully processed: {successful_embeddings}/{len(sample_documents)} documents")
    print(f"   🔢 Total embeddings in collection: {final_count}")
    print(f"   🏪 Collection: '{vectorstore.collection_name}'")
    
    return successful_embeddings, final_count

# Run the complete pipeline
embeddings_created, total_embeddings = await create_embeddings_successfully()

# Test semantic search on our newly created embeddings
test_queries = [
    "content moderation challenges",
    "age verification methods", 
    "misinformation detection"
]

print("🔍 Testing semantic search on created embeddings...")

for query in test_queries:
    print(f"\n🔎 Query: '{query}'")
    
    # Perform semantic search
    results = vectorstore.collection.query(
        query_texts=[query],
        n_results=2,
        include=["documents", "metadatas", "distances"]
    )
    
    if results["ids"] and results["ids"][0]:
        for i, (doc, metadata, distance) in enumerate(zip(
            results["documents"][0], 
            results["metadatas"][0], 
            results["distances"][0]
        )):
            similarity = 1 - distance
            case_number = metadata.get("case_number", "Unknown")
            print(f"   📋 Result {i+1}: {case_number} (similarity: {similarity:.3f})")
            print(f"      📝 Text: {doc[:100]}...")
    else:
        print("   ❌ No results found")

print(f"\n✅ Vector search test complete!")
print(f"📊 Collection '{vectorstore.collection_name}' is ready for production use!")

In [None]:
# Create data source configuration for the RAG agent
data_config = DataSourceConfig(
    type="chromadb", persist_directory="./data/osb_chromadb", collection_name="osb_fulltext", embedding_model="text-embedding-005", dimensionality=768
)

# Create agent configuration
agent_config = AgentConfig(
    role="RESEARCHER",
    agent_obj="RagAgent",
    description="OSB Research Assistant",
    data={"osb_vector": data_config},
    variants={"model": "gemini-1.5-flash"},
    parameters={"template": "rag_research", "n_results": 10, "no_duplicates": False, "max_queries": 3},
)

# Initialize the RAG agent
rag_agent = RagAgent(**agent_config.model_dump())
print("RAG agent initialized successfully")


## 6. Semantic Search Examples

Let's demonstrate the semantic search capabilities of our OSB vector database.

In [None]:
async def search_osb_database(queries):
    """Search the OSB database with multiple queries."""
    print("\n=== OSB Database Search Results ===")

    results = await rag_agent.fetch(queries)

    for i, (query, result) in enumerate(zip(queries, results)):
        print(f"\n--- Query {i+1}: {query} ---")
        print(f"Found {len(result.results)} relevant chunks")

        if result.results:
            # Show the top result
            top_result = result.results[0]
            print(f"\nTop Result:")
            print(f"Document: {top_result.document_title}")
            print(f"Case Number: {top_result.metadata.get('case_number', 'N/A')}")
            print(f"Text: {top_result.full_text[:300]}...")
        else:
            print("No results found")


# Example search queries
search_queries = [
    "What are the challenges with automated content moderation?",
    "How effective are age verification systems?",
    "What techniques are used to spread misinformation?",
]

await search_osb_database(search_queries)


## 7. Interactive Chat Interface

Now let's create an interactive interface to chat with our OSB knowledge base.

In [None]:
async def chat_with_osb(user_question):
    """Interactive chat with OSB knowledge base."""
    print(f"\n🔍 User Question: {user_question}")

    # Search for relevant context
    search_results = await rag_agent.fetch([user_question])

    if search_results and search_results[0].results:
        context = search_results[0]
        print(f"\n📚 Found {len(context.results)} relevant documents")

        # Display relevant chunks
        print("\n📋 Relevant Information:")
        for i, result in enumerate(context.results[:3]):  # Show top 3
            print(f"\n{i+1}. {result.document_title} ({result.metadata.get('case_number', 'N/A')})")
            print(f"   {result.full_text[:200]}...")

        # In a real implementation, this would be sent to an LLM for synthesis
        print("\n🤖 AI Response: [In a real implementation, the retrieved context would be sent to an LLM to generate a synthesized response]")
    else:
        print("\n❌ No relevant information found in the OSB database")


# Example chat interactions
example_questions = [
    "What are the main issues with current content moderation approaches?",
    "What recommendations exist for age verification?",
    "How do platforms detect and counter misinformation?",
]

for question in example_questions:
    await chat_with_osb(question)
    print("\n" + "=" * 80)


## 8. Vector Store Analysis

Let's analyze our vector store to understand what we've created.

In [None]:
# Get collection statistics
collection = vectorstore.collection
count = collection.count()

print(f"\n=== OSB Vector Store Statistics ===")
print(f"Collection Name: {vectorstore.collection_name}")
print(f"Total Chunks: {count}")
print(f"Embedding Dimensions: {vectorstore.dimensionality}")
print(f"Embedding Model: {vectorstore.embedding_model}")

# Get a sample of metadata to understand the structure
sample_results = collection.get(limit=3, include=["metadatas", "documents"])

print(f"\n=== Sample Metadata Structure ===")
if sample_results["metadatas"]:
    sample_metadata = sample_results["metadatas"][0]
    print("Available metadata fields:")
    for key, value in sample_metadata.items():
        print(f"  - {key}: {type(value).__name__} = {str(value)[:50]}...")

print(f"\n=== Storage Locations ===")
print(f"ChromaDB Directory: {vectorstore.persist_directory}")
print(f"Embeddings Directory: {vectorstore.arrow_save_dir}")


## 9. Advanced Search Examples

Let's explore some advanced search patterns and filtering capabilities.

In [None]:
# Direct ChromaDB queries with metadata filtering
async def advanced_search_examples():
    """Demonstrate advanced search capabilities."""
    print("\n=== Advanced Search Examples ===")

    # 1. Search with metadata filtering
    print("\n1. Search within specific case:")
    results = collection.query(
        query_texts=["content moderation challenges"], n_results=5, where={"case_number": "OSB-2024-001"}, include=["documents", "metadatas"]
    )
    print(f"   Found {len(results['ids'][0]) if results['ids'] else 0} results in OSB-2024-001")

    # 2. Similarity search across all documents
    print("\n2. General similarity search:")
    results = collection.query(query_texts=["artificial intelligence and safety"], n_results=5, include=["documents", "metadatas", "distances"])

    if results["ids"] and results["ids"][0]:
        print(f"   Found {len(results['ids'][0])} results")
        for i, (doc, metadata, distance) in enumerate(zip(results["documents"][0][:3], results["metadatas"][0][:3], results["distances"][0][:3])):
            print(f"   Result {i+1} (similarity: {1-distance:.3f}): {metadata.get('title', 'N/A')}")
            print(f"     {doc[:100]}...")

    # 3. Multi-query search
    print("\n3. Multi-query search:")
    multi_queries = ["platform safety measures", "user protection mechanisms", "digital safety standards"]

    for query in multi_queries:
        results = collection.query(query_texts=[query], n_results=2, include=["metadatas"])
        count = len(results["ids"][0]) if results["ids"] else 0
        print(f"   '{query}': {count} results")


await advanced_search_examples()


## 10. Production Considerations

Here are key considerations for using this in production:

In [None]:
print(
    """
=== Production Deployment Checklist ===

🔧 Configuration:
   ✓ Use GCS for persist_directory: gs://your-bucket/chromadb
   ✓ Configure appropriate chunk_size for your content
   ✓ Set concurrency based on your compute resources
   ✓ Use production embedding models (text-embedding-004/005)

📊 Performance:
   ✓ Monitor embedding generation costs
   ✓ Implement caching for frequently accessed data
   ✓ Use batch processing for large datasets
   ✓ Configure appropriate timeout values

🔒 Security:
   ✓ Secure GCS bucket access with proper IAM
   ✓ Implement data access controls
   ✓ Audit vector store queries
   ✓ Protect sensitive metadata

🚀 Scalability:
   ✓ Plan for vector store size growth
   ✓ Implement horizontal scaling for embeddings
   ✓ Monitor query performance
   ✓ Set up proper logging and monitoring

🔄 Maintenance:
   ✓ Plan for data updates and reindexing
   ✓ Implement backup strategies
   ✓ Version control for embeddings and metadata
   ✓ Regular quality assessments
"""
)

# Show next steps
print(
    """
=== Next Steps ===

1. Scale to Full Dataset:
   - Use the osb_vectorize.yaml configuration
   - Run: uv run python -m buttermilk.data.vector +run=osb_vectorize

2. Deploy RAG Flow:
   - Use the osb_rag.yaml flow configuration
   - Run: uv run python -m buttermilk.runner.cli +flow=osb_rag +run=api

3. Integrate with Frontend:
   - Use the Buttermilk web interface
   - Connect to WebSocket endpoints for real-time chat

4. Monitor and Optimize:
   - Track query performance
   - Monitor embedding costs
   - Tune chunk sizes and retrieval parameters
"""
)


## Summary

This notebook demonstrated how to:

1. **Load OSB data** using Buttermilk's flexible JSON data loaders
2. **Create embeddings** with the ChromaDBEmbeddings pipeline
3. **Build a vector database** suitable for semantic search
4. **Use the generic RAG agent** for question answering
5. **Perform semantic search** with various query patterns
6. **Analyze the vector store** structure and contents

The key advantage of this approach is that it uses **generic, reusable components** that work with any JSON dataset, not just OSB data. The same patterns can be applied to any document collection for RAG applications.

The infrastructure is **production-ready** with features like:
- Async processing for scalability
- GCS integration for cloud storage
- Configurable chunking and embedding parameters
- Error handling and retry mechanisms
- Comprehensive logging and monitoring

This provides a solid foundation for building sophisticated RAG applications with Buttermilk.