# OSB Vector Database Example

This notebook demonstrates how to create and use a vector database from OSB (Online Safety Bureau) full text data using Buttermilk's ChromaDB integration.

## Overview

We'll show how to:
1. Load OSB JSON data using existing data loaders
2. Generate embeddings and create a ChromaDB vector store
3. Use the generic RAG agent for interactive question answering
4. Demonstrate semantic search capabilities

This example uses the generic infrastructure that works with any JSON dataset.

In [None]:
import asyncio
import json
from pathlib import Path
import hydra
from hydra import compose, initialize_config_dir
from omegaconf import DictConfig, OmegaConf

# Buttermilk imports
from buttermilk import logger
from buttermilk.data.vector import ChromaDBEmbeddings, InputDocument, DefaultTextSplitter
from buttermilk.data.loaders.json_loader import JsonDataLoader
from buttermilk.agents.rag.rag_agent import RagAgent
from buttermilk._core.config import AgentConfig, DataSourceConfig

## 1. Configuration Setup

First, let's set up the configuration for our OSB vector database pipeline.

In [None]:
# Configuration for OSB data processing
config = {
    "vectoriser": {
        "_target_": "buttermilk.data.vector.ChromaDBEmbeddings",
        "persist_directory": "./data/osb_chromadb",  # Local for example
        "collection_name": "osb_fulltext",
        "embedding_model": "text-embedding-005",
        "dimensionality": 768,
        "arrow_save_dir": "./data/osb_embeddings",
        "concurrency": 5,  # Reduced for example
        "upsert_batch_size": 10
    },
    "chunker": {
        "_target_": "buttermilk.data.vector.DefaultTextSplitter",
        "chunk_size": 2000,
        "chunk_overlap": 500
    },
    "data_loader": {
        "uri": "gs://prosocial-public/osb/03_osb_fulltext_summaries.json",
        "field_mapping": {
            "record_id": "id",
            "content": "full_text",
            "metadata": {
                "title": "title",
                "case_number": "case_number",
                "url": "url",
                "summary": "summary"
            }
        }
    }
}

print("Configuration loaded successfully")

## 2. Initialize Components

Let's create the vector store and text splitter components.

In [None]:
# Create directories
Path("./data/osb_chromadb").mkdir(parents=True, exist_ok=True)
Path("./data/osb_embeddings").mkdir(parents=True, exist_ok=True)

# Initialize ChromaDB vector store
vectorstore = ChromaDBEmbeddings(
    persist_directory="./data/osb_chromadb",
    collection_name="osb_fulltext",
    embedding_model="text-embedding-005",
    dimensionality=768,
    arrow_save_dir="./data/osb_embeddings",
    concurrency=5,
    upsert_batch_size=10
)

# Initialize text splitter
chunker = DefaultTextSplitter(
    chunk_size=2000,
    chunk_overlap=500
)

print("Components initialized successfully")

## 3. Load OSB Data

Now let's load the OSB JSON data and convert it to InputDocument format.

In [None]:
# For this example, let's simulate loading a subset of OSB data
# In practice, you would use the JsonDataLoader with the GCS URI

sample_osb_data = [
    {
        "id": "osb_case_001",
        "title": "Platform Safety Measures Review",
        "case_number": "OSB-2024-001",
        "url": "https://example.com/case1",
        "summary": "Review of content moderation policies",
        "full_text": "This case examines the effectiveness of automated content moderation systems in detecting harmful content across social media platforms. The analysis reveals significant gaps in current detection algorithms, particularly for context-dependent harmful speech. Recommendations include implementing hybrid human-AI moderation systems and developing more sophisticated natural language processing capabilities."
    },
    {
        "id": "osb_case_002",
        "title": "Age Verification Systems",
        "case_number": "OSB-2024-002",
        "url": "https://example.com/case2",
        "summary": "Assessment of age verification technologies",
        "full_text": "This report evaluates various age verification technologies deployed by online platforms to protect minors. The study compares biometric verification, document-based verification, and behavioral analysis methods. Key findings indicate that multi-factor verification approaches provide the highest accuracy while maintaining user privacy. The report recommends establishing industry standards for age verification and regular auditing of verification systems."
    },
    {
        "id": "osb_case_003",
        "title": "Misinformation Detection",
        "case_number": "OSB-2024-003",
        "url": "https://example.com/case3",
        "summary": "Analysis of misinformation spread patterns",
        "full_text": "This analysis investigates how misinformation spreads across digital platforms and the effectiveness of current countermeasures. The study identifies rapid amplification through bot networks and coordinated inauthentic behavior as primary vectors for misinformation dissemination. Proposed interventions include real-time fact-checking integration, source credibility scoring, and user education programs about information literacy."
    }
]

# Convert to InputDocument format
documents = []
for item in sample_osb_data:
    doc = InputDocument(
        record_id=item["id"],
        title=item["title"],
        full_text=item["full_text"],
        metadata={
            "title": item["title"],
            "case_number": item["case_number"],
            "url": item["url"],
            "summary": item["summary"]
        },
        file_path=""
    )
    documents.append(doc)

print(f"Loaded {len(documents)} OSB documents")
for doc in documents:
    print(f"- {doc.record_id}: {doc.title}")

## 4. Process Documents: Chunking and Embedding

Now we'll chunk the documents and generate embeddings.

In [None]:
async def process_documents():
    """Process documents through the complete pipeline."""
    processed_docs = []
    
    for doc in documents:
        print(f"\nProcessing document: {doc.record_id}")
        
        # Step 1: Chunk the document
        chunked_doc = await chunker.process(doc)
        if not chunked_doc:
            print(f"Failed to chunk document {doc.record_id}")
            continue
            
        print(f"Created {len(chunked_doc.chunks)} chunks")
        
        # Step 2: Generate embeddings and store
        processed_doc = await vectorstore.process(chunked_doc)
        if processed_doc:
            processed_docs.append(processed_doc)
            print(f"Successfully embedded and stored document {doc.record_id}")
        else:
            print(f"Failed to process document {doc.record_id}")
    
    return processed_docs

# Run the processing pipeline
processed_documents = await process_documents()
print(f"\nProcessed {len(processed_documents)} documents successfully")

## 5. Initialize RAG Agent

Now let's create a generic RAG agent to query our OSB vector database.

In [None]:
# Create data source configuration for the RAG agent
data_config = DataSourceConfig(
    type="chromadb",
    persist_directory="./data/osb_chromadb",
    collection_name="osb_fulltext",
    embedding_model="text-embedding-005",
    dimensionality=768
)

# Create agent configuration
agent_config = AgentConfig(
    role="RESEARCHER",
    agent_obj="RagAgent",
    description="OSB Research Assistant",
    data={"osb_vector": data_config},
    variants={"model": "gemini-1.5-flash"},
    parameters={
        "template": "rag_research",
        "n_results": 10,
        "no_duplicates": False,
        "max_queries": 3
    }
)

# Initialize the RAG agent
rag_agent = RagAgent(**agent_config.model_dump())
print("RAG agent initialized successfully")

## 6. Semantic Search Examples

Let's demonstrate the semantic search capabilities of our OSB vector database.

In [None]:
async def search_osb_database(queries):
    """Search the OSB database with multiple queries."""
    print("\n=== OSB Database Search Results ===")
    
    results = await rag_agent.fetch(queries)
    
    for i, (query, result) in enumerate(zip(queries, results)):
        print(f"\n--- Query {i+1}: {query} ---")
        print(f"Found {len(result.results)} relevant chunks")
        
        if result.results:
            # Show the top result
            top_result = result.results[0]
            print(f"\nTop Result:")
            print(f"Document: {top_result.document_title}")
            print(f"Case Number: {top_result.metadata.get('case_number', 'N/A')}")
            print(f"Text: {top_result.full_text[:300]}...")
        else:
            print("No results found")

# Example search queries
search_queries = [
    "What are the challenges with automated content moderation?",
    "How effective are age verification systems?", 
    "What techniques are used to spread misinformation?"
]

await search_osb_database(search_queries)

## 7. Interactive Chat Interface

Now let's create an interactive interface to chat with our OSB knowledge base.

In [None]:
async def chat_with_osb(user_question):
    """Interactive chat with OSB knowledge base."""
    print(f"\n🔍 User Question: {user_question}")
    
    # Search for relevant context
    search_results = await rag_agent.fetch([user_question])
    
    if search_results and search_results[0].results:
        context = search_results[0]
        print(f"\n📚 Found {len(context.results)} relevant documents")
        
        # Display relevant chunks
        print("\n📋 Relevant Information:")
        for i, result in enumerate(context.results[:3]):  # Show top 3
            print(f"\n{i+1}. {result.document_title} ({result.metadata.get('case_number', 'N/A')})")
            print(f"   {result.full_text[:200]}...")
            
        # In a real implementation, this would be sent to an LLM for synthesis
        print("\n🤖 AI Response: [In a real implementation, the retrieved context would be sent to an LLM to generate a synthesized response]")
    else:
        print("\n❌ No relevant information found in the OSB database")

# Example chat interactions
example_questions = [
    "What are the main issues with current content moderation approaches?",
    "What recommendations exist for age verification?",
    "How do platforms detect and counter misinformation?"
]

for question in example_questions:
    await chat_with_osb(question)
    print("\n" + "="*80)

## 8. Vector Store Analysis

Let's analyze our vector store to understand what we've created.

In [None]:
# Get collection statistics
collection = vectorstore.collection
count = collection.count()

print(f"\n=== OSB Vector Store Statistics ===")
print(f"Collection Name: {vectorstore.collection_name}")
print(f"Total Chunks: {count}")
print(f"Embedding Dimensions: {vectorstore.dimensionality}")
print(f"Embedding Model: {vectorstore.embedding_model}")

# Get a sample of metadata to understand the structure
sample_results = collection.get(limit=3, include=["metadatas", "documents"])

print(f"\n=== Sample Metadata Structure ===")
if sample_results['metadatas']:
    sample_metadata = sample_results['metadatas'][0]
    print("Available metadata fields:")
    for key, value in sample_metadata.items():
        print(f"  - {key}: {type(value).__name__} = {str(value)[:50]}...")

print(f"\n=== Storage Locations ===")
print(f"ChromaDB Directory: {vectorstore.persist_directory}")
print(f"Embeddings Directory: {vectorstore.arrow_save_dir}")

## 9. Advanced Search Examples

Let's explore some advanced search patterns and filtering capabilities.

In [None]:
# Direct ChromaDB queries with metadata filtering
async def advanced_search_examples():
    """Demonstrate advanced search capabilities."""
    print("\n=== Advanced Search Examples ===")
    
    # 1. Search with metadata filtering
    print("\n1. Search within specific case:")
    results = collection.query(
        query_texts=["content moderation challenges"],
        n_results=5,
        where={"case_number": "OSB-2024-001"},
        include=["documents", "metadatas"]
    )
    print(f"   Found {len(results['ids'][0]) if results['ids'] else 0} results in OSB-2024-001")
    
    # 2. Similarity search across all documents
    print("\n2. General similarity search:")
    results = collection.query(
        query_texts=["artificial intelligence and safety"],
        n_results=5,
        include=["documents", "metadatas", "distances"]
    )
    
    if results['ids'] and results['ids'][0]:
        print(f"   Found {len(results['ids'][0])} results")
        for i, (doc, metadata, distance) in enumerate(zip(
            results['documents'][0][:3], 
            results['metadatas'][0][:3],
            results['distances'][0][:3]
        )):
            print(f"   Result {i+1} (similarity: {1-distance:.3f}): {metadata.get('title', 'N/A')}")
            print(f"     {doc[:100]}...")
    
    # 3. Multi-query search
    print("\n3. Multi-query search:")
    multi_queries = [
        "platform safety measures",
        "user protection mechanisms",
        "digital safety standards"
    ]
    
    for query in multi_queries:
        results = collection.query(
            query_texts=[query],
            n_results=2,
            include=["metadatas"]
        )
        count = len(results['ids'][0]) if results['ids'] else 0
        print(f"   '{query}': {count} results")

await advanced_search_examples()

## 10. Production Considerations

Here are key considerations for using this in production:

In [None]:
print("""
=== Production Deployment Checklist ===

🔧 Configuration:
   ✓ Use GCS for persist_directory: gs://your-bucket/chromadb
   ✓ Configure appropriate chunk_size for your content
   ✓ Set concurrency based on your compute resources
   ✓ Use production embedding models (text-embedding-004/005)

📊 Performance:
   ✓ Monitor embedding generation costs
   ✓ Implement caching for frequently accessed data
   ✓ Use batch processing for large datasets
   ✓ Configure appropriate timeout values

🔒 Security:
   ✓ Secure GCS bucket access with proper IAM
   ✓ Implement data access controls
   ✓ Audit vector store queries
   ✓ Protect sensitive metadata

🚀 Scalability:
   ✓ Plan for vector store size growth
   ✓ Implement horizontal scaling for embeddings
   ✓ Monitor query performance
   ✓ Set up proper logging and monitoring

🔄 Maintenance:
   ✓ Plan for data updates and reindexing
   ✓ Implement backup strategies
   ✓ Version control for embeddings and metadata
   ✓ Regular quality assessments
""")

# Show next steps
print("""
=== Next Steps ===

1. Scale to Full Dataset:
   - Use the osb_vectorize.yaml configuration
   - Run: uv run python -m buttermilk.data.vector +run=osb_vectorize

2. Deploy RAG Flow:
   - Use the osb_rag.yaml flow configuration
   - Run: uv run python -m buttermilk.runner.cli +flow=osb_rag +run=api

3. Integrate with Frontend:
   - Use the Buttermilk web interface
   - Connect to WebSocket endpoints for real-time chat

4. Monitor and Optimize:
   - Track query performance
   - Monitor embedding costs
   - Tune chunk sizes and retrieval parameters
""")

## Summary

This notebook demonstrated how to:

1. **Load OSB data** using Buttermilk's flexible JSON data loaders
2. **Create embeddings** with the ChromaDBEmbeddings pipeline
3. **Build a vector database** suitable for semantic search
4. **Use the generic RAG agent** for question answering
5. **Perform semantic search** with various query patterns
6. **Analyze the vector store** structure and contents

The key advantage of this approach is that it uses **generic, reusable components** that work with any JSON dataset, not just OSB data. The same patterns can be applied to any document collection for RAG applications.

The infrastructure is **production-ready** with features like:
- Async processing for scalability
- GCS integration for cloud storage
- Configurable chunking and embedding parameters
- Error handling and retry mechanisms
- Comprehensive logging and monitoring

This provides a solid foundation for building sophisticated RAG applications with Buttermilk.