# OSB Vector Database Example

This notebook demonstrates how to work with the Online Safety Bureau (OSB) dataset using Buttermilk's vector database infrastructure.

## Overview

The OSB dataset contains legal case summaries and decisions from the Online Safety Bureau. This example shows how to:

1. Load OSB JSON data from Google Cloud Storage
2. Create and populate a ChromaDB vector store
3. Perform semantic search queries
4. Use the OSB expert prompt template for legal analysis

## Prerequisites

- Buttermilk environment properly configured
- Access to Google Cloud Storage (gs://prosocial-public/osb/)
- Vertex AI embedding model access

In [None]:
# Initialize Buttermilk with OSB configuration
from buttermilk.utils.nb import init
from buttermilk._core.dmrc import get_bm, set_bm
from rich import print
import asyncio
import json
from pathlib import Path

# Initialize with OSB flow configuration
cfg = init(job="osb_example", overrides=["flows=[osb]"])
set_bm(cfg.bm)
bm = get_bm()

print("✅ Buttermilk initialized for OSB example")

## 1. Exploring the OSB Configuration

Let's examine the existing OSB configuration to understand how it's structured.

In [None]:
# Examine the OSB flow configuration
print("OSB Flow Configuration:")
print(json.dumps(cfg.flows.osb.model_dump(), indent=2))

print("\n" + "="*50)
print("OSB Data Source Configuration:")
if hasattr(cfg.flows.osb, 'storage') and 'cases' in cfg.flows.osb.storage:
    print(json.dumps(cfg.flows.osb.storage.cases.model_dump(), indent=2))
else:
    print("Data source configuration not found in expected location")

## 2. Loading OSB JSON Data

The OSB dataset is stored as a JSON file in Google Cloud Storage. Let's load it using Buttermilk's data loading infrastructure.

In [None]:
from buttermilk._core.config import DataSourceConfig
from buttermilk.data.loaders import create_data_loader

# Create data source configuration for OSB JSON file
osb_config = DataSourceConfig(
    type="file",
    path="gs://prosocial-public/osb/03_osb_fulltext_summaries.json",
    # Map JSON fields to Record fields if needed
    columns={
        "content": "text",  # Map 'text' field to 'content'
        "record_id": "id"   # Map 'id' field to 'record_id'
    }
)

# Create data loader
loader = create_data_loader(osb_config)

print(f"Created data loader: {type(loader).__name__}")
print(f"Loading OSB data from: {osb_config.path}")

In [None]:
# Load and examine the first few records
osb_records = list(loader)
print(f"✅ Loaded {len(osb_records)} OSB records")

# Examine the first record
if osb_records:
    first_record = osb_records[0]
    print("\nFirst record structure:")
    print(f"Record ID: {first_record.record_id}")
    print(f"Content preview: {first_record.content[:200]}...")
    print(f"Metadata keys: {list(first_record.metadata.keys()) if first_record.metadata else 'None'}")
else:
    print("⚠️ No records loaded")

## 3. Setting Up ChromaDB Vector Store

Now let's create a ChromaDB vector store to index the OSB documents for semantic search.

In [None]:
from buttermilk.data.vector import ChromaDBEmbeddings, InputDocument, DefaultTextSplitter
import tempfile

# Create a temporary directory for our vector store
temp_dir = tempfile.mkdtemp()
print(f"Vector store will be created in: {temp_dir}")

# Initialize ChromaDB vector store
vector_store = ChromaDBEmbeddings(
    collection_name="osb_cases",
    persist_directory=temp_dir,
    embedding_model="text-embedding-005",  # Using the same model as OSB config
    dimensionality=3072,  # Vertex AI text-embedding-005 dimensionality
    concurrency=5,  # Reduce for example
    arrow_save_dir=f"{temp_dir}/chunks"
)

print("✅ ChromaDB vector store initialized")

In [None]:
# Initialize the cache for remote ChromaDB (if needed)
await vector_store.ensure_cache_initialized()
print("✅ Vector store cache initialized")

## 4. Converting Records to InputDocuments

We need to convert Buttermilk Records to InputDocuments for the vector store processing pipeline.

In [None]:
# Convert Records to InputDocuments
input_docs = []

for record in osb_records[:10]:  # Process first 10 for example
    input_doc = InputDocument(
        record_id=record.record_id,
        title=record.metadata.get('title', f"OSB Case {record.record_id}") if record.metadata else f"OSB Case {record.record_id}",
        full_text=record.content,
        file_path="",  # Not from file
        metadata=record.metadata or {}
    )
    input_docs.append(input_doc)

print(f"✅ Created {len(input_docs)} InputDocuments for processing")
print(f"Example document title: {input_docs[0].title}")
print(f"Example document text length: {len(input_docs[0].full_text)} characters")

## 5. Processing Documents: Chunking and Embedding

Now we'll process the documents through the complete pipeline: chunking, embedding, and storing in ChromaDB.

In [None]:
from buttermilk.data.vector import list_to_async_iterator

# Create text splitter for chunking
text_splitter = DefaultTextSplitter(
    chunk_size=1000,  # Smaller chunks for legal text
    chunk_overlap=200
)

print("Processing documents through the vector store pipeline...")
print("This includes: chunking → embedding → storing in ChromaDB")

# Process documents one by one to show progress
successful_docs = 0
total_chunks = 0

for i, doc in enumerate(input_docs):
    print(f"\nProcessing document {i+1}/{len(input_docs)}: {doc.title}")
    
    # Step 1: Chunk the document
    chunked_doc = await text_splitter.process(doc)
    if not chunked_doc or not chunked_doc.chunks:
        print(f"  ⚠️ Failed to chunk document {doc.record_id}")
        continue
    
    print(f"  📄 Created {len(chunked_doc.chunks)} chunks")
    
    # Step 2: Embed and store in ChromaDB
    processed_doc = await vector_store.process(chunked_doc)
    if processed_doc:
        successful_docs += 1
        total_chunks += len(processed_doc.chunks)
        print(f"  ✅ Successfully embedded and stored {len(processed_doc.chunks)} chunks")
    else:
        print(f"  ❌ Failed to process document {doc.record_id}")

print(f"\n🎉 Pipeline complete!")
print(f"Successfully processed {successful_docs}/{len(input_docs)} documents")
print(f"Total chunks stored: {total_chunks}")

## 6. Performing Semantic Search

Now let's test the vector store by performing semantic searches on the OSB data.

In [None]:
# Test semantic search
def search_osb_cases(query: str, n_results: int = 5):
    """
    Search OSB cases using semantic similarity
    """
    try:
        results = vector_store.collection.query(
            query_texts=[query],
            n_results=n_results,
            include=["documents", "metadatas", "distances"]
        )
        
        print(f"\n🔍 Search results for: '{query}'")
        print("="*80)
        
        if results['documents'] and results['documents'][0]:
            for i, (doc, metadata, distance) in enumerate(zip(
                results['documents'][0],
                results['metadatas'][0],
                results['distances'][0]
            )):
                print(f"\n📄 Result {i+1} (similarity: {1-distance:.3f})")
                print(f"Document: {metadata.get('document_title', 'Unknown')}")
                print(f"Chunk: {metadata.get('chunk_index', 'Unknown')}")
                print(f"Text: {doc[:200]}...")
                print("-" * 40)
        else:
            print("No results found")
            
    except Exception as e:
        print(f"Search error: {e}")

# Test different types of legal queries
queries = [
    "harmful content moderation",
    "platform responsibility for user-generated content",
    "misinformation and disinformation",
    "child safety online"
]

for query in queries:
    search_osb_cases(query, n_results=3)

## 7. Using the OSB Expert Prompt Template

Let's demonstrate how to use the OSB-specific prompt template for legal analysis.

In [None]:
from buttermilk.utils.templating import render_template
from buttermilk._core.types import Record

# Search for relevant cases
query = "content moderation appeals process"
search_results = vector_store.collection.query(
    query_texts=[query],
    n_results=3,
    include=["documents", "metadatas"]
)

# Convert search results to the format expected by the template
dataset = []
if search_results['documents'] and search_results['documents'][0]:
    for doc, metadata in zip(search_results['documents'][0], search_results['metadatas'][0]):
        case_data = {
            'title': metadata.get('document_title', 'Unknown Case'),
            'text': doc,
            'record_id': metadata.get('document_id', 'unknown')
        }
        dataset.append(case_data)

print(f"Found {len(dataset)} relevant cases for analysis")

# Render the OSB expert template
if dataset:
    prompt_context = {
        'dataset': dataset,
        'formatting': 'Provide a structured analysis with clear conclusions',
        'prompt': query
    }
    
    try:
        rendered_prompt = render_template('osb', prompt_context)
        print("\n📋 OSB Expert Prompt:")
        print("="*80)
        print(rendered_prompt)
    except Exception as e:
        print(f"Error rendering template: {e}")
        print("Template may not be available in current setup")
else:
    print("No cases found for template demonstration")

## 8. Advanced Search with Filters

ChromaDB supports metadata filtering for more precise searches.

In [None]:
# Example of filtered search (if metadata contains relevant fields)
def filtered_search(query: str, filters: dict = None, n_results: int = 5):
    """
    Search with metadata filters
    """
    search_params = {
        "query_texts": [query],
        "n_results": n_results,
        "include": ["documents", "metadatas", "distances"]
    }
    
    if filters:
        search_params["where"] = filters
    
    results = vector_store.collection.query(**search_params)
    
    print(f"\n🔍 Filtered search: '{query}'")
    if filters:
        print(f"Filters: {filters}")
    print("="*60)
    
    if results['documents'] and results['documents'][0]:
        for i, (doc, metadata, distance) in enumerate(zip(
            results['documents'][0],
            results['metadatas'][0], 
            results['distances'][0]
        )):
            print(f"\n📄 Result {i+1} (similarity: {1-distance:.3f})")
            print(f"Text: {doc[:150]}...")
    else:
        print("No results found")

# Try different searches
filtered_search("online safety", n_results=3)

# Example with filters (adjust based on actual metadata structure)
# filtered_search("platform liability", filters={"document_title": {"$contains": "Case"}}, n_results=3)

## 9. Collection Statistics and Management

Let's examine what we've created and learn about managing the vector store.

In [None]:
# Get collection statistics
collection = vector_store.collection
count = collection.count()

print(f"📊 ChromaDB Collection Statistics:")
print(f"Collection name: {collection.name}")
print(f"Total chunks: {count}")
print(f"Storage directory: {vector_store.persist_directory}")

# Sample some records to see metadata structure
if count > 0:
    sample = collection.get(limit=3, include=["metadatas"])
    print(f"\nSample metadata keys:")
    for i, metadata in enumerate(sample['metadatas']):
        print(f"  Record {i+1}: {list(metadata.keys())}")

# Show storage directory contents
storage_path = Path(vector_store.persist_directory)
if storage_path.exists():
    print(f"\n📁 Storage directory contents:")
    for item in storage_path.iterdir():
        if item.is_file():
            size = item.stat().st_size
            print(f"  {item.name}: {size:,} bytes")
        else:
            print(f"  {item.name}/ (directory)")

## 10. Cleanup and Next Steps

Finally, let's clean up and discuss next steps for production usage.

In [None]:
import shutil

print("🧹 Cleanup options:")
print(f"To remove temporary vector store: rm -rf {temp_dir}")
print(f"To keep vector store for future use, move it to a permanent location")

# Uncomment to actually clean up
# shutil.rmtree(temp_dir)
# print("✅ Temporary files cleaned up")

print("\n🚀 Next Steps:")
print("1. For production: use a permanent storage location")
print("2. Process the full OSB dataset (not just first 10 records)")
print("3. Integrate with Buttermilk flows for automated processing")
print("4. Use pre-computed embeddings from gs://prosocial-public/osb/04_osb_embeddings_vertex-005.json")
print("5. Implement custom metadata filtering based on case types")
print("6. Create custom prompt templates for specific legal analysis tasks")

## Summary

This notebook demonstrated:

✅ **Loading OSB JSON data** using Buttermilk's data loading infrastructure  
✅ **Creating ChromaDB vector store** with proper configuration for legal text  
✅ **Processing documents** through chunking, embedding, and storage pipeline  
✅ **Performing semantic search** to find relevant legal cases  
✅ **Using OSB expert template** for structured legal analysis  
✅ **Advanced filtering** and collection management  

### Key Benefits of Buttermilk's Vector Store Infrastructure:

- **Async processing** for scalable document ingestion
- **Cloud storage integration** with automatic caching
- **Vertex AI embeddings** for high-quality semantic search
- **Flexible configuration** through Hydra configs
- **Built-in error handling** and retry mechanisms
- **Metadata preservation** for rich filtering capabilities

### Production Considerations:

1. **Use persistent storage** locations instead of temporary directories
2. **Leverage pre-computed embeddings** to save time and costs
3. **Configure appropriate chunk sizes** based on your use case
4. **Implement monitoring** for embedding costs and performance
5. **Consider backup strategies** for valuable vector stores