# JSON to Vector Database Tutorial

This tutorial demonstrates how to convert any JSON dataset into a searchable vector database using Buttermilk's infrastructure.

## What You'll Learn

- 📄 Loading JSON data from various sources (local files, cloud storage, URLs)
- 🔧 Configuring vector stores for different data types
- ✂️ Chunking strategies for optimal search performance
- 🧠 Embedding generation with different models
- 🔍 Semantic search techniques and optimization
- 📊 Best practices for production deployments

## Prerequisites

- Buttermilk environment configured
- Basic familiarity with JSON data structures
- Access to embedding models (Vertex AI recommended)

In [1]:
# Setup and imports
import json
import tempfile
import asyncio
from pathlib import Path
from rich import print
from typing import List, Dict, Any

# Buttermilk imports
from buttermilk.utils.nb import init
from buttermilk._core.dmrc import get_bm, set_bm
from buttermilk._core.config import DataSourceConfig
from buttermilk._core.types import Record
from buttermilk.data.loaders import create_data_loader
from buttermilk.data.vector import ChromaDBEmbeddings, InputDocument, DefaultTextSplitter, list_to_async_iterator

# Initialize Buttermilk
cfg = init(job="json_tutorial")
bm = get_bm()

print("🚀 Buttermilk initialized for JSON-to-Vector tutorial")


[32m2025-06-16 18:50:52[0m [] [1;30mINFO[0m bm_init.py:745 Logging set up for run: platform='local' name='bm_api' job='json_tutorial' run_id='20250616T0850Z-cHd3-docker-desktop-debian' ip=None node_name='docker-desktop' save_dir='/tmp/tmpsv6awzna/bm_api/json_tutorial/20250616T0850Z-cHd3-docker-desktop-debian' flow_api=None. Save directory: /tmp/tmpsv6awzna/bm_api/json_tutorial/20250616T0850Z-cHd3-docker-desktop-debian


[32m2025-06-16 18:50:52[0m [] [1;30mINFO[0m nb.py:59 Starting interactive run for bm_api job json_tutorial in notebook


[32m2025-06-16 18:50:52[0m [] [1;30mINFO[0m save.py:641 Successfully dumped data to local disk (JSON): /tmp/tmpsv6awzna/bm_api/json_tutorial/20250616T0850Z-cHd3-docker-desktop-debian/tmpsrg7shbs.json.
[32m2025-06-16 18:50:52[0m [] [1;30mINFO[0m save.py:215 Successfully saved data using dump_to_disk to: /tmp/tmpsv6awzna/bm_api/json_tutorial/20250616T0850Z-cHd3-docker-desktop-debian/tmpsrg7shbs.json.
[32m2025-06-16 18:50:52[0m [] [1;30mINFO[0m bm_init.py:831 {'message': 'Successfully saved data to: /tmp/tmpsv6awzna/bm_api/json_tutorial/20250616T0850Z-cHd3-docker-desktop-debian/tmpsrg7shbs.json', 'uri': '/tmp/tmpsv6awzna/bm_api/json_tutorial/20250616T0850Z-cHd3-docker-desktop-debian/tmpsrg7shbs.json', 'run_id': '20250616T0850Z-cHd3-docker-desktop-debian'}


## Step 1: Preparing Sample JSON Data

Let's start by creating some sample JSON data to work with. We'll create different types of documents to demonstrate various scenarios.

In [None]:
# Create sample JSON data representing different document types
sample_documents = [
    {
        "id": "doc_001",
        "title": "Introduction to Machine Learning",
        "content": "Machine learning is a subset of artificial intelligence that enables computers to learn and make decisions from data without being explicitly programmed. It involves algorithms that can identify patterns in data and make predictions or classifications based on those patterns. The field has applications in image recognition, natural language processing, recommendation systems, and autonomous vehicles.",
        "category": "technology",
        "author": "Dr. Alice Smith",
        "date": "2024-01-15",
        "tags": ["AI", "machine learning", "algorithms"]
    },
    {
        "id": "doc_002", 
        "title": "Sustainable Energy Solutions",
        "content": "Renewable energy sources such as solar, wind, and hydroelectric power are becoming increasingly important in combating climate change. Solar panels convert sunlight directly into electricity through photovoltaic cells, while wind turbines harness wind energy to generate power. These technologies are becoming more efficient and cost-effective, making them viable alternatives to fossil fuels.",
        "category": "environment",
        "author": "Prof. Bob Johnson",
        "date": "2024-02-03",
        "tags": ["renewable energy", "solar", "wind", "climate"]
    },
    {
        "id": "doc_003",
        "title": "Modern Web Development Frameworks",
        "content": "Web development has evolved significantly with the introduction of modern frameworks like React, Vue.js, and Angular. These frameworks provide component-based architectures that make it easier to build complex, interactive user interfaces. React uses a virtual DOM for efficient updates, Vue.js offers a gentle learning curve with powerful features, and Angular provides a full-featured framework with TypeScript integration.",
        "category": "technology",
        "author": "Carol Developer",
        "date": "2024-01-28",
        "tags": ["web development", "React", "Vue", "Angular"]
    },
    {
        "id": "doc_004",
        "title": "Urban Gardening Techniques",
        "content": "Urban gardening allows city dwellers to grow their own food and plants in limited spaces. Techniques include container gardening, vertical gardens, and rooftop farming. These methods can help improve food security, reduce environmental impact, and provide mental health benefits. Hydroponic and aquaponic systems are particularly effective for small spaces, allowing for year-round growing without soil.",
        "category": "lifestyle",
        "author": "David Green",
        "date": "2024-02-10",
        "tags": ["gardening", "urban", "hydroponics", "sustainability"]
    },
    {
        "id": "doc_005",
        "title": "Quantum Computing Fundamentals",
        "content": "Quantum computing represents a paradigm shift from classical computing by leveraging quantum mechanical phenomena like superposition and entanglement. Unlike classical bits that exist in either 0 or 1 states, quantum bits (qubits) can exist in multiple states simultaneously. This allows quantum computers to potentially solve certain problems exponentially faster than classical computers, particularly in cryptography, optimization, and scientific simulation.",
        "category": "technology",
        "author": "Dr. Eva Quantum",
        "date": "2024-02-15",
        "tags": ["quantum computing", "qubits", "superposition", "cryptography"]
    }
]

# Save sample data to a temporary JSON file
temp_dir = tempfile.mkdtemp()
json_file_path = Path(temp_dir) / "sample_documents.json"

with open(json_file_path, 'w') as f:
    for doc in sample_documents:
        f.write(json.dumps(doc) + '\n')  # JSONL format

print(f"📄 Created sample JSON data with {len(sample_documents)} documents")
print(f"📁 Saved to: {json_file_path}")
print(f"\nSample document structure:")
print(json.dumps(sample_documents[0], indent=2))


## Step 2: JSON Data Loading

Now let's explore different ways to load JSON data using Buttermilk's data loading infrastructure.

In [None]:
# Method 1: Loading JSONL (JSON Lines) files
print("📊 Method 1: Loading JSONL file")

# Create data source configuration
jsonl_config = DataSourceConfig(
    type="file",
    path=str(json_file_path),
    # Map JSON fields to Record fields
    columns={
        "record_id": "id",     # Map 'id' to 'record_id'
        "content": "content"   # Map 'content' to 'content'
    }
)

# Create and use data loader
loader = create_data_loader(jsonl_config)
records = list(loader)

print(f"✅ Loaded {len(records)} records")
print(f"\nFirst record:")
first_record = records[0]
print(f"  Record ID: {first_record.record_id}")
print(f"  Content preview: {first_record.content[:100]}...")
print(f"  Metadata keys: {list(first_record.metadata.keys())}")


In [None]:
# Method 2: Custom JSON loading with different field mappings
print("\n📊 Method 2: Custom field mapping")

# Alternative mapping strategy
custom_config = DataSourceConfig(
    type="file",
    path=str(json_file_path),
    columns={
        "record_id": "id",
        "content": "title",    # Use title as primary content
        "uri": "content"       # Store full content in URI field
    }
)

loader_custom = create_data_loader(custom_config)
records_custom = list(loader_custom)

print(f"✅ Loaded {len(records_custom)} records with custom mapping")
print(f"  Content (title): {records_custom[0].content}")
print(f"  URI (full content): {records_custom[0].uri[:100]}...")


In [None]:
# Method 3: Handling complex JSON structures
print("\n📊 Method 3: Complex JSON structures")

# Create a more complex JSON structure
complex_document = {
    "document": {
        "id": "complex_001",
        "metadata": {
            "title": "Advanced Data Science",
            "author": {"name": "Dr. Jane Doe", "affiliation": "University XYZ"},
            "publication_date": "2024-03-01"
        },
        "sections": [
            {"heading": "Introduction", "content": "Data science combines statistics, programming, and domain expertise..."},
            {"heading": "Methods", "content": "We employed machine learning algorithms including random forests..."}
        ]
    }
}

# For complex structures, we often need custom processing
def process_complex_json(json_data: dict) -> Record:
    """
    Custom function to process complex JSON structures
    """
    doc = json_data["document"]
    
    # Extract and combine content from sections
    content_parts = []
    content_parts.append(doc["metadata"]["title"])
    
    for section in doc["sections"]:
        content_parts.append(f"{section['heading']}: {section['content']}")
    
    combined_content = "\n\n".join(content_parts)
    
    # Create metadata
    metadata = {
        "title": doc["metadata"]["title"],
        "author_name": doc["metadata"]["author"]["name"],
        "author_affiliation": doc["metadata"]["author"]["affiliation"],
        "publication_date": doc["metadata"]["publication_date"],
        "sections_count": len(doc["sections"])
    }
    
    return Record(
        record_id=doc["id"],
        content=combined_content,
        metadata=metadata
    )

# Process the complex document
complex_record = process_complex_json(complex_document)
print(f"✅ Processed complex JSON structure")
print(f"  Record ID: {complex_record.record_id}")
print(f"  Content: {complex_record.content[:150]}...")
print(f"  Metadata: {complex_record.metadata}")


## Step 3: Vector Store Configuration

Let's explore different vector store configurations for various use cases.

In [None]:
# Configuration 1: Standard setup for general documents
print("⚙️ Configuration 1: Standard Document Processing")

standard_vector_store = ChromaDBEmbeddings(
    collection_name="tutorial_standard",
    persist_directory=f"{temp_dir}/standard_vectorstore",
    embedding_model="text-embedding-005",
    dimensionality=3072,
    concurrency=5,  # Lower for tutorial
    arrow_save_dir=f"{temp_dir}/standard_chunks"
)

# Initialize cache
await standard_vector_store.ensure_cache_initialized()
print(f"✅ Standard vector store initialized")
print(f"   Collection: {standard_vector_store.collection.name}")
print(f"   Embedding model: {standard_vector_store.embedding_model}")
print(f"   Dimensionality: {standard_vector_store.dimensionality}")


In [None]:
# Configuration 2: Optimized for short texts
print("\n⚙️ Configuration 2: Short Text Optimization")

short_text_vector_store = ChromaDBEmbeddings(
    collection_name="tutorial_short_text",
    persist_directory=f"{temp_dir}/short_text_vectorstore",
    embedding_model="text-embedding-005",
    dimensionality=3072,
    concurrency=3,
    upsert_batch_size=20,  # Smaller batches for short texts
    arrow_save_dir=f"{temp_dir}/short_text_chunks"
)

await short_text_vector_store.ensure_cache_initialized()
print(f"✅ Short text vector store initialized")


In [None]:
# Configuration 3: High-performance setup
print("\n⚙️ Configuration 3: High-Performance Setup")

# Note: In production, you'd use higher concurrency and batch sizes
performance_config = {
    "collection_name": "tutorial_performance",
    "persist_directory": f"{temp_dir}/performance_vectorstore",
    "embedding_model": "text-embedding-005",
    "dimensionality": 3072,
    "concurrency": 20,           # Higher concurrency
    "upsert_batch_size": 100,    # Larger batches
    "embedding_batch_size": 10,  # Batch embeddings
    "arrow_save_dir": f"{temp_dir}/performance_chunks"
}

print(f"📋 High-performance configuration:")
for key, value in performance_config.items():
    print(f"   {key}: {value}")
    
print("\n💡 For production, consider:")
print("   - Persistent storage volumes")
print("   - SSD storage for better performance")
print("   - Monitoring embedding API costs")
print("   - Load balancing for high traffic")


## Step 4: Text Chunking Strategies

Different types of content require different chunking strategies for optimal search performance.

In [None]:
# Strategy 1: Standard chunking
print("✂️ Chunking Strategy 1: Standard Documents")

standard_splitter = DefaultTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

# Test with our sample document
sample_text = sample_documents[0]["content"]
standard_chunks = standard_splitter.split_text(sample_text)

print(f"Original text length: {len(sample_text)} characters")
print(f"Number of chunks: {len(standard_chunks)}")
print(f"Average chunk size: {sum(len(chunk) for chunk in standard_chunks) / len(standard_chunks):.0f} characters")
print(f"\nFirst chunk: {standard_chunks[0][:100]}...")


In [None]:
# Strategy 2: Small chunks for precise search
print("\n✂️ Chunking Strategy 2: Small Chunks (Precise Search)")

precise_splitter = DefaultTextSplitter(
    chunk_size=500,   # Smaller chunks
    chunk_overlap=100  # Less overlap
)

precise_chunks = precise_splitter.split_text(sample_text)
print(f"Number of chunks: {len(precise_chunks)}")
print(f"Average chunk size: {sum(len(chunk) for chunk in precise_chunks) / len(precise_chunks):.0f} characters")
print("\n📝 Use case: When you need to find very specific information")


In [None]:
# Strategy 3: Large chunks for context preservation
print("\n✂️ Chunking Strategy 3: Large Chunks (Context Preservation)")

context_splitter = DefaultTextSplitter(
    chunk_size=2000,  # Larger chunks
    chunk_overlap=400  # More overlap
)

# Use a longer text for this example
long_text = " ".join([doc["content"] for doc in sample_documents[:3]])
context_chunks = context_splitter.split_text(long_text)

print(f"Long text length: {len(long_text)} characters")
print(f"Number of chunks: {len(context_chunks)}")
print(f"Average chunk size: {sum(len(chunk) for chunk in context_chunks) / len(context_chunks):.0f} characters")
print("\n📝 Use case: When context and relationships between ideas are important")


In [None]:
# Strategy 4: Custom chunking for structured content
print("\n✂️ Chunking Strategy 4: Custom Structured Chunking")

def custom_structured_chunker(text: str, metadata: dict) -> list[str]:
    """
    Custom chunking that respects document structure
    """
    chunks = []
    
    # Include title and metadata in first chunk
    title = metadata.get("title", "")
    author = metadata.get("author", "")
    category = metadata.get("category", "")
    
    header_chunk = f"Title: {title}\nAuthor: {author}\nCategory: {category}\n\n{text[:500]}"
    chunks.append(header_chunk)
    
    # Split remaining text into smaller chunks
    remaining_text = text[500:]
    if remaining_text:
        splitter = DefaultTextSplitter(chunk_size=800, chunk_overlap=150)
        remaining_chunks = splitter.split_text(remaining_text)
        chunks.extend(remaining_chunks)
    
    return chunks

# Test custom chunking
sample_doc = sample_documents[0]
custom_chunks = custom_structured_chunker(sample_doc["content"], sample_doc)

print(f"Number of chunks: {len(custom_chunks)}")
print(f"\nFirst chunk (with metadata):")
print(custom_chunks[0][:200] + "...")
print("\n📝 Use case: When you want to preserve document metadata and structure")


## Step 5: Document Processing Pipeline

Now let's put it all together and process our documents through the complete pipeline.

In [None]:
# Convert Records to InputDocuments
print("🔄 Converting Records to InputDocuments")

def records_to_input_docs(records: List[Record]) -> List[InputDocument]:
    """
    Convert Buttermilk Records to InputDocuments for vector processing
    """
    input_docs = []
    
    for record in records:
        # Use title from metadata if available, otherwise create one
        title = "Unknown Document"
        if record.metadata:
            title = record.metadata.get("title", f"Document {record.record_id}")
        
        input_doc = InputDocument(
            record_id=record.record_id,
            title=title,
            full_text=record.content,
            file_path="",  # Not from a file
            metadata=record.metadata or {}
        )
        input_docs.append(input_doc)
    
    return input_docs

# Convert our sample records
input_documents = records_to_input_docs(records)
print(f"✅ Created {len(input_documents)} InputDocuments")

# Show example
example_doc = input_documents[0]
print(f"\nExample InputDocument:")
print(f"  ID: {example_doc.record_id}")
print(f"  Title: {example_doc.title}")
print(f"  Text length: {len(example_doc.full_text)} characters")
print(f"  Metadata keys: {list(example_doc.metadata.keys())}")


In [None]:
# Process documents through the pipeline
print("🏭 Processing Documents Through Pipeline")
print("This includes: chunking → embedding → storing in ChromaDB")

# Use standard configuration and chunking
text_splitter = DefaultTextSplitter(chunk_size=800, chunk_overlap=150)

# Process each document
processed_count = 0
total_chunks = 0

for i, doc in enumerate(input_documents):
    print(f"\n📄 Processing document {i+1}/{len(input_documents)}: {doc.title}")
    
    try:
        # Step 1: Chunk the document
        chunked_doc = await text_splitter.process(doc)
        if not chunked_doc or not chunked_doc.chunks:
            print(f"  ⚠️ No chunks created for {doc.record_id}")
            continue
        
        print(f"  ✂️ Created {len(chunked_doc.chunks)} chunks")
        
        # Step 2: Embed and store
        processed_doc = await standard_vector_store.process(chunked_doc)
        if processed_doc:
            processed_count += 1
            total_chunks += len(processed_doc.chunks)
            print(f"  ✅ Embedded and stored {len(processed_doc.chunks)} chunks")
        else:
            print(f"  ❌ Failed to process {doc.record_id}")
            
    except Exception as e:
        print(f"  ❌ Error processing {doc.record_id}: {e}")

print(f"\n🎉 Pipeline Complete!")
print(f"Successfully processed: {processed_count}/{len(input_documents)} documents")
print(f"Total chunks stored: {total_chunks}")
print(f"Vector store collection count: {standard_vector_store.collection.count()}")


## Step 6: Semantic Search Techniques

Now let's explore different search techniques and optimization strategies.

In [None]:
# Basic semantic search function
def semantic_search(query: str, n_results: int = 5, collection=None):
    """
    Perform semantic search on the vector store
    """
    if collection is None:
        collection = standard_vector_store.collection
    
    results = collection.query(
        query_texts=[query],
        n_results=n_results,
        include=["documents", "metadatas", "distances"]
    )
    
    print(f"\n🔍 Search: '{query}'")
    print("=" * 50)
    
    if not results['documents'] or not results['documents'][0]:
        print("No results found")
        return []
    
    search_results = []
    for i, (doc, metadata, distance) in enumerate(zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    )):
        similarity = 1 - distance
        result = {
            'rank': i + 1,
            'similarity': similarity,
            'document': doc,
            'metadata': metadata
        }
        search_results.append(result)
        
        print(f"\n📄 Result {i+1} (similarity: {similarity:.3f})")
        print(f"Document: {metadata.get('document_title', 'Unknown')}")
        print(f"Text: {doc[:150]}...")
        if 'category' in metadata:
            print(f"Category: {metadata.get('category', 'Unknown')}")
    
    return search_results

# Test different types of queries
test_queries = [
    "machine learning algorithms",
    "renewable energy solar power", 
    "web development frameworks",
    "gardening in small spaces"
]

for query in test_queries:
    results = semantic_search(query, n_results=2)
    print("\n" + "-" * 60)


In [None]:
# Advanced search with filtering
print("\n🔍 Advanced Search: Filtering by Metadata")

def filtered_search(query: str, filters: dict = None, n_results: int = 5):
    """
    Semantic search with metadata filtering
    """
    search_params = {
        "query_texts": [query],
        "n_results": n_results,
        "include": ["documents", "metadatas", "distances"]
    }
    
    if filters:
        search_params["where"] = filters
    
    results = standard_vector_store.collection.query(**search_params)
    
    print(f"\n🔍 Filtered Search: '{query}'")
    if filters:
        print(f"Filters: {filters}")
    print("=" * 50)
    
    if results['documents'] and results['documents'][0]:
        for i, (doc, metadata, distance) in enumerate(zip(
            results['documents'][0],
            results['metadatas'][0],
            results['distances'][0]
        )):
            similarity = 1 - distance
            print(f"\n📄 Result {i+1} (similarity: {similarity:.3f})")
            print(f"Document: {metadata.get('document_title', 'Unknown')}")
            print(f"Category: {metadata.get('category', 'Unknown')}")
            print(f"Text: {doc[:100]}...")
    else:
        print("No results found with current filters")

# Test filtered searches
print("🧪 Testing different filter combinations:")

# Filter by category
filtered_search(
    "programming", 
    filters={"category": "technology"},
    n_results=3
)

# Filter by author (if available in metadata)
filtered_search(
    "sustainable practices",
    filters={"author": "Prof. Bob Johnson"},
    n_results=2
)


In [None]:
# Search optimization techniques
print("\n⚡ Search Optimization Techniques")

def optimized_search(query: str, similarity_threshold: float = 0.5):
    """
    Search with similarity threshold filtering
    """
    # Get more results initially
    results = standard_vector_store.collection.query(
        query_texts=[query],
        n_results=20,  # Get more to filter
        include=["documents", "metadatas", "distances"]
    )
    
    if not results['documents'] or not results['documents'][0]:
        return []
    
    # Filter by similarity threshold
    filtered_results = []
    for doc, metadata, distance in zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    ):
        similarity = 1 - distance
        if similarity >= similarity_threshold:
            filtered_results.append({
                'similarity': similarity,
                'document': doc,
                'metadata': metadata
            })
    
    print(f"\n🎯 Optimized Search: '{query}' (threshold: {similarity_threshold})")
    print(f"Found {len(filtered_results)} results above threshold")
    print("=" * 50)
    
    for i, result in enumerate(filtered_results[:5]):  # Show top 5
        print(f"\n📄 Result {i+1} (similarity: {result['similarity']:.3f})")
        print(f"Document: {result['metadata'].get('document_title', 'Unknown')}")
        print(f"Text: {result['document'][:100]}...")
    
    return filtered_results

# Test optimized search
high_quality_results = optimized_search("artificial intelligence", similarity_threshold=0.6)
all_results = optimized_search("artificial intelligence", similarity_threshold=0.3)

print(f"\n📊 Search Quality Comparison:")
print(f"High threshold (0.6): {len(high_quality_results)} results")
print(f"Low threshold (0.3): {len(all_results)} results")


In [None]:
# Multi-query search for better recall
print("\n🎯 Multi-Query Search for Better Recall")

def multi_query_search(queries: List[str], n_results_per_query: int = 3):
    """
    Search with multiple related queries and combine results
    """
    all_results = {}
    
    for query in queries:
        results = standard_vector_store.collection.query(
            query_texts=[query],
            n_results=n_results_per_query,
            include=["documents", "metadatas", "distances"]
        )
        
        if results['documents'] and results['documents'][0]:
            for doc, metadata, distance in zip(
                results['documents'][0],
                results['metadatas'][0],
                results['distances'][0]
            ):
                doc_id = metadata.get('document_id', doc[:50])
                similarity = 1 - distance
                
                # Keep the best similarity score for each document
                if doc_id not in all_results or similarity > all_results[doc_id]['similarity']:
                    all_results[doc_id] = {
                        'similarity': similarity,
                        'document': doc,
                        'metadata': metadata,
                        'matching_query': query
                    }
    
    # Sort by similarity
    sorted_results = sorted(all_results.values(), key=lambda x: x['similarity'], reverse=True)
    
    print(f"\n🔍 Multi-Query Search Results")
    print(f"Queries: {queries}")
    print(f"Unique documents found: {len(sorted_results)}")
    print("=" * 50)
    
    for i, result in enumerate(sorted_results[:5]):
        print(f"\n📄 Result {i+1} (similarity: {result['similarity']:.3f})")
        print(f"Matched query: '{result['matching_query']}'")
        print(f"Document: {result['metadata'].get('document_title', 'Unknown')}")
        print(f"Text: {result['document'][:100]}...")
    
    return sorted_results

# Test multi-query search
related_queries = [
    "machine learning",
    "artificial intelligence", 
    "AI algorithms",
    "neural networks"
]

multi_results = multi_query_search(related_queries)


## Step 7: Performance Analysis and Best Practices

Let's analyze the performance of our vector store and discuss best practices.

In [None]:
# Vector store analysis
print("📊 Vector Store Analysis")

collection = standard_vector_store.collection
total_documents = collection.count()

# Get sample data for analysis
sample_data = collection.get(
    limit=min(10, total_documents),
    include=["documents", "metadatas", "embeddings"]
)

print(f"📈 Collection Statistics:")
print(f"  Total chunks: {total_documents}")
print(f"  Collection name: {collection.name}")
print(f"  Storage location: {standard_vector_store.persist_directory}")

if sample_data['documents']:
    # Analyze chunk sizes
    chunk_sizes = [len(doc) for doc in sample_data['documents']]
    avg_chunk_size = sum(chunk_sizes) / len(chunk_sizes)
    min_chunk_size = min(chunk_sizes)
    max_chunk_size = max(chunk_sizes)
    
    print(f"\n📏 Chunk Size Analysis:")
    print(f"  Average chunk size: {avg_chunk_size:.0f} characters")
    print(f"  Min chunk size: {min_chunk_size} characters")
    print(f"  Max chunk size: {max_chunk_size} characters")
    
    # Analyze metadata
    if sample_data['metadatas']:
        metadata_keys = set()
        for metadata in sample_data['metadatas']:
            metadata_keys.update(metadata.keys())
        
        print(f"\n🏷️ Metadata Analysis:")
        print(f"  Unique metadata keys: {sorted(metadata_keys)}")
    
    # Analyze embeddings
    if sample_data['embeddings'] and sample_data['embeddings'][0]:
        embedding_dim = len(sample_data['embeddings'][0])
        print(f"\n🧠 Embedding Analysis:")
        print(f"  Embedding dimensionality: {embedding_dim}")
        print(f"  Expected dimensionality: {standard_vector_store.dimensionality}")
        print(f"  ✅ Dimensions match" if embedding_dim == standard_vector_store.dimensionality else "❌ Dimension mismatch!")

# Storage analysis
storage_path = Path(standard_vector_store.persist_directory)
if storage_path.exists():
    print(f"\n💾 Storage Analysis:")
    total_size = 0
    file_count = 0
    
    for item in storage_path.rglob('*'):
        if item.is_file():
            size = item.stat().st_size
            total_size += size
            file_count += 1
            if size > 1024:  # Show files larger than 1KB
                size_mb = size / (1024 * 1024)
                print(f"  {item.name}: {size_mb:.2f} MB")
    
    print(f"  Total files: {file_count}")
    print(f"  Total storage: {total_size / (1024 * 1024):.2f} MB")
    print(f"  Storage per chunk: {total_size / total_documents / 1024:.1f} KB" if total_documents > 0 else "  No chunks")


In [None]:
# Performance benchmarking
print("\n⚡ Performance Benchmarking")

import time

def benchmark_search(queries: List[str], n_runs: int = 3):
    """
    Benchmark search performance
    """
    times = []
    
    for run in range(n_runs):
        start_time = time.time()
        
        for query in queries:
            collection.query(
                query_texts=[query],
                n_results=5,
                include=["documents", "metadatas"]
            )
        
        total_time = time.time() - start_time
        times.append(total_time)
    
    avg_time = sum(times) / len(times)
    avg_time_per_query = avg_time / len(queries)
    
    print(f"🏃 Benchmark Results ({n_runs} runs, {len(queries)} queries each):")
    print(f"  Average total time: {avg_time:.3f} seconds")
    print(f"  Average time per query: {avg_time_per_query:.3f} seconds")
    print(f"  Queries per second: {1/avg_time_per_query:.1f}")
    
    return avg_time_per_query

# Run benchmark
benchmark_queries = [
    "technology",
    "environment", 
    "programming",
    "sustainable energy",
    "machine learning algorithms"
]

avg_query_time = benchmark_search(benchmark_queries)

# Performance recommendations
print(f"\n💡 Performance Recommendations:")
if avg_query_time > 0.5:
    print(f"  ⚠️ Query time ({avg_query_time:.3f}s) is high. Consider:")
    print(f"     - Reducing embedding dimensionality")
    print(f"     - Using fewer chunks (larger chunk size)")
    print(f"     - Adding more RAM/faster storage")
elif avg_query_time > 0.1:
    print(f"  ✅ Query time ({avg_query_time:.3f}s) is acceptable")
    print(f"     - Consider optimization for production workloads")
else:
    print(f"  🚀 Query time ({avg_query_time:.3f}s) is excellent!")


In [None]:
# Best practices summary
print("\n📋 Best Practices Summary")

best_practices = {
    "Data Preparation": [
        "Clean and normalize text before processing",
        "Preserve important metadata for filtering", 
        "Consider document structure when chunking",
        "Remove or handle special characters appropriately"
    ],
    "Chunking Strategy": [
        "Balance chunk size vs. context preservation",
        "Use overlap to maintain context across chunks",
        "Consider domain-specific chunking (e.g., by paragraphs, sections)",
        "Test different chunk sizes for your use case"
    ],
    "Vector Store Configuration": [
        "Choose embedding model based on your domain",
        "Use appropriate dimensionality for your needs",
        "Configure batch sizes based on available resources",
        "Monitor embedding API costs and usage"
    ],
    "Search Optimization": [
        "Use similarity thresholds to filter low-quality results",
        "Implement metadata filtering for precise searches",
        "Consider multi-query approaches for better recall",
        "Cache frequently used search results"
    ],
    "Production Deployment": [
        "Use persistent storage volumes",
        "Implement backup and recovery strategies", 
        "Monitor performance and resource usage",
        "Plan for scaling as data grows",
        "Implement proper error handling and retries"
    ],
    "Cost Management": [
        "Use pre-computed embeddings when possible",
        "Batch embedding operations efficiently",
        "Monitor embedding API usage and costs",
        "Consider alternative embedding models for different use cases"
    ]
}

for category, practices in best_practices.items():
    print(f"\n🔧 {category}:")
    for practice in practices:
        print(f"   • {practice}")


## Step 8: Common Use Cases and Patterns

Let's explore some common patterns for different types of applications.

In [None]:
# Use Case 1: Document Q&A System
print("📚 Use Case 1: Document Q&A System")

def document_qa(question: str, n_context_chunks: int = 3):
    """
    Simple document Q&A using vector search + context
    """
    # Search for relevant chunks
    results = standard_vector_store.collection.query(
        query_texts=[question],
        n_results=n_context_chunks,
        include=["documents", "metadatas", "distances"]
    )
    
    if not results['documents'] or not results['documents'][0]:
        return "No relevant information found."
    
    # Combine context from top chunks
    context_parts = []
    for doc, metadata, distance in zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    ):
        similarity = 1 - distance
        if similarity > 0.5:  # Only use high-quality matches
            title = metadata.get('document_title', 'Unknown Document')
            context_parts.append(f"From '{title}': {doc}")
    
    if not context_parts:
        return "No sufficiently relevant information found."
    
    # In a real system, you'd send this to an LLM
    context = "\n\n".join(context_parts)
    
    return {
        'question': question,
        'context': context,
        'sources': len(context_parts)
    }

# Test Q&A
qa_questions = [
    "What are the benefits of machine learning?",
    "How do solar panels work?",
    "What is quantum computing?"
]

for question in qa_questions:
    answer = document_qa(question)
    print(f"\n❓ Question: {question}")
    if isinstance(answer, dict):
        print(f"📖 Found context from {answer['sources']} sources")
        print(f"Context preview: {answer['context'][:200]}...")
    else:
        print(f"Answer: {answer}")
    print("-" * 40)


In [None]:
# Use Case 2: Content Recommendation System
print("\n🎯 Use Case 2: Content Recommendation System")

def recommend_similar_content(document_id: str, n_recommendations: int = 3):
    """
    Recommend similar content based on a document
    """
    # Find the original document
    original_doc = standard_vector_store.collection.get(
        where={"document_id": document_id},
        include=["documents", "metadatas"]
    )
    
    if not original_doc['documents']:
        return f"Document {document_id} not found"
    
    # Use the document content as query
    query_text = original_doc['documents'][0]
    
    # Search for similar documents (excluding the original)
    results = standard_vector_store.collection.query(
        query_texts=[query_text],
        n_results=n_recommendations + 5,  # Get extra to filter out original
        include=["documents", "metadatas", "distances"]
    )
    
    recommendations = []
    for doc, metadata, distance in zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    ):
        # Skip the original document
        if metadata.get('document_id') == document_id:
            continue
            
        similarity = 1 - distance
        recommendations.append({
            'document_id': metadata.get('document_id'),
            'title': metadata.get('document_title', 'Unknown'),
            'similarity': similarity,
            'category': metadata.get('category', 'Unknown'),
            'preview': doc[:100] + "..."
        })
        
        if len(recommendations) >= n_recommendations:
            break
    
    return recommendations

# Test recommendations
print(f"🔍 Testing content recommendations:")

# Get a document ID from our collection
sample_docs = standard_vector_store.collection.get(
    limit=1,
    include=["metadatas"]
)

if sample_docs['metadatas']:
    test_doc_id = sample_docs['metadatas'][0].get('document_id')
    recommendations = recommend_similar_content(test_doc_id)
    
    print(f"\n📄 Recommendations for document: {test_doc_id}")
    
    if isinstance(recommendations, list):
        for i, rec in enumerate(recommendations):
            print(f"\n{i+1}. {rec['title']} (similarity: {rec['similarity']:.3f})")
            print(f"   Category: {rec['category']}")
            print(f"   Preview: {rec['preview']}")
    else:
        print(recommendations)
else:
    print("No documents available for testing")


In [None]:
# Use Case 3: Content Classification and Tagging
print("\n🏷️ Use Case 3: Content Classification and Tagging")

def auto_tag_content(text: str, existing_tags: List[str] = None):
    """
    Automatically suggest tags for content based on similar documents
    """
    if existing_tags is None:
        existing_tags = ["technology", "environment", "lifestyle", "education", "business"]
    
    # Search for similar content
    results = standard_vector_store.collection.query(
        query_texts=[text],
        n_results=10,
        include=["documents", "metadatas", "distances"]
    )
    
    if not results['documents'] or not results['documents'][0]:
        return []
    
    # Analyze tags from similar documents
    tag_scores = {}
    
    for doc, metadata, distance in zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    ):
        similarity = 1 - distance
        if similarity < 0.3:  # Skip low-similarity results
            continue
            
        # Extract tags from metadata
        doc_category = metadata.get('category', '')
        doc_tags = metadata.get('tags', [])
        
        # Score category
        if doc_category:
            tag_scores[doc_category] = tag_scores.get(doc_category, 0) + similarity
        
        # Score individual tags
        if isinstance(doc_tags, list):
            for tag in doc_tags:
                tag_scores[tag] = tag_scores.get(tag, 0) + similarity * 0.5
    
    # Sort tags by score
    suggested_tags = sorted(tag_scores.items(), key=lambda x: x[1], reverse=True)
    
    return suggested_tags[:5]  # Return top 5 suggestions

# Test auto-tagging
test_texts = [
    "Deep learning neural networks are revolutionizing computer vision and natural language processing applications.",
    "Solar energy and wind power are becoming more cost-effective alternatives to fossil fuels for electricity generation.",
    "Container gardening allows urban dwellers to grow vegetables and herbs in small apartment spaces."
]

for i, text in enumerate(test_texts):
    print(f"\n📝 Text {i+1}: {text[:80]}...")
    
    suggested_tags = auto_tag_content(text)
    
    print(f"🏷️ Suggested tags:")
    for tag, score in suggested_tags:
        print(f"   {tag} (confidence: {score:.2f})")


## Step 9: Cleanup and Next Steps

Let's clean up our tutorial resources and discuss production considerations.

In [None]:
# Cleanup and summary
print("🧹 Cleanup and Summary")

# Show what was created
print(f"\n📁 Files created during tutorial:")
print(f"  JSON data: {json_file_path}")
print(f"  Vector store: {standard_vector_store.persist_directory}")
print(f"  Chunks storage: {standard_vector_store.arrow_save_dir}")

# Storage usage
storage_path = Path(temp_dir)
total_size = sum(f.stat().st_size for f in storage_path.rglob('*') if f.is_file())
print(f"  Total storage used: {total_size / (1024 * 1024):.2f} MB")

# Final statistics
final_count = standard_vector_store.collection.count()
print(f"\n📊 Final Statistics:")
print(f"  Documents processed: {len(input_documents)}")
print(f"  Chunks created: {final_count}")
print(f"  Vector store collection: {standard_vector_store.collection.name}")

print(f"\n🗑️ Cleanup:")
print(f"To remove tutorial files: rm -rf {temp_dir}")
print(f"(Uncomment the line below to auto-cleanup)")

# Uncomment to actually clean up
# import shutil
# shutil.rmtree(temp_dir)
# print("✅ Tutorial files cleaned up")


In [None]:
# Production readiness checklist
print("\n🚀 Production Readiness Checklist")

production_checklist = {
    "✅ Essential": [
        "Set up persistent storage volumes",
        "Configure backup and recovery procedures",
        "Implement proper error handling and logging",
        "Set up monitoring and alerting",
        "Secure API keys and credentials",
        "Test with production data volumes"
    ],
    "⚡ Performance": [
        "Optimize chunk sizes for your domain",
        "Configure batch sizes based on resources",
        "Implement caching for frequent queries",
        "Consider using pre-computed embeddings",
        "Set up load balancing if needed",
        "Monitor embedding API usage and costs"
    ],
    "🔒 Security": [
        "Implement authentication and authorization",
        "Encrypt data at rest and in transit", 
        "Set up network security (VPC, firewalls)",
        "Regular security audits and updates",
        "Data privacy compliance (GDPR, etc.)",
        "Secure API endpoints"
    ],
    "📈 Scalability": [
        "Plan for horizontal scaling",
        "Implement data partitioning strategies",
        "Set up auto-scaling policies",
        "Consider distributed vector stores",
        "Plan for disaster recovery",
        "Monitor resource utilization"
    ]
}

for category, items in production_checklist.items():
    print(f"\n{category}:")
    for item in items:
        print(f"   □ {item}")

print(f"\n🎓 What You've Learned:")
learning_outcomes = [
    "How to load JSON data using Buttermilk's data loaders",
    "Different vector store configuration strategies",
    "Text chunking techniques for optimal search",
    "Embedding generation and storage workflows",
    "Semantic search implementation and optimization",
    "Common use cases and design patterns",
    "Performance analysis and best practices",
    "Production deployment considerations"
]

for outcome in learning_outcomes:
    print(f"   ✓ {outcome}")

print(f"\n📚 Additional Resources:")
print(f"   • OSB Vector Database Guide: /docs/osb_vector_database_guide.md")
print(f"   • OSB Example Notebook: /examples/osb_vector_example.ipynb")
print(f"   • Buttermilk Documentation: Check the /docs directory")
print(f"   • Vector Store Tests: /tests/flows/test_embeddings.py")


## Conclusion

🎉 **Congratulations!** You've successfully completed the JSON-to-Vector Database tutorial.

### Key Takeaways

1. **Flexible Data Loading**: Buttermilk's data loading infrastructure supports various JSON formats and sources
2. **Configurable Vector Stores**: ChromaDB integration provides powerful, scalable vector search capabilities
3. **Chunking Strategies**: Different approaches optimize for different use cases (precision vs. context)
4. **Search Optimization**: Multiple techniques improve search quality and performance
5. **Production Ready**: Built-in support for async processing, error handling, and monitoring

### Next Steps

- **Experiment** with your own JSON datasets
- **Optimize** chunking and embedding strategies for your domain
- **Integrate** with Buttermilk flows for automated processing
- **Scale** to production with proper infrastructure
- **Monitor** performance and costs in production

### Architecture Benefits

Buttermilk's vector database infrastructure provides:

- 🔧 **Modular Design**: Mix and match components as needed
- 🚀 **Async Processing**: Scalable for large datasets
- ☁️ **Cloud Integration**: Native support for GCS, BigQuery, etc.
- 🔍 **Rich Metadata**: Powerful filtering and search capabilities
- 📊 **Observable**: Built-in logging and tracing
- 🛡️ **Robust**: Error handling and retry mechanisms

Ready to build amazing search and AI applications with your JSON data! 🚀