# Vector Databases with Weaviate: Indexing and Retrieval at Scale

**Yale AI Research Techniques Workshop**  
*Text Embeddings and Classification for Entity Resolution*

## Learning Objectives
- Understand why vector databases are essential for large-scale similarity search
- Set up Weaviate with OpenAI embeddings integration
- Index library metadata using proper schema design
- Perform efficient similarity queries for entity resolution
- Apply these techniques to real bibliographic data challenges

---

## Introduction: Why Vector Databases?

In our previous notebook, we explored Word2Vec and learned that words become vectors. But what happens when you have millions of these vectors and need to find similar ones quickly?

**The Challenge**: Yale Library has 17.6 million catalog records. If each record becomes a vector with 1,536 dimensions (like OpenAI's embeddings), that's over 120 billion numbers to search through. Finding similar records would take hours using basic similarity calculations.

**The Solution**: Vector databases like Weaviate use clever algorithms (like HNSW - Hierarchical Navigable Small World) to organize vectors in a way that makes similarity search lightning-fast. Instead of checking every single vector, these algorithms can find the most similar ones in milliseconds.

Think of it like the difference between searching through a messy pile of papers versus using a well-organized library catalog system.

## Part 1: Setting Up Our Environment

We'll work with both OpenAI (for generating embeddings) and Weaviate (for storing and searching them). It's crucial to understand that these are separate services working together.

In [None]:
# Install required packages
!pip install weaviate-client openai python-dotenv numpy pandas matplotlib seaborn

In [9]:
# Import essential libraries
import weaviate
from weaviate.classes.config import Configure, Property, DataType, VectorDistances
from weaviate.classes.query import Filter, MetadataQuery
from weaviate.util import generate_uuid5
import openai
from openai import OpenAI
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import time
import json
from typing import List, Dict, Any
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ Libraries imported successfully!")
print(f"Weaviate client version: {weaviate.__version__}")

✅ Libraries imported successfully!
Weaviate client version: 4.8.0


## Part 2: Understanding the Two-Client Architecture

**Critical Concept**: We use TWO separate clients:
1. **OpenAI Client**: Converts text into embedding vectors
2. **Weaviate Client**: Stores vectors and performs similarity search

This separation allows us to use OpenAI's powerful embedding models while leveraging Weaviate's efficient vector search capabilities.

In [10]:
# Configuration - in practice, use environment variables for API keys
OPENAI_API_KEY = "your-openai-api-key-here"  # Replace with your actual key
WEAVIATE_URL = "http://localhost:8080"  # Local Weaviate instance

# For this workshop, we'll use mock data to demonstrate concepts
# In production, you'd have real API keys

def setup_openai_client():
    """Initialize OpenAI client for generating embeddings."""
    # In a real scenario, you'd use: OpenAI(api_key=OPENAI_API_KEY)
    # For demo purposes, we'll simulate this
    print("🔗 OpenAI Client: Handles text → vector conversion")
    print("   • Sends text to OpenAI's text-embedding-3-small model")
    print("   • Receives 1,536-dimensional vectors back")
    print("   • Cost: $0.02 per 1 million tokens")
    return "openai_client_placeholder"

def setup_weaviate_client():
    """Initialize Weaviate client for vector storage and search."""
    try:
        # In production, this would connect to actual Weaviate instance
        print("🗄️  Weaviate Client: Handles vector storage and similarity search")
        print("   • Stores vectors with metadata in collections")
        print("   • Performs fast similarity search using HNSW algorithm")
        print("   • Can integrate directly with OpenAI for automatic embedding")
        
        # For demo, we'll simulate the connection
        # Real code would be:
        # client = weaviate.WeaviateClient(
        #     connection_params=weaviate.connect.ConnectionParams.from_url(WEAVIATE_URL)
        # )
        # client.connect()
        
        return "weaviate_client_placeholder"
    except Exception as e:
        print(f"⚠️  Could not connect to Weaviate (expected in demo): {e}")
        print("   This is normal - we'll demonstrate with mock data")
        return None

# Initialize both clients
openai_client = setup_openai_client()
weaviate_client = setup_weaviate_client()

print("\n💡 Key Insight: These two clients work together but serve different purposes!")

🔗 OpenAI Client: Handles text → vector conversion
   • Sends text to OpenAI's text-embedding-3-small model
   • Receives 1,536-dimensional vectors back
   • Cost: $0.02 per 1 million tokens
🗄️  Weaviate Client: Handles vector storage and similarity search
   • Stores vectors with metadata in collections
   • Performs fast similarity search using HNSW algorithm
   • Can integrate directly with OpenAI for automatic embedding

💡 Key Insight: These two clients work together but serve different purposes!


## Part 3: Designing Our Schema for Library Metadata

Before we can store vectors, we need to design a schema that captures the structure of our library metadata. This is like designing the blueprint for our vector database tables.

In [11]:
def design_entity_schema():
    """Design the schema for our EntityString collection.
    
    This schema is based on Yale's actual entity resolution pipeline.
    Each record stores a text string along with its embedding vector.
    """
    
    schema_design = {
        "collection_name": "EntityString",
        "description": "Collection for entity string values with their embeddings",
        "properties": [
            {
                "name": "original_string",
                "type": "TEXT",
                "description": "The actual text (e.g., 'Schubert, Franz')"
            },
            {
                "name": "hash_value", 
                "type": "TEXT",
                "description": "Unique identifier for deduplication"
            },
            {
                "name": "field_type",
                "type": "TEXT", 
                "description": "What kind of field: 'person', 'title', 'composite'"
            },
            {
                "name": "frequency",
                "type": "INT",
                "description": "How many times this string appears in the catalog"
            }
        ],
        "vector_config": {
            "model": "text-embedding-3-small",
            "dimensions": 1536,
            "distance_metric": "COSINE"
        }
    }
    
    return schema_design

# Display our schema design
schema = design_entity_schema()
print("📋 Schema Design for EntityString Collection")
print("="*50)
print(f"Collection: {schema['collection_name']}")
print(f"Purpose: {schema['description']}")
print("\nProperties (metadata we store with each vector):")

for prop in schema['properties']:
    print(f"  • {prop['name']} ({prop['type']}): {prop['description']}")

print("\nVector Configuration:")
for key, value in schema['vector_config'].items():
    print(f"  • {key}: {value}")

print("\n💡 Why This Design?")
print("• original_string: The actual text we want to search")
print("• hash_value: Prevents storing duplicate strings")
print("• field_type: Allows filtering by person names, titles, etc.")
print("• frequency: Helps with relevance ranking")
print("• Cosine distance: Best for text embeddings (as we learned earlier)")

📋 Schema Design for EntityString Collection
Collection: EntityString
Purpose: Collection for entity string values with their embeddings

Properties (metadata we store with each vector):
  • original_string (TEXT): The actual text (e.g., 'Schubert, Franz')
  • hash_value (TEXT): Unique identifier for deduplication
  • field_type (TEXT): What kind of field: 'person', 'title', 'composite'
  • frequency (INT): How many times this string appears in the catalog

Vector Configuration:
  • model: text-embedding-3-small
  • dimensions: 1536
  • distance_metric: COSINE

💡 Why This Design?
• original_string: The actual text we want to search
• hash_value: Prevents storing duplicate strings
• field_type: Allows filtering by person names, titles, etc.
• frequency: Helps with relevance ranking
• Cosine distance: Best for text embeddings (as we learned earlier)


## Part 4: Creating the Collection with Integrated OpenAI Embeddings

Weaviate can automatically generate embeddings using OpenAI's API. This means we only send text to Weaviate, and it handles the embedding generation behind the scenes.

In [12]:
def create_collection_with_openai_integration():
    """Create a Weaviate collection that automatically generates OpenAI embeddings.
    
    This demonstrates the v4 Weaviate Python client syntax.
    """
    
    # This is the actual code you'd use with a real Weaviate instance
    creation_code = '''
# Real Weaviate v4 collection creation code:
collection = weaviate_client.collections.create(
    name="EntityString",
    description="Collection for entity string values with their embeddings",
    
    # Configure automatic OpenAI embedding generation
    vectorizer_config=Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small",
        dimensions=1536
    ),
    
    # Configure the vector index for fast similarity search
    vector_index_config=Configure.VectorIndex.hnsw(
        ef=128,                    # Higher = more accurate, slower
        max_connections=64,        # Higher = more memory, faster search
        ef_construction=128,       # Higher = better quality index
        distance_metric=VectorDistances.COSINE
    ),
    
    # Define the metadata properties
    properties=[
        Property(name="original_string", data_type=DataType.TEXT),
        Property(name="hash_value", data_type=DataType.TEXT),
        Property(name="field_type", data_type=DataType.TEXT),
        Property(name="frequency", data_type=DataType.INT)
    ]
)
'''
    
    print("🏗️  Collection Creation Process")
    print("="*40)
    print(creation_code)
    
    print("\n🔧 Configuration Explained:")
    print("\n1. Vectorizer Configuration:")
    print("   • Tells Weaviate to use OpenAI's text-embedding-3-small")
    print("   • Automatically converts text to 1,536-dimensional vectors")
    print("   • No need to call OpenAI API manually!")
    
    print("\n2. Vector Index Configuration (HNSW):")
    print("   • ef=128: Search quality parameter (higher = more accurate)")
    print("   • max_connections=64: Memory vs speed tradeoff")
    print("   • ef_construction=128: Index building quality")
    print("   • COSINE distance: Best for text embeddings")
    
    print("\n3. Properties:")
    print("   • Metadata that gets stored alongside each vector")
    print("   • Allows filtering and retrieval of original information")
    
    return "collection_created_successfully"

# Demonstrate the collection creation process
collection_status = create_collection_with_openai_integration()

print("\n✅ Collection design complete!")
print("In a real scenario, Weaviate would now be ready to accept data.")

🏗️  Collection Creation Process

# Real Weaviate v4 collection creation code:
collection = weaviate_client.collections.create(
    name="EntityString",
    description="Collection for entity string values with their embeddings",
    
    # Configure automatic OpenAI embedding generation
    vectorizer_config=Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small",
        dimensions=1536
    ),
    
    # Configure the vector index for fast similarity search
    vector_index_config=Configure.VectorIndex.hnsw(
        ef=128,                    # Higher = more accurate, slower
        max_connections=64,        # Higher = more memory, faster search
        ef_construction=128,       # Higher = better quality index
        distance_metric=VectorDistances.COSINE
    ),
    
    # Define the metadata properties
    properties=[
        Property(name="original_string", data_type=DataType.TEXT),
        Property(name="hash_value", data_type=DataType.TEXT),
        Proper

## Part 5: Preparing Library Metadata for Indexing

Let's work with realistic examples from your entity resolution dataset. We'll create sample records that represent the Franz Schubert disambiguation challenge.

In [13]:
def create_sample_library_data():
    """Create sample library metadata following the actual entity resolution format.
    
    This represents the kind of data that would be indexed in Weaviate.
    """
    
    # Sample records based on your actual entity resolution dataset
    sample_records = [
        {
            "identity": "9.1",
            "composite": "Title: Archäologie und Photographie: fünfzig Beispiele zur Geschichte und Methode Subjects: Photography in archaeology Provision information: Mainz: P. von Zabern, 1978",
            "person": "Schubert, Franz",
            "title": "Archäologie und Photographie: fünfzig Beispiele zur Geschichte und Methode",
            "subjects": "Photography in archaeology",
            "attribution": "ausgewählt von Franz Schubert und Susanne Grunauer-von Hoerschelmann",
            "provision": "Mainz: P. von Zabern, 1978",
            "actual_person": "Franz August Schubert (1805-1893), German artist"
        },
        {
            "identity": "12.7",
            "composite": "Title: Symphony No. 8 in B minor 'Unfinished' Subjects: Symphonies, Classical music Person: Austrian composer 1797-1828",
            "person": "Schubert, Franz, 1797-1828",
            "title": "Symphony No. 8 in B minor 'Unfinished'", 
            "subjects": "Symphonies--19th century, Classical music",
            "attribution": "composed by Franz Schubert",
            "provision": "Vienna, 1822",
            "actual_person": "Franz Schubert (1797-1828), Austrian composer"
        },
        {
            "identity": "15.3",
            "composite": "Title: Die schöne Müllerin Subjects: Songs, German Lieder, Poetry settings Person: Austrian composer",
            "person": "Schubert, Franz, 1797-1828",
            "title": "Die schöne Müllerin",
            "subjects": "Songs, German--19th century, Lieder",
            "attribution": "music by Franz Schubert; texts by Wilhelm Müller",
            "provision": "Vienna, 1823",
            "actual_person": "Franz Schubert (1797-1828), Austrian composer"
        },
        {
            "identity": "18.9",
            "composite": "Title: Medieval manuscripts and illumination Subjects: Art history, Manuscripts Person: Franz Schubert German scholar",
            "person": "Schubert, Franz",
            "title": "Medieval manuscripts and illumination techniques",
            "subjects": "Art history, Manuscript illumination, Medieval art",
            "attribution": "by Franz Schubert",
            "provision": "Munich: Beck, 1881",
            "actual_person": "Franz August Schubert (1805-1893), German artist"
        }
    ]
    
    return sample_records

def prepare_indexing_data(records):
    """Transform library records into the format needed for Weaviate indexing.
    
    This mimics the preprocessing step in your actual pipeline.
    """
    
    indexing_items = []
    
    for record in records:
        # For each record, we create multiple indexing items for different field types
        # This allows us to search by person, title, or composite fields
        
        fields_to_index = {
            'person': record['person'],
            'title': record['title'], 
            'composite': record['composite']
        }
        
        for field_type, text_value in fields_to_index.items():
            if text_value and text_value.strip():  # Only index non-empty values
                
                # Create a hash for deduplication (simplified version)
                hash_value = f"hash_{hash(text_value)}_{field_type}"
                
                indexing_items.append({
                    'original_string': text_value,
                    'hash_value': hash_value,
                    'field_type': field_type,
                    'frequency': 1,  # In real pipeline, this would be calculated
                    'source_record': record['identity'],  # For tracking
                    'actual_person': record['actual_person']  # For our analysis
                })
    
    return indexing_items

# Create and prepare our sample data
library_records = create_sample_library_data()
indexing_data = prepare_indexing_data(library_records)

print("📚 Sample Library Records Created")
print("="*40)
print(f"Total bibliographic records: {len(library_records)}")
print(f"Total indexing items: {len(indexing_data)}")
print(f"Items per record: {len(indexing_data) / len(library_records):.1f} (person + title + composite)")

# Show a sample indexing item
print("\n📋 Sample Indexing Item:")
sample_item = indexing_data[0]
for key, value in sample_item.items():
    if key == 'original_string' and len(str(value)) > 60:
        print(f"  {key}: {str(value)[:60]}...")
    else:
        print(f"  {key}: {value}")

print("\n💡 This shows how one bibliographic record becomes multiple searchable items!")

📚 Sample Library Records Created
Total bibliographic records: 4
Total indexing items: 12
Items per record: 3.0 (person + title + composite)

📋 Sample Indexing Item:
  original_string: Schubert, Franz
  hash_value: hash_-7776589006278226991_person
  field_type: person
  frequency: 1
  source_record: 9.1
  actual_person: Franz August Schubert (1805-1893), German artist

💡 This shows how one bibliographic record becomes multiple searchable items!


## Part 6: The Indexing Process

Now let's understand how data gets indexed into Weaviate. In your actual pipeline, this happens automatically, but let's break down each step for learning.

In [14]:
def demonstrate_batch_indexing_process(indexing_data):
    """Show how batch indexing works in Weaviate v4.
    
    This demonstrates the actual code pattern from your pipeline.
    """
    
    indexing_code = '''
# Real Weaviate v4 batch indexing code:
collection = weaviate_client.collections.get("EntityString")

# Use batch processing for efficiency
with collection.batch.fixed_size(batch_size=100) as batch_writer:
    for item in indexing_data:
        # Generate consistent UUID for deduplication
        uuid_input = f"{item['hash_value']}_{item['field_type']}"
        uuid = generate_uuid5(uuid_input)
        
        # Add object to batch (Weaviate will auto-generate embedding)
        batch_writer.add_object(
            properties={
                "original_string": item['original_string'],
                "hash_value": item['hash_value'],
                "field_type": item['field_type'],
                "frequency": item['frequency']
            },
            uuid=uuid
        )
        # Note: No vector needed! OpenAI integration handles it automatically
'''
    
    print("⚡ Batch Indexing Process")
    print("="*30)
    print(indexing_code)
    
    print("\n🔄 What Happens During Indexing:")
    print("\n1. Text Processing:")
    print("   • Weaviate receives the original_string")
    print("   • Sends text to OpenAI's embedding API")
    print("   • Receives back 1,536-dimensional vector")
    
    print("\n2. Vector Storage:")
    print("   • Vector gets stored in HNSW index structure")
    print("   • Metadata (properties) stored alongside")
    print("   • UUID ensures no duplicates")
    
    print("\n3. Index Building:")
    print("   • HNSW algorithm creates navigation structure")
    print("   • Similar vectors get connected in graph")
    print("   • Enables fast similarity search later")
    
    return True

def simulate_indexing_costs(indexing_data):
    """Calculate the cost of indexing our sample data."""
    
    # Estimate tokens for cost calculation
    total_chars = sum(len(item['original_string']) for item in indexing_data)
    estimated_tokens = total_chars // 4  # Rough estimation: 4 chars per token
    
    # OpenAI pricing
    cost_per_million_tokens = 0.02
    estimated_cost = (estimated_tokens / 1_000_000) * cost_per_million_tokens
    
    print(f"\n💰 Indexing Cost Estimate:")
    print(f"  • Total characters: {total_chars:,}")
    print(f"  • Estimated tokens: {estimated_tokens:,}")
    print(f"  • Estimated cost: ${estimated_cost:.4f}")
    print(f"  • Cost per item: ${estimated_cost/len(indexing_data):.6f}")
    
    return estimated_cost

# Demonstrate the indexing process
demonstrate_batch_indexing_process(indexing_data)
simulate_indexing_costs(indexing_data)

print("\n✅ Indexing process complete!")
print("In production, Weaviate would now contain searchable vectors for all our library metadata.")

⚡ Batch Indexing Process

# Real Weaviate v4 batch indexing code:
collection = weaviate_client.collections.get("EntityString")

# Use batch processing for efficiency
with collection.batch.fixed_size(batch_size=100) as batch_writer:
    for item in indexing_data:
        # Generate consistent UUID for deduplication
        uuid_input = f"{item['hash_value']}_{item['field_type']}"
        uuid = generate_uuid5(uuid_input)
        
        # Add object to batch (Weaviate will auto-generate embedding)
        batch_writer.add_object(
            properties={
                "original_string": item['original_string'],
                "hash_value": item['hash_value'],
                "field_type": item['field_type'],
                "frequency": item['frequency']
            },
            uuid=uuid
        )
        # Note: No vector needed! OpenAI integration handles it automatically


🔄 What Happens During Indexing:

1. Text Processing:
   • Weaviate receives the original_string
   • Send

## Part 7: Similarity Search and Retrieval

Now comes the exciting part: finding similar entities! This is where the magic of vector search really shines. Let's explore how to query our indexed data.

In [None]:
def demonstrate_similarity_search():
    """Show how to perform similarity search in Weaviate v4.
    
    This demonstrates the querying patterns from your actual pipeline.
    """
    
    search_code = '''
# Real Weaviate v4 similarity search code:
collection = weaviate_client.collections.get("EntityString")

# Method 1: Search by text (let Weaviate handle embedding)
result = collection.query.near_text(
    query="Schubert composer Austrian music",
    filters=Filter.by_property("field_type").equal("person"),
    limit=5,
    return_properties=["original_string", "field_type", "frequency"],
    return_metadata=MetadataQuery(distance=True)
)

# Method 2: Search by vector (if you have pre-computed embedding)
result = collection.query.near_vector(
    near_vector=query_vector,  # 1536-dimensional vector
    filters=Filter.by_property("field_type").equal("composite"),
    limit=10,
    return_properties=["original_string", "hash_value"],
    return_metadata=MetadataQuery(distance=True, score=True)
)
'''
    
    print("🔍 Similarity Search Methods")
    print("="*35)
    print(search_code)
    
    print("\n🎯 Search Method Comparison:")
    print("\n1. near_text():")
    print("   • Send plain text query")
    print("   • Weaviate converts to vector automatically")
    print("   • Good for exploring and testing")
    
    print("\n2. near_vector():")
    print("   • Send pre-computed vector")
    print("   • Faster (no embedding API call)")
    print("   • Used in your production pipeline")
    
    print("\n🔧 Query Components:")
    print("   • filters: Restrict search to specific field types")
    print("   • limit: Control number of results returned")
    print("   • return_properties: Which metadata to include")
    print("   • return_metadata: Include distance/similarity scores")
    
    return True

def simulate_entity_resolution_query(query_text, target_field_type="person"):
    """Simulate how entity resolution queries work.
    
    This shows what would happen in your actual pipeline.
    """
    
    print(f"\n🔎 Entity Resolution Query: '{query_text}'")
    print(f"Target field type: {target_field_type}")
    print("="*60)
    
    # Simulate finding similar items from our sample data
    # In reality, this would use vector similarity
    relevant_items = []
    
    for item in indexing_data:
        if item['field_type'] == target_field_type:
            # Simple keyword matching for simulation
            text_to_check = item['original_string'].lower()
            query_words = query_text.lower().split()
            
            # Calculate simple similarity score
            matches = sum(1 for word in query_words if word in text_to_check)
            if matches > 0:
                # Simulate cosine similarity score
                simulated_similarity = min(0.95, matches / len(query_words) + 0.1)
                
                relevant_items.append({
                    'text': item['original_string'],
                    'source': item['source_record'],
                    'person': item['actual_person'],
                    'similarity': simulated_similarity,
                    'distance': 1.0 - simulated_similarity
                })
    
    # Sort by similarity (highest first)
    relevant_items.sort(key=lambda x: x['similarity'], reverse=True)
    
    print("\n📊 Search Results:")
    for i, item in enumerate(relevant_items[:5], 1):
        print(f"\n{i}. Similarity: {item['similarity']:.3f} (distance: {item['distance']:.3f})")
        print(f"   Text: {item['text'][:80]}{'...' if len(item['text']) > 80 else ''}")
        print(f"   Source: Record {item['source']}")
        print(f"   Actual Person: {item['person']}")
    
    if not relevant_items:
        print("   No similar items found in sample data.")
    
    return relevant_items

# Demonstrate search methods
demonstrate_similarity_search()

# Simulate some actual entity resolution queries
test_queries = [
    "Schubert Franz composer",
    "Schubert photography archaeology",
    "Austrian composer symphony"
]

for query in test_queries:
    results = simulate_entity_resolution_query(query, "person")
    time.sleep(1)  # Brief pause for readability

## Part 8: Advanced Querying Techniques

Let's explore more sophisticated querying patterns that are essential for entity resolution at scale.

In [None]:
def demonstrate_advanced_filtering():
    """Show advanced filtering and aggregation capabilities."""
    
    advanced_code = '''
# Advanced Weaviate v4 query patterns:

# 1. Multi-field filtering
result = collection.query.near_text(
    query="Franz Schubert",
    filters=(
        Filter.by_property("field_type").equal("person") &
        Filter.by_property("frequency").greater_than(1)
    ),
    limit=10
)

# 2. Hybrid search (keyword + vector)
result = collection.query.hybrid(
    query="Austrian composer",
    alpha=0.7,  # 0.7 vector + 0.3 keyword search
    filters=Filter.by_property("field_type").equal("composite")
)

# 3. Aggregation queries
aggregation = collection.aggregate.over_all(
    group_by="field_type",
    total_count=True
)

# 4. Batch similarity search
queries = ["composer", "artist", "photographer"]
for query_text in queries:
    result = collection.query.near_text(
        query=query_text,
        limit=3
    )
'''
    
    print("🚀 Advanced Query Patterns")
    print("="*30)
    print(advanced_code)
    
    print("\n🔧 Advanced Techniques Explained:")
    print("\n1. Multi-field Filtering:")
    print("   • Combine multiple conditions with & (AND) or | (OR)")
    print("   • Filter by field_type AND frequency threshold")
    print("   • Helps narrow down results to most relevant items")
    
    print("\n2. Hybrid Search:")
    print("   • Combines vector similarity with keyword matching")
    print("   • Alpha parameter controls the balance (0=keywords, 1=vectors)")
    print("   • Best of both worlds for complex queries")
    
    print("\n3. Aggregation:")
    print("   • Get statistics about your data")
    print("   • Count items by field type, frequency, etc.")
    print("   • Useful for understanding data distribution")
    
    print("\n4. Batch Processing:")
    print("   • Process multiple queries efficiently")
    print("   • Essential for large-scale entity resolution")
    print("   • Reduces API overhead")
    
    return True

def analyze_query_performance():
    """Analyze the performance characteristics of different query types."""
    
    performance_data = {
        'Query Type': [
            'near_text()',
            'near_vector()',
            'hybrid()',
            'filtered near_text()',
            'aggregation()'
        ],
        'Speed': [
            'Medium',
            'Fast',
            'Medium-Slow',
            'Medium',
            'Fast'
        ],
        'Accuracy': [
            'High',
            'High',
            'Very High',
            'High',
            'N/A'
        ],
        'Use Case': [
            'Exploration, prototyping',
            'Production similarity search',
            'Complex entity matching',
            'Filtered entity resolution',
            'Data analysis, monitoring'
        ],
        'API Calls': [
            '2 (Weaviate + OpenAI)',
            '1 (Weaviate only)',
            '2 (Weaviate + OpenAI)', 
            '2 (Weaviate + OpenAI)',
            '1 (Weaviate only)'
        ]
    }
    
    df = pd.DataFrame(performance_data)
    print("\n⚡ Query Performance Analysis")
    print("="*40)
    print(df.to_string(index=False))
    
    print("\n💡 Performance Tips:")
    print("• Use near_vector() in production for best speed")
    print("• Cache frequently-used embedding vectors")
    print("• Apply filters to reduce search space")
    print("• Use batch operations when processing many queries")
    print("• Monitor query latency and adjust limits as needed")
    
    return df

# Demonstrate advanced techniques
demonstrate_advanced_filtering()
performance_df = analyze_query_performance()

print("\n✅ Advanced querying concepts covered!")

## Part 9: Entity Resolution Pipeline Integration

Let's put it all together and show how Weaviate integrates into your complete entity resolution pipeline.

In [None]:
def demonstrate_complete_pipeline():
    """Show how Weaviate fits into the complete entity resolution pipeline."""
    
    pipeline_stages = [
        {
            'stage': '1. Data Preparation',
            'description': 'Extract and clean metadata from library catalog',
            'tools': 'Pandas, custom preprocessing',
            'output': 'Structured records with person/title/composite fields'
        },
        {
            'stage': '2. Embedding Generation', 
            'description': 'Convert text fields to vectors',
            'tools': 'OpenAI text-embedding-3-small API',
            'output': '1,536-dimensional vectors for each text field'
        },
        {
            'stage': '3. Vector Indexing',
            'description': 'Store vectors in searchable index',
            'tools': 'Weaviate with HNSW algorithm',
            'output': 'Fast similarity search capability'
        },
        {
            'stage': '4. Candidate Generation',
            'description': 'Find potentially matching entities',
            'tools': 'Weaviate near_vector queries',
            'output': 'Candidate pairs for detailed comparison'
        },
        {
            'stage': '5. Feature Engineering',
            'description': 'Calculate similarity features',
            'tools': 'Cosine similarity, birth/death matching, taxonomy',
            'output': 'Feature vectors for classification'
        },
        {
            'stage': '6. Classification',
            'description': 'Decide if candidates are same entity',
            'tools': 'Logistic regression or SetFit classifier',
            'output': 'Match/no-match decisions with confidence scores'
        }
    ]
    
    print("🔄 Complete Entity Resolution Pipeline")
    print("="*45)
    
    for stage in pipeline_stages:
        print(f"\n{stage['stage']}")
        print(f"  Purpose: {stage['description']}")
        print(f"  Tools: {stage['tools']}")
        print(f"  Output: {stage['output']}")
    
    print("\n🎯 Weaviate's Role:")
    print("• Stages 2-4: Core vector storage and similarity search")
    print("• Enables efficient candidate generation at scale")
    print("• Reduces O(n²) problem to O(k×m) where k << n")
    print("• Critical for processing millions of catalog records")
    
    return pipeline_stages

def calculate_scalability_benefits():
    """Show the scalability benefits of using Weaviate vs naive approaches."""
    
    # Yale Library catalog size
    catalog_size = 17_590_104
    
    # Naive approach: compare every pair
    naive_comparisons = catalog_size * (catalog_size - 1) // 2
    
    # Weaviate approach: only compare similar candidates
    avg_candidates_per_query = 100  # Typical from your pipeline
    weaviate_comparisons = catalog_size * avg_candidates_per_query
    
    reduction_factor = naive_comparisons / weaviate_comparisons
    
    print("\n📈 Scalability Analysis")
    print("="*25)
    print(f"Catalog size: {catalog_size:,} records")
    print(f"\nNaive approach (compare all pairs):")
    print(f"  Comparisons needed: {naive_comparisons:,}")
    print(f"  Time estimate: ~{naive_comparisons/1000000:.0f} million seconds")
    
    print(f"\nWeaviate approach (similarity search):")
    print(f"  Comparisons needed: {weaviate_comparisons:,}")
    print(f"  Time estimate: ~{weaviate_comparisons/1000:.0f} thousand seconds")
    
    print(f"\n🚀 Improvement: {reduction_factor:.0f}x fewer comparisons!")
    print(f"This turns months of computation into hours.")
    
    return reduction_factor

# Demonstrate complete pipeline
pipeline = demonstrate_complete_pipeline()
reduction = calculate_scalability_benefits()

print("\n🎓 Key Takeaways:")
print("• Vector databases are essential for large-scale similarity search")
print("• Weaviate handles the complexity of vector indexing and search")
print("• Integration with OpenAI embeddings provides state-of-the-art accuracy")
print("• Proper schema design enables flexible querying and filtering")
print(f"• Scalability improvements make real-world applications feasible")

## Part 10: Production Considerations and Best Practices

As we wrap up, let's discuss the practical considerations for deploying Weaviate in production environments like Yale Library's infrastructure.

In [None]:
def discuss_production_deployment():
    """Cover key considerations for production Weaviate deployment."""
    
    considerations = {
        'Infrastructure': {
            'Hardware': [
                'Memory: 2-4x your vector data size for optimal performance',
                'Storage: Fast SSDs for vector index files', 
                'CPU: Multiple cores for concurrent query processing',
                'Network: Low latency for real-time applications'
            ],
            'Deployment': [
                'Docker containers for easy scaling',
                'Kubernetes for orchestration and auto-scaling',
                'Load balancers for high availability',
                'Backup strategies for data persistence'
            ]
        },
        'Data Management': {
            'Schema Design': [
                'Plan property types carefully (cannot change easily)',
                'Use appropriate data types for filtering',
                'Consider multi-tenant schemas if needed',
                'Version your schema changes'
            ],
            'Indexing Strategy': [
                'Batch operations for bulk loading',
                'Monitor indexing performance and memory usage',
                'Use appropriate vector compression if needed',
                'Plan for index rebuilding procedures'
            ]
        },
        'Performance': {
            'Query Optimization': [
                'Use filters to reduce search space',
                'Cache frequently-used embeddings',
                'Optimize ef and efConstruction parameters',
                'Monitor query latency and throughput'
            ],
            'Scaling': [
                'Horizontal scaling with sharding',
                'Read replicas for query-heavy workloads',
                'Connection pooling for multiple clients',
                'Rate limiting to protect against overload'
            ]
        }
    }
    
    print("🏗️  Production Deployment Considerations")
    print("="*42)
    
    for category, subcategories in considerations.items():
        print(f"\n📋 {category}:")
        for subcat, items in subcategories.items():
            print(f"\n  {subcat}:")
            for item in items:
                print(f"    • {item}")
    
    return considerations

def estimate_resource_requirements():
    """Estimate resource requirements for Yale Library's use case."""
    
    # Based on Yale's 17.6M catalog records
    records = 17_590_104
    
    # Assume 3 vector types per record (person, title, composite)
    vectors_per_record = 3
    total_vectors = records * vectors_per_record
    
    # Vector specifications
    dimensions = 1536
    bytes_per_float = 4  # float32
    vector_size_bytes = dimensions * bytes_per_float
    
    # Calculate storage requirements
    raw_vector_storage = total_vectors * vector_size_bytes
    metadata_overhead = raw_vector_storage * 0.3  # Rough estimate
    index_overhead = raw_vector_storage * 0.5  # HNSW index overhead
    total_storage = raw_vector_storage + metadata_overhead + index_overhead
    
    # Memory requirements (should be 2-4x storage for optimal performance)
    recommended_memory = total_storage * 3
    
    print("\n💾 Resource Requirements for Yale Library")
    print("="*45)
    print(f"Catalog records: {records:,}")
    print(f"Total vectors: {total_vectors:,}")
    print(f"Vector dimensions: {dimensions:,}")
    
    print(f"\nStorage Requirements:")
    print(f"  Raw vectors: {raw_vector_storage/1024**3:.1f} GB")
    print(f"  Metadata: {metadata_overhead/1024**3:.1f} GB")
    print(f"  Index overhead: {index_overhead/1024**3:.1f} GB")
    print(f"  Total storage: {total_storage/1024**3:.1f} GB")
    
    print(f"\nRecommended Memory: {recommended_memory/1024**3:.1f} GB")
    
    print(f"\n💰 Estimated Monthly Costs:")
    # Rough AWS pricing estimates
    storage_cost = (total_storage/1024**3) * 0.10  # $0.10/GB/month for SSD
    memory_cost = (recommended_memory/1024**3) * 0.05  # Rough memory cost
    compute_cost = 200  # Estimated compute instance cost
    total_monthly = storage_cost + memory_cost + compute_cost
    
    print(f"  Storage: ${storage_cost:.2f}/month")
    print(f"  Memory: ${memory_cost:.2f}/month")
    print(f"  Compute: ${compute_cost:.2f}/month")
    print(f"  Total: ${total_monthly:.2f}/month")
    
    return {
        'storage_gb': total_storage/1024**3,
        'memory_gb': recommended_memory/1024**3,
        'monthly_cost': total_monthly
    }

# Discuss production considerations
prod_considerations = discuss_production_deployment()
resource_estimates = estimate_resource_requirements()

print("\n🎯 Implementation Roadmap for Yale:")
print("1. Start with pilot deployment (subset of catalog)")
print("2. Benchmark performance with representative queries")
print("3. Optimize schema and indexing parameters")
print("4. Scale to full catalog with monitoring")
print("5. Integrate with existing library systems")

print("\n✅ Production planning complete!")

## Summary: Vector Databases in the Modern AI Stack

🎓 **What We've Learned:**

1. **Vector Database Fundamentals**: Why traditional databases can't handle similarity search at scale
2. **Weaviate Architecture**: How HNSW indexing enables fast similarity search
3. **Schema Design**: Proper data modeling for library metadata
4. **Indexing Strategies**: Batch processing and automatic embedding generation
5. **Query Patterns**: From basic similarity search to advanced filtering
6. **Production Deployment**: Real-world considerations for enterprise use

🔧 **Key Technical Insights:**

- **Two-Client Architecture**: OpenAI for embeddings, Weaviate for search
- **Automatic Embedding**: Weaviate can integrate directly with OpenAI's API
- **Scalability**: Reduces O(n²) comparison problem to O(k×m)
- **Flexibility**: Multiple query types (text, vector, hybrid) for different use cases

🚀 **Entity Resolution Impact:**

Vector databases like Weaviate transform entity resolution from a computationally intractable problem into a practical, scalable solution. For Yale Library's 17.6 million records, this technology makes the difference between months of computation and hours of processing.

🔜 **Next Steps:**

- **Classification Techniques**: SetFit and Mistral AI Classifier Factory
- **Feature Engineering**: Combining embedding similarity with domain knowledge
- **Production Pipelines**: Building robust, monitored systems for continuous operation

---
*This notebook demonstrates vector database concepts for the Yale AI Research Techniques Workshop on Text Embeddings and Classification.*