# Vector Indexing and Retrieval with Weaviate v4

## Learning Objectives
By the end of this notebook, you will:
- Set up and configure Weaviate vector database
- Create collections optimized for library metadata
- Index library catalog records with modern embeddings
- Perform semantic similarity searches
- Execute complex queries combining vector and keyword search
- Apply these techniques to real entity resolution challenges

## Why Vector Databases for Libraries?

Traditional library catalogs rely on exact keyword matching and controlled vocabularies. But what if a researcher searches for "musical compositions" and we have records cataloged under "symphonies," "sonatas," or "lieder"? 

Vector databases like Weaviate solve this by understanding semantic meaning. They can find relevant records even when the exact terms don't match, opening up new possibilities for discovery and research.

**Real Example**: A search for "Viennese classical music" could automatically surface records for Mozart, Beethoven, and Schubert, even if those exact terms don't appear in the catalog records.

## Part 1: Environment Setup and Dependencies

First, let's install the required packages and set up our environment. We'll use the latest Weaviate v4 Python client which provides significant improvements over previous versions.

In [6]:
!pip install httpx openai

Collecting openai
  Downloading openai-1.93.0-py3-none-any.whl.metadata (29 kB)
Downloading openai-1.93.0-py3-none-any.whl (755 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m755.0/755.0 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m[31m4.8 MB/s[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: openai
Successfully installed openai-1.93.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [7]:
!pip list | grep -E "(openai|httpx|httpcore)"

httpcore                      1.0.9
httpx                         0.28.1
langchain-openai              0.2.2
openai                        1.93.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [8]:
# Install required packages for this workshop
!pip install weaviate-client>=4.4.0 openai pandas numpy matplotlib seaborn python-dotenv

# Import libraries
import weaviate
import weaviate.classes as wvc
from weaviate.classes.config import Property, DataType, Configure
from weaviate.classes.query import MetadataQuery
import pandas as pd
import numpy as np
import json
import os
from typing import List, Dict, Any, Optional
import matplotlib.pyplot as plt
import seaborn as sns
from dotenv import load_dotenv
import warnings
warnings.filterwarnings('ignore')

# Load environment variables (create a .env file with your API keys)
load_dotenv()

# Configure plotting
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)

print("✅ Environment setup complete!")
print(f"📦 Weaviate client version: {weaviate.__version__}")

zsh:1: 4.4.0 not found
✅ Environment setup complete!
📦 Weaviate client version: 4.8.0


## Part 2: Starting Weaviate and Client Connection

We'll start with a local Weaviate instance using Docker. This gives us full control and eliminates API costs during development.

**Note**: If you don't have Docker installed, you can use Weaviate Cloud Services (WCS) instead by modifying the connection parameters.

In [18]:
# First, let's start Weaviate using Docker
# Run this in your terminal before executing this notebook:
# docker run -d --name weaviate-workshop \
#   -p 8080:8080 -p 50051:50051 \
#   -e QUERY_DEFAULTS_LIMIT=25 \
#   -e AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true \
#   -e PERSISTENCE_DATA_PATH='/var/lib/weaviate' \
#   -e DEFAULT_VECTORIZER_MODULE='none' \
#   -e CLUSTER_HOSTNAME='node1' \
#   semitechnologies/weaviate:1.25.0

# Connect to local Weaviate instance
def connect_to_weaviate(host: str = "localhost", port: int = 8080, secure: bool = False) -> weaviate.WeaviateClient:
    """Connect to Weaviate instance with proper error handling."""
    try:
        # Create connection URL
        protocol = "https" if secure else "http"
        url = f"{protocol}://{host}:{port}"
        
        # Connect using v4 client syntax
        weaviate_client = weaviate.connect_to_local(
            host=host,
            port=port,
            grpc_port=50051,  # gRPC port for faster operations            
        )
        
        # Test the connection
        if weaviate_client.is_ready():
            print(f"✅ Successfully connected to Weaviate at {url}")
            
            # Get cluster information
            meta = weaviate_client.get_meta()
            print(f"📊 Weaviate version: {meta.get('version', 'Unknown')}")
            
            return weaviate_client
        else:
            raise ConnectionError("Weaviate is not ready")
            
    except Exception as e:
        print(f"❌ Failed to connect to Weaviate: {e}")
        print("\n💡 Troubleshooting tips:")
        print("   1. Make sure Docker is running")
        print("   2. Start Weaviate with the Docker command provided above")
        print("   3. Wait 30-60 seconds for Weaviate to fully start")
        print("   4. Check that port 8080 is not in use by another service")
        raise

# Connect to Weaviate
weaviate_client = connect_to_weaviate()

# Verify connection with some basic info
print(f"\n🔍 Current collections in Weaviate: {len(weaviate_client.collections.list_all())}")
for collection_name in weaviate_client.collections.list_all():
    print(f"  - {collection_name}")

✅ Successfully connected to Weaviate at http://localhost:8080
📊 Weaviate version: 1.30.0-rc.0

🔍 Current collections in Weaviate: 7
  - ConceptScheme
  - ClassificationScheme
  - LibraryCatalog
  - FineTunedConcepts
  - Document
  - MultilingualDocument
  - EntityString


## Part 3: Designing Our Library Collection Schema

Now we'll create a collection specifically designed for library catalog records. This schema will capture the essential metadata fields while optimizing for semantic search.

Think of this as designing the "database table" structure, but optimized for vector similarity rather than traditional relational queries.

In [19]:
# Define our library catalog collection schema
COLLECTION_NAME = "LibraryCatalog"

def create_library_collection(weaviate_client: weaviate.WeaviateClient) -> bool:
    """Create a collection optimized for library catalog records."""
    
    # Delete existing collection if it exists (for clean slate)
    if weaviate_client.collections.exists(COLLECTION_NAME):
        print(f"🗑️  Deleting existing '{COLLECTION_NAME}' collection")
        weaviate_client.collections.delete(COLLECTION_NAME)
    
    try:
        # Define properties for our library records
        # Each property maps to a field in our MARC/catalog data
        properties = [
            Property(
                name="personId",
                data_type=DataType.TEXT,
                description="Unique identifier for the person/entity"
            ),
            Property(
                name="person",
                data_type=DataType.TEXT,
                description="Person name (e.g., 'Schubert, Franz, 1797-1828')"
            ),
            Property(
                name="title",
                data_type=DataType.TEXT,
                description="Work title"
            ),
            Property(
                name="roles",
                data_type=DataType.TEXT,
                description="Person's role (Composer, Contributor, etc.)"
            ),
            Property(
                name="subjects",
                data_type=DataType.TEXT,
                description="Subject headings and topics"
            ),
            Property(
                name="provision",
                data_type=DataType.TEXT,
                description="Publication information"
            ),
            Property(
                name="composite",
                data_type=DataType.TEXT,
                description="Full composite record for embedding"
            ),
            Property(
                name="classification",
                data_type=DataType.TEXT,
                description="Hierarchical classification (e.g., 'Music and Sound Arts')"
            ),
            Property(
                name="recordId",
                data_type=DataType.TEXT,
                description="Original catalog record ID"
            )
        ]
        
        # Create the collection with manual vectorization
        # We'll provide our own embeddings rather than using Weaviate's built-in vectorizers
        collection = weaviate_client.collections.create(
            name=COLLECTION_NAME,
            properties=properties,
            # Configure for manual vector management
            vectorizer_config=Configure.Vectorizer.none(),
            # Use HNSW index for fast approximate nearest neighbor search
            vector_index_config=Configure.VectorIndex.hnsw(
                distance_metric=wvc.config.VectorDistances.COSINE,  # Use cosine similarity
                ef_construction=128,  # Higher = better quality, slower indexing
                max_connections=64   # Higher = better search quality, more memory
            ),
            description="Yale Library catalog records with semantic embeddings"
        )
        
        print(f"✅ Created collection '{COLLECTION_NAME}' successfully")
        print(f"📋 Properties defined: {len(properties)}")
        
        # Display the schema for verification
        print("\n📊 Collection Schema:")
        for prop in properties:
            print(f"  • {prop.name}: {prop.description}")
        
        return True
        
    except Exception as e:
        print(f"❌ Failed to create collection: {e}")
        return False

# Create our library collection
success = create_library_collection(weaviate_client)

if success:
    # Verify the collection was created
    collections = weaviate_client.collections.list_all()
    print(f"\n🎯 Total collections now: {len(collections)}")
    
    # Get a reference to our collection for future operations
    library_collection = weaviate_client.collections.get(COLLECTION_NAME)
    print(f"📚 Ready to work with '{COLLECTION_NAME}' collection")

🗑️  Deleting existing 'LibraryCatalog' collection
✅ Created collection 'LibraryCatalog' successfully
📋 Properties defined: 9

📊 Collection Schema:
  • personId: Unique identifier for the person/entity
  • person: Person name (e.g., 'Schubert, Franz, 1797-1828')
  • title: Work title
  • roles: Person's role (Composer, Contributor, etc.)
  • subjects: Subject headings and topics
  • provision: Publication information
  • composite: Full composite record for embedding
  • classification: Hierarchical classification (e.g., 'Music and Sound Arts')
  • recordId: Original catalog record ID

🎯 Total collections now: 7
📚 Ready to work with 'LibraryCatalog' collection


## Part 4: Preparing Sample Library Data

Let's create sample library catalog records that represent the diversity of an academic library collection. We'll include our Franz Schubert examples plus records from various domains to demonstrate semantic search capabilities.

In [20]:
# Create comprehensive sample library data
def create_sample_library_data() -> List[Dict[str, Any]]:
    """Generate sample library catalog records for demonstration."""
    
    sample_records = [
        # Franz Schubert - Composer (1797-1828)
        {
            "personId": "53001#Agent700-1",
            "person": "Schubert, Franz, 1797-1828",
            "title": "Symphony no. 8 in B minor, D. 759 \"Unfinished\"",
            "roles": "Composer",
            "subjects": "Symphonies; Classical music; Romantic period music",
            "provision": "Vienna: Universal Edition, 1978",
            "composite": "Title: Symphony no. 8 in B minor, D. 759 \"Unfinished\" Subjects: Symphonies; Classical music; Romantic period music Provision information: Vienna: Universal Edition, 1978",
            "classification": "Music and Sound Arts",
            "recordId": "53001"
        },
        {
            "personId": "53001#Agent700-2",
            "person": "Schubert, Franz, 1797-1828",
            "title": "Gretchen am Spinnrade, D. 118",
            "roles": "Composer",
            "subjects": "Songs; Lieder; German songs; Voice and piano music",
            "provision": "Leipzig: C. F. Peters, 1885",
            "composite": "Title: Gretchen am Spinnrade, D. 118 Subjects: Songs; Lieder; German songs; Voice and piano music Provision information: Leipzig: C. F. Peters, 1885",
            "classification": "Music and Sound Arts",
            "recordId": "53002"
        },
        
        # Franz August Schubert - Artist (1806-1893)
        {
            "personId": "53144#Agent700-22",
            "person": "Schubert, Franz August, 1806-1893",
            "title": "Archäologie und Photographie: fünfzig Beispiele zur Geschichte und Methode",
            "roles": "Contributor",
            "subjects": "Photography in archaeology; Archaeological illustration; Scientific photography",
            "provision": "Mainz: P. von Zabern, 1978",
            "composite": "Title: Archäologie und Photographie: fünfzig Beispiele zur Geschichte und Methode Subjects: Photography in archaeology; Archaeological illustration; Scientific photography Provision information: Mainz: P. von Zabern, 1978",
            "classification": "Documentary and Technical Arts",
            "recordId": "53144"
        },
        {
            "personId": "53144#Agent700-23",
            "person": "Schubert, Franz August, 1806-1893",
            "title": "Dessauer Künstler des 19. Jahrhunderts",
            "roles": "Artist",
            "subjects": "German artists; 19th century art; Regional art history; Portrait painting",
            "provision": "Dessau: Anhaltische Verlagsgesellschaft, 1925",
            "composite": "Title: Dessauer Künstler des 19. Jahrhunderts Subjects: German artists; 19th century art; Regional art history; Portrait painting Provision information: Dessau: Anhaltische Verlagsgesellschaft, 1925",
            "classification": "Visual Arts",
            "recordId": "53145"
        },
        
        # Other composers for context
        {
            "personId": "12345#Agent700-5",
            "person": "Mozart, Wolfgang Amadeus, 1756-1791",
            "title": "Piano Sonata No. 11 in A major, K. 331",
            "roles": "Composer",
            "subjects": "Piano music; Classical period; Sonatas",
            "provision": "Vienna: Artaria, 1784",
            "composite": "Title: Piano Sonata No. 11 in A major, K. 331 Subjects: Piano music; Classical period; Sonatas Provision information: Vienna: Artaria, 1784",
            "classification": "Music and Sound Arts",
            "recordId": "12345"
        },
        {
            "personId": "67890#Agent700-8",
            "person": "Beethoven, Ludwig van, 1770-1827",
            "title": "Symphony No. 9 in D minor, Op. 125 \"Choral\"",
            "roles": "Composer",
            "subjects": "Symphonies; Choral symphonies; Classical music; Romantic music",
            "provision": "Mainz: B. Schott's Söhne, 1826",
            "composite": "Title: Symphony No. 9 in D minor, Op. 125 \"Choral\" Subjects: Symphonies; Choral symphonies; Classical music; Romantic music Provision information: Mainz: B. Schott's Söhne, 1826",
            "classification": "Music and Sound Arts",
            "recordId": "67890"
        },
        
        # Archaeology and Art records for domain contrast
        {
            "personId": "78901#Agent700-12",
            "person": "Winkelmann, Johann Joachim, 1717-1768",
            "title": "Geschichte der Kunst des Alterthums",
            "roles": "Author",
            "subjects": "Art history; Ancient art; Classical archaeology; Greek art; Roman art",
            "provision": "Dresden: Walther, 1764",
            "composite": "Title: Geschichte der Kunst des Alterthums Subjects: Art history; Ancient art; Classical archaeology; Greek art; Roman art Provision information: Dresden: Walther, 1764",
            "classification": "History and Culture",
            "recordId": "78901"
        },
        {
            "personId": "89012#Agent700-15",
            "person": "Wheeler, Mortimer, 1890-1976",
            "title": "Archaeology from the Earth",
            "roles": "Author",
            "subjects": "Archaeological methods; Excavation techniques; Field archaeology; Stratigraphy",
            "provision": "Oxford: Clarendon Press, 1954",
            "composite": "Title: Archaeology from the Earth Subjects: Archaeological methods; Excavation techniques; Field archaeology; Stratigraphy Provision information: Oxford: Clarendon Press, 1954",
            "classification": "Social Sciences",
            "recordId": "89012"
        },
        
        # Literature for additional domain diversity
        {
            "personId": "34567#Agent700-20",
            "person": "Goethe, Johann Wolfgang von, 1749-1832",
            "title": "Faust: eine Tragödie",
            "roles": "Author",
            "subjects": "German literature; Drama; Romanticism; Philosophy in literature",
            "provision": "Tübingen: Cotta, 1808",
            "composite": "Title: Faust: eine Tragödie Subjects: German literature; Drama; Romanticism; Philosophy in literature Provision information: Tübingen: Cotta, 1808",
            "classification": "Literature and Narrative Arts",
            "recordId": "34567"
        }
    ]
    
    return sample_records

# Generate our sample data
sample_data = create_sample_library_data()

print(f"📚 Created {len(sample_data)} sample library records")
print("\n📊 Records by classification:")

# Count records by classification
classification_counts = {}
for record in sample_data:
    classification = record["classification"]
    classification_counts[classification] = classification_counts.get(classification, 0) + 1

for classification, count in classification_counts.items():
    print(f"  • {classification}: {count} records")

# Display first record as example
print("\n🔍 Example record structure:")
for key, value in sample_data[0].items():
    print(f"  {key}: {value}")

📚 Created 9 sample library records

📊 Records by classification:
  • Music and Sound Arts: 4 records
  • Documentary and Technical Arts: 1 records
  • Visual Arts: 1 records
  • History and Culture: 1 records
  • Social Sciences: 1 records
  • Literature and Narrative Arts: 1 records

🔍 Example record structure:
  personId: 53001#Agent700-1
  person: Schubert, Franz, 1797-1828
  title: Symphony no. 8 in B minor, D. 759 "Unfinished"
  roles: Composer
  subjects: Symphonies; Classical music; Romantic period music
  provision: Vienna: Universal Edition, 1978
  composite: Title: Symphony no. 8 in B minor, D. 759 "Unfinished" Subjects: Symphonies; Classical music; Romantic period music Provision information: Vienna: Universal Edition, 1978
  classification: Music and Sound Arts
  recordId: 53001


## Part 5: Generating Embeddings with OpenAI

Now we need to convert our text records into vector embeddings. We'll use OpenAI's text-embedding-3-small model, which provides excellent performance for semantic search tasks.

**Note**: You'll need an OpenAI API key for this section. Add it to your `.env` file as `OPENAI_API_KEY=your_key_here`

In [17]:
!pip uninstall openai httpx httpcore -y

Found existing installation: openai 1.12.0
Uninstalling openai-1.12.0:
  Successfully uninstalled openai-1.12.0
Found existing installation: httpx 0.25.2
Uninstalling httpx-0.25.2:
  Successfully uninstalled httpx-0.25.2
Found existing installation: httpcore 1.0.9
Uninstalling httpcore-1.0.9:
  Successfully uninstalled httpcore-1.0.9


In [18]:
!pip install openai>=1.3.0 httpx httpcore

zsh:1: 1.3.0 not found


In [25]:
import os
import numpy as np
from typing import List
from openai import OpenAI

# Simple, direct OpenAI client setup following your working pattern
def setup_openai_client():
    """Create OpenAI client using the exact pattern from your working code."""
    api_key = os.getenv("OPENAI_API_KEY")
    
    if not api_key:
        print("⚠️  No OpenAI API key found")
        return None, False
    
    try:
        # This is exactly how you do it in your working code
        openai_client = OpenAI(api_key=api_key)
        print("✅ OpenAI client created successfully")
        return openai_client, True
    except Exception as e:
        print(f"❌ Client creation failed: {e}")
        return None, False

def get_embeddings(texts: List[str], openai_client=None, model: str = "text-embedding-3-small") -> List[List[float]]:
    """Generate embeddings using your proven API call pattern."""
    
    if openai_client is None:
        return create_dummy_embeddings(texts)
    
    try:
        # This follows your exact working pattern from embedding_and_indexing.py
        response = openai_client.embeddings.create(
            model=model,
            input=texts
        )
        
        # Extract embeddings exactly like your working code
        embeddings = []
        for embedding_data in response.data:
            embedding = np.array(embedding_data.embedding, dtype=np.float32)
            embeddings.append(embedding.tolist())  # Convert to list for consistency
        
        # Get token count like your working code
        token_count = response.usage.total_tokens
        
        print(f"✅ Generated {len(embeddings)} real embeddings using {token_count} tokens")
        return embeddings
        
    except Exception as e:
        print(f"⚠️  API call failed: {e}")
        print("Falling back to dummy embeddings...")
        return create_dummy_embeddings(texts)

def create_dummy_embeddings(texts: List[str]) -> List[List[float]]:
    """Create educational dummy embeddings with realistic patterns."""
    print("📝 Creating dummy embeddings...")
    
    embeddings = []
    for text in texts:
        # Deterministic embeddings based on text content
        np.random.seed(hash(text) % 2147483647)
        embedding = np.random.normal(0, 0.1, 1536).tolist()
        
        # Add semantic clustering based on content
        text_lower = text.lower()
        if any(word in text_lower for word in ["music", "symphony", "composer"]):
            for i in range(0, 100):
                embedding[i] += 0.4
        elif any(word in text_lower for word in ["archaeology", "photography", "art"]):
            for i in range(200, 300):
                embedding[i] += 0.4
        
        embeddings.append(embedding)
    
    print(f"📊 Created {len(embeddings)} dummy embeddings")
    return embeddings

# Execute the setup using your proven approach
openai_client, has_real_api = setup_openai_client()

# Generate embeddings for your data
texts_to_embed = [record["composite"] for record in sample_data]
embeddings = get_embeddings(texts_to_embed, openai_client)

# Attach to records
for i, record in enumerate(sample_data):
    record["vector"] = embeddings[i]

print(f"🔗 Successfully attached vectors to {len(sample_data)} records")
print("📚 Ready to proceed with Weaviate indexing!")

✅ OpenAI client created successfully
✅ Generated 9 real embeddings using 384 tokens
🔗 Successfully attached vectors to 9 records
📚 Ready to proceed with Weaviate indexing!


## Part 6: Indexing Records in Weaviate

Now comes the exciting part - loading our embedded records into Weaviate! This is where the magic happens: Weaviate will build an efficient index structure that allows for fast similarity searches across our entire collection.

The HNSW (Hierarchical Navigable Small World) algorithm creates a graph structure that can find similar vectors in logarithmic time, even with millions of records.

In [26]:
from weaviate.classes.data import DataObject
from weaviate.util import generate_uuid5

def index_library_records(client: weaviate.WeaviateClient, records: List[Dict[str, Any]]) -> bool:
    """
    Index library records into Weaviate with their embeddings.
    
    Args:
        client: Weaviate client instance
        records: List of library records with embeddings
    
    Returns:
        True if successful, False otherwise
    """
    try:
        # Get our collection
        collection = weaviate_client.collections.get(COLLECTION_NAME)
        
        print(f"📤 Starting to index {len(records)} records...")
        
        # Prepare data objects for batch insertion
        data_objects = []
        
        for i, record in enumerate(records):
            # Extract the vector (embedding)
            vector = record.pop("vector")  # Remove vector from properties
            
            # Generate a consistent UUID based on the record ID
            # This ensures we can update the same record later if needed
            uuid = generate_uuid5(record["recordId"], "LibraryRecord")
            
            # Create data object
            data_object = DataObject(
                properties=record,  # All the metadata fields
                vector=vector,      # The embedding vector
                uuid=uuid          # Consistent UUID
            )
            
            data_objects.append(data_object)
            
            # Progress indicator
            if (i + 1) % 2 == 0:
                print(f"  📋 Prepared {i + 1}/{len(records)} records")
        
        # Batch insert all records
        print(f"\n🚀 Inserting {len(data_objects)} records into Weaviate...")
        
        # Use batch insertion for efficiency
        response = collection.data.insert_many(data_objects)
        
        # Check for any errors
        if response.has_errors:
            print(f"⚠️  Some records had errors:")
            for i, error in enumerate(response.errors):
                if error:
                    print(f"  Record {i}: {error}")
        
        successful_inserts = len([r for r in response.uuids if r is not None])
        print(f"✅ Successfully indexed {successful_inserts}/{len(records)} records")
        
        # Verify the data was inserted
        total_objects = collection.aggregate.over_all(total_count=True).total_count
        print(f"📊 Total objects in collection: {total_objects}")
        
        return True
        
    except Exception as e:
        print(f"❌ Error indexing records: {e}")
        return False

# Index our sample records
success = index_library_records(weaviate_client, sample_data)

if success:
    print("\n🎉 Indexing completed successfully!")
    print("📚 Your library collection is now ready for semantic search")
    
    # Quick verification: let's see what's in our collection
    collection = weaviate_client.collections.get(COLLECTION_NAME)
    
    # Get a sample record to verify structure
    sample_result = collection.query.fetch_objects(limit=1)
    
    if sample_result.objects:
        sample_object = sample_result.objects[0]
        print(f"\n🔍 Sample indexed record:")
        print(f"  UUID: {sample_object.uuid}")
        print(f"  Person: {sample_object.properties.get('person', 'N/A')}")
        print(f"  Title: {sample_object.properties.get('title', 'N/A')}")
        print(f"  Classification: {sample_object.properties.get('classification', 'N/A')}")
        print(f"  Vector dimension: {len(sample_object.vector) if sample_object.vector else 'N/A'}")
else:
    print("❌ Indexing failed. Please check the error messages above.")

📤 Starting to index 9 records...
  📋 Prepared 2/9 records
  📋 Prepared 4/9 records
  📋 Prepared 6/9 records
  📋 Prepared 8/9 records

🚀 Inserting 9 records into Weaviate...
✅ Successfully indexed 9/9 records
📊 Total objects in collection: 9

🎉 Indexing completed successfully!
📚 Your library collection is now ready for semantic search

🔍 Sample indexed record:
  UUID: 1099f26d-99f1-5080-8efd-ede20ae10695
  Person: Winkelmann, Johann Joachim, 1717-1768
  Title: Geschichte der Kunst des Alterthums
  Classification: History and Culture
  Vector dimension: N/A


## Part 7: Basic Semantic Search Queries

Now for the exciting part - let's search our library collection using semantic similarity! We'll start with basic vector searches and then explore more sophisticated querying techniques.

Vector search finds records that are semantically similar to your query, even if they don't contain the exact same words.

In [27]:
import time
def semantic_search(weaviate_client: weaviate.WeaviateClient, query_text: str, limit: int = 5) -> List[Dict[str, Any]]:
    """
    Perform semantic search using vector similarity.
    
    Args:
        weaviate_client: Weaviate client
        query_text: Text to search for
        limit: Maximum number of results to return
    
    Returns:
        List of matching records with similarity scores
    """
    try:
        # First, we need to embed the query text
        print(f"🔍 Searching for: '{query_text}'")
        query_embedding = get_embeddings([query_text])[0]
        
        # Get our collection
        collection = weaviate_client.collections.get(COLLECTION_NAME)
        
        # Perform vector search
        response = collection.query.near_vector(
            near_vector=query_embedding,
            limit=limit,
            return_metadata=MetadataQuery(score=True, distance=True)
        )
        
        # Process results
        results = []
        for obj in response.objects:
            result = {
                "uuid": str(obj.uuid),
                "score": obj.metadata.score if obj.metadata.score else 0,
                "distance": obj.metadata.distance if obj.metadata.distance else 1,
                "person": obj.properties.get("person", "Unknown"),
                "title": obj.properties.get("title", "Unknown"),
                "classification": obj.properties.get("classification", "Unknown"),
                "subjects": obj.properties.get("subjects", "Unknown"),
                "composite": obj.properties.get("composite", "Unknown")
            }
            results.append(result)
        
        return results
        
    except Exception as e:
        print(f"❌ Search error: {e}")
        return []

def display_search_results(results: List[Dict[str, Any]], query: str):
    """Display search results u a formatted way."""
    print(f"\n📊 Search Results for: '{query}'")
    print("=" * 60)
    
    if not results:
        print("❌ No results found")
        return
    
    for i, result in enumerate(results, 1):
        # Calculate similarity percentage (1 - distance for cosine)
        similarity = (1 - result["distance"]) * 100 if result["distance"] <= 1 else 0
        
        print(f"\n{i}. 📚 {result['person']}")
        print(f"   📖 Title: {result['title']}")
        print(f"   🏷️  Classification: {result['classification']}")
        print(f"   🔖 Subjects: {result['subjects'][:100]}{'...' if len(result['subjects']) > 100 else ''}")
        print(f"   📏 Similarity: {similarity:.1f}%")

# Test searches with different types of queries
test_queries = [
    "classical music composer symphonies",
    "archaeological photography documentation", 
    "German romantic period art",
    "piano sonatas and musical compositions",
    "ancient art history and classical civilizations"
]

# Perform searches
for query in test_queries:
    results = semantic_search(weaviate_client, query, limit=3)
    display_search_results(results, query)
    
    # Small delay between searches for readability
    time.sleep(0.5)

🔍 Searching for: 'classical music composer symphonies'
📝 Creating dummy embeddings...
📊 Created 1 dummy embeddings

📊 Search Results for: 'classical music composer symphonies'

1. 📚 Winkelmann, Johann Joachim, 1717-1768
   📖 Title: Geschichte der Kunst des Alterthums
   🏷️  Classification: History and Culture
   🔖 Subjects: Art history; Ancient art; Classical archaeology; Greek art; Roman art
   📏 Similarity: 6.3%

2. 📚 Schubert, Franz August, 1806-1893
   📖 Title: Archäologie und Photographie: fünfzig Beispiele zur Geschichte und Methode
   🏷️  Classification: Documentary and Technical Arts
   🔖 Subjects: Photography in archaeology; Archaeological illustration; Scientific photography
   📏 Similarity: 5.6%

3. 📚 Schubert, Franz August, 1806-1893
   📖 Title: Dessauer Künstler des 19. Jahrhunderts
   🏷️  Classification: Visual Arts
   🔖 Subjects: German artists; 19th century art; Regional art history; Portrait painting
   📏 Similarity: 4.7%
🔍 Searching for: 'archaeological photography do

## Part 8: Advanced Query Techniques

Weaviate's power really shines when you combine vector search with traditional filters and complex queries. Let's explore some advanced techniques that are particularly useful for library applications.

In [None]:
import weaviate.classes.query as wq

def hybrid_search(client: weaviate.WeaviateClient, 
                 query_text: str, 
                 classification_filter: Optional[str] = None,
                 person_filter: Optional[str] = None,
                 limit: int = 5) -> List[Dict[str, Any]]:
    """
    Perform hybrid search combining vector similarity with metadata filters.
    
    Args:
        client: Weaviate client
        query_text: Text to search for semantically
        classification_filter: Filter by classification (e.g., "Music and Sound Arts")
        person_filter: Filter by person name (contains match)
        limit: Maximum results to return
    
    Returns:
        List of matching records
    """
    try:
        # Generate query embedding
        query_embedding = get_embeddings([query_text])[0]
        
        # Get collection
        collection = client.collections.get(COLLECTION_NAME)
        
        # Build filter conditions
        filters = []
        
        if classification_filter:
            filters.append(
                wq.Filter.by_property("classification").equal(classification_filter)
            )
        
        if person_filter:
            filters.append(
                wq.Filter.by_property("person").contains_any([person_filter])
            )
        
        # Combine filters with AND logic if multiple filters
        combined_filter = None
        if len(filters) == 1:
            combined_filter = filters[0]
        elif len(filters) > 1:
            combined_filter = wq.Filter.all_of(filters)
        
        # Perform the search
        if combined_filter:
            response = collection.query.near_vector(
                near_vector=query_embedding,
                where=combined_filter,
                limit=limit,
                return_metadata=MetadataQuery(score=True, distance=True)
            )
        else:
            response = collection.query.near_vector(
                near_vector=query_embedding,
                limit=limit,
                return_metadata=MetadataQuery(score=True, distance=True)
            )
        
        # Process results
        results = []
        for obj in response.objects:
            result = {
                "uuid": str(obj.uuid),
                "score": obj.metadata.score if obj.metadata.score else 0,
                "distance": obj.metadata.distance if obj.metadata.distance else 1,
                "person": obj.properties.get("person", "Unknown"),
                "title": obj.properties.get("title", "Unknown"),
                "classification": obj.properties.get("classification", "Unknown"),
                "subjects": obj.properties.get("subjects", "Unknown"),
                "roles": obj.properties.get("roles", "Unknown")
            }
            results.append(result)
        
        return results
        
    except Exception as e:
        print(f"❌ Hybrid search error: {e}")
        return []

def entity_disambiguation_query(client: weaviate.WeaviateClient, 
                               person_name: str) -> List[Dict[str, Any]]:
    """
    Find all records for a specific person to help with entity disambiguation.
    
    Args:
        client: Weaviate client
        person_name: Name to search for (e.g., "Schubert")
    
    Returns:
        List of all records matching the person name
    """
    try:
        collection = client.collections.get(COLLECTION_NAME)
        
        # Search for records containing the person name
        response = collection.query.fetch_objects(
            where=wq.Filter.by_property("person").contains_any([person_name]),
            limit=20  # Get more results for disambiguation
        )
        
        # Process and group results
        results = []
        for obj in response.objects:
            result = {
                "personId": obj.properties.get("personId", "Unknown"),
                "person": obj.properties.get("person", "Unknown"),
                "title": obj.properties.get("title", "Unknown"),
                "classification": obj.properties.get("classification", "Unknown"),
                "subjects": obj.properties.get("subjects", "Unknown"),
                "roles": obj.properties.get("roles", "Unknown"),
                "provision": obj.properties.get("provision", "Unknown")
            }
            results.append(result)
        
        return results
        
    except Exception as e:
        print(f"❌ Entity disambiguation error: {e}")
        return []

# Test advanced queries
print("🔬 Advanced Query Examples")
print("=" * 50)

# 1. Hybrid search: Music + semantic similarity
print("\n1️⃣ Hybrid Search: Music domain + 'romantic composition'")
results = hybrid_search(client, 
                       query_text="romantic composition", 
                       classification_filter="Music and Sound Arts",
                       limit=3)
display_search_results(results, "romantic composition [Music domain only]")

# 2. Entity disambiguation for "Schubert"
print("\n\n2️⃣ Entity Disambiguation: All 'Schubert' records")
schubert_records = entity_disambiguation_query(client, "Schubert")

print(f"\n📊 Found {len(schubert_records)} Schubert records")
print("=" * 40)

# Group by person ID to show distinct entities
schubert_entities = {}
for record in schubert_records:
    person_id = record["personId"]
    if person_id not in schubert_entities:
        schubert_entities[person_id] = []
    schubert_entities[person_id].append(record)

for person_id, records in schubert_entities.items():
    person_name = records[0]["person"]
    classifications = set(r["classification"] for r in records)
    
    print(f"\n👤 {person_name}")
    print(f"   🆔 ID: {person_id}")
    print(f"   📚 Works: {len(records)}")
    print(f"   🏷️  Domains: {', '.join(classifications)}")
    
    for i, record in enumerate(records[:2]):  # Show first 2 works
        print(f"     {i+1}. {record['title'][:50]}{'...' if len(record['title']) > 50 else ''}")
    
    if len(records) > 2:
        print(f"     ... and {len(records) - 2} more works")

# 3. Cross-domain similarity search
print("\n\n3️⃣ Cross-Domain Search: 'German cultural heritage'")
results = semantic_search(client, "German cultural heritage", limit=5)
display_search_results(results, "German cultural heritage")

print("\n💡 Notice how the search finds relevant records across different domains!")
print("   This demonstrates the power of semantic understanding vs. keyword matching.")

## Part 9: Similarity Analysis and Clustering

Let's explore the relationships between our records by analyzing their vector similarities. This will help us understand how well our embeddings capture semantic relationships and identify potential entity resolution opportunities.

In [None]:
def analyze_collection_similarities(client: weaviate.WeaviateClient) -> pd.DataFrame:
    """
    Analyze similarities between all records in the collection.
    
    Returns:
        DataFrame with pairwise similarity analysis
    """
    try:
        # Fetch all records with their vectors
        collection = client.collections.get(COLLECTION_NAME)
        response = collection.query.fetch_objects(
            limit=100,  # Get all our records
            include_vector=True
        )
        
        if not response.objects:
            print("❌ No objects found in collection")
            return pd.DataFrame()
        
        # Extract data for analysis
        records_data = []
        vectors = []
        
        for obj in response.objects:
            record_info = {
                "uuid": str(obj.uuid),
                "person": obj.properties.get("person", "Unknown"),
                "title": obj.properties.get("title", "Unknown"),
                "classification": obj.properties.get("classification", "Unknown"),
                "personId": obj.properties.get("personId", "Unknown")
            }
            records_data.append(record_info)
            vectors.append(obj.vector)
        
        # Convert to numpy arrays for efficient computation
        vectors_array = np.array(vectors)
        
        # Calculate pairwise cosine similarities
        from sklearn.metrics.pairwise import cosine_similarity
        similarity_matrix = cosine_similarity(vectors_array)
        
        # Create detailed similarity analysis
        similarity_pairs = []
        
        for i in range(len(records_data)):
            for j in range(i + 1, len(records_data)):
                similarity = similarity_matrix[i][j]
                
                pair_info = {
                    "record1_person": records_data[i]["person"],
                    "record1_title": records_data[i]["title"],
                    "record1_classification": records_data[i]["classification"],
                    "record1_personId": records_data[i]["personId"],
                    "record2_person": records_data[j]["person"],
                    "record2_title": records_data[j]["title"],
                    "record2_classification": records_data[j]["classification"],
                    "record2_personId": records_data[j]["personId"],
                    "similarity": similarity,
                    "same_person": records_data[i]["personId"] == records_data[j]["personId"],
                    "same_classification": records_data[i]["classification"] == records_data[j]["classification"]
                }
                
                similarity_pairs.append(pair_info)
        
        return pd.DataFrame(similarity_pairs)
        
    except Exception as e:
        print(f"❌ Error analyzing similarities: {e}")
        return pd.DataFrame()

# Perform similarity analysis
print("🔍 Analyzing collection similarities...")
similarity_df = analyze_collection_similarities(client)

if not similarity_df.empty:
    print(f"✅ Analyzed {len(similarity_df)} record pairs")
    
    # Show highest similarities
    print("\n🔥 Highest Similarity Pairs:")
    print("=" * 50)
    
    top_similarities = similarity_df.nlargest(5, 'similarity')
    
    for idx, row in top_similarities.iterrows():
        person_match = "✅ Same person" if row['same_person'] else "❌ Different people"
        domain_match = "✅ Same domain" if row['same_classification'] else "🔄 Cross-domain"
        
        print(f"\n📊 Similarity: {row['similarity']:.3f}")
        print(f"   👤 {row['record1_person']}")
        print(f"      📖 {row['record1_title'][:60]}{'...' if len(row['record1_title']) > 60 else ''}")
        print(f"   ↕️")
        print(f"   👤 {row['record2_person']}")
        print(f"      📖 {row['record2_title'][:60]}{'...' if len(row['record2_title']) > 60 else ''}")
        print(f"   🎯 {person_match} | {domain_match}")
    
    # Statistical analysis
    print("\n\n📈 Statistical Analysis:")
    print("=" * 30)
    
    # Same person similarities
    same_person_sims = similarity_df[similarity_df['same_person']]['similarity']
    diff_person_sims = similarity_df[~similarity_df['same_person']]['similarity']
    
    print(f"📊 Same Person Pairs:")
    print(f"   Count: {len(same_person_sims)}")
    if len(same_person_sims) > 0:
        print(f"   Average similarity: {same_person_sims.mean():.3f}")
        print(f"   Similarity range: {same_person_sims.min():.3f} - {same_person_sims.max():.3f}")
    
    print(f"\n📊 Different Person Pairs:")
    print(f"   Count: {len(diff_person_sims)}")
    if len(diff_person_sims) > 0:
        print(f"   Average similarity: {diff_person_sims.mean():.3f}")
        print(f"   Similarity range: {diff_person_sims.min():.3f} - {diff_person_sims.max():.3f}")
    
    # Domain analysis
    same_domain_sims = similarity_df[similarity_df['same_classification']]['similarity']
    diff_domain_sims = similarity_df[~similarity_df['same_classification']]['similarity']
    
    print(f"\n📊 Same Domain Pairs:")
    print(f"   Count: {len(same_domain_sims)}")
    if len(same_domain_sims) > 0:
        print(f"   Average similarity: {same_domain_sims.mean():.3f}")
    
    print(f"\n📊 Cross-Domain Pairs:")
    print(f"   Count: {len(diff_domain_sims)}")
    if len(diff_domain_sims) > 0:
        print(f"   Average similarity: {diff_domain_sims.mean():.3f}")
else:
    print("❌ Could not perform similarity analysis")

## Part 10: Visualization and Insights

Let's create visualizations to better understand the semantic structure of our library collection and how different domains cluster in the embedding space.

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

def visualize_collection_embeddings(client: weaviate.WeaviateClient):
    """
    Create visualizations of the collection's embedding space.
    """
    try:
        # Fetch all records with vectors
        collection = client.collections.get(COLLECTION_NAME)
        response = collection.query.fetch_objects(
            limit=100,
            include_vector=True
        )
        
        if not response.objects:
            print("❌ No objects found for visualization")
            return
        
        # Extract data
        vectors = []
        labels = []
        classifications = []
        persons = []
        
        for obj in response.objects:
            vectors.append(obj.vector)
            person = obj.properties.get("person", "Unknown")
            # Shorten person names for visualization
            person_short = person.split(",")[0] if "," in person else person
            labels.append(person_short)
            classifications.append(obj.properties.get("classification", "Unknown"))
            persons.append(person)
        
        vectors_array = np.array(vectors)
        
        # Reduce dimensionality using PCA
        print("🔄 Reducing dimensionality with PCA...")
        pca = PCA(n_components=2)
        vectors_2d = pca.fit_transform(vectors_array)
        
        # Create visualization
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
        
        # Plot 1: Color by classification
        unique_classifications = list(set(classifications))
        colors = plt.cm.Set3(np.linspace(0, 1, len(unique_classifications)))
        color_map = dict(zip(unique_classifications, colors))
        
        for i, classification in enumerate(classifications):
            ax1.scatter(vectors_2d[i, 0], vectors_2d[i, 1], 
                       c=[color_map[classification]], 
                       s=100, alpha=0.7)
            ax1.annotate(labels[i], 
                        (vectors_2d[i, 0], vectors_2d[i, 1]),
                        xytext=(5, 5), textcoords='offset points',
                        fontsize=8, alpha=0.8)
        
        ax1.set_title('Library Records in 2D Embedding Space\n(Colored by Classification)', 
                     fontsize=14, fontweight='bold')
        ax1.set_xlabel('First Principal Component')
        ax1.set_ylabel('Second Principal Component')
        ax1.grid(True, alpha=0.3)
        
        # Create legend for classifications
        legend_elements = [plt.scatter([], [], c=[color_map[cls]], s=100, label=cls) 
                          for cls in unique_classifications]
        ax1.legend(handles=legend_elements, loc='best', fontsize=10)
        
        # Plot 2: Highlight Franz Schubert entities
        for i, person in enumerate(persons):
            if "Schubert" in person:
                if "1797-1828" in person:
                    color = 'red'
                    marker = 'o'
                    label = 'Franz Schubert (Composer)' if i == 0 else ""
                elif "1806-1893" in person:
                    color = 'blue'
                    marker = 's'
                    label = 'Franz August Schubert (Artist)' if persons[:i].count(person) == 0 else ""
                else:
                    color = 'purple'
                    marker = '^'
                    label = 'Other Schubert'
            else:
                color = 'gray'
                marker = 'o'
                label = 'Other Authors' if i == 0 else ""
            
            ax2.scatter(vectors_2d[i, 0], vectors_2d[i, 1], 
                       c=color, marker=marker, s=120, alpha=0.7, 
                       label=label if label else "")
            
            # Annotate Schubert records
            if "Schubert" in person:
                ax2.annotate(labels[i], 
                           (vectors_2d[i, 0], vectors_2d[i, 1]),
                           xytext=(5, 5), textcoords='offset points',
                           fontsize=9, fontweight='bold')
        
        ax2.set_title('Entity Disambiguation: Franz Schubert Records\n(Red=Composer, Blue=Artist)', 
                     fontsize=14, fontweight='bold')
        ax2.set_xlabel('First Principal Component')
        ax2.set_ylabel('Second Principal Component')
        ax2.grid(True, alpha=0.3)
        ax2.legend(loc='best', fontsize=10)
        
        plt.tight_layout()
        plt.show()
        
        # Print analysis
        print(f"\n📊 Visualization Analysis:")
        print(f"✅ Plotted {len(vectors)} records in 2D space")
        print(f"📏 PCA explained variance: {pca.explained_variance_ratio_}")
        print(f"📈 Total variance captured: {sum(pca.explained_variance_ratio_):.1%}")
        
        # Calculate Schubert separation
        schubert_composer_indices = [i for i, p in enumerate(persons) if "1797-1828" in p]
        schubert_artist_indices = [i for i, p in enumerate(persons) if "1806-1893" in p]
        
        if schubert_composer_indices and schubert_artist_indices:
            composer_centroid = np.mean(vectors_2d[schubert_composer_indices], axis=0)
            artist_centroid = np.mean(vectors_2d[schubert_artist_indices], axis=0)
            separation = np.linalg.norm(composer_centroid - artist_centroid)
            
            print(f"\n🎯 Franz Schubert Entity Separation:")
            print(f"   Distance between composer and artist centroids: {separation:.3f}")
            print(f"   This demonstrates how embeddings can distinguish same-name entities!")
        
    except Exception as e:
        print(f"❌ Visualization error: {e}")

# Create the visualization
visualize_collection_embeddings(client)

## Part 11: Practical Applications Summary

Let's conclude by summarizing the practical applications of what we've learned and provide guidance for implementing these techniques in real library systems.

In [None]:
def generate_implementation_report(client: weaviate.WeaviateClient):
    """
    Generate a comprehensive report on our Weaviate implementation.
    """
    print("📋 WEAVIATE IMPLEMENTATION REPORT")
    print("=" * 50)
    
    try:
        # Collection statistics
        collection = client.collections.get(COLLECTION_NAME)
        total_objects = collection.aggregate.over_all(total_count=True).total_count
        
        print(f"\n📊 Collection Statistics:")
        print(f"   • Total records indexed: {total_objects}")
        print(f"   • Vector dimensions: 1536 (OpenAI text-embedding-3-small)")
        print(f"   • Index type: HNSW (Hierarchical Navigable Small World)")
        print(f"   • Distance metric: Cosine similarity")
        
        # Performance metrics
        print(f"\n⚡ Performance Characteristics:")
        print(f"   • Search complexity: O(log n) approximate")
        print(f"   • Index build time: ~{total_objects * 0.1:.1f} seconds estimated")
        print(f"   • Query latency: <50ms for most searches")
        print(f"   • Memory usage: ~{total_objects * 1536 * 4 / 1024 / 1024:.1f} MB for vectors")
        
        # Test search performance
        start_time = time.time()
        test_results = semantic_search(client, "classical music composition", limit=3)
        search_time = (time.time() - start_time) * 1000
        
        print(f"\n🔍 Live Performance Test:")
        print(f"   • Query: 'classical music composition'")
        print(f"   • Results returned: {len(test_results)}")
        print(f"   • Search time: {search_time:.1f} ms")
        
        # Scalability projections
        print(f"\n📈 Scalability Projections:")
        scales = [1000, 10000, 100000, 1000000]
        for scale in scales:
            memory_gb = (scale * 1536 * 4) / (1024**3)
            search_time_ms = np.log2(scale) * 2  # Rough HNSW estimate
            print(f"   • {scale:,} records: ~{memory_gb:.1f} GB memory, ~{search_time_ms:.1f} ms search")
        
        print(f"\n💡 Implementation Recommendations:")
        print(f"   ✅ Excellent for collections up to 1M records")
        print(f"   ✅ Sub-second search performance")
        print(f"   ✅ Handles complex metadata filtering")
        print(f"   ✅ Supports real-time updates")
        print(f"   ⚠️  Consider sharding for 10M+ records")
        print(f"   ⚠️  Monitor memory usage in production")
        
    except Exception as e:
        print(f"❌ Error generating report: {e}")

def practical_applications_guide():
    """
    Provide guidance on practical applications in library systems.
    """
    print("\n\n🎯 PRACTICAL APPLICATIONS GUIDE")
    print("=" * 50)
    
    applications = {
        "🔍 Enhanced Discovery": [
            "Semantic search that finds relevant works even with different terminology",
            "'Related items' suggestions based on semantic similarity",
            "Cross-lingual discovery (embeddings capture meaning across languages)",
            "Subject heading expansion and suggestion"
        ],
        "👥 Entity Resolution": [
            "Identify duplicate person records across different name forms",
            "Distinguish between people with identical names (like our Schubert example)",
            "Merge bibliographic records for the same work",
            "Authority control automation and suggestion"
        ],
        "📊 Collection Analysis": [
            "Identify gaps in collection coverage",
            "Find thematically related materials for collection development",
            "Analyze patron interests through search patterns",
            "Automated subject classification and suggestion"
        ],
        "🤖 Workflow Automation": [
            "Automated cataloging suggestions based on similar records",
            "Quality control: flag potentially incorrect metadata",
            "Batch processing for metadata enhancement",
            "Integration with AI systems for catalog enrichment"
        ]
    }
    
    for category, items in applications.items():
        print(f"\n{category}:")
        for item in items:
            print(f"   • {item}")
    
    print(f"\n💰 Cost Considerations:")
    print(f"   • OpenAI embeddings: ~$0.02 per 1M tokens")
    print(f"   • For 100K records: ~$20-50 one-time embedding cost")
    print(f"   • Weaviate hosting: $50-500/month depending on scale")
    print(f"   • Development time: 2-4 weeks for basic implementation")
    print(f"   • ROI: Significant improvement in discovery and cataloger efficiency")
    
    print(f"\n🛠️ Next Steps for Implementation:")
    print(f"   1. Start with a pilot collection (1000-10000 records)")
    print(f"   2. Define use cases and success metrics")
    print(f"   3. Establish data pipeline for embedding generation")
    print(f"   4. Build search interface and user testing")
    print(f"   5. Scale gradually with user feedback")
    print(f"   6. Integrate with existing ILS and discovery systems")

# Generate comprehensive report
generate_implementation_report(client)
practical_applications_guide()

print(f"\n\n🎉 WORKSHOP COMPLETE!")
print(f"🚀 You now have hands-on experience with:")
print(f"   ✅ Setting up Weaviate vector database")
print(f"   ✅ Creating collections optimized for library metadata")
print(f"   ✅ Generating and indexing embeddings")
print(f"   ✅ Performing semantic similarity searches")
print(f"   ✅ Advanced querying with filters and conditions")
print(f"   ✅ Entity resolution and disambiguation")
print(f"   ✅ Collection analysis and visualization")
print(f"\n📚 Ready to revolutionize library discovery with semantic search!")

## Cleanup and Connection Management

Always remember to properly close your Weaviate connection when finished.

In [None]:
# Clean up the connection
try:
    client.close()
    print("✅ Weaviate connection closed successfully")
except:
    print("⚠️  Connection was already closed or not established")

print("\n🎓 Workshop completed successfully!")
print("💡 Don't forget to stop your Weaviate Docker container when done:")
print("   docker stop weaviate-workshop")
print("   docker rm weaviate-workshop")