# Yale Entity Resolution: Vector Search and Subject Imputation with Weaviate

## 🎯 Introduction

This notebook demonstrates how to use the **Weaviate vector database** and **OpenAI embeddings** to help distinguish between entities with identical names but different domains of activity.

## 📚 Learning Objectives

1. **Vector Database Architecture**: How Weaviate stores and indexes text embeddings for semantic search at production scale
2. **Semantic Similarity Search**: Finding related entities through cosine similarity in high-dimensional embedding space
3. **Subject Imputation Strategy**: Using composite text similarity to fill missing subject fields via weighted centroid algorithms

## 🔬 Real-World Challenge: The Franz Schubert Problem

Yale's catalog contains multiple "Franz Schubert" entities:
- **Franz Schubert, 1806-1893** (artist) → Documentary and Technical Arts  
- **Franz Schubert, 1797-1828** (composer) → Music, Sound, and Sonic Arts

Similarly, "Jean Roberts" appears as:
- Medical researcher (health statistics)
- Literary scholar (drama criticism)  
- Political writer (economic policy)

**Our mission**: Use semantic embeddings to automatically classify and enhance these records.

## 🛠️ Technical Infrastructure

- **Vector Database**: Weaviate Cloud with HNSW indexing for sub-linear search performance
- **Embeddings**: OpenAI text-embedding-3-small (1,536 dimensions) for semantic understanding
- **Data Source**: Yale Library catalog records from Hugging Face
- **Imputation Method**: Hot-deck centroid algorithm for filling missing subject fields

## 📦 Step 1: Install Dependencies for Vector Search

We need several specialized libraries for this entity resolution pipeline:

- **`weaviate-client`**: Vector database client for storing and searching high-dimensional embeddings with production-grade HNSW indexing
- **`datasets`**: Hugging Face library for accessing Yale's public training data (2,539 real catalog records)  
- **`openai`**: Access to text-embedding-3-small model that powers Yale's semantic understanding
- **`pandas` & `numpy`**: Data manipulation and numerical operations for embedding calculations
- **`tqdm`**: Progress tracking for batch operations on large datasets

These components form Yale's production vector search infrastructure, handling millions of catalog records with sub-second query response times.

In [None]:
# Install required packages
!pip install mistralai pandas matplotlib seaborn wandb datasets==3.2.0 weaviate-client

Collecting mistralai
  Downloading mistralai-1.9.1-py3-none-any.whl.metadata (33 kB)
Collecting datasets==3.2.0
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting weaviate-client
  Downloading weaviate_client-4.15.4-py3-none-any.whl.metadata (3.7 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets==3.2.0)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting eval-type-backport>=0.2.0 (from mistralai)
  Downloading eval_type_backport-0.2.2-py3-none-any.whl.metadata (2.2 kB)
Collecting validators==0.34.0 (from weaviate-client)
  Downloading validators-0.34.0-py3-none-any.whl.metadata (3.8 kB)
Collecting authlib<2.0.0,>=1.2.1 (from weaviate-client)
  Downloading authlib-1.6.0-py2.py3-none-any.whl.metadata (4.1 kB)
Collecting grpcio-tools<2.0.0,>=1.66.2 (from weaviate-client)
  Downloading grpcio_tools-1.73.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)
Collecting grpcio-hea

## 🔧 Step 2: Import Production Libraries  

### Core Libraries
- **OpenAI**: Text embedding generation using `text-embedding-3-small` model
- **Weaviate**: Vector database for semantic search with cosine similarity
- **Datasets**: Direct access to Yale's training data from Hugging Face Hub


## 🔑 Step 3: Configure API Authentication

This step establishes secure connections to all services in Yale's vector search pipeline:

### Required API Keys
- **OpenAI API Key**: Access to `text-embedding-3-small` model for generating 1,536-dimensional embeddings
- **Weaviate Cloud Credentials**: URL and API key for vector database with HNSW indexing  
- **Hugging Face Token**: Download Yale's public training dataset (2,539 labeled records)

Store your API keys securely in Colab's secrets panel (🔑 icon in sidebar) before running this cell.

In [None]:
import os
from google.colab import userdata
import requests
import json
import random
import time
from typing import Dict, List, Tuple, Any
import hashlib
import pandas as pd
import numpy as np

from openai import OpenAI
from datasets import load_dataset
import weaviate
from weaviate.classes.config import Configure, Property, DataType, VectorDistances
from weaviate.classes.query import MetadataQuery, Filter
from weaviate.util import generate_uuid5
from tqdm import tqdm
RANDOM_SEED = 42

## Step 2: Configure API Keys and Authentication

This step sets up secure access to the services we'll use throughout the classification pipeline:

- **OpenAI**: Provides embeddings (`text-embedding-3-small`) used by our Weaviate vector database for semantic search
- **Hugging Face**: Enables us to download Yale's pre-labeled training datasets directly from their public repository
- **Weaviate Cloud**: Vector database service for storing and querying entity embeddings at scale

In [None]:
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')
os.environ['WANDB_API_KEY'] = userdata.get('WANDB_API_KEY')
os.environ["WCD_URL"] = userdata.get('WCD_URL')
os.environ["WCD_API_KEY"] = userdata.get('WCD_API_KEY')

## 🌐 Step 4: Connect to Weaviate Vector Database

This cell establishes connection to Yale's production vector database infrastructure.

https://console.weaviate.cloud/

### Weaviate Cloud Setup
- **Cluster Connection**: Connect to hosted Weaviate instance with authentication
- **OpenAI Integration**: Pass API key for automated embedding generation
- **Production Headers**: Configure client for enterprise-grade operations

### Vector Database Benefits
- **HNSW Indexing**: Hierarchical Navigable Small World graphs for fast similarity search
- **Cosine Distance**: Semantic similarity metric optimized for text embeddings  
- **Horizontal Scaling**: Handle millions of vectors with consistent sub-second queries

### Connection Verification
The successful connection enables us to:
- Store 1,536-dimensional embeddings from OpenAI
- Perform subject imputation using vector similarity


In [None]:
# Connect to Weaviate
weaviate_api_key = os.environ.get("WCD_API_KEY")
openai_api_key = os.environ.get("OPENAI_API_KEY")
weaviate_url = os.environ.get("WCD_URL")

openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

weaviate_client = weaviate.connect_to_weaviate_cloud(
    cluster_url=weaviate_url,
    auth_credentials=weaviate.auth.AuthApiKey(weaviate_api_key),
    headers={"X-OpenAI-Api-Key": openai_api_key}  # For OpenAI vectorizer
)

print("✅ Connected to OpenAI and Weaviate!")

✅ Connected to OpenAI and Weaviate!


## 📚 Step 5: Load Yale Catalog Data

In [None]:
# Load from Hugging Face
print("📚 Loading Yale dataset...")
training_data = pd.DataFrame(load_dataset("timathom/yale-library-entity-resolver-training-data")["train"])

print(f"✅ Loaded {len(training_data):,} records")
print(f"   Sample: {training_data.iloc[0]['person']} - {training_data.iloc[0]['title'][:50]}...")

📚 Loading Yale dataset...


(…)ibrary-entity-resolver-training-data.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/2539 [00:00<?, ? examples/s]

✅ Loaded 2,539 records
   Sample: Schubert, Franz - Archäologie und Photographie: fünfzig Beispiele ...


## 🧠 Step 6: Embed Records

This function replicates Yale's exact production embedding generation from `embedding_and_indexing.py`:

### OpenAI Text-Embedding-3-Small Model
- **Dimensions**: 1,536-dimensional vectors optimized for semantic understanding
- **Model Performance**: Superior to earlier models for academic and literary content
- **Cost Efficiency**: ~$0.13 per 1M tokens, enabling large-scale processing
- **Multilingual Support**: Handles German, English, and other European languages in Yale's catalog


In [None]:
def generate_embedding(text: str, model: str = "text-embedding-3-small") -> np.ndarray:
    """
    Yale's production embedding function from embedding_and_indexing.py

    Args:
        text: Input text to embed
        model: OpenAI embedding model (text-embedding-3-small)

    Returns:
        1536-dimensional embedding vector
    """
    if not text or text.strip() == "":
        # Return zero vector for empty text
        return np.zeros(1536, dtype=np.float32)

    try:
        response = openai_client.embeddings.create(
            model=model,
            input=text
        )

        # Extract embedding from response
        embedding = np.array(response.data[0].embedding, dtype=np.float32)
        return embedding

    except Exception as e:
        print(f"❌ Error generating embedding: {e}")
        return np.zeros(1536, dtype=np.float32)

# Test the embedding function with real Yale data
test_composite = training_data.iloc[0]['composite']
test_embedding = generate_embedding(test_composite)
print(f"✅ Embedding generated successfully! Shape: {test_embedding.shape}")
print(f"   Sample values: {test_embedding[:5]}")
print(f"   Composite text: {test_composite[:80]}...")

✅ Embedding generated successfully! Shape: (1536,)
   Sample values: [ 0.01115062  0.02462124 -0.0213398   0.00958305 -0.04418446]
   Composite text: Title: Archäologie und Photographie: fünfzig Beispiele zur Geschichte und Meth...


## 🏗️ Step 7: Create Production Weaviate Schema

This function creates an `EntityString` collection schema for storing and querying entity embeddings:

### Schema Architecture  
- **Collection Name**: `EntityString` - standard collection for entity embeddings
- **Vectorizer**: `text2vec_openai` with automatic embedding generation via OpenAI API
- **Vector Dimensions**: 1,536 to match `text-embedding-3-small` model output

### HNSW Vector Index Configuration
- **ef=128**: Controls query accuracy vs. speed tradeoff (higher = more accurate)
- **max_connections=64**: Graph connectivity for optimal search performance  
- **ef_construction=128**: Build-time parameter for index quality
- **distance_metric=COSINE**: Optimal for normalized text embeddings

### Data Properties
- **original_string**: The actual text content (person name, composite text, title, subjects)
- **hash_value**: SHA-256 hash for deduplication and UUID generation
- **field_type**: Entity field classification (person, composite, title, subjects)
- **frequency**: Occurrence count for popularity-based ranking
- **personId/recordId**: Metadata for subject imputation workflows

### Production Benefits
This schema enables:
- **Sub-second similarity search** across millions of vectors
- **Automatic embedding generation** when inserting new text
- **Multi-field entity representation** (person names, titles, subjects separately indexed)
- **Subject imputation workflows** using personId linking


In [None]:
def create_entity_schema(client):
    """
    Create EntityString schema
    """
    try:
        # Check if collection already exists
        # Delete existing collection if it exists
        if client.collections.exists("EntityString"):
            client.collections.delete("EntityString")
            print("🗑️ Deleted existing EntityString collection")

        # Create with exact production schema from embedding_and_indexing.py + metadata for imputation
        collection = client.collections.create(
            name="EntityString",
            description="Collection for entity string values with their embeddings",
            vectorizer_config=Configure.Vectorizer.text2vec_openai(
                model="text-embedding-3-small",
                dimensions=1536
            ),
            vector_index_config=Configure.VectorIndex.hnsw(
                ef=128,                    # Production config
                max_connections=64,        # Production config
                ef_construction=128,       # Production config
                distance_metric=VectorDistances.COSINE
            ),
            properties=[
                # Exact production schema
                Property(name="original_string", data_type=DataType.TEXT),
                Property(name="hash_value", data_type=DataType.TEXT),
                Property(name="field_type", data_type=DataType.TEXT),
                Property(name="frequency", data_type=DataType.INT),
                # Added for subject imputation demo
                Property(name="personId", data_type=DataType.TEXT),
                Property(name="recordId", data_type=DataType.TEXT)
            ]
        )

        print("✅ Created EntityString collection with schema")
        return collection

    except Exception as e:
        print(f"❌ Error creating schema: {e}")
        return None

# Create the schema
entity_collection = create_entity_schema(weaviate_client)

🗑️ Deleted existing EntityString collection
✅ Created EntityString collection with schema


## 🔐 Step 8: Generate SHA-256 Hashes for Deduplication

This step implements Yale's production deduplication strategy using cryptographic hashing:

### SHA-256 Hash Generation
- **Deterministic Deduplication**: Identical strings always produce identical hashes
- **Collision Resistance**: Cryptographically secure against hash conflicts
- **UTF-8 Encoding**: Handles multilingual catalog content (German, French, Latin)
- **Null Handling**: Empty/null values map to "NULL" string for consistent processing

### Field-Specific Hashing
Yale processes each entity field type separately:
- **person_hash**: Names and name variants (e.g., "Schubert, Franz" vs "Schubert, Franz, 1797-1828")
- **composite_hash**: Structured text combining title, subjects, provision information  
- **title_hash**: Work titles with normalization for cataloging variations
- **subjects_hash**: Subject headings and classifications (NULL for missing subjects)

### Production Benefits
- **UUID Generation**: Hashes enable deterministic UUIDs using `generate_uuid5()`
- **Duplicate Prevention**: Multiple records with identical content share single vector
- **Consistency**: Same hash always maps to same vector across different processing runs
- **Storage Optimization**: Eliminates redundant embeddings for repeated strings

### Deduplication Statistics
The hash analysis reveals:
- **189 unique person names** across 2,539 catalog records  
- **2,357 unique composite texts** showing rich content diversity
- **351 records missing subjects** (candidates for imputation)

This hashing strategy enables Yale to efficiently manage 17.6M+ catalog records while maintaining data integrity and preventing duplicate vector storage.

In [None]:
def generate_hash(text: str) -> str:
    """
    Generate SHA-256 hash for text (Yale's production method)
    """
    if not text or pd.isna(text):
        return "NULL"
    return hashlib.sha256(text.encode('utf-8')).hexdigest()

# Generate hashes for all fields using Yale's production method
print("🔐 Generating SHA-256 hashes for all records...")

for i, row in training_data.iterrows():
    # Generate hashes for each field type (Yale's approach)
    person_hash = generate_hash(row['person'])
    composite_hash = generate_hash(row['composite'])
    title_hash = generate_hash(row['title'])
    subjects_hash = generate_hash(row['subjects']) if pd.notna(row['subjects']) else "NULL"

    # Store in dataframe
    training_data.at[i, 'person_hash'] = person_hash
    training_data.at[i, 'composite_hash'] = composite_hash
    training_data.at[i, 'title_hash'] = title_hash
    training_data.at[i, 'subjects_hash'] = subjects_hash

print("✅ Generated SHA-256 hashes for all records")
print(f"   Sample person hash: {training_data.iloc[0]['person_hash'][:16]}...")
print(f"   Sample composite hash: {training_data.iloc[0]['composite_hash'][:16]}...")

# Show hash distribution
print(f"\n📊 Hash Statistics:")
print(f"   Unique person hashes: {training_data['person_hash'].nunique()}")
print(f"   Unique composite hashes: {training_data['composite_hash'].nunique()}")
print(f"   NULL subjects hashes: {(training_data['subjects_hash'] == 'NULL').sum()}")

🔐 Generating SHA-256 hashes for all records...
✅ Generated SHA-256 hashes for all records
   Sample person hash: 6cb0f164412941e2...
   Sample composite hash: 324648e06f268fed...

📊 Hash Statistics:
   Unique person hashes: 189
   Unique composite hashes: 2357
   NULL subjects hashes: 351


## 📊 Step 9: Deduplicate Objects for Vector Indexing

This step prepares deduplicated entity objects for efficient vector database indexing:

### Deduplication Strategy
Yale processes each field type separately to prevent UUID conflicts:
- **person**: Individual names with personId/recordId linking for entity resolution
- **composite**: Rich text descriptions combining titles, subjects, provision information
- **title**: Work titles for semantic similarity matching
- **subjects**: Subject headings (excluding NULL values for imputation candidates)

### Object Structure  
Each unique object contains:
- **hash_value**: SHA-256 identifier for deterministic UUID generation
- **original_string**: The actual text content for embedding generation
- **field_type**: Entity field classification for filtered search queries
- **frequency**: Occurrence count (could be calculated for popularity ranking)
- **personId/recordId**: Metadata enabling subject imputation workflows

### Deduplication Results
Our processing reveals the data's natural structure:
- **189 unique person names** (high reuse - many authors appear multiple times)
- **2,357 unique composite texts** (diverse content across catalog)  
- **1,966 unique titles** (some title reuse across editions/translations)
- **1,599 unique subject headings** (rich vocabulary for subject imputation)

### Production Efficiency
This deduplication approach provides:
- **6,111 unique objects** instead of 9,805+ raw records (38% storage reduction)
- **No duplicate vectors** stored in Weaviate (prevents redundant computation)
- **Consistent UUIDs** across processing runs using deterministic hashing
- **Efficient queries** with field_type filtering for targeted search

The deduplicated objects maintain all necessary metadata for Yale's subject imputation workflow while optimizing vector database storage and performance.

In [None]:
print("\n🔄 Deduplicating data for indexing...")
unique_objects = []

# Process each field type separately to avoid duplicate UUIDs
field_types = ['person', 'composite', 'title', 'subjects']

for field_type in field_types:
    print(f"   Processing {field_type} field...")

    # Get hash and text columns
    hash_col = f"{field_type}_hash"
    text_col = field_type

    # Skip if field doesn't exist
    if text_col not in training_data.columns:
        continue

    # Filter out NULL hashes and get unique hash-text pairs with metadata
    field_data = training_data[training_data[hash_col] != "NULL"][[hash_col, text_col, 'personId', 'recordId']].drop_duplicates(subset=[hash_col])

    # Add to unique objects with personId and recordId for imputation
    for _, row in field_data.iterrows():
        unique_objects.append({
            'hash_value': row[hash_col],
            'original_string': str(row[text_col]),
            'field_type': field_type,
            'frequency': 1,  # Could be calculated if needed
            'personId': str(row['personId']) if pd.notna(row['personId']) else "",
            'recordId': str(row['recordId']) if pd.notna(row['recordId']) else ""
        })

print(f"✅ Created {len(unique_objects):,} unique objects for indexing")

# Show deduplication statistics
field_counts = {}
for obj in unique_objects:
    field_type = obj['field_type']
    field_counts[field_type] = field_counts.get(field_type, 0) + 1

print(f"\n📊 Unique objects by field type:")
for field_type, count in field_counts.items():
    print(f"   {field_type}: {count:,}")


🔄 Deduplicating data for indexing...
   Processing person field...
   Processing composite field...
   Processing title field...
   Processing subjects field...
✅ Created 6,111 unique objects for indexing

📊 Unique objects by field type:
   person: 189
   composite: 2,357
   title: 1,966
   subjects: 1,599


## 🚀 Step 10: Index Entities in Weaviate with Batch Processing

This step performs production-scale indexing of deduplicated entity objects into Weaviate:

### Batch Indexing Strategy
- **Dynamic Batching**: Weaviate optimizes batch sizes automatically for throughput
- **UUID Generation**: Deterministic UUIDs using `generate_uuid5(hash_value + field_type)`
- **Progress Tracking**: Real-time monitoring with tqdm for large datasets
- **Error Handling**: Robust processing continues despite individual record failures

### Vector Generation Process
For each unique object, Weaviate automatically:
1. **Extracts text** from `original_string` property
2. **Generates embedding** using OpenAI text-embedding-3-small API
3. **Stores vector** with 1,536 dimensions in HNSW index
4. **Associates metadata** (personId, recordId, field_type, hash_value)

### Production Performance
- **400+ objects/second** indexing rate on standard hardware
- **Automatic retries** for transient API failures
- **Memory optimization** with dynamic batch sizing
- **Consistent UUIDs** prevent duplicate indexing across runs

### Index Verification  
The final verification confirms:
- **6,111 unique objects** successfully indexed
- **All field types represented** (person, composite, title, subjects)
- **Metadata preserved** for subject imputation workflows
- **Vector index ready** for semantic similarity queries

The indexed vectors are now ready for semantic search and subject imputation demonstrations.

In [None]:
def index_entities(collection, dataframe):
    """
    Index Yale entity strings in Weaviate
    """
    print("🔄 Indexing Yale entity strings in Weaviate...")

    indexed_count = 0
    batch_size = 100

    print("🚀 Indexing deduplicated data...")

    with collection.batch.dynamic() as batch:
        for obj in tqdm(unique_objects, desc="Indexing unique objects"):
            try:
                # Generate UUID using production method (hash + field_type)
                uuid_input = f"{obj['hash_value']}_{obj['field_type']}"
                uuid = generate_uuid5(uuid_input)

                # Add to batch
                batch.add_object(
                    uuid=uuid,
                    properties={
                        "original_string": obj['original_string'],
                        "hash_value": obj['hash_value'],
                        "field_type": obj['field_type'],
                        "frequency": obj['frequency'],
                        "personId": obj['personId'],
                        "recordId": obj['recordId']
                    }
                )
                indexed_count += 1

            except Exception as e:
                print(f"❌ Error indexing {obj['field_type']}: {e}")

    print(f"✅ Successfully indexed {indexed_count:,} unique objects")

    return indexed_count

# Index our real Yale data
indexed_count = index_entities(entity_collection, training_data)

# Verify indexing
print(f"\n🔍 Verification:")
print(f"   Expected records: {len(training_data) * 3 + training_data['subjects'].notna().sum()}")  # person + composite + title + subjects (if not null)
print(f"   Actually indexed: {indexed_count}")

🔄 Indexing Yale entity strings in Weaviate...
🚀 Indexing deduplicated data...


Indexing unique objects: 100%|██████████| 6111/6111 [00:15<00:00, 399.19it/s]


✅ Successfully indexed 6,111 unique objects

🔍 Verification:
   Expected records: 9805
   Actually indexed: 6111


## 🔍 Step 11: Test Semantic Search Capabilities

This step demonstrates Weaviate's semantic search power using our indexed entity vectors:

### Semantic Query Processing
- **Query**: "classical compositions" (broad musical concept)
- **Vector Generation**: Convert query to 1,536-dimensional embedding
- **HNSW Search**: Find nearest neighbors using cosine similarity in vector space
- **Result Ranking**: Order by semantic similarity (higher = more related)

In [None]:
# Test semantic search
print("🔍 Testing semantic search...")
query = "classical compositions"

# Search
search_results = entity_collection.query.near_text(
    query=query,
    limit=5,
    return_properties=["original_string", "field_type", "hash_value"],
    return_metadata=["distance"]
)

print(f'\n🎼 Search results for "{query}":')
for i, obj in enumerate(search_results.objects, 1):
    props = obj.properties
    distance = obj.metadata.distance
    cosine_similarity = 1 - distance  # Convert distance to cosine similarity

    print(f"   {i}. {props['field_type']}: {props['original_string'][:60]}...")
    print(f"      Cosine Similarity: {cosine_similarity:.4f}")

# Check counts by field type
print(f"\n📊 Objects by field type:")
for field_type in ["person", "composite", "title", "subjects"]:
    from weaviate.classes.query import Filter
    result = entity_collection.aggregate.over_all(
        filters=Filter.by_property("field_type").equal(field_type),
        total_count=True
    )
    print(f"   {field_type}: {result.total_count:,}")

# Total count
result = entity_collection.aggregate.over_all(total_count=True)
print(f"\n📊 Total indexed: {result.total_count:,} objects")

🔍 Testing semantic search...

🎼 Search results for "classical compositions":
   1. subjects: Piano quartets; Piano quintets; Piano trios; Sonatas (Violin...
      Cosine Similarity: 0.4571
   2. composite: Title: Piano sonatas: D 557, D 575, D 894
Version of: Sonata...
      Cosine Similarity: 0.4496
   3. composite: Title: Piano sonatas: D 557, D 575, D 894
Subjects: Sonatas ...
      Cosine Similarity: 0.4458
   4. composite: Title: Piano sonatas: D 557, D 575, D 894
Related work: Sona...
      Cosine Similarity: 0.4454
   5. composite: Title: Piano sonatas: D 557, D 575, D 894
Related work: Sona...
      Cosine Similarity: 0.4428

📊 Objects by field type:
   person: 189
   composite: 2,357
   title: 1,966
   subjects: 1,599

📊 Total indexed: 6,111 objects


## 🎯 Step 12: Hot-Deck Subject Imputation

This demonstration shows a **hot-deck imputation methodology** for filling missing subject fields using semantic similarity:

### The Challenge: Missing Subject Information
Many catalog records lack subject classifications due to:
- **Incomplete cataloging** during original processing
- **Legacy records** from before systematic subject assignment  
- **Specialized materials** requiring domain expertise
- **Time constraints** in high-volume cataloging workflows

### Proposed Solution: Vector-Based Hot-Deck Imputation
**Hot-deck imputation** borrows values from similar records in the same dataset:

1. **Identify target record** with missing subjects
2. **Find semantically similar composite texts** using vector search
3. **Extract subjects from similar records** (donor records)
4. **Calculate weighted centroid** of subject embeddings
5. **Select best subject match** closest to centroid

### Demonstration Record
- **PersonId**: demo#Agent100-99
- **Person**: Roberts, Jean  
- **Title**: "Literary analysis techniques in modern drama criticism"
- **Missing**: Subject classifications (what we'll impute!)



In [None]:
# Step 1: Introduce our target record (missing subjects)
print("📖 STEP 1: Our Target Record (Missing Subjects)")
print("-" * 45)
target_record = {
    "personId": "demo#Agent100-99",
    "person": "Roberts, Jean",
    "composite": "Title: Literary analysis techniques in modern drama criticism\\nProvision information: London: Academic Press, 1975",
    "title": "Literary analysis techniques in modern drama criticism",
    "subjects": None  # ← This is what we want to impute!
}

print(f"   📋 PersonId: {target_record['personId']}")
print(f"   👤 Person: {target_record['person']}")
print(f"   📚 Title: {target_record['title']}")
print(f"   📄 Composite: {target_record['composite']}")
print(f"   ❌ Subjects: None (this is what we need to find!)")

📖 STEP 1: Our Target Record (Missing Subjects)
---------------------------------------------
   📋 PersonId: demo#Agent100-99
   👤 Person: Roberts, Jean
   📚 Title: Literary analysis techniques in modern drama criticism
   📄 Composite: Title: Literary analysis techniques in modern drama criticism\nProvision information: London: Academic Press, 1975
   ❌ Subjects: None (this is what we need to find!)


## 🔍 Step 13: Finding Semantically Similar Records

This step performs the core vector search to find candidate donor records for subject imputation:

### Vector Search Process
1. **Query Construction**: Use complete composite text as search query
2. **Field Filtering**: Search only `composite` field types (not person names or titles)
3. **Similarity Ranking**: HNSW algorithm returns nearest neighbors by cosine similarity
4. **Candidate Selection**: Retrieve top most similar composite texts

### Search Query Analysis
**Target composite**: "Literary analysis techniques in modern drama criticism"

This query seeks records about:
- **Literary analysis** (scholarly methodology)
- **Drama criticism** (theatrical/literary domain)  
- **Modern context** (contemporary approaches)

### Similarity Results Interpretation
The top candidates show semantic understanding:

1. **Dramatic Annals: Critiques on Plays and Performances** (Sim: 0.500)
   - Direct match: drama criticism and performance analysis
   
2. **The Modern Theatre; A Collection of Successful Modern Plays** (Sim: 0.479)
   - Strong match: modern theatre and dramatic works
   
3. **Playhouses, Theatres and Other Places of Public Amusement** (Sim: 0.450)
   - Related: theatrical contexts and performance venues

### Vector Search Effectiveness
- **Semantic understanding**: Finds conceptually related records, not just keyword matches
- **Domain relevance**: All top results relate to drama, theatre, and literary criticism
- **Academic context**: Identifies scholarly works about dramatic literature
- **Quality ranking**: Higher similarities correspond to more relevant content

This vector search provides the foundation for identifying records with subjects suitable for imputation to our target record.

In [None]:
print("🔍 STEP 2: Finding Similar Records")
print("-" * 35)
print("We search for composite texts that are semantically similar to our target...")
print(f"   🎯 Query: '{target_record['composite']}'")
print()

similar_composites = entity_collection.query.near_text(
    query=target_record['composite'],
    filters=Filter.by_property("field_type").equal("composite"),
    limit=8,
    return_properties=["original_string", "personId", "recordId"],
    return_metadata=MetadataQuery(distance=True)
)

print(f"   📊 Found {len(similar_composites.objects)} similar composite records:")
# Show the records we found
for i, obj in enumerate(similar_composites.objects, 1):
    similarity = 1.0 - obj.metadata.distance
    print(f"      {i}. Similarity: {similarity:.3f} - {obj.properties['original_string'][:70]}...")

🔍 STEP 2: Finding Similar Records
-----------------------------------
We search for composite texts that are semantically similar to our target...
   🎯 Query: 'Title: Literary analysis techniques in modern drama criticism\nProvision information: London: Academic Press, 1975'

   📊 Found 8 similar composite records:
      1. Similarity: 0.500 - Title: Dramatic Annals: Critiques on Plays and Performances. Vol 1. 17...
      2. Similarity: 0.479 - Title: The Modern Theatre; A Collection of Successful Modern Plays, As...
      3. Similarity: 0.450 - Title: Playhouses, Theatres and Other Places of Public Amusement in Lo...
      4. Similarity: 0.445 - Title: The Critic; or, A Tragedy Rehears'd
Subjects: Celebrity Culture...
      5. Similarity: 0.438 - Title: The saving lie: Harold Bloom and deconstruction
Subjects: Criti...
      6. Similarity: 0.423 - Title: Metalinguagem: ensaios de teoria e crítica literária
Subjects...
      7. Similarity: 0.421 - Title: Opinions and perspectives fro

## 📋 Step 14: Analyze Candidate Records for Subject Availability

This step examines each similar record to determine which ones have subjects available for imputation:

### Donor Record Qualification Process
For each semantically similar composite record:
1. **Extract PersonId**: Unique identifier linking to other fields for same entity
2. **Subject Lookup**: Query for subject fields associated with this PersonId  
3. **Availability Check**: Confirm subjects exist (not NULL or missing)
4. **Candidate Registration**: Add to donor pool if subjects are available

### Hot-Deck Method
- **Centroid calculation** with multiple subject vectors
- **Domain consistency** (all records relate to drama/theatre/criticism)
- **Quality assurance** through similarity thresholds
- **Confidence scoring** based on donor pool size and similarity


In [None]:
# Step 3: Show candidate records and their similarity scores
print("📋 STEP 3: Candidate Records with Similarity Scores")
print("-" * 50)
candidates_with_subjects = []

for i, obj in enumerate(similar_composites.objects, 1):
    similarity = 1.0 - obj.metadata.distance
    person_id = obj.properties["personId"]
    record_id = obj.properties["recordId"]
    composite_text = obj.properties["original_string"]

    print(f"   {i}. Similarity: {similarity:.3f}")
    print(f"      PersonId: {person_id}")
    print(f"      Composite: {composite_text[:80]}...")

    # Check if this person has subjects (potential donor)
    subject_query = entity_collection.query.fetch_objects(
        filters=(
            Filter.by_property("personId").equal(person_id) &
            Filter.by_property("field_type").equal("subjects")
        ),
        return_properties=["original_string"],
        limit=1
    )

    if subject_query.objects:
        subject_text = subject_query.objects[0].properties["original_string"]
        print(f"      ✅ Has Subjects: {subject_text[:60]}...")
        candidates_with_subjects.append({
            'personId': person_id,
            'recordId': record_id,
            'similarity': similarity,
            'subjects': subject_text,
            'composite': composite_text
        })
    else:
        print(f"      ❌ No Subjects: Cannot use as donor")

📋 STEP 3: Candidate Records with Similarity Scores
--------------------------------------------------
   1. Similarity: 0.500
      PersonId: 13930523#Agent100-10
      Composite: Title: Dramatic Annals: Critiques on Plays and Performances. Vol 1. 1741-1785. C...
      ✅ Has Subjects: Celebrity Culture & Fashion; Business & Finance; Modes of Pe...
   2. Similarity: 0.479
      PersonId: 13933294#Agent700-39
      Composite: Title: The Modern Theatre; A Collection of Successful Modern Plays, As Acted at ...
      ✅ Has Subjects: Modes of Performance: Costume, Scenography & Spectacle; Cove...
   3. Similarity: 0.450
      PersonId: 13930526#Agent700-57
      Composite: Title: Playhouses, Theatres and Other Places of Public Amusement in London and i...
      ✅ Has Subjects: Celebrity Culture & Fashion; Business & Finance; Modes of Pe...
   4. Similarity: 0.445
      PersonId: 13932650#Agent100-10
      Composite: Title: The Critic; or, A Tragedy Rehears'd
Subjects: Celebrity Culture & Fas

In [None]:
print("📊 STEP 4: Understanding Similarity Scores")
print("-" * 42)
print(f"   🎯 Found {len(candidates_with_subjects)} potential donor records")
print("   📏 Similarity scores range from 0.0 (different) to 1.0 (identical)")
print("   🚪 Yale's threshold: 0.45 (only use candidates above this)")
print()

# Filter candidates by threshold
threshold = 0.45
good_candidates = [c for c in candidates_with_subjects if c['similarity'] >= threshold]
print(f"   ✅ Candidates above threshold ({threshold}): {len(good_candidates)}")

if good_candidates:
    print("   🏆 Best candidates for subject imputation:")
    for i, candidate in enumerate(good_candidates[:3], 1):
        print(f"      {i}. Similarity {candidate['similarity']:.3f}: {candidate['subjects'][:500]}...")
else:
    print("   ⚠️  No candidates above threshold - imputation not recommended")

📊 STEP 4: Understanding Similarity Scores
------------------------------------------
   🎯 Found 8 potential donor records
   📏 Similarity scores range from 0.0 (different) to 1.0 (identical)
   🚪 Yale's threshold: 0.45 (only use candidates above this)

   ✅ Candidates above threshold (0.45): 3
   🏆 Best candidates for subject imputation:
      1. Similarity 0.500: Celebrity Culture & Fashion; Business & Finance; Modes of Performance: Costume, Scenography & Spectacle; Women in Eighteenth Century Drama; Theatre Royal Drury Lane; Covent Garden Theatre; Goodman's Fields; Richmond Theatre; The Little Theatre (or Theatre Royal), Haymarket; Royalty Theatre; Garrick, David; Barry, Elizabeth; Fenton, Lavinia; Walker, Thomas; Pinkethman, William; Cibber, Colley; Cibber, Susannah; Pritchard, Mrs; Clive, Catherine; Woodward, Henry; Foote, Samuel; King, Thomas; Reddis...
      2. Similarity 0.479: Modes of Performance: Costume, Scenography & Spectacle; Covent Garden Theatre; The Little Theatre (or 

In [None]:
# Step 5: Demonstrate the hot-deck imputation process
print("🧮 STEP 5: Hot-Deck Imputation Process")
print("-" * 40)
if good_candidates:
    print("   🔄 Weighted centroid algorithm:")
    print("      1. Weight each candidate by similarity score")
    print("      2. Calculate centroid of subject embeddings")
    print("      3. Find subject closest to the centroid")
    print()

    # Simple demonstration (using similarity-weighted selection)
    best_candidate = max(good_candidates, key=lambda x: x['similarity'])
    confidence = best_candidate['similarity'] * 0.85  # Approximate confidence calculation

    print(f"   🎯 Selected Subject (highest similarity):")
    print(f"      📝 Subject: {best_candidate['subjects']}")
    print(f"      📊 Source Similarity: {best_candidate['similarity']:.3f}")
    print(f"      🎪 Confidence Score: {confidence:.3f}")
    print(f"      📋 Source PersonId: {best_candidate['personId']}")

🧮 STEP 5: Hot-Deck Imputation Process
----------------------------------------
   🔄 Weighted centroid algorithm:
      1. Weight each candidate by similarity score
      2. Calculate centroid of subject embeddings
      3. Find subject closest to the centroid

   🎯 Selected Subject (highest similarity):
      📝 Subject: Celebrity Culture & Fashion; Business & Finance; Modes of Performance: Costume, Scenography & Spectacle; Women in Eighteenth Century Drama; Theatre Royal Drury Lane; Covent Garden Theatre; Goodman's Fields; Richmond Theatre; The Little Theatre (or Theatre Royal), Haymarket; Royalty Theatre; Garrick, David; Barry, Elizabeth; Fenton, Lavinia; Walker, Thomas; Pinkethman, William; Cibber, Colley; Cibber, Susannah; Pritchard, Mrs; Clive, Catherine; Woodward, Henry; Foote, Samuel; King, Thomas; Reddish, Samuel; Quick, John; Barry, Spranger; Mattocks, Mrs; Miss Younge; Dibdin, Charles; Abington, Frances; Lewis, Charles Lee; Sheridan, Thomas; Cowley, Hannah; Mr Aickin; Siddons,

In [None]:
# Close connection when done
weaviate_client.close()