# Hybrid Matching

The code in this notebook combines a deterministic or rules-based matching approach and an AI/ML approach to reconciling incoming metadata for editions of Latin texts with the authority and work files in the DLL Catalog.

## Step 1: Prepare the Data

The data has already been preprocessed in the `data-preparation.ipynb` notebook. Now it needs to be loaded here and converted into the lookup dictionaries that will be used for the deterministic matching routine.

In [24]:
# Import the Pandas library for working with CSV data
import pandas as pd

# Read in the authors data
authors = pd.read_csv('../data/authors_db.csv',encoding='utf-8',quotechar='"')
# Read in the works data
works = pd.read_csv('../data/works_db.csv',encoding='utf-8',quotechar='"')

# Get basic information about the dataframes
authors.info()
works.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27290 entries, 0 to 27289
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Variant                  27290 non-null  object
 1   Authorized Name          27290 non-null  object
 2   DLL Identifier (Author)  27290 non-null  object
dtypes: object(3)
memory usage: 639.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5315 entries, 0 to 5314
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Title                    5315 non-null   object
 1   DLL Identifier (Work)    5315 non-null   object
 2   DLL Identifier (Author)  5315 non-null   object
dtypes: object(3)
memory usage: 124.7+ KB


In [25]:
# Change the names of the columns to be lower case without spaces or punctuation
authors = authors.rename(columns={'Variant':'variant_name','Authorized Name':'authorized_name','DLL Identifier (Author)':'dll_id_author'})
works = works.rename(columns={'Title':'title','DLL Identifier (Work)':'dll_id_work','DLL Identifier (Author)': 'dll_id_author'})

In [26]:
# Prepare the lookup dictionary of variant author names
variant_to_authorized = {
    row["variant_name"]: {
        "authorized_name": row["authorized_name"], 
        "author_id": row["dll_id_author"]
    }
    for _, row in authors.iterrows()
}

variant_to_authorized

{'herryson, joannes floruit=15th century a.d.': {'authorized_name': 'herryson, joannes',
  'author_id': 'A1868'},
 'joannes herryson': {'authorized_name': 'herryson, joannes',
  'author_id': 'A1868'},
 'john herryson': {'authorized_name': 'herryson, joannes',
  'author_id': 'A1868'},
 'heryyson, joannes floruit=15th century a.d.': {'authorized_name': 'herryson, joannes',
  'author_id': 'A1868'},
 'herryson, xoannes floruit=15th century a.d.': {'authorized_name': 'herryson, joannes',
  'author_id': 'A1868'},
 'johannes stratford': {'authorized_name': 'stratford, john, -1348',
  'author_id': 'A1870'},
 'john stratford': {'authorized_name': 'stratford, john, -1348',
  'author_id': 'A1870'},
 'john stratford, 1275?-1348': {'authorized_name': 'stratford, john, -1348',
  'author_id': 'A1870'},
 'stratford, johannes ca. 1275-1348': {'authorized_name': 'stratford, john, -1348',
  'author_id': 'A1870'},
 'stratford, john ca 1275-1348': {'authorized_name': 'stratford, john, -1348',
  'author_id'

In [27]:
# Prepare the lookup dictionary for titles
title_to_work = {
    row["title"]: {
        "dll_id_work": row["dll_id_work"],
        "dll_id_author": row["dll_id_author"]
    }
    for _, row in works.iterrows()
}

title_to_work

{'de signis et symptomatibus aegritudinum': {'dll_id_work': 'W10655',
  'dll_id_author': 'A3919'},
 'de coniuratione porcaria dialogus': {'dll_id_work': 'W10654',
  'dll_id_author': 'A3221'},
 'alda': {'dll_id_work': 'W10653', 'dll_id_author': 'A4844'},
 'de viris illustribus': {'dll_id_work': 'W4469', 'dll_id_author': 'A4936'},
 'de philosophis': {'dll_id_work': 'W10651', 'dll_id_author': 'A4799'},
 'epigrammata super exilio': {'dll_id_work': 'W10650',
  'dll_id_author': 'A4655'},
 'porcaria': {'dll_id_work': 'W10649', 'dll_id_author': 'A3205'},
 'liber de curatione egritudinum partium totius corporis': {'dll_id_work': 'W10648',
  'dll_id_author': 'A3153'},
 'occupatio': {'dll_id_work': 'W1913', 'dll_id_author': 'A5021'},
 'liber senecae de moribus': {'dll_id_work': 'W10636',
  'dll_id_author': 'A4655'},
 'praefationes': {'dll_id_work': 'W10635', 'dll_id_author': 'A3873'},
 'orationes': {'dll_id_work': 'W335', 'dll_id_author': 'A3593'},
 'psyche et cupido': {'dll_id_work': 'W10632', '

In [28]:
# Prepare the dictionaries for embedding the author names and titles
canonical_authors = authors.to_dict("records")
canonical_titles = works.to_dict("records")

## Prepare Embeddings

In [29]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Initialize the embedding model
embedding_model = SentenceTransformer('distiluse-base-multilingual-cased-v2')

# Extract canonical titles
canonical_titles = list(title_to_work.keys())

# Generate embeddings for canonical titles
title_embeddings = embedding_model.encode(canonical_titles)

# Store the embeddings with their respective titles
title_embeddings_dict = {
    title: embedding for title, embedding in zip(canonical_titles, title_embeddings)
}


## Prepare the FAISS Vector Stores

Since there are many author names to keep track of, I'm going to save them in a vector store for easier and more rapid searching instead of keeping them in memory.

I'm using [FAISS (Facebook AI Similarity Search)](https://faiss.ai/) because it is reliable, open-source, and relatively easy to use. In previous versions of this experiment, I tried using [Chroma](https://www.trychroma.com/) and found that it was too buggy to use.

In [None]:
import faiss

# Generate author embeddings and set up FAISS
author_embeddings = [embedding_model.encode(name) for name in variant_to_authorized.keys()]
author_embeddings = np.array(author_embeddings, dtype=np.float32)

dimension = author_embeddings.shape[1]
author_index = faiss.IndexFlatL2(dimension)
author_index.add(author_embeddings)

# Map index positions to author names
author_map = {i: name for i, name in enumerate(variant_to_authorized.keys())}

In [None]:
# Generate title embeddings and set up FAISS
title_embeddings = [embedding_model.encode(title) for title in title_to_work.keys()]
title_embeddings = np.array(title_embeddings, dtype=np.float32)

dimension = title_embeddings.shape[1]
title_index = faiss.IndexFlatL2(dimension)
title_index.add(title_embeddings)

# Map index positions to titles
title_map = {i: title for i, title in enumerate(title_to_work.keys())}

### Save the vector stores to disk

In [None]:
# Save author vector store
faiss.write_index(author_index, "../author_index.faiss")

# Save title vector store
faiss.write_index(title_index, "../title_index.faiss")

print("FAISS indices saved to disk.")

import pickle

# Save author_map and title_map
with open("author_map.pkl", "wb") as f:
    pickle.dump(author_map, f)

with open("title_map.pkl", "wb") as f:
    pickle.dump(title_map, f)

print("Maps saved to disk.")

In [None]:
# Code for loading vector stores, if needed

# author_index = faiss.read_index("author_index.faiss")
# title_index = faiss.read_index("title_index.faiss")

# with open("author_map.pkl", "rb") as f:
#     author_map = pickle.load(f)

# with open("title_map.pkl", "rb") as f:
#     title_map = pickle.load(f)

## Define Utility Functions

In [None]:
def deterministic_author_match(input_author):
    """Match author using deterministic lookups."""
    input_author_cleaned = input_author.lower().strip()
    author_info = variant_to_authorized.get(input_author_cleaned)
    if author_info:
        print(f"Deterministic author match: {author_info}")
        return author_info
    return None

def tokenize_title(title):
    """Tokenize the title into meaningful words."""
    return [token.strip().lower() for token in title.split()]

def deterministic_title_match(input_title):
    """Match title using deterministic lookups or tokenized search."""
    input_title_cleaned = input_title.lower().strip()
    
    # Direct deterministic match
    title_info = title_to_work.get(input_title_cleaned)
    if title_info:
        print(f"Deterministic title match: {title_info}")
        return title_info

    # Tokenized search
    input_tokens = tokenize_title(input_title)
    for token in input_tokens:
        if token in title_to_work:
            title_info = title_to_work[token]
            print(f"Tokenized title match: {title_info}")
            return title_info
    
    return None

def embedding_author_match(input_author):
    """Fallback to embedding-based author matching."""
    input_embedding = embedding_model.encode([input_author]).astype(np.float32)
    distances, indices = author_index.search(input_embedding, k=1)
    best_index = indices[0][0]
    best_match = author_map[best_index]
    similarity = 1 - distances[0][0]  # Convert L2 distance to similarity
    print(f"Embedding author match: {best_match} with similarity {similarity:.2f}")
    return variant_to_authorized.get(best_match) if similarity > 0.8 else None

def embedding_title_match(input_title):
    """Fallback to embedding-based title matching."""
    input_embedding = embedding_model.encode([input_title]).astype(np.float32)
    distances, indices = title_index.search(input_embedding, k=1)
    best_index = indices[0][0]
    best_match = title_map[best_index]
    similarity = 1 - distances[0][0]  # Convert L2 distance to similarity
    print(f"Embedding title match: {best_match} with similarity {similarity:.2f}")
    return title_to_work.get(best_match) if similarity > 0.8 else None

def match_metadata(input_author, input_title):
    """Match metadata against canonical records."""
    # Author matching
    author_info = deterministic_author_match(input_author)
    if not author_info:
        author_info = embedding_author_match(input_author)
    if not author_info:
        author_info = {"authorized_name": "Unknown", "author_id": "Unknown"}
    
    # Title matching
    title_info = deterministic_title_match(input_title)
    if not title_info:
        title_info = embedding_title_match(input_title)
    if not title_info:
        title_info = {"dll_id_work": "Unknown", "dll_id_author": "Unknown"}
    
    # Combine results
    return {
        "author_info": author_info,
        "title_info": title_info
    }

## Process Incoming Metadata

In [None]:
# Example input metadata
incoming_metadata = [
    {"author": "Vergil", "title": "Libri Duodecim Aeneidos P. Vergilii Maronis, cum annotationibus"},
    {"author": "Joannes Herryson", "title": "De Philosophis"},
    {"author": "Unknown Author", "title": "Unknown Work"},
]

# Match each metadata record
for record in incoming_metadata:
    print("\nProcessing Record:", record)
    result = match_metadata(record["author"], record["title"])
    print("Matched Result:", result)