# Hybrid Matching

## Step 1: Prepare the Data

The data has already been preprocessed in the `data-preparation.ipynb` notebook. Now it needs to be loaded here and converted into the lookup dictionaries that will be used for the deterministic matching routine.

In [12]:
# Import the Pandas library for working with CSV data
import pandas as pd

# Read in the authors data
authors = pd.read_csv('../data/authors_db.csv',encoding='utf-8',quotechar='"')
# Read in the works data
works = pd.read_csv('../data/works_db.csv',encoding='utf-8',quotechar='"')

# Get basic information about the dataframes
authors.info()
works.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27290 entries, 0 to 27289
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Variant                  27290 non-null  object
 1   Authorized Name          27290 non-null  object
 2   DLL Identifier (Author)  27290 non-null  object
dtypes: object(3)
memory usage: 639.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5315 entries, 0 to 5314
Data columns (total 3 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Title                    5315 non-null   object
 1   DLL Identifier (Work)    5315 non-null   object
 2   DLL Identifier (Author)  5315 non-null   object
dtypes: object(3)
memory usage: 124.7+ KB


In [13]:
# Change the names of the columns to be lower case without spaces or punctuation
authors = authors.rename(columns={'Variant':'variant_name','Authorized Name':'authorized_name','DLL Identifier (Author)':'dll_id_author'})
works = works.rename(columns={'Title':'title','DLL Identifier (Work)':'dll_id_work','DLL Identifier (Author)': 'dll_id_author'})

In [14]:
import re

def normalize_author_name(name):
    """Normalize author names for consistent matching."""
    # Convert to lowercase, strip whitespace, and remove non-alphanumeric characters (except spaces)
    normalized_name = re.sub(r"[^\w\s]", "", name.lower().strip())
    return re.sub(r"\s+", " ", normalized_name)  # Normalize multiple spaces

In [15]:
# Prepare the lookup dictionary of variant author names
variant_to_authorized = {
    normalize_author_name(row["variant_name"]): {
        "authorized_name": row["authorized_name"], 
        "author_id": row["dll_id_author"]
    }
    for _, row in authors.iterrows()
}

variant_to_authorized

{'herryson joannes floruit15th century ad': {'authorized_name': 'herryson, joannes',
  'author_id': 'A1868'},
 'joannes herryson': {'authorized_name': 'herryson, joannes',
  'author_id': 'A1868'},
 'john herryson': {'authorized_name': 'herryson, joannes',
  'author_id': 'A1868'},
 'heryyson joannes floruit15th century ad': {'authorized_name': 'herryson, joannes',
  'author_id': 'A1868'},
 'herryson xoannes floruit15th century ad': {'authorized_name': 'herryson, joannes',
  'author_id': 'A1868'},
 'johannes stratford': {'authorized_name': 'stratford, john, -1348',
  'author_id': 'A1870'},
 'john stratford': {'authorized_name': 'stratford, john, -1348',
  'author_id': 'A1870'},
 'john stratford 12751348': {'authorized_name': 'stratford, john, -1348',
  'author_id': 'A1870'},
 'stratford johannes ca 12751348': {'authorized_name': 'stratford, john, -1348',
  'author_id': 'A1870'},
 'stratford john ca 12751348': {'authorized_name': 'stratford, john, -1348',
  'author_id': 'A1870'},
 'stratf

In [16]:
# Prepare the lookup dictionary for titles
title_to_work = {
    row["title"]: {
        "dll_id_work": row["dll_id_work"],
        "dll_id_author": row["dll_id_author"]
    }
    for _, row in works.iterrows()
}

title_to_work

{'de signis et symptomatibus aegritudinum': {'dll_id_work': 'W10655',
  'dll_id_author': 'A3919'},
 'de coniuratione porcaria dialogus': {'dll_id_work': 'W10654',
  'dll_id_author': 'A3221'},
 'alda': {'dll_id_work': 'W10653', 'dll_id_author': 'A4844'},
 'de viris illustribus': {'dll_id_work': 'W4469', 'dll_id_author': 'A4936'},
 'de philosophis': {'dll_id_work': 'W10651', 'dll_id_author': 'A4799'},
 'epigrammata super exilio': {'dll_id_work': 'W10650',
  'dll_id_author': 'A4655'},
 'porcaria': {'dll_id_work': 'W10649', 'dll_id_author': 'A3205'},
 'liber de curatione egritudinum partium totius corporis': {'dll_id_work': 'W10648',
  'dll_id_author': 'A3153'},
 'occupatio': {'dll_id_work': 'W1913', 'dll_id_author': 'A5021'},
 'liber senecae de moribus': {'dll_id_work': 'W10636',
  'dll_id_author': 'A4655'},
 'praefationes': {'dll_id_work': 'W10635', 'dll_id_author': 'A3873'},
 'orationes': {'dll_id_work': 'W335', 'dll_id_author': 'A3593'},
 'psyche et cupido': {'dll_id_work': 'W10632', '

In [17]:
# Prepare the dictionaries for embedding the author names and titles
canonical_authors = authors.to_dict("records")
canonical_titles = works.to_dict("records")

## Prepare Embeddings

In [6]:
from sentence_transformers import SentenceTransformer
from tqdm.notebook import tqdm
import numpy as np

# Initialize the embedding model
embedding_model = SentenceTransformer('distiluse-base-multilingual-cased-v2')

# Extract canonical titles
canonical_titles = list(title_to_work.keys())

# Generate embeddings for canonical titles
title_embeddings = embedding_model.encode(canonical_titles)

# Store the embeddings with their respective titles
title_embeddings_dict = {
    title: embedding for title, embedding in zip(canonical_titles, title_embeddings)
}


## Prepare the FAISS Vector Stores

Since there are many author names to keep track of, I'm going to save them in a vector store for easier and more rapid searching instead of keeping them in memory.

I'm using [FAISS (Facebook AI Similarity Search)](https://faiss.ai/) because it is reliable, open-source, and relatively easy to use. In previous versions of this experiment, I tried using [Chroma](https://www.trychroma.com/) and found that it was too buggy to use.

In [30]:
# import faiss

# # Generate author embeddings and set up FAISS
# author_embeddings = [embedding_model.encode(name) for name in variant_to_authorized.keys()]
# author_embeddings = np.array(author_embeddings, dtype=np.float32)

# dimension = author_embeddings.shape[1]
# author_index = faiss.IndexFlatL2(dimension)
# author_index.add(author_embeddings)

# # Map index positions to author names
# author_map = {i: name for i, name in enumerate(variant_to_authorized.keys())}

Note that it took 21m 36.6s to complete this step on my laptop's CPU.

In [31]:
# # Generate title embeddings and set up FAISS
# title_embeddings = [embedding_model.encode(title) for title in title_to_work.keys()]
# title_embeddings = np.array(title_embeddings, dtype=np.float32)

# dimension = title_embeddings.shape[1]
# title_index = faiss.IndexFlatL2(dimension)
# title_index.add(title_embeddings)

# # Map index positions to titles
# title_map = {i: title for i, title in enumerate(title_to_work.keys())}

It took 3m 10.7s to complete this step.

### Save the vector stores to disk

In [32]:
# # Save author vector store
# faiss.write_index(author_index, "../author_index.faiss")

# # Save title vector store
# faiss.write_index(title_index, "../title_index.faiss")

# print("FAISS indices saved to disk.")

# import pickle

# # Save author_map and title_map
# with open("../author_map.pkl", "wb") as f:
#     pickle.dump(author_map, f)

# with open("../title_map.pkl", "wb") as f:
#     pickle.dump(title_map, f)

# print("Maps saved to disk.")

FAISS indices saved to disk.
Maps saved to disk.


## Utility Functions

In [18]:
# Import LatinCy for parsing title strings
import spacy

# Load the LatinCy spaCy model
try:
    nlp = spacy.load('la_core_web_lg')
except OSError:
    print("Downloading LatinCy model...")
    from spacy.cli import download
    download("")
    nlp = spacy.load('la_core_web_lg')

from spacy.lang.la import STOP_WORDS

# Augment the Latin stop words list
custom_stop_words = {"liber", "libri", "libro", "librum", "librorum", "libris", "libros"}
all_stop_words = STOP_WORDS.union(custom_stop_words)

def deterministic_author_match(input_author):
    """Match author using deterministic lookups."""
    input_author_normalized = normalize_author_name(input_author)
    author_info = variant_to_authorized.get(input_author_normalized)
    if author_info:
        print(f"Deterministic author match: {author_info}")
        return author_info
    return None


def tokenize_title(title):
    """Tokenize the title using LatinCy with robust custom lemmatization rules and augmented stop words."""
    doc = nlp(title.lower().strip())
    tokens = []
    for token in doc:
        # Correct lemmatization for specific cases
        lemma = token.lemma_
        if lemma in {"aeneidus", "aeneus"} or token.text in {"Aeneidos", "Aeneis"}:
            lemma = "aeneis"
        # Skip stop words (including custom stop words)
        if token.is_alpha and lemma not in all_stop_words:
            tokens.append(lemma)
    return tokens

def deterministic_title_match(input_title):
    """Match title using deterministic lookups or tokenized search."""
    input_title_cleaned = input_title.lower().strip()

    # Direct deterministic match
    title_info = title_to_work.get(input_title_cleaned)
    if title_info:
        title_info["title"] = input_title_cleaned  # Explicitly set the title key
        print(f"Deterministic title match: {title_info}")
        return title_info

    # Tokenized search
    input_tokens = tokenize_title(input_title)
    for token in input_tokens:
        if token in title_to_work:
            title_info = title_to_work[token]
            title_info["title"] = token  # Explicitly set the title key
            print(f"Tokenized title match: {title_info}")
            return title_info

    return None

def embedding_author_match(input_author):
    """Fallback to embedding-based author matching."""
    input_embedding = embedding_model.encode([input_author]).astype(np.float32)
    distances, indices = author_index.search(input_embedding, k=1)
    best_index = indices[0][0]
    best_match = author_map[best_index]
    similarity = 1 - distances[0][0]  # Convert L2 distance to similarity
    print(f"Embedding author match: {best_match} with similarity {similarity:.2f}")
    
    # Lower the threshold for fallback, if necessary
    return variant_to_authorized.get(best_match) if similarity > 0.75 else None

def embedding_title_match(input_title):
    """Fallback to embedding-based title matching."""
    input_embedding = embedding_model.encode([input_title]).astype(np.float32)
    distances, indices = title_index.search(input_embedding, k=1)
    best_index = indices[0][0]
    best_match = title_map[best_index]
    similarity = 1 - distances[0][0]  # Convert L2 distance to similarity
    if similarity > 0.8:  # Confidence threshold
        title_info = title_to_work.get(best_match, {"dll_id_work": "Unknown", "dll_id_author": "Unknown"})
        title_info["title"] = best_match  # Explicitly set the title key
        print(f"Embedding title match: {best_match} with similarity {similarity:.2f}")
        return title_info
    return None

def match_metadata(input_author, input_title):
    """Match metadata against canonical records."""
    # Author matching
    author_info = deterministic_author_match(input_author)
    if not author_info:
        author_info = embedding_author_match(input_author)
    if not author_info:
        author_info = {"authorized_name": "Unknown", "author_id": "Unknown"}
    
    # Title matching
    title_info = deterministic_title_match(input_title)
    if not title_info:
        title_info = embedding_title_match(input_title)
    if not title_info:
        title_info = {"matched_title": "Unknown", "dll_id_work": "Unknown", "dll_id_author": "Unknown"}
    
    # Combine results
    return {
        "author_info": author_info,
        "title_info": title_info
    }

## Load the Vector Stores

In [19]:
# Code for loading vector stores, if needed
import faiss
import pickle

author_index = faiss.read_index("../author_index.faiss")
title_index = faiss.read_index("../title_index.faiss")

with open("../author_map.pkl", "rb") as f:
    author_map = pickle.load(f)

with open("../title_map.pkl", "rb") as f:
    title_map = pickle.load(f)

In [20]:
# Example input metadata
incoming_metadata = [
    {"author": "Vergil", "title": "Libri Duodecim Aeneidos P. Vergilii Maronis, cum annotationibus"},
    {"author": "Joannes Herryson", "title": "De Philosophis"},
    {"author": "Unknown Author", "title": "Unknown Work"},
]

# Match each metadata record
for record in incoming_metadata:
    print("\nProcessing Record:", record)
    result = match_metadata(record["author"], record["title"])
    print("Matched Result:", result)


Processing Record: {'author': 'Vergil', 'title': 'Libri Duodecim Aeneidos P. Vergilii Maronis, cum annotationibus'}
Embedding author match: virgil with similarity 0.91
Tokenized title match: {'dll_id_work': 'W3809', 'dll_id_author': 'A4830', 'title': 'aeneis'}
Matched Result: {'author_info': {'authorized_name': 'virgil', 'author_id': 'A4830'}, 'title_info': {'dll_id_work': 'W3809', 'dll_id_author': 'A4830', 'title': 'aeneis'}}

Processing Record: {'author': 'Joannes Herryson', 'title': 'De Philosophis'}
Deterministic author match: {'authorized_name': 'herryson, joannes', 'author_id': 'A1868'}
Deterministic title match: {'dll_id_work': 'W10651', 'dll_id_author': 'A4799', 'title': 'de philosophis'}
Matched Result: {'author_info': {'authorized_name': 'herryson, joannes', 'author_id': 'A1868'}, 'title_info': {'dll_id_work': 'W10651', 'dll_id_author': 'A4799', 'title': 'de philosophis'}}

Processing Record: {'author': 'Unknown Author', 'title': 'Unknown Work'}
Embedding author match: hrots

In [34]:
def process_metadata(input_df):
    """Process input dataframe and match metadata."""
    matched_authors = []
    authorized_names = []
    dll_id_authors = []
    author_match_similarities = []
    author_match_methods = []

    matched_titles = []
    dll_id_works = []
    title_match_similarities = []
    title_match_methods = []

    for _, row in input_df.iterrows():
        # Author processing
        input_author = row["author"]
        author_info = deterministic_author_match(input_author) or embedding_author_match(input_author)
        if not author_info:
            # If no match found, set a default fallback
            author_info = {
                "authorized_name": "Unknown",
                "author_id": "Unknown",
                "similarity": 0.0,
            }
            match_method = "unknown"
        else:
            match_method = author_info.get("match_method", "deterministic")

        matched_authors.append(input_author)
        authorized_names.append(author_info["authorized_name"])
        dll_id_authors.append(author_info["author_id"])
        author_match_similarities.append(author_info.get("similarity", 0.0))
        author_match_methods.append(match_method)

        # Title processing
        input_title = row["title"]
        title_info = deterministic_title_match(input_title) or embedding_title_match(input_title)
        if not title_info:
            title_info = {"title": "Unknown", "dll_id_work": "Unknown", "dll_id_author": "Unknown"}
            title_similarity = 0.0
            title_match_method = "unknown"
        else:
            title_similarity = title_info.get("similarity", 1.0)
            title_match_method = title_info.get("match_method", "deterministic")

        matched_titles.append(title_info["title"])
        dll_id_works.append(title_info["dll_id_work"])
        title_match_similarities.append(title_similarity)
        title_match_methods.append(title_match_method)

    # Create the output DataFrame
    output_df = input_df.copy()
    output_df["authorized_name"] = authorized_names
    output_df["dll_id_author"] = dll_id_authors
    output_df["author_match_similarity"] = author_match_similarities
    output_df["author_match_method"] = author_match_methods

    output_df["matched_title"] = matched_titles
    output_df["dll_id_work"] = dll_id_works
    output_df["title_match_similarity"] = title_match_similarities
    output_df["title_match_method"] = title_match_methods

    return output_df

input_df = pd.DataFrame(input_data)
output_df = process_metadata(input_df)


# Example usage
input_data = {
    "author": ["Vergil", "P. Vergilius Maro"],
    "title": ["Libri Duodecim Aeneidos P. Vergilii Maronis, cum annotationibus", "Aeneis"],
    "publisher": ["Publisher A", "Publisher B"],
    "place": ["Place A", "Place B"],
    "year": ["1501", "1502"],
    "url": ["http://example.com/aeneis1", "http://example.com/aeneis2"]
}

input_df = pd.DataFrame(input_data)
output_df = process_metadata(input_df)


Embedding author match: virgil with similarity 0.91
Tokenized title match: {'dll_id_work': 'W3809', 'dll_id_author': 'A4830', 'title': 'aeneis'}
Embedding author match: mariano vittori with similarity 0.65
Deterministic title match: {'dll_id_work': 'W3809', 'dll_id_author': 'A4830', 'title': 'aeneis'}
Embedding author match: virgil with similarity 0.91
Tokenized title match: {'dll_id_work': 'W3809', 'dll_id_author': 'A4830', 'title': 'aeneis'}
Embedding author match: mariano vittori with similarity 0.65
Deterministic title match: {'dll_id_work': 'W3809', 'dll_id_author': 'A4830', 'title': 'aeneis'}


In [26]:
output_df

Unnamed: 0,author,title,publisher,place,year,url,authorized_name,dll_id_author,author_match_similarity,author_match_method,matched_title,dll_id_work,title_match_similarity,title_match_method
0,Vergil,"Libri Duodecim Aeneidos P. Vergilii Maronis, c...",Publisher A,Place A,1501,http://example.com/aeneis1,virgil,A4830,0.0,embedded,aeneis,W3809,1.0,deterministic
1,P. Vergilius Maro,Aeneis,Publisher B,Place B,1502,http://example.com/aeneis2,virgil,A4830,1.0,title_fallback,aeneis,W3809,1.0,deterministic


In [27]:
hathi = pd.read_csv('../data/hathi.csv',encoding='utf-8',quotechar='"')

In [29]:
hathi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24799 entries, 0 to 24798
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   author     23835 non-null  object
 1   title      24799 non-null  object
 2   publisher  24788 non-null  object
 3   place      24799 non-null  object
 4   year       24799 non-null  int64 
 5   url        24799 non-null  object
dtypes: int64(1), object(5)
memory usage: 1.1+ MB


pandas.core.series.Series

In [35]:
output_df = process_metadata(hathi)

Embedding author match: nansius, franciscus, 1525?-1595 with similarity 0.72
Embedding author match: meizter, ferdinand, 1828-1915 with similarity 0.48
Embedding author match: rutgers, johannes, 1589-1625. with similarity 0.76
Embedding author match: juliano, emperador de roma with similarity 0.64
Embedding author match: nemezjanus with similarity 0.43
Embedding author match: meursius, johannes, 1579-1639 with similarity 0.69
Tokenized title match: {'dll_id_work': 'W365', 'dll_id_author': 'A4821', 'title': 'tractatus'}
Deterministic author match: {'authorized_name': 'kircher, athanasius, 1602-1680', 'author_id': 'A4106'}
Tokenized title match: {'dll_id_work': 'W2525', 'dll_id_author': 'A4610', 'title': 'distributio'}
Embedding author match: meursius, johannes, 1579-1639 with similarity 0.69
Tokenized title match: {'dll_id_work': 'W365', 'dll_id_author': 'A4821', 'title': 'tractatus'}
Embedding author match: meursius, johannes, 1579-1639 with similarity 0.69
Tokenized title match: {'dll

In [39]:
output_df.head()


Unnamed: 0,author,title,publisher,place,year,url,authorized_name,dll_id_author,author_match_similarity,author_match_method,matched_title,dll_id_work,title_match_similarity,title_match_method
0,"Du Creux, François, 1596?-1666.","Historiæ canadensis, seu Novæ-Franciæ libri de...",Apud Sebastianum Cramoisy et Sebast. Mabre-Cra...,fr,1664,https://hdl.handle.net/2027/aeu.ark:/13960/t25...,Unknown,Unknown,0.0,unknown,Unknown,Unknown,0.0,unknown
1,"Meyer, Ernst H. F. 1791-1858.",Ernesti Meyer de plantis labradoricis libri tres.,"Sumtibus Leopoldi Vossii, 1830.",gw,1830,https://hdl.handle.net/2027/aeu.ark:/13960/t5q...,Unknown,Unknown,0.0,unknown,Unknown,Unknown,0.0,unknown
2,"Laet, Joannes de, 1593-1649.","Novus orbis, seu Descriptionis Indiae Occident...","Apud Elzevirios, 1633.",ne,1633,https://hdl.handle.net/2027/aeu.ark:/13960/t61...,Unknown,Unknown,0.0,unknown,Unknown,Unknown,0.0,unknown
3,"Caesar, Julius",C. Julii Cæsaris commentariorum De Bello Galli...,"Armour and Ramsay, 1849.",quc,1849,https://hdl.handle.net/2027/aeu.ark:/13960/t6t...,Unknown,Unknown,0.0,unknown,Unknown,Unknown,0.0,unknown
4,Unknown,Collectanea latina seu ecclesiasticæ antiquita...,"[s.n.], 1853.",onc,1853,https://hdl.handle.net/2027/aeu.ark:/13960/t77...,Unknown,Unknown,0.0,unknown,Unknown,Unknown,0.0,unknown


In [38]:
output_df['title_match_similarity'].value_counts()

title_match_similarity
0.0    16838
1.0     7961
Name: count, dtype: int64